Consistent Ensemble Distillation for Audio Tagging (CED)

March 20, 2026 · View on GitHub

This repo is the source for the ICASSP 2024 Paper Consistent Ensemble Distillation for Audio Tagging.

Framework

Model	Parameters (M)	AS-20K (mAP)	AS-2M (mAP)
CED-Tiny	5.5	36.5	48.1
CED-Mini	9.6	38.5	49.0
CED-Small	22	41.6	49.6
CED-Base	86	44.0	50.0

All models work with 16 kHz audio and use 64-dim Mel-spectrograms, making them very fast. CED-Tiny should be faster than MobileNets on a single x86 CPU (even though MACs/FLops would indicate otherwise).

Pretrained models could be downloaded from Zenodo or Hugging Face.

	Zenodo	Hugging Face
CED-Tiny	Link	Link
CED-Mini	Link	Link
CED-Small	Link	Link
CED-Base	Link	Link

Demo

We have an online demo available here for CED-Base.

Inference/Usage

Huggingface Transformers

Inference (Onnx, Recommended)

>>> from optimum.onnxruntime import ORTModelForAudioClassification

>>> model_name = "mispeech/ced-mini"
>>> model = ORTModelForAudioClassification.from_pretrained(model_name, trust_remote_code=True)

>>> import torchaudio
>>> audio, sampling_rate = torchaudio.load("/path-to/JeD5V5aaaoI_931_932.wav")
>>> assert sampling_rate == 16000
>>> input_name = model.session.get_inputs()[0].name
>>> output = model(**{input_name: torch.randn(1, 16000)})
>>> logits = output.logits.squeeze()
>>> for idx in logits.argsort()[-2:][::-1]:
>>>   print(f"{model.config.id2label[idx]}: {logits[idx]:.4f}")
'Finger snapping: 0.9155'
'Slap: 0.0567'

Or standard Huggingface Transformers

# pip install transformers
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor

model_name = "mispeech/ced-base"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForAudioClassification.from_pretrained(model_name, trust_remote_code=True)

import torchaudio
audio, sampling_rate = torchaudio.load("/path-to/JeD5V5aaaoI_931_932.wav")
assert sampling_rate == 16000
inputs = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt")

import torch
with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_id = torch.argmax(logits, dim=-1).item()
model.config.id2label[predicted_class_id]

Locally

To just use the CED models for inference, simply run:

git clone https://github.com/Richermans/CED/
cd CED/
pip3 install -r requirements.txt
python3 inference.py resources/*

Note that I experienced some problems with higher versions of hdf5, so if possible please use 1.12.1.

By default we use CED-mini here, which offers a good trade-off between performance and speed. One can switch the models with the -m flag:

python3 inference.py -m ced_tiny resources/*
python3 inference.py -m ced_mini resources/*
python3 inference.py -m ced_small resources/*
python3 inference.py -m ced_base resources/*

You can also use the models directly from Hugging Face, see here for usage instructions.

Training/Reproducing results

1. Preparing data

First, one needs to download Audioset. One might use one of our own scripts.

For example, one can put the downloaded files into a folder named data/balanced and data/unbalanced, data/eval such as:

data/balanced/
├── -0DdlOuIFUI_50.000.wav
├── -0DLPzsiXXE_30.000.wav
├── -0FHUc78Gqo_30.000.wav
├── -0mjrMposBM_80.000.wav
├── -0O3e95y4gE_100.000.wav
…

data/unbalanced/
├── --04kMEQOAs_0.000_10.000.wav
├── --0aJtOMp2M_30.000_40.000.wav
├── --0AzKXCHj8_22.000_32.000.wav
├── --0B3G_C3qc_10.000_20.000.wav
├── --0bntG9i7E_30.000_40.000.wav
…

data/eval/
├── 007P6bFgRCU_10.000_20.000.wav
├── 00AGIhlv-w0_300.000_310.000.wav
├── 00FBAdjlF4g_30.000_40.000.wav
├── 00G2vNrTnCc_10.000_20.000.wav
├── 00KM53yZi2A_30.000_40.000.wav
├── 00XaUxjGuX8_170.000_180.000.wav
├── 0-2Onbywljo_380.000_390.000.wav

Then just generate a .tsv file with:

find data/balanced/ -type f | awk 'BEGIN{print "filename"}{print}' > data/balanced.tsv

Then dump the data as hdf5 files using scripts/wavlist_to_hdf5.py:

python3 scripts/wavlist_to_hdf5.py data/balanced.tsv data/balanced_train/

This will generate a training datafile data/balanced_train/labels/balanced.tsv.

For the eval data, please use this script to download.

The resulting eval.tsv should look like this:

filename	labels	hdf5path
data/eval/--4gqARaEJE.wav	73;361;74;72	data/eval_data/hdf5/eval_0.h5
data/eval/--BfvyPmVMo.wav	419	data/eval_data/hdf5/eval_0.h5
data/eval/--U7joUcTCo.wav	47	data/eval_data/hdf5/eval_0.h5
data/eval/-0BIyqJj9ZU.wav	21;20;17	data/eval_data/hdf5/eval_0.h5
data/eval/-0Gj8-vB1q4.wav	273;268;137	data/eval_data/hdf5/eval_0.h5
data/eval/-0RWZT-miFs.wav	379;307	data/eval_data/hdf5/eval_0.h5
data/eval/-0YUDn-1yII.wav	268;137	data/eval_data/hdf5/eval_0.h5
data/eval/-0jeONf82dE.wav	87;137;89;0;72	data/eval_data/hdf5/eval_0.h5
data/eval/-0nqfRcnAYE.wav	364	data/eval_data/hdf5/eval_0.h5

2. Download logits

Download the logits used in the paper from Zenodo:

wget https://zenodo.org/record/8275347/files/logits.zip?download=1 -O logits.zip
unzip logits.zip

This will create:

logits/
└── ensemble5014
    ├── balanced
    │   └── chunk_10
    └── full
        └── chunk_10

3. Train

python3 run.py train trainconfig/balanced_mixup_tiny_T_ensemble5014_chunk10.yaml

Export ONNX and Inference ONNX

python3 export_onnx.py -m ced_tiny
#or ced_mini ced_small ced_base

python3 onnx_inference_with_kaldi.py test.wav -m ced_tiny.onnx
python3 onnx_inference_with_torchaudio.py test.wav -m ced_tiny.onnx

Why use Kaldi to calculate Mel features? Because it has ready-made C++ implementation code, which can be found here: https://github.com/csukuangfj/kaldi-native-fbank/tree/master

Training on your own data

This is a label-free framework, meaning that any data can be used for optimization. To use your own data, do the follwing:

Put your data somewhere and generate a .tsv file with a single header filename, such as:

find some_directory -type f | awk 'BEGIN{print "filename"}{print}' > my_data.tsv

Then dump the corresponding hdf5 file using scripts/wavlist_to_hdf5.py:

python3 scripts/wavlist_to_hdf5.py my_data.tsv my_data_hdf5/

Then run the script save_logits.py as:

torchrun save_logits.py logitconfig/balanced_base_chunk10s_topk20.yaml --train_data my_data_hdf5/labels/my_data.tsv

Finally you can train your own model on that augmented dataset with:

python3 run.py train trainconfig/balanced_mixup_base_T_ensemble5014_chunk10.yaml --logitspath YOUR_LOGITS_PATH --train_data YOUR_TRAIN_DATA.tsv

We also submitted the models for the HEAR benchmark evaluation. Hear uses a simple linear downstream evaluation protocol across 19 tasks. We simply extracted the features from all ced-models from the penultimate layer. The repo can be found here.

Model	Beehive States Avg	Beijing Opera Percussion	CREMA-D	DCASE16	ESC-50	FSD50K	GTZAN Genre	GTZAN Music Speech	Gunshot Triangulation	LibriCount	MAESTRO 5hr	Mridangam Stroke	Mridangam Tonic	NSynth Pitch 50hr	NSynth Pitch 5hr	Speech Commands 5hr	Speech Commands Full	Vocal Imitations	VoxLingua107 Top10
ced-tiny	38.345	94.90	62.52	88.02	95.80	62.73	89.20	93.01	91.67	61.26	4.81	96.13	90.74	69.19	44.00	70.53	77.10	19.18	33.64
ced-mini	59.17	96.18	65.26	90.66	95.35	63.88	90.30	94.49	86.01	64.02	8.29	96.56	93.32	75.20	55.60	77.38	81.96	20.37	34.67
ced-small	51.70	96.60	66.64	91.63	95.95	64.33	89.50	91.22	93.45	65.59	10.96	96.82	93.94	79.95	60.20	80.92	85.19	21.92	36.53
ced-base	48.35	96.60	69.10	92.19	96.65	65.48	88.60	94.36	89.29	67.85	14.76	97.43	96.55	82.81	68.20	86.93	89.67	22.69	38.57

Android APks

Thanks to csukuangfj, there are also pre-compiled android binaries using sherpa-onnx.

Binaries are available on the k2-fsa sherpa page, https://k2-fsa.github.io/sherpa/onnx/audio-tagging/apk.html.

Citation

Please cite our paper if you find this work useful:

@inproceedings{dinkel2023ced,
  title={CED: Consistent ensemble distillation for audio tagging},
  author={Dinkel, Heinrich and Wang, Yongqing and Yan, Zhiyong and Zhang, Junbo and Wang, Yujun},
  booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2024}
}