README.md

October 13, 2023 · View on GitHub

ESC-50: Dataset for Environmental Sound Classification

Overview | Download | Results | Repository content | License | Citing | Caveats | Changelog

    Download 

ESC-50 clip preview

The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification.

The dataset consists of 5-second-long recordings organized into 50 semantical classes (with 40 examples per class) loosely arranged into 5 major categories:

AnimalsNatural soundscapes & water sounds Human, non-speech soundsInterior/domestic soundsExterior/urban noises
DogRainCrying babyDoor knockHelicopter
RoosterSea wavesSneezingMouse clickChainsaw
PigCrackling fireClappingKeyboard typingSiren
CowCricketsBreathingDoor, wood creaksCar horn
FrogChirping birdsCoughingCan openingEngine
CatWater dropsFootstepsWashing machineTrain
HenWindLaughingVacuum cleanerChurch bells
Insects (flying)Pouring waterBrushing teethClock alarmAirplane
SheepToilet flushSnoringClock tickFireworks
CrowThunderstormDrinking, sippingGlass breakingHand saw

Clips in this dataset have been manually extracted from public field recordings gathered by the Freesound.org project. The dataset has been prearranged into 5 folds for comparable cross-validation, making sure that fragments from the same original source file are contained in a single fold.

A more thorough description of the dataset is available in the original paper with some supplementary materials on GitHub: ESC: Dataset for Environmental Sound Classification - paper replication data.

Download

The dataset can be downloaded as a single .zip file (~600 MB):

Download ESC-50 dataset

Results

Supervised Methods

Numerous machine learning & signal processing approaches have been evaluated on the ESC-50 dataset. Most of them are listed here. If you know of some other reference, you can message me or open a Pull Request directly.

Terms used in the table:

• CNN - Convolutional Neural Network
• CRNN - Convolutional Recurrent Neural Network
• GMM - Gaussian Mixture Model
• GTCC - Gammatone Cepstral Coefficients
• GTSC - Gammatone Spectral Coefficients
• k-NN - k-Neareast Neighbors
• MFCC - Mel-Frequency Cepstral Coefficients
• MLP - Multi-Layer Perceptron
• RBM - Restricted Boltzmann Machine
• RNN - Recurrent Neural Network
• SVM - Support Vector Machine
• TEO - Teager Energy Operator
• ZCR - Zero-Crossing Rate

TitleNotesAccuracyPaperCode
Natural Language Supervision for General-Purpose Audio RepresentationsHTSAT-22 model pretrained by natural language supervision98.25%msclap2023:scroll:
BEATs: Audio Pre-Training with Acoustic TokenizersTransformer model pretrained with acoustic tokenizers98.10%chen2022:scroll:
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and DetectionTransformer model with hierarchical structure and token-semantic modules97.00%chen2022:scroll:
CLAP: Learning Audio Concepts From Natural Language SupervisionCNN model pretrained by natural language supervision96.70%elizalde2022:scroll:
CAT: Causal Audio Transformer for Audio ClassificationTransformer model with MFMR features and a causal module96.4%liu2023
AST: Audio Spectrogram TransformerPure Attention Model Pretrained on AudioSet95.70%gong2021:scroll:
Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge TransferA Transformer model pretrained w/ visual image supervision95.70%zhao2022:scroll:
A Sequential Self Teaching Approach for Improving Generalization in Sound Event RecognitionMulti-stage sequential learning with knowledge transfer from Audioset94.10%kumar2020
Efficient End-to-End Audio Embeddings Generation for Audio Classification on Target ApplicationsCNN model pretrained on AudioSet92.32%lopez-meyer2021
Urban Sound Tagging using Multi-Channel Audio Feature with Convolutional Neural NetworksPretrained model with multi-channel features89.50%kim2020:scroll:
An Ensemble of Convolutional Neural Networks for Audio ClassificationCNN ensemble with data augmentation88.65%nanni2020:scroll:
Environmental Sound Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained DevicesCNN model (ACDNet) with potential compression87.1%mohaimenuzzaman2021:scroll:
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound ClassificationCNN with filterbanks learned using convolutional RBM + fusion with GTSC and mel energies86.50%sailor2017
AclNet: efficient end-to-end audio classification CNNCNN with mixup and data augmentation85.65%huang2018
On Open-Set Classification with L3-Net Embeddings for Machine Listening Applicationsx-vector network with openll3 embeddings85.00%wilkinghoff2020
Learning from Between-class Examples for Deep Sound RecognitionEnvNet-v2 (tokozume2017a) + data augmentation + Between-Class learning84.90%tokozume2017b
Novel Phase Encoded Mel Filterbank Energies for Environmental Sound ClassificationCNN working with phase encoded mel filterbank energies (PEFBEs), fusion with Mel energies84.15%tak2017
Knowledge Transfer from Weakly Labeled Audio using Convolutional Neural Network for Sound Events and ScenesCNN pretrained on AudioSet83.50%kumar2017:scroll:
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound ClassificationCNN with filterbanks learned using convolutional RBM + fusion with GTSC83.00%sailor2017
Deep Multimodal Clustering for Unsupervised Audiovisual LearningCNN + unsupervised audio-visual learning82.60%hu2019
Novel TEO-based Gammatone Features for Environmental Sound ClassificationFusion of GTSC & TEO-GTSC with CNN81.95%agrawal2017
Learning from Between-class Examples for Deep Sound RecognitionEnvNet-v2 (tokozume2017a) + Between-Class learning81.80%tokozume2017b
:headphones: Human accuracyCrowdsourcing experiment in classifying ESC-50 by human listeners81.30%piczak2015a:scroll:
Objects that SoundLook, Listen and Learn (L3) network (arandjelovic2017a) with stride 2, larger batches and learning rate schedule79.80%arandjelovic2017b
Look, Listen and Learn8-layer convolutional subnetwork pretrained on an audio-visual correspondence task79.30%arandjelovic2017a
Learning Environmental Sounds with Multi-scale Convolutional Neural NetworkMulti-scale convolutions with feature fusion (waveform + spectrogram)79.10%zhu2018
Novel TEO-based Gammatone Features for Environmental Sound ClassificationGTSC with CNN79.10%agrawal2017
Learning from Between-class Examples for Deep Sound RecognitionEnvNet-v2 (tokozume2017a) + data augmentation78.80%tokozume2017b
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound ClassificationCNN with filterbanks learned using convolutional RBM78.45%sailor2017
Learning from Between-class Examples for Deep Sound RecognitionBaseline CNN (piczak2015b) + Batch Normalization + Between-Class learning76.90%tokozume2017b
Novel TEO-based Gammatone Features for Environmental Sound ClassificationTEO-GTSC with CNN74.85%agrawal2017
Learning from Between-class Examples for Deep Sound RecognitionEnvNet-v2 (tokozume2017a)74.40%tokozume2017b
Soundnet: Learning sound representations from unlabeled video8-layer CNN (raw audio) with transfer learning from unlabeled videos74.20%aytar2016:scroll:
Learning from Between-class Examples for Deep Sound Recognition18-layer CNN on raw waveforms (dai2016) + Between-Class learning73.30%tokozume2017b
Novel Phase Encoded Mel Filterbank Energies for Environmental Sound ClassificationCNN working with phase encoded mel filterbank energies (PEFBEs)73.25%tak2017
Classifying environmental sounds using image recognition networks16 kHz sampling rate, GoogLeNet on spectrograms (40 ms frame length)73.20%boddapati2017:scroll:
Learning from Between-class Examples for Deep Sound RecognitionBaseline CNN (piczak2015b) + Batch Normalization72.40%tokozume2017b
Novel TEO-based Gammatone Features for Environmental Sound ClassificationFusion of MFCC & TEO-GTCC with GMM72.25%agrawal2017
Learning environmental sounds with end-to-end convolutional neural network (EnvNet)Combination of spectrogram and raw waveform CNN71.00%tokozume2017a
Novel TEO-based Gammatone Features for Environmental Sound ClassificationTEO-GTCC with GMM68.85%agrawal2017
Classifying environmental sounds using image recognition networks16 kHz sampling rate, AlexNet on spectrograms (30 ms frame length)68.70%boddapati2017:scroll:
Very Deep Convolutional Neural Networks for Raw Waveforms18-layer CNN on raw waveforms68.50%dai2016, tokozume2017b:scroll:
Classifying environmental sounds using image recognition networks32 kHz sampling rate, GoogLeNet on spectrograms (30 ms frame length)67.80%boddapati2017:scroll:
WSNet: Learning Compact and Efficient Networks with Weight SamplingSoundNet 8-layer CNN architecture with 100x model compression66.25%jin2017
Soundnet: Learning sound representations from unlabeled video5-layer CNN (raw audio) with transfer learning from unlabeled videos66.10%aytar2016:scroll:
WSNet: Learning Compact and Efficient Networks with Weight SamplingSoundNet 8-layer CNN architecture with 180x model compression65.80%jin2017
Soundnet: Learning sound representations from unlabeled video5-layer CNN trained on raw audio of ESC-50 only65.00%aytar2016:scroll:
:bar_chart: Environmental Sound Classification with Convolutional Neural Networks - CNN baselineCNN with 2 convolutional and 2 fully-connected layers, mel-spectrograms as input, vertical filters in the first layer64.50%piczak2015b:scroll:
auDeep: Unsupervised Learning of Representations from Audio with Deep Recurrent Neural NetworksMLP classifier on features extracted with an RNN autoencoder64.30%freitag2017:scroll:
Classifying environmental sounds using image recognition networks32 kHz sampling rate, AlexNet on spectrograms (30 ms frame length)63.20%boddapati2017:scroll:
Classifying environmental sounds using image recognition networksCRNN60.30%boddapati2017:scroll:
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks3-layer CNN with vertical filters on wideband mel-STFT (median accuracy)56.37%huzaifah2017
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks3-layer CNN with square filters on wideband mel-STFT (median accuracy)54.00%huzaifah2017
Soundnet: Learning sound representations from unlabeled video8-layer CNN trained on raw audio of ESC-50 only51.10%aytar2016:scroll:
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks5-layer CNN with square filters on wideband mel-STFT (median accuracy)50.87%huzaifah2017
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks5-layer CNN with vertical filters on wideband mel-STFT (median accuracy)46.25%huzaifah2017
:bar_chart: Baseline - random forestBaseline ML approach (MFCC & ZCR + random forest)44.30%piczak2015a:scroll:
Soundnet: Learning sound representations from unlabeled videoConvolutional autoencoder trained on unlabeled videos39.90%aytar2016:scroll:
:bar_chart: Baseline - SVMBaseline ML approach (MFCC & ZCR + SVM)39.60%piczak2015a:scroll:
:bar_chart: Baseline - k-NNBaseline ML approach (MFCC & ZCR + k-NN)32.20%piczak2015a:scroll:
A mixture model-based real-time audio sources classification methodDictionary of sound models used for classification (accuracy is computed on segments instead of files)94.00%baelde2017
NELS - Never-Ending Learner of SoundsLarge-scale audio crawling with classifiers trained on AED datasets (including ESC-50)N/Aelizalde2017:scroll:
Utilizing Domain Knowledge in End-to-End Audio ProcessingEnd-to-end CNN with learned mel-spectrogram transformationN/Atax2017:scroll:
Deep Neural Network based learning and transferring mid-level audio features for acoustic scene classificationTransfer learning from various datasets, including ESC-50N/Amun2017
Features and Kernels for Audio Event RecognitionMFCC, GMM, SVMN/Akumar2016b
A real-time environmental sound recognition system for the Android OSReal-time sound recognition for Android evaluated on ESC-10N/Apillos2016
Comparing Time and Frequency Domain for Audio Event Recognition Using Deep LearningDiscriminatory effectiveness of different signal representations compared on ESC-10 and Freiburg-106N/Ahertel2016
Audio Event and Scene Recognition: A Unified Approach using Strongly and Weakly Labeled DataCombination of weakly labeled data (YouTube) with strong labeling (ESC-10) for Acoustic Event DetectionN/Akumar2016a

Unsupervised Methods

ESC-50 was also evaluated in unsupervised learning settings (Zhao et al., 2022):

  1. Zero shot (ZS): Human supervision (i.e., labeled audio-text pairs outside the ESC-50 domain) may be used in training.
  2. Zero resource (ZR): No human supervision (i.e., any form of manually labeled audio-text pairs) is used in training.
TitleNotesZSZRPaperCode
Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge TransferVIP-ANT: VIsually-Pivoted Audio and(N) Text -- A Transformer model pretrained w/ visual image supervision62.8%69.5%zhao2022:scroll:

Repository content

  • audio/*.wav

    2000 audio recordings in WAV format (5 seconds, 44.1 kHz, mono) with the following naming convention:

    {FOLD}-{CLIP_ID}-{TAKE}-{TARGET}.wav

    • {FOLD} - index of the cross-validation fold,
    • {CLIP_ID} - ID of the original Freesound clip,
    • {TAKE} - letter disambiguating between different fragments from the same Freesound clip,
    • {TARGET} - class in numeric format [0, 49].
  • meta/esc50.csv

    CSV file with the following structure:

    filenamefoldtargetcategoryesc10src_filetake

    The esc10 column indicates if a given file belongs to the ESC-10 subset (10 selected classes, CC BY license).

  • meta/esc50-human.xlsx

    Additional data pertaining to the crowdsourcing experiment (human classification accuracy).

License

The dataset is available under the terms of the Creative Commons Attribution Non-Commercial license.

A smaller subset (clips tagged as ESC-10) is distributed under CC BY (Attribution).

Attributions for each clip are available in the LICENSE file.

Citing

Download paper in PDF format

If you find this dataset useful in an academic setting please cite:

K. J. Piczak. ESC: Dataset for Environmental Sound Classification. Proceedings of the 23rd Annual ACM Conference on Multimedia, Brisbane, Australia, 2015.

[DOI: http://dx.doi.org/10.1145/2733373.2806390]

@inproceedings{piczak2015dataset,
  title = {{ESC}: {Dataset} for {Environmental Sound Classification}},
  author = {Piczak, Karol J.},
  booktitle = {Proceedings of the 23rd {Annual ACM Conference} on {Multimedia}},
  date = {2015-10-13},
  url = {http://dl.acm.org/citation.cfm?doid=2733373.2806390},
  doi = {10.1145/2733373.2806390},
  location = {{Brisbane, Australia}},
  isbn = {978-1-4503-3459-4},
  publisher = {{ACM Press}},
  pages = {1015--1018}
}

Caveats

Please be aware of potential information leakage while training models on ESC-50, as some of the original Freesound recordings were already preprocessed in a manner that might be class dependent (mostly bandlimiting). Unfortunately, this issue went unnoticed when creating the original version of the dataset. Due to the number of methods already evaluated on ESC-50, no changes rectifying this issue will be made in order to preserve comparability.

Changelog

v2.0.0 (2017-12-13)

• Change to WAV version as default.

v2.0.0-pre (2016-10-10) (wav-files branch)

• Replace OGG recordings with cropped WAV files for easier loading and frame-level precision (some of the OGG recordings had a slightly different length when loaded).
• Move recordings to a one directory structure with a meta CSV file.

v1.0.0 (2015-04-15)

• Initial version of the dataset (OGG format).