Masked Spectrogram Modeling using Masked Autoencoders (MSM-MAE)

February 20, 2026 ยท View on GitHub

key_visual

Masked Spectrogram Modeling using Masked Autoencoders (MSM-MAE)

๐Ÿšจ Important Notice: A newer, better implementation is available. ๐Ÿšจ

๐ŸŽ‰ The successor to this repository, Masked Modeling Duo (M2D), is now available. If you are starting a new project, please use M2D instead of this repository.

  • ๐Ÿ† M2D pre-trained weights significantly outperform those provided here. See the results below.
  • ๐Ÿ“ฆ The M2D repository also distributes MSM-MAE pre-trained weights, and they significantly outperform the weights in this repositoryโ€”even as MSM-MAE models.
  • ๐Ÿ” M2D supports training MSM-MAE models as well as M2D models.
  • ๐Ÿ—‚๏ธ This repository remains available as the reference implementation accompanying the original MSM-MAE paper.

๐Ÿ‘‰ Go to the M2D repository: https://github.com/nttcslab/m2d ๐Ÿ‘ˆ

๐Ÿค” Why M2D?

The table below compares EVAR benchmark results between MSM-MAE (this repository) and M2D models. M2D consistently and substantially outperforms MSM-MAE across all tasks.

new results

M2D improves upon MSM-MAE in two key ways:

  • ๐Ÿง  Instead of reconstructing masked patches in input space, M2D computes the loss in feature space using a momentum encoderโ€”leading to richer representations.
  • ๐Ÿ“ˆ M2D achieves higher scores across all downstream tasks evaluated in our papers.

For training MSM-MAE-style models, M2D's codebase supports MSM-MAE training as well, so there is no reason to use this older implementation for new work. ๐Ÿ™Œ


About This Repository (Reference / Archive)

This repository is the original implementation of Masked Spectrogram Modeling using Masked Autoencoders (MSM-MAE), a self-supervised learning method for general-purpose audio representation. It includes:

  • Training code that can pre-train models with arbitrary audio files.
  • Evaluation code to test models under two benchmarks, HEAR 2021 and EVAR.
  • Visualization examples and a notebook.
  • Pre-trained weights.

If you find MSM-MAE useful in your research, please use the following BibTeX entry for citation.

@InProceedings{niizumi2022masked,
    title     = {Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation},
    author    = {Niizumi, Daisuke and Takeuchi, Daiki and Ohishi, Yasunori and Harada, Noboru and Kashino, Kunio},
    booktitle = {HEAR: Holistic Evaluation of Audio Representations (NeurIPS 2021 Competition)},
    pages     = {1--24},
    year      = {2022},
    editor    = {Turian, Joseph and Schuller, Bjรถrn W. and Herremans, Dorien and Kirchoff, Katrin and Perera, Paola Garcia and Esling, Philippe},
    volume    = {166},
    series    = {Proceedings of Machine Learning Research},
    month     = {13--14 Dec},
    publisher = {PMLR},
    pdf       = {https://proceedings.mlr.press/v166/niizumi22a/niizumi22a.pdf},
    url       = {https://proceedings.mlr.press/v166/niizumi22a.html}
}

History

  • UPDATE (Apr, 2023): Masked Modeling Duo (M2D) is linked to this repository.
  • UPDATE (Jan, 2023): PMLR paper is out. Replaced BibTeX and URL links with the PMLR website. Also fixed a bug.
  • UPDATE (Dec, 2022): Added Vizualization & Audio example notebook. Now we can listen ๐Ÿ‘‚ to how the reconstruction results sound!?
  • UPDATE (Nov, 2022): Extended runtime inference 'encode_lms()' to output features for each layer.

1. Getting Started

The repository relies on the codes from facebookresearch/mae, and we patch our changes on these files.

  1. Download external source files from facebookresearch/mae, and apply a patches.
curl -o util/lars.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/util/lars.py
curl -o util/lr_decay.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/util/lr_decay.py
curl -o util/lr_sched.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/util/lr_sched.py
curl -o util/misc.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/util/misc.py
curl -o msm_mae/pos_embed.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/util/pos_embed.py
curl -o main_pretrain.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/main_pretrain.py
curl -o msm_mae/engine_pretrain.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/engine_pretrain.py
curl -o msm_mae/models_mae.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/models_mae.py
cd msm_mae
patch -p1 < patch_msm_mae.diff
cd ..
patch -p1 < patch_main.diff
  1. If you need a clean environment, the following anaconda example creates a new environment named ar:
conda create -n ar python==3.8
conda activate ar
  1. Install external modules listed on requirements.txt.

1-1. Quick example

We have a utility runtime model, RuntimeMAE, which helps you to load a pre-trained model and encode your audios.

from msm_mae.runtime import RuntimeMAE

device = torch.device('cuda')

# Prepare your batch of audios. This is a dummy  example of three 10s  waves.
batch_audio = 2 * torch.rand((3, 10 * 16000)) - 1.0 # input range = [-1., 1]
batch_audio = batch_audio.to(device)

# Create a model with pretrained weights.
runtime = RuntimeMAE(weight_file='80x512p16x16_paper/checkpoint-100.pth')
runtime = runtime.to(device)

# Encode raw audio into features. `encode()` will process automatically:
# 1. Convert the input `batch_audio` to log-mel spectrograms (LMS).
# 2. Normalize the batch LMS with mean and std calculated from the batch.
# 3. Encode to feature.
frame_level = runtime.encode(batch_audio)

# This example ends up with frame-level 3840-d feature vectors for 63 time frames.
# The `frame_level` will have a size of torch.Size([3, 63, 3840]).
print(frame_level.shape)

#  You can get clip-level features by taking average of time franes.
# The `clip_level` will have a size of torch.Size([3, 3840])
clip_level = torch.mean(frame_level, dim=1)
print(clip_level.shape)

To get the best features, you can normalize your audio with normalization statistics of your entire input data and use them in your pipeline.

# Calculate statistics in advance. This is an example with 10 random waves.
means, stds = [], []
for _ in range(10):
    lms = runtime.to_feature(torch.rand((10 * 16000)).to(device))
    means.append(lms.mean())
    stds.append(lms.std())
dataset_mean, dataset_std = torch.mean(torch.stack(means)), torch.mean(torch.stack(stds))
# These can be numbers [-5.4919195, 5.0389895], for example.

# The followings are an example pipeline.

# Convert your batch audios into LMS.
batch_lms = runtime.to_feature(batch_audio)
# Normalize them.
batch_lms = (batch_lms - dataset_mean) / (dataset_std + torch.finfo().eps)
# Encode them to feame-level features.
frame_level = model.encode_lms(batch_lms)
#  Calculate clip-level features if needed.
clip_level = torch.mean(frame_level, dim=1)

To get features per layer, you can add return_layers=True.

# Getting features per layer.
y = rt.encode_lms(torch.rand(10, 1, 80, 900), return_layers=True)
print(len(y), y[0].shape)
# --> 12 torch.Size([10, 57, 3840])

# As normal, getting finale-layer features.
y = rt.encode_lms(torch.rand(10, 1, 80, 900), return_layers=False)
print(y.shape)
# --> torch.Size([10, 57, 3840])

2. Evaluating MSM-MAE

2-1. Evaluating on HEAR 2021 NeurIPS Challenge Tasks

We evaluate our models on our paper using hear-eval-kit from on HEAR 2021 NeurIPS Challenge as follows.

NOTE: The folder hear has all the files we need to evaluate models on hear-eval-kit.

  1. Install hear-eval-kit as follows:

    pip install heareval
    
  2. Download your copy of downstream dataset for 16 kHz. See HEAR NeurIPS 2021 Datasets@zenodo for the detail.

  3. To evaluate our models, we need a local package which loads and runs our models. The followings shows an example for a model named 80x208p16x16_mymodel.

    cd hear/hear_msm
    cp sample.py 80x208p16x16_mymodel.py
    ** Edit the 80x208p16x16_mymodel.py here, so that the value of `model_path` points to your model with an absolute path. **
    cd ..
    pip install -e .
    
  4. We are on the folder hear. Run the hear-eval-kit with your pre-trained model.

    python -m heareval.embeddings.runner hear_msm.80x208p16x16_mymodel --tasks-dir /your_copy/hear/tasks
    CUBLAS_WORKSPACE_CONFIG=:4096:8 python -m heareval.predictions.runner embeddings/hear_msm.80x208p16x16_mymodel/*
    

2-2. Evaluating on EVAR

EVAR is an evaluation package for audio representations used by our research papers such as BYOL-A and Composing General Audio Representation by Fusing Multilayer Features of a Pre-trained Model. It supports the following downstream tasks: ESC-50, US8K, FSD50K, SPCV1/V2, VoxForge, VoxCeleb1, CREMA-D, GTZAN, NSynth instrument family, Pitch Audio Dataset (Surge synthesizer).

The following steps setup MSM-MAE on the EVAR.

  1. Clone EVAR repository and prepare basic items.

    git clone https://github.com/nttcslab/eval-audio-repr.git evar
    cd evar
    curl https://raw.githubusercontent.com/daisukelab/general-learning/master/MLP/torch_mlp_clf2.py -o evar/utils/torch_mlp_clf2.py
    curl https://raw.githubusercontent.com/daisukelab/sound-clf-pytorch/master/for_evar/sampler.py -o evar/sampler.py
    curl https://raw.githubusercontent.com/daisukelab/sound-clf-pytorch/master/for_evar/cnn14_decoupled.py -o evar/cnn14_decoupled.py
    cd ..
    
  2. Install MSM-MAE files on the cloned evar folder.

    ln -s ../../to_evar/ar_msm_mae.py evar/evar
    ln -s ../../to_evar/msm_mae.yaml evar/config
    cd evar
    sed 's/import evar\.ar_byola/import evar\.ar_byola, evar\.ar_msm_mae/' -i lineareval.py
    cd ..
    
  3. Setup downstream task datasets according to Preparing-datasets.md. The following is an example for setting up CREMA-D dataset.

    cd evar
    python evar/utils/download_cremad.py downloads/cremad
    python prepare_wav.py downloads/cremad work/16k/cremad 16000
    cd ..
    

Once setup is complete, you can evaluate your models as follows.

  • For evaluating a model with an absolute path /your/path/to/model.pth.

    cd evar
    python lineareval.py config/msm_mae.yaml cremad weight_file=/your/path/to/model.pth
    
  • If you want to save GPU memory, set a fewer batch size as follows. This example sets it as 16.

    cd evar
    python lineareval.py config/msm_mae.yaml cremad batch_size=16,weight_file=/your/path/to/model.pth
    

3. Training From Scratch

First, you need to prepare pre-training data samples; then, you can pre-train your model. The following is an example to pre-train a model on FSD50K dataset.

  1. Preprocess .wav files into log-mel spectrogram .npy files. The following converts from a source folder /your/local/fsd50k/FSD50K.dev_audio to a new folder lms_fsd50kdev.

    python wav_to_lms.py /your/local/fsd50k/FSD50K.dev_audio lms_fsd50kdev
    
  2. Create a CSV file which is used as a list of pre-training samples containing a single column file_name. The following example creates trainingfiles.csv.

    echo file_name > trainingfiles.csv
    (cd lms_fsd50kdev && find . -name "*.npy") >> trainingfiles.csv
    
  3. Make some copies of samples for visualization. The following example makes 3 symbolic links. All the samples under the folder vis_samples will be used to visualize reconstruction results of a checkpoint during training.

    mkdir -p lms_fsd50kdev/vis_samples
    ln -s ../1240.npy lms_fsd50kdev/vis_samples/1240.npy
    ln -s ../106237.npy lms_fsd50kdev/vis_samples/106237.npy
    ln -s ../119949.npy lms_fsd50kdev/vis_samples/119949.npy
    
  4. Once your data is ready, start pre-training as follows.

    python main_pretrain.py 80x208p16x16_your_test --input_size 80x208 --data_path lms_fsd50kdev --dataset trainingfiles --model=mae_vit_base_patch16x16
    

3-1. About run folders

The training loop creates a folder called the run folder to store artifacts of your training run, such as a log, visualizations, and checkpoints. In the example above, the 80x208p16x16_your_test is the run folder name which has a format below:

<input_size>p<patch_size>_<your_free_string>
 - input_size: Two integers concatenated with `x`.
 - patch_size: Two integers concatenated with `x`.
 - your_free_string: Optional.

NOTE: When a model is evaluated by EVAR or hear-eval-kit, the input and patch size parameters written in the run name are used.

3-2. Evalidation during training

The training loop takes two actions for evaluating checkpoints during training: visualization and calling an external command quick_eval.sh (EVAR by default).

  • While training, main_pretrain.py will call a script named quick_eval.sh for every 20 epochs by default. You can edit quick_eval.sh for your purpose.

  • The training loop also visualizes reconstruction results using checkpoints. You can find them under the sub-folder of the run folder, such as 80x208p16x16_tttt/19. The folder contains raw results in .npy and image files in .png as the following examples.

    recon_126AbihZt28 recon_A1TNpj9hHW0 recon_v1Xn9GONUDE

4. Visualization & Audio Examples

๐Ÿ‘‰ Visualization example notebook and Vizualization & Audio example notebook are available. ๐Ÿ‘ˆ

The Visualization example notebook shows how to visualize reconstruction results as well as attention maps.

  • Download AudioSetWav16k_examples.zip that contains example wave samples from the releases and unzip the zip file in the misc folder beforehand.

Here are reconstruction examples:

recon512

Here are attention map examples:

attn512

In addition, Vizualization & Audio example notebook shows how we can invert these log-mel spectrograms to audios using librosa.feature.inverse.mel_to_audio. Two examples from the notebook follow:

5. Pre-trained Weights and Network Structure Details

Three pre-trained weights are published on the releases, and the followings are their EVAR task results:

weightesc50us8kspcv2vc1voxforgecremadgtzannsynthsurge
80x512p16x16_paper89.1%84.3%93.1%56.1%97.5%70.1%79.3%76.9%37.5%
80x512p16x16_042588.9%84.0%92.4%56.5%96.8%67.6%81.4%76.1%37.4%
80x208p16x16_042287.1%82.2%91.1%49.7%95.6%66.9%75.9%76.5%37.8%
  • 80x512p16x16_paper: This weight is the pre-trained weight used to calculate the numbers in the paper, trained on our old unpolished code. Cannot be used for visualizations due to a minor difference in the decoder structure.
  • 80x512p16x16_0425: This weight is a pre-trained weight that is trained with the current code to check reproducibility.
  • 80x208p16x16_0422: This weight is a pre-trained weight that is trained with the current code to check reproducibility.

FYI: You can check a notebook that summarizes EVAR task results.

5-1. Network Structure Details

Our ViT network implementation has three minor differences from the original MAE implementation, due to historical reason. (We have already been testing based on pengzhiliang/MAE-pytorch in late 2021, before the MAE implementation became available.)

  • Our implementation follows BEiT, where the k bias is always zero in the Attention calculation. Though it is explained to be equivalent in terms of calculation results, we keep this implementation in our network for better reproducibility.
  • [CLS] token is removable by the option --no_cls_token.
  • The decoder uses 1d positional embedding. You can switch to the 2d version by the option --dec_pos_2d.

6. License

See LICENSE for details.

Acknowledgements

We appreciate these publicly available implementations and all the modules our experiments heavily depend on!

References