MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
June 12, 2025 · View on GitHub
This repository contains the PyTorch implementation of the following paper:
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
Authors: Jeong Hun Yeo*, Hyeongseop Rha*, Se Jin Park, Yong Man Ro (*Equal contribution)
Paper Link: http://arxiv.org/abs/2503.11315
Introduction
MMS-LLaMA is an efficient multimodal speech LLM framework, for AVSR that minimizes the length of multimodal speech tokens while preserving their linguistic content.

Environment Setup
conda create -n mms-llama python=3.9 -y
conda activate mms-llama
git clone https://github.com/JeongHun0716/MMS-LLaMA
cd MMS-LLaMA
# PyTorch and related packages
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install numpy==1.23.5 scipy opencv-python
pip install editdistance python_speech_features einops soundfile sentencepiece tqdm tensorboard unidecode librosa
pip install omegaconf==2.0.6 hydra-core==1.0.7 (If your pip version > 24.1, please run "python3 -m pip install --upgrade pip==24.0")
pip install transformers==4.47.1 peft==0.14.0 bitsandbytes==0.45.0
cd fairseq
pip install --editable ./
Preparation
Before running any training or evaluation, you must update the dataset file paths in the tsv files. These tsv files contain placeholders (e.g., {LRS3_ROOT}) that need to be replaced with the absolute paths to your local copies of the datasets. The provided script (update_dataset_paths.py) automates this process, ensuring that all references in the tsv files point to the correct locations on your system.
The required datasets are:
Once you have downloaded these datasets, you should pre-process every video clip to crop the mouth regions. You can follow the pre-processing instructions provided in Auto-AVSR.
Note that for the LRS3 and VoxCeleb2 datasets, the facial landmarks are already provided in the Auto-AVSR repository.
After the pre-processing, update the tsv files with the absolute paths to the dataset directories using the provided script. This ensures that all dataset references point to the correct locations on your system.
python update_dataset_paths.py --input_dir ./ --vox2 'path for the VoxCeleb2 dataset' --lrs3 'path for the LRS3 dataset'
For example:
python update_dataset_paths.py --input_dir ./ --vox2 /Dataset/vox2 --lrs3 /Dataset/lrs3
The above command updates the placeholder paths in the tsv files to your absolute dataset paths.
Each tsv files contains one line per data sample, with the following fields separated by a tab (\t):
- used dataset
- video_path
- audio_path
- num_video_frames
- num_audio_frames
- speech_rate
Below are the expected directory structures for LRS3 dataset:
LRS3
lrs3/
├── lrs3_video_seg24s/
│ ├── pretrain/
│ ├── test/
│ ├── trainval/
└── lrs3_text_seg24s/
├── pretrain/
├── test/
└── trainval/
Training
Train a new MMS-LLaMA
bash scripts/train.sh
Note: For training on an 8 GPU RTX 3090 setup, the 433h model can be trained in approximately 6 hours, while the 1759h model requires about 20 hours.
Evaluation of the MMS-LLaMA
To evaluate the performance of MMS-LLaMA, execute the evaluation script by running:
bash scripts/eval.sh
Evaluation of the MMS-LLaMA, under noisy environment
To evaluate the performance of MMS-LLaMA in a noisy environment, run the evaluation script using:
bash scripts/eval_snr.sh
Pretrained Models
- Download the
AV-HuBERT Large modelfrom this link - Download the
Whisper-medium.enfrom this link - Download the
LLaMA-3.2 3B modelfrom this link
After downloading, make sure to place the models in the correct directories:
- The
large_vox_iter5.pt(AV-HuBERT)model should be placed in thepretrained_models/avhubertfolder. - Place the
433h checkpointin thepretrained_models/mms-llama/433hfolder. - Place the
1759h checkpointin thepretrained_models/mms-llama/1759hfolder. - Place the
speech rate predictorcheckpoint in thepretrained_models/sr_predictorfolder.
MMS-LLaMA
| Model | Used Datasets | Training data (# hours) | WER(%), Clean | WER(%), Noisy |
|---|---|---|---|---|
| ckpt.pt | LRS3 | 433 | 0.90 | 2.4 |
| ckpt.pt | LRS3, VoxCeleb2 | 1759 | 0.72 | 1.9 |
You can download the pre-trained models using wget with the following command:
# 433h model
wget -O ckpt.pt "https://www.dropbox.com/scl/fi/uiaxa2lgjze4mt7tdi5wu/checkpoint_best.pt?rlkey=o62sc6ann8xm3gpkyj4yk3rwe&st=s5q385op&dl=1"
# 1759h model
wget -O ckpt.pt "https://www.dropbox.com/scl/fi/ou28xe2k9ampxsf4ihoft/checkpoint_best.pt?rlkey=a4q1qgigodhrgwqi9lgsalj7f&st=ga8z79vc&dl=1"
Speech Rate Predictor
| Model | Used Datasets | Training data (# hours) |
|---|---|---|
| ckpt.pt | LRS3 | 433 |
wget -O ckpt.pt "https://www.dropbox.com/scl/fi/rc6jmbzdvj8afn84z47qt/checkpoint.pt?rlkey=aoa0ifkdydgm9gjmt2ljwpgrc&st=we9qoqtb&dl=1"
Citation
If you find this work useful in your research, please cite the paper:
@article{yeo2025mms,
title={MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens},
author={Yeo, Jeong Hun and Rha, Hyeongseop and Park, Se Jin and Ro, Yong Man},
journal={arXiv preprint arXiv:2503.11315},
year={2025}
}
Acknowledgement
This project is based on the avhubert, auto-avsr, and fairseq code. We would like to thank the developers of these projects for their contributions and the open-source community for making this work possible.