Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser
July 13, 2025 · View on GitHub
This is the official repository of NeurIPS 2023 paper VALOR.

Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser
Yung-Hsuan Lai, Yen-Chun Chen, Yu-Chiang Frank Wang
Machine environment
- Ubuntu version: 20.04.6 LTS
- CUDA version: 11.4
- Testing GPU: NVIDIA GeForce RTX 3090
Requirements
A conda environment named valor can be created and activated with:
conda env create -f environment.yaml
conda activate valor
Dataset setup
We recommend directly work with our pre-extracted features for simplicity. If you still wish to run feature extraction and pseudo-label generation yourself, please follow the below optional instructions.
Video downloading (optional)
One may download the videos listed in a csv file with the following command. We provide an example.csv to demonstrate the process of downloading a video.python ./utils/download_dataset.py --videos_saved_dir ./data/raw_videos --label_all_dataset ./data/example.csv
After downloading the videos to data/raw_videos/, one should extract video frames in 8fps from each video with the following command. We show the frame extraction of the video downloaded from the above command.
python ./utils/extract_frames.py --video_path ./data/raw_videos --out_dir ./data/video_frames
However, there are some potential issues when using these downloaded videos. For more details, please see here.
LLP dataset annotations
Please download LLP dataset annotations (6 csv files) from AVVP-ECCV20 and put in data/.
Pre-extracted features
Please download audio features (VGGish), 2D visual features (ResNet152), and 3D visual features (ResNet (2+1)D) from AVVP-ECCV20 and put in data/feats/
CLIP features & segment-level pseudo labels
Option 1. Please download visual features and segment-level pseudo labels from CLIP from this link, put in data/, and unzip the file with the following command:
unzip CLIP.zip
Option 2
The visual features and pseudo labels can also be generated with the following command if the video frames from all videos are prepared in data/video_frames/:
python ./utils/clip_preprocess.py --print_progress
For example, the features and pseudo labels of the video downloaded above can be generated in data/example_v_feats/ and data/example_v_labels/, respectively, with the following command:
python ./utils/clip_preprocess.py --label_all_dataset ./data/example.csv --pseudo_labels_saved_dir ./data/example_v_labels --visual_feats_saved_dir ./data/example_v_feats --print_progress
CLAP features & segment-level pseudo labels
Option 1. Please download audio features and segment-level pseudo labels from CLAP from this link, put in data/, and unzip the file with the following command:
unzip CLAP.zip
Option 2
The audio features and pseudo labels can also be generated with the following command if raw videos are prepared in data/raw_videos/:
python ./utils/clap_preprocess.py --print_progress
For example, the features and pseudo labels of the video downloaded above can be generated in data/example_a_feats/ and data/example_a_labels/, respectively, with the following command:
python ./utils/clap_preprocess.py --label_all_dataset ./data/example.csv --pseudo_labels_saved_dir ./data/example_a_labels --audio_feats_saved_dir ./data/example_a_feats --print_progress
File structure for dataset and code
Please make sure that the file structure is the same as the following.
File structure
> data/
├── AVVP_dataset_full.csv
├── AVVP_eval_audio.csv
├── AVVP_eval_visual.csv
├── AVVP_test_pd.csv
├── AVVP_train.csv
├── AVVP_val_pd.csv
├── feats/
│ └── r2plus1d_18/
│ └── res152/
│ └── vggish/
├── CLIP/
│ └── features/
│ └── segment_pseudo_labels/
├── CLAP/
│ └── features/
│ └── segment_pseudo_labels/
├── video_frames/ (optional)
│ └── 00BDwKBD5i8/
│ └── 00fs8Gpipss/
│ └── ...
└── raw_videos/ (optional)
└── 00BDwKBD5i8.mp4
└── 00fs8Gpipss.mp4
└── ...
Download trained models
Please download the trained models from this link and put the models in their corresponding model directory.
File structure
> models/
├── model_VALOR/
│ └── checkpoint_best.pt
├── model_VALOR+/
│ └── checkpoint_best.pt
└── model_VALOR++/
└── checkpoint_best.pt
Training
We provide some sample scripts for training VALOR, VALOR+, and VALOR++.
VALOR
bash scripts/train_valor.sh
VALOR+
bash scripts/train_valor+.sh
VALOR++
bash scripts/train_valor++.sh
Evaluation
Note: If you are evaluating the trained models generated with the training scripts, change --model_name model_VALOR to --model_name model_VALOR_reproduce here.
VALOR
bash scripts/test_valor.sh
VALOR+
bash scripts/test_valor+.sh
VALOR++
bash scripts/test_valor++.sh
Acknowledgement
We build VALOR codebase heavily on the codebase of AVVP-ECCV20 and MM-Pyramid. We sincerely thank the authors for open-sourcing! We also thank CLIP and CLAP for open-sourcing pre-trained models.
Citation
If you find this code useful for your research, please consider citing:
@inproceedings{lai2023modality,
title={Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser},
author={Yung-Hsuan Lai, Yen-Chun Chen, Yu-Chiang Frank Wang},
booktitle={NeurIPS},
year={2023}
}
License
This project is released under the MIT License.