Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model
April 19, 2024 ยท View on GitHub
Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model
This repository contains the code and dataset accompanying the paper "Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model" by Dr. Jaeyong Kang, Prof. Soujanya Poria, and Prof. Dorien Herremans.
๐ฅ Live demo available on HuggingFace and Replicate.
Introduction
We propose a novel AI-powered multimodal music generation framework called Video2Music. This framework uniquely uses video features as conditioning input to generate matching music using a Transformer architecture. By employing cutting-edge technology, our system aims to provide video creators with a seamless and efficient solution for generating tailor-made background music.

Change Log
- 2023-11-28: add new input method (YouTube URL) on HuggingFace
Quickstart Guide
Generate music from video:
import IPython
from video2music import Video2music
input_video = "input.mp4"
input_primer = "C Am F G"
input_key = "C major"
video2music = Video2music()
output_filename = video2music.generate(input_video, input_primer, input_key)
IPython.display.Video(output_filename)
Installation
This repo is developed using python version 3.8
apt-get update
apt-get install ffmpeg
apt-get install fluidsynth
git clone https://github.com/AMAAI-Lab/Video2Music
cd Video2Music
pip install -r requirements.txt
-
Download the processed training data
AMT.zipfrom HERE and extract the zip file and put the extracted two files directly under this folder (saved_models/AMT/) -
Download the soundfont file
default_sound_font.sf2from HERE and put the file directly under this folder (soundfonts/) -
Our code is built on pytorch version 1.12.1 (torch==1.12.1 in the requirements.txt). But you might need to choose the correct version of
torchbased on your CUDA version
Dataset
-
Obtain the dataset:
- MuVi-Sync (Link)
-
Put all directories started with
vevoin the dataset under this folder (dataset/)
Directory Structure
saved_models/: saved model filesutilities/run_model_vevo.py: code for running model (AMT)run_model_regression.py: code for running model (bi-GRU)
model/video_music_transformer.py: Affective Multimodal Transformer (AMT) modelvideo_regression.py: Bi-GRU regression model used for predicting note density/loudnesspositional_encoding.py: code for Positional encodingrpr.py: code for RPR (Relative Positional Representation)
dataset/vevo_dataset.py: Dataset loader
script/: code for extracting video/music features (sementic, motion, emotion, scene offset, loudness, and note density)train.py: training script (AMT)train_regression.py: training script (bi-GRU)evaluate.py: evaluation scriptgenerate.py: inference scriptvideo2music.py: Video2Music module that outputs video with generated background music from input videodemo.ipynb: Jupyter notebook for Quickstart Guide
Training
python train.py
Inference
python generate.py
Subjective Evaluation by Listeners
| Model | Overall Music Quality โ | Music-Video Correspondence โ | Harmonic Matching โ | Rhythmic Matching โ | Loudness Matching โ |
|---|---|---|---|---|---|
| Music Transformer | 3.4905 | 2.7476 | 2.6333 | 2.8476 | 3.1286 |
| Video2Music | 4.2095 | 3.6667 | 3.4143 | 3.8714 | 3.8143 |
TODO
- Add other instruments (e.g., drum) for live demo
Citation
If you find this resource useful, please cite the original work:
@article{KANG2024123640,
title = {Video2Music: Suitable music generation from videos using an Affective Multimodal Transformer model},
author = {Jaeyong Kang and Soujanya Poria and Dorien Herremans},
journal = {Expert Systems with Applications},
pages = {123640},
year = {2024},
issn = {0957-4174},
doi = {https://doi.org/10.1016/j.eswa.2024.123640},
}
Kang, J., Poria, S. & Herremans, D. (2024). Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model, Expert Systems with Applications (in press).
Acknowledgements
Our code is based on Music Transformer.