Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

April 19, 2024 · View on GitHub

Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

Demo | Website and Examples | Paper | Dataset (MuVi-Sync)

This repository contains the code and dataset accompanying the paper "Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model" by Dr. Jaeyong Kang, Prof. Soujanya Poria, and Prof. Dorien Herremans.

🔥 Live demo available on HuggingFace and Replicate.

Introduction

We propose a novel AI-powered multimodal music generation framework called Video2Music. This framework uniquely uses video features as conditioning input to generate matching music using a Transformer architecture. By employing cutting-edge technology, our system aims to provide video creators with a seamless and efficient solution for generating tailor-made background music.

Change Log

2023-11-28: add new input method (YouTube URL) on HuggingFace

Quickstart Guide

Generate music from video:

import IPython
from video2music import Video2music

input_video = "input.mp4"

input_primer = "C Am F G"
input_key = "C major"

video2music = Video2music()
output_filename = video2music.generate(input_video, input_primer, input_key)

IPython.display.Video(output_filename)

Installation

This repo is developed using python version 3.8

apt-get update
apt-get install ffmpeg
apt-get install fluidsynth
git clone https://github.com/AMAAI-Lab/Video2Music
cd Video2Music
pip install -r requirements.txt

Download the processed training data AMT.zip from HERE and extract the zip file and put the extracted two files directly under this folder (saved_models/AMT/)
Download the soundfont file default_sound_font.sf2 from HERE and put the file directly under this folder (soundfonts/)
Our code is built on pytorch version 1.12.1 (torch==1.12.1 in the requirements.txt). But you might need to choose the correct version of torch based on your CUDA version

Dataset

Obtain the dataset:
- MuVi-Sync (Link)
Put all directories started with vevo in the dataset under this folder (dataset/)

Directory Structure

saved_models/: saved model files
utilities/
- run_model_vevo.py: code for running model (AMT)
- run_model_regression.py: code for running model (bi-GRU)
model/
- video_music_transformer.py: Affective Multimodal Transformer (AMT) model
- video_regression.py: Bi-GRU regression model used for predicting note density/loudness
- positional_encoding.py: code for Positional encoding
- rpr.py: code for RPR (Relative Positional Representation)
dataset/
- vevo_dataset.py: Dataset loader
script/ : code for extracting video/music features (sementic, motion, emotion, scene offset, loudness, and note density)
train.py: training script (AMT)
train_regression.py: training script (bi-GRU)
evaluate.py: evaluation script
generate.py: inference script
video2music.py: Video2Music module that outputs video with generated background music from input video
demo.ipynb: Jupyter notebook for Quickstart Guide

Training

python train.py

Inference

python generate.py

Subjective Evaluation by Listeners

Model	Overall Music Quality ↑	Music-Video Correspondence ↑	Harmonic Matching ↑	Rhythmic Matching ↑	Loudness Matching ↑
Music Transformer	3.4905	2.7476	2.6333	2.8476	3.1286
Video2Music	4.2095	3.6667	3.4143	3.8714	3.8143

TODO

Add other instruments (e.g., drum) for live demo

Citation

If you find this resource useful, please cite the original work:

@article{KANG2024123640,
  title = {Video2Music: Suitable music generation from videos using an Affective Multimodal Transformer model},
  author = {Jaeyong Kang and Soujanya Poria and Dorien Herremans},
  journal = {Expert Systems with Applications},
  pages = {123640},
  year = {2024},
  issn = {0957-4174},
  doi = {https://doi.org/10.1016/j.eswa.2024.123640},
}

Kang, J., Poria, S. & Herremans, D. (2024). Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model, Expert Systems with Applications (in press).

Acknowledgements

Our code is based on Music Transformer.