LOLA
October 27, 2021 Β· View on GitHub
LOLA

Play with the project here
Welcome to my final project made to the context of CORE School.
LOLA is a tool for classifying audios to different flamenco styles (palo).
Table of contents
- Project and motivation π
- Dataset Used π
- Models used πͺ
- Local deployment π¨π§
- Future steps
- License
Project and motivation π
Flamenco is an art composed of guitar, singing, and dancing. Its origins come from the XVIII century.
Flamenco is divided into styles, called palos. Each of them has different guitar styles, rhythms, as well as lyrics.
Let me illustrate this splitting with an image.

Even this image does not contain all flamenco styles. If you take a look at this one:

There are some styles present only to one of them. All of this is just to illustrate how many different styles are to flamenco.
Even for experienced people, classifying specific audios is hard. That's why I decided to make an expert system for trying to solve this task.
I'm from Andalusia, where flamenco comes from, so contributing in any way to that culture it's a major motivation for me.
Dataset used π
The dataset is a self-built one. I manually selected, split and tagged audio files from three different palos:
- BulerΓas
- AlegrΓas
- Sevillanas
The total dataset is composed by 577 alegrias audios, 589 bulerias and 518 sevillanas.
I've built the largest flamenco dataset on the internet. You can download it from here
The dataset structure is:
.
βββ alegrias
β βββ 0.mp3
β βββ 1.mp3
β βββ ...
β βββ n.mp3
βββ bulerias
β βββ 0.mp3
β βββ 1.mp3
β βββ ...
β βββ n.mp3
βββ sevillanas
βββ 0.mp3
βββ 1.mp3
βββ ...
βββ n.mp3
There is another flamenco dataset available, called TONAS. But there are only 72 acapella audio samples.
Models used πͺ
I've used two different approaches to this project. The first is a pure deep learning one, while the latter is a hybrid between a deep learning and a classical machine learning model.
Deep learning π

This approach takes an audio file as input. That audio is split into portions of 10-15 seconds each, with overlapping.
For each sample, a tempogram, spectrogram, and normalized tempogram are extracted. Each of those images is fed into a convolutional neural network. That network recognizes the images and outputs a prediction for each of them.
Finally, the label with more occurrences among all predictions is taken as the final one.
Machine learning hybrid π½

The second approach is a hybrid process involving a neural network called VGGish, which takes an audio, splits it into samples of 0.9 seconds, and, for each sample, generates a 128-lenght vector of numerical features.
Those vectors are fed into a random forest, which generates a prediction.
Local deployment π¨π§
Docker π³
To deploy this project using Docker, you only have to clone the repository and bring it up with docker-compose.
git clone https://github.com/alesanmed/lola.git
cd lola
docker-compose up
From source β²
If you want to take this path, I recommend:
- Clone the repository
- Open it to your favorite IDE
- Follow the steps to each part's README