Audio Classification

We have all used apps or some streaming app to listen to music. What is the apps logic for creating a personalized playlist of genres for us. One example is to have a Music Genre classification System. The aim of conducting this project is to see how to handle audio files in python, create audio features from them, and run deep learning algorithm on them to see the results.
Furthermore, I wanted to create a machine learning model which classifies music samples into different genres. It aims to predict the genres using an audio signal as its input. The objective of automating the music classification is to make the selection of songs quick and less cumbersome. If one has to manually classify the songs or music, one has to listen to the whole lot of songs and then select the genre. This is not only time consuming but also difficult.

Audio classification: Text

Data content

The dataset contained 1000 audio tracks each 30 second long. It contains 10 genres each represented by 100 tracks. The 10 genres are:

Blues
Classical
Country
Disco
Hip-hop
Jazz
Metal
Pop
Reggae
Rock

Data wrangling and visualization

The challenges with this data was the fact that the data was continuous and computers can only store discrete data so I had to digitize the data in such a way that it would be stored. So the only solution was to find a way to store the data that is discrete yet continuous. I used a python called librosa to sample the data at a fixed duration like 0.02 second. Using python OS library with numpy and pandas, a dataframe of the audio files was generated to reference the audio files. The data was 1000 audios of 30 seconds and with data, more is better, so to that end, I used pylabs library to split each audio files into 10 audio of 3 seconds each. Librosa was used to view the waveform of the sound which is the time on the x-axis and amplitude on the y-axis. Finally using mfcc method of librosa library, features of the audio was extracted. These features are numerical and hence allows me to create a model from it.

Creating the machine learning model

The features being extracted was converted to numpy arrays which were then converted to tensors. These tensors were separated into input column which was the features column and the target column. These were further divided into training set, validation set and test set in the ratio of 60:20:20. They were further placed in tensors dataset. Data loaders was created in order not to present all the data at once to the machine to learn but to load a few batch at a time so that the learning process would be more efficient. The model used was a convoluted neural network which had 4 convoluted block, 2 residual block and a classifier with two hidden layers. The model was trained with torch.optim.Adam as the optimizer function. The model was trained for 30 epochs and had a final accuracy of 85%

Audio classification: Text

Audio Classification

Data content

The dataset contained 1000 audio tracks each 30 second long. It contains 10 genres each represented by 100 tracks. The 10 genres are:

Blues

Classical

Country

Disco

Hip-hop

Jazz

Metal

Pop

Reggae

Rock

Data wrangling and visualization

​

Creating the machine learning model

​