Music Genre Classification

Definition

Project Overview:

Automatic music classification/recognition is one such area which is being widely used in many

commercial applications also like Shzam, Google Play, Sony Track ID, etc. All these applications have

one thing in common that they aim to understand the semantic of the music rather than just curating

the metadata out of it. To develop an advanced and intelligent music player there is a large semantic

gap between audio signal processing and listeners” preference. Most of the cloud based music

providers use collaborative filtering and sound Meta data to recommend the next song to the

listeners. But they could not fulfil the gap of listener’s preference i.e genre, mood, lyrics,

instrumentation, rhythm, music records time etc.

Thus, we aim to focus the major area for preference of a music listener is genre of music. Music genre

classification is one such problem in MIR (Music Information Retrieval) which has been addressed by

a lot of signal processing techniques along with standard machine learning algorithms in place. But as

we have seen that the traditional signal processing techniques to extract features don’t add much

advantage when it comes to typical problems of clustering, classification because we might miss many

important features at the time of feature extraction and selection. As Deep Learning has proven its

importance in image processing to understand the features from the image pixels. Similarly, we here

aim to adopt a similar approach for audio processing.

Applications

1. Music Tag Recommender (Genre based tagging)

2. Personalized Song Recommender Systems (eg. spotify)

3. Music Teaching Applications to make the students understand, what genre are all about.

4. Automatic Meta Tagging of songs

5. Segmenting the genres in music players or in local PCs for organized storage.

6. Develop song indexing systems using genre as on the key to the hash or node to the btree.

Problem Statement:

To goal of this project is to develop an algorithm that can take a set of songs and assign genre to the

song. Music can be categorized into different genres, majorly ten genres that we aim to recognize

automatically from the input song.

As we have discrete number of classes that we need to classify thus, our problem is a classification

problem with 10 classes as genres. As a multiclass supervised learning classification problem we could

probably start with the fairly simple machine learning algorithms, as there were a lot of features that

we can’t engineer well without the good understanding on the Audio Signal Processing thus, deep

learning has been chosen to build out the classifier.

After training the classifier we aimed to build a system that can classify the song genre when someone

inputs the raw music file. There are various things that needs to be done to build such a system i.e

converting audio into *.au or *.wav format , resample it, extract the features that can be used to train

the model and then also for predicting the new inputs.

Analysis

Data Exploration:

The dataset used for training is GTZAN dataset, which was used in, “Music genre classification of audio

signals”, by G. Tzanetakis and P.Cook. The files are collected from different sources in varied recording

conditions to generalize more.

The dataset consists of 100 songs from 10 different genres namely:

1. Blues

2. Classical

3. Country

4. Disco

5. Hiphop

6. Jazz

7. Metal

8. Pop

9. Reggae

10. Rock

Each song is of 30 sec duration and 22050Hz Mono 16-bit in *.au format.

Exploratory Visualization

Audios and songs can be represented as the time series of frequencies. In order to

1. Raw Features: There are a lot of audio features that we can see and visualize. Audio features

are computed from the raw audio signal. The simple features are the pulse code modulated

digital signal features.

2. Specgram: Specgram is a representation of frequencies with respect to time axis.

3. Log Power Specgram: It the power spectral density distribution of the signal with respect to

time axis.

In the following section we can clearly visualize that based on the genres the waves, discrete time

Fourier Transform (DFT) and other features of the signal is different. Thus, in this project we aim to

learn such patterns which are being learned through the Machine Learning algorithms clearly. Just by

eye balling their different patterns can be observed in the signals. Therefore, it is very evident to learn

such patterns that can easily discriminate the genres.

Figure 1, Genre(Classical), Wave Plot, ZCR, FFT,MEL Power Spectrum

Figure 2,, Genre(Country), Wave Plot, ZCR, FFT, MEL Power Spectrum

Figure 3,, Genre(Disco), Wave Plot, ZCR, FFT,MEL Power Spectrum

Figure 4,, Genre(HipHop), Wave Plot, ZCR, FFT,MEL Power Spectrum

Figure 5,, Genre(Jazz), Wave Plot, ZCR, FFT,MEL Power Spectrum

Figure 6, Genre(Metal), Wave Plot, ZCR, FFT,MEL Power Spectrum

Figure 7, Genre(Pop), Wave Plot, ZCR, FFT,MEL Power Spectrum

Figure 8, Genre(Reggae), Wave Plot, ZCR, FFT,MEL Power Spectrum

Figure 9, Genre(Rock), Wave Plot, ZCR, FFT, MEL Power Spectrum

Figure 10, Genre(Blues), Wave Plot, ZCR, FFT, MEL Power Spectrum

If we just compare few of the genres with FFT only we get to understand that for classical that there

is no much fluctuation in the frequencies throughout the song, whereas others have some time series

pattern to frequency fluctuations, Mel Spectrum of metal shows the intensity of energy is very high

and also the ZCR shows the shrillness of sound too as there are a lot of ZCR very frequenctly occurring.

Metrics

While solving any problem, the evaluation technique should be an equally important task. Therefore,

as this genre classification problem is a classification problem and the accuracy of the system matters

most to us. Thus, defining the metrics to calculate the metrics is defined as follow.

A clean and unambiguous way to present the prediction results of a classifier is to use a use a confusion

matrix. For a binary classification problem, the table has 2 rows and 2 columns. Across the top is the

observed class labels and down the side are the predicted class labels. Each cell contains the number

of predictions made by the classifier that fall into that cell.

Considering the following values of the confusion matrix for binary classification

a is the number of correct predictions that an instance is negative,

b is the number of incorrect predictions that an instance is positive,

c is the number of incorrect of predictions that an instance negative, and

d is the number of correct predictions that an instance is positive.

The accuracy (AC) is the proportion of the total number of predictions that were correct. It is

determined using the following equation

 

 

    

Algorithms and Techniques

This project incorporates the pattern matching approach for classification of music using Deep

Learning techniques. Logistic regression, Feed forward neural network and convolutional neural

network has been evaluated against each other. The initial step of examination frameworks is

therefore to concentrate a few components from the sound information to control more significant

data and to diminish the further preparing. Removing elements is the initial step of most example

acknowledgment frameworks. Undoubtedly, once huge components are extricated, any grouping plan

might be utilized. On account of sound signs, components might be identified with the principle

components of music including song, agreement, beat, timbre and spatial area.

Feature Extraction

1. Zero-Crossing Rate: it is characterized as the quantity of zero-intersections of the flag in the

time area; it is a measure of uproar of the flag and it is corresponded with pitch. Calculation

is basic and has low computational multifaceted nature.

2. FFT coefficients: the component vector is just the vector of the FFT coefficients.

a. Fast Fourier Transform (FFT) is the round off version of discrete Fourier transform

(DFT) which requires O(N^2) operations: there are N outputs Xk, and each output

requires a sum of N terms. An FFT is any method to compute the same results in O(N

log N) operations.

DFT: Let x0, ...., xN−1 be complex numbers.





  











    

3. Mel-Frequency Cepstrum Coefficients (MFCCs) : they are perceptually inspired elements

acquired by taking the log-plentifulness of the extent range, distorting the range onto a

perceptual recurrence scale (the Mel recurrence scale) and by applying a discrete cosine

change on the Mel coefficients to decorrelate the subsequent component vector. Thirteen

coefficients are ordinarily utilized for discourse acknowledgment.

4. Spectral Centroid: it depicts the focal point of gravity of the power range. It is a modest

depiction of the state of the power range. It shows whether a sound range is ruled by low or

high frequencies and moreover it is corresponded with a noteworthy perceptual

measurement of timbre; i.e. sharpness.

It is calculated as the weighted mean of the frequencies present in the signal, determined

using a Fourier transform, with their magnitudes as the weights:

 

















where x(n) represents the weighted frequency value, or magnitude, of bin number n, and f(n)

represents the centre frequency of that bin.

5. Spectral Roll-Off: it is a measure of otherworldly shape. It is characterized as the recurrence

underneath which the majority of the power range is concentrated (normally 85 % of the

power range).

6. Spectrum Spread: it portrays the second snapshot of the power range. It is a modest

descriptor of the state of the power range that shows whether it is amassed in the region of

its centroid, or spread out over the range. It permits separating between tone-like and

commotion like sounds.

7. Chroma: Chroma elements are a fascinating and intense representation for music sound in

which the whole range is anticipated onto 12 canisters speaking to the 12 particular semitones

(or chroma) of the melodic octave. Since, in music, notes precisely one octave separated are

seen as especially comparative, knowing the appropriation of chroma even without the total

recurrence (i.e. the first octave) can give helpful melodic data about the sound - and may even

uncover saw melodic similitude that is not clear in the first spectra.

8. Tonnetz: Tonnetz can be used to show traditional harmonic relationships.

Data Modelling

Figure 11, Schematic Logistic Regression classifier, Python Machine Learning by Sebastian Raschka

Neural networks are somewhat related to logistic regression. Basically, we can think of logistic

regression as a one layer neural network. It is very common to use logistic sigmoid functions as

activation functions in the hidden layer of a neural network – like the schematic above but without

the threshold function. Therefore, we have started building our classifier with logistic regression with

softmax layer and then try out some deeper feed forward neural network and finally build a deeper

network with convolution layers, activation layers and fully connected layers to model more complex

relationships while not allowing model to over fit by introducing some regularization like dropouts.

Benchmark

Tao Feng paper on Deep learning for music genre classification where he has implemented the 4 class

classification model with training accuracy of 94 % and test accuracy of 61 % using restricted

Boltzmann machine. Thus, we aim to develop a fully 10 class classification system with some

improvements.

This paper discusses about the deep neural network approach which seems to be overfitting for 4 class

classifiers as the training accuracy is good but the testing is not so good. We here have compared few

other architectures with different set of features and testing on all the classes also.

Methodology

Implementation

The following diagram explains the complete training process starting from reading the audio files

and to feature extraction and then feeding to the Convolution Neural Network for model training.

There are few steps involved, to build a song classifier:

1. Pre-Processing: Convert the audio wave into a vector of acoustic coefficients. Extract a new

vector about every few milliseconds.

2. Acoustic Features Extraction: Extract FFT, MFCCs, Spectral Centroids, etc features.

3. Data Preparation: Stack all the features together into an array, and doing zero padding

wherever necessary to make all the feature vectors equal in dimension and one – hot – encode

the labels corresponding to the song.

4. Split data into train and test datasets.

5. Data Modelling: Find the sequence of bits that does the best job of fitting the acoustic data

also fitting a model of the kinds that best describes the genre.

a. Training the classifier using different

i. Simple Logistic

ii. Feed Forward Network with different hidden layers

iii. Convolutional Neural Network

b. Model Evaluation using accuracy

Features

Read File

DTFT/STF

Mel

Frequenci

MFCCs

Chromas

Tonnetz

Input Song

(*.mp3/*.au)

Neural

Network

Figure 12, Features extraction - > Data Preparation -> Fitting different neural network architectures

(Feed Forward neural network or CNN).

Data Modelling

As it is supervised learning classification problem , therefore we intially tried with fairly simple model

that is multiclass logistic regression , then used a feed forward neural network with pair of different

accoustic features, then the concept of convolutions has been implemented with the notion of weight

sharing across the different convolutions of the audio features along with different activation layers

and fully connected layers.

Three modelling techniques has been tried out

1. Logistic Regression

2. Feed Forward Neural Network

3. Convolutional Neural Network: Rather than limiting to the known and the researched

features from the audio signals. There can be some other level of information/features that

is difficult to find , therefore convolutional neural network plays an important role to find

such features. Because of resources constraint, we can not go deep with the architecture.

After training and evaluating few models, the final model has been chosen to in the

application to predict the genre of the song that a user uploads.

Refinements:

While training the network, there were instance were the model over fitting. Initially when we started

with the feed forward neural network as we kept on increasing the number of hidden layers, we found

that after certain extent the model starts over fitting therefore, the number of hidden layers were

limited to only two.

Similar, problem also arise while using the convolution neural networks, we introduced the

regularization techniques i.e dropouts with keep probability of 0.2 and hence don’t allow model to

over fit.

When the same set features (tonnetz, chroma, stft, mfcc) were used to train the convolution network

that were used to train feed forward network, the model was under fitting and then after

experimenting with features , only mel spectrum log power amplitude is used with CNNs.

Results

Model Evaluation and Validation

Simply accuracy has been calculated on the testing set, for different models and the best

amongst is chosen based on the accuracy on the testing set only.

Classification Accuracy:

Model

Test Accuracy

Logistic Regression

0.11

Input layer -> 193 nodes fully connected layer -> 10 output

nodes

0.64

Input layer -> 280 nodes fully connected layer -> 10 output

nodes

0.62

Input layer -> 280 nodes fully connected layer -> 300 nodes

hidden layers -> 10 output nodes

0.67

Input layer -> 280 nodes fully connected layer -> 300 nodes

fully connected layer -> 300 nodes fully connected layer-> 10

output nodes

0.615

Input layer -> Convolution (30x1x20)-> Dropouts (0.2)-> 280

nodes fully connected layer ->10 output nodes

0.72

Following graph shows that as we increase the number of iterations of the neural network the loss

keeps on decreasing. All the architectures follow the same pattern. Though we need to restrict the

iterations to 5000 only, due to hardware constraint. But, for few architectures the saturation time

for the loss maybe more than 5000. The following graph shows that for training the error keeps on

reducing with increasing complexity of the model, here we meant the number of iterations as the

one of the model complexity parameter.

Figure 13, Training 2 Vs Training Iterations

As compared to Tao Feng’s paper the four genre classification has been over fitted because the

training accuracy was 94% and the test accuracy was 61%. But, we have taken 10 genre classifier to

evaluate the model which has given the test accuracy of 72 % which is the better estimate of

generalization.

Conclusions

In this project, a lot signal processing techniques has been used to extract useful features from signals

and then used for modelling the classifier using Deep Neural Network Architectures. Convolutional

neural networks shown a significant improvement as expected which is more generalized as compared

to Tao Feng’s paper. Though, there were some challenging areas when initially feed forward neural

networks were not giving good results, even after increasing the nodes and the hidden layers also, but

convnets made their way to solve these issues.

Convolutional neural networks have once again proved their strength in such highly dimensional

classification problems over feed forward neural networks. Therefore, in such areas we can explore

their use even more. Through this we conclude that the similar concepts of neural networks can be

applied to different domains. Which is not restricted to images. Moreover, using many features of

audio is better rather than just using the MFCCs, or just using the Chroma. But collectively using all

the extracted features provided good accuracy.

While comparing the different models we get to understand the comparison between the different

architectures, CNN vs NN. As, it is difficult to find all the features collectively, therefore convolutional

neural networks help as to learn the features also as we go deep in the network. Though the training

time increases drastically.

Thus, instead of extracting different combination of acoustic features and then using a feed forward

neural network, a convolution neural network with just the log amplitude of mel-spectrum has proved

better performance and proves the notion that using right convolutions helps in extracting and

selecting the relevant features to the problem rather than using all the features.

Future Scope

Some more architectures can be explored: -

1. Using Cross validation techniques such as K-Fold, which we can’t use because of the

resource constrain.

2. Hyper Parameter Tuning will definitely help us to find the best classifier.

3. Using some more features and trying to create some more data points by changing

the pitch of the song and other features which are independent of genre, to make

model more generic.

4. Can use different regularization techniques also such as dropout to avoid overfitting.

5. Recurrent Neural Networks with (Long Short Term Memory) LSTM and (Gated

Recurrent Unit) GRUs with (Connectionist temporal categorization) CTC loss.

6. Stack convolutional neural networks with Recurrent Neural Networks can be very

much useful. As Convolutional Neural Network will help to extract useful features and

then Recurrent Neural Networks will help to model the temporal aspects of the signal.

7. Recently the attentional encoders have been used in Text and Images, similar must be

designed for audios.

References

[1]. Feng, Tao. "Deep Learning for Music Genre Classification." (n.d.): n. pag. Print.

[2]. “The Echo Nest - Powering Intelligent Digital Music Applications.” The Intelligent Music

Application Platform. Web. 07 May 2012. http://the.echonest.com/

[3]. “Last.fm.” Last.fm. Web. 07 May 2012. http://www.last.fm/

[4]. Category-based intrinsic motivation, Proceedings of the Ninth International Conference on

Epigenetic Robotics,Rachel Lee, Ryan Walker, Lisa Meeden, and James Marshall (2009)

[5]. A growing neural gas learns topologies Advances in Neural Information Processing Systems 7

, Bernd Fritzke (1995)

[6]. “Musical Genre Classification of Audio Signals”, George Tzanetakis and Perry Cook,IEEE

Transactions on Speech and Audio Processing, 10(5), July 2002

[7]. “Scanning the Dial: The Rapid Recognition of Music Genres”, Robert o. Gjerdingen and David

Perrot Journal of New Music Research, Vol. 37, No. 2. (2008), pp. 93-100

[8]. “Mathworks Forum ”

https://www.mathworks.com/matlabcentral/newsreader/view_thread/18880?requestedDo

main=www.mathworks.com