Music Genre Classification
Definition
Project Overview:
Automatic music classification/recognition is one such area which is being widely used in many
commercial applications also like Shzam, Google Play, Sony Track ID, etc. All these applications have
one thing in common that they aim to understand the semantic of the music rather than just curating
the metadata out of it. To develop an advanced and intelligent music player there is a large semantic
gap between audio signal processing and listeners preference. Most of the cloud based music
providers use collaborative filtering and sound Meta data to recommend the next song to the
listeners. But they could not fulfil the gap of listener’s preference i.e genre, mood, lyrics,
instrumentation, rhythm, music records time etc.
Thus, we aim to focus the major area for preference of a music listener is genre of music. Music genre
classification is one such problem in MIR (Music Information Retrieval) which has been addressed by
a lot of signal processing techniques along with standard machine learning algorithms in place. But as
we have seen that the traditional signal processing techniques to extract features don’t add much
advantage when it comes to typical problems of clustering, classification because we might miss many
important features at the time of feature extraction and selection. As Deep Learning has proven its
importance in image processing to understand the features from the image pixels. Similarly, we here
aim to adopt a similar approach for audio processing.
Applications
1. Music Tag Recommender (Genre based tagging)
2. Personalized Song Recommender Systems (eg. spotify)
3. Music Teaching Applications to make the students understand, what genre are all about.
4. Automatic Meta Tagging of songs
5. Segmenting the genres in music players or in local PCs for organized storage.
6. Develop song indexing systems using genre as on the key to the hash or node to the btree.
Problem Statement:
To goal of this project is to develop an algorithm that can take a set of songs and assign genre to the
song. Music can be categorized into different genres, majorly ten genres that we aim to recognize
automatically from the input song.
As we have discrete number of classes that we need to classify thus, our problem is a classification
problem with 10 classes as genres. As a multiclass supervised learning classification problem we could
probably start with the fairly simple machine learning algorithms, as there were a lot of features that
we can’t engineer well without the good understanding on the Audio Signal Processing thus, deep
learning has been chosen to build out the classifier.
After training the classifier we aimed to build a system that can classify the song genre when someone
inputs the raw music file. There are various things that needs to be done to build such a system i.e
converting audio into *.au or *.wav format , resample it, extract the features that can be used to train
the model and then also for predicting the new inputs.
Analysis
Data Exploration:
The dataset used for training is GTZAN dataset, which was used in, “Music genre classification of audio
signals”, by G. Tzanetakis and P.Cook. The files are collected from different sources in varied recording
conditions to generalize more.
The dataset consists of 100 songs from 10 different genres namely:
1. Blues
2. Classical
3. Country
4. Disco
5. Hiphop
6. Jazz
7. Metal
8. Pop
9. Reggae
10. Rock
Each song is of 30 sec duration and 22050Hz Mono 16-bit in *.au format.
Exploratory Visualization
Audios and songs can be represented as the time series of frequencies. In order to
1. Raw Features: There are a lot of audio features that we can see and visualize. Audio features
are computed from the raw audio signal. The simple features are the pulse code modulated
digital signal features.
2. Specgram: Specgram is a representation of frequencies with respect to time axis.
3. Log Power Specgram: It the power spectral density distribution of the signal with respect to
time axis.
In the following section we can clearly visualize that based on the genres the waves, discrete time
Fourier Transform (DFT) and other features of the signal is different. Thus, in this project we aim to
learn such patterns which are being learned through the Machine Learning algorithms clearly. Just by
eye balling their different patterns can be observed in the signals. Therefore, it is very evident to learn
such patterns that can easily discriminate the genres.
Figure 1, Genre(Classical), Wave Plot, ZCR, FFT,MEL Power Spectrum
Figure 2,, Genre(Country), Wave Plot, ZCR, FFT, MEL Power Spectrum
Figure 3,, Genre(Disco), Wave Plot, ZCR, FFT,MEL Power Spectrum
Figure 4,, Genre(HipHop), Wave Plot, ZCR, FFT,MEL Power Spectrum
Figure 5,, Genre(Jazz), Wave Plot, ZCR, FFT,MEL Power Spectrum
Figure 6, Genre(Metal), Wave Plot, ZCR, FFT,MEL Power Spectrum
Figure 7, Genre(Pop), Wave Plot, ZCR, FFT,MEL Power Spectrum
Figure 8, Genre(Reggae), Wave Plot, ZCR, FFT,MEL Power Spectrum
Figure 9, Genre(Rock), Wave Plot, ZCR, FFT, MEL Power Spectrum
Figure 10, Genre(Blues), Wave Plot, ZCR, FFT, MEL Power Spectrum
If we just compare few of the genres with FFT only we get to understand that for classical that there
is no much fluctuation in the frequencies throughout the song, whereas others have some time series
pattern to frequency fluctuations, Mel Spectrum of metal shows the intensity of energy is very high
and also the ZCR shows the shrillness of sound too as there are a lot of ZCR very frequenctly occurring.
Metrics
While solving any problem, the evaluation technique should be an equally important task. Therefore,
as this genre classification problem is a classification problem and the accuracy of the system matters
most to us. Thus, defining the metrics to calculate the metrics is defined as follow.
A clean and unambiguous way to present the prediction results of a classifier is to use a use a confusion
matrix. For a binary classification problem, the table has 2 rows and 2 columns. Across the top is the
observed class labels and down the side are the predicted class labels. Each cell contains the number
of predictions made by the classifier that fall into that cell.
Considering the following values of the confusion matrix for binary classification
a is the number of correct predictions that an instance is negative,
b is the number of incorrect predictions that an instance is positive,
c is the number of incorrect of predictions that an instance negative, and
d is the number of correct predictions that an instance is positive.
The accuracy (AC) is the proportion of the total number of predictions that were correct. It is
determined using the following equation

 
Algorithms and Techniques
This project incorporates the pattern matching approach for classification of music using Deep
Learning techniques. Logistic regression, Feed forward neural network and convolutional neural
network has been evaluated against each other. The initial step of examination frameworks is
therefore to concentrate a few components from the sound information to control more significant
data and to diminish the further preparing. Removing elements is the initial step of most example
acknowledgment frameworks. Undoubtedly, once huge components are extricated, any grouping plan
might be utilized. On account of sound signs, components might be identified with the principle
components of music including song, agreement, beat, timbre and spatial area.
Feature Extraction
1. Zero-Crossing Rate: it is characterized as the quantity of zero-intersections of the flag in the
time area; it is a measure of uproar of the flag and it is corresponded with pitch. Calculation
is basic and has low computational multifaceted nature.
2. FFT coefficients: the component vector is just the vector of the FFT coefficients.
a. Fast Fourier Transform (FFT) is the round off version of discrete Fourier transform
(DFT) which requires O(N^2) operations: there are N outputs Xk, and each output
requires a sum of N terms. An FFT is any method to compute the same results in O(N
log N) operations.
DFT: Let x0, ...., xN−1 be complex numbers.



   
3. Mel-Frequency Cepstrum Coefficients (MFCCs) : they are perceptually inspired elements
acquired by taking the log-plentifulness of the extent range, distorting the range onto a
perceptual recurrence scale (the Mel recurrence scale) and by applying a discrete cosine
change on the Mel coefficients to decorrelate the subsequent component vector. Thirteen
coefficients are ordinarily utilized for discourse acknowledgment.
4. Spectral Centroid: it depicts the focal point of gravity of the power range. It is a modest
depiction of the state of the power range. It shows whether a sound range is ruled by low or
high frequencies and moreover it is corresponded with a noteworthy perceptual
measurement of timbre; i.e. sharpness.
It is calculated as the weighted mean of the frequencies present in the signal, determined
using a Fourier transform, with their magnitudes as the weights:







where x(n) represents the weighted frequency value, or magnitude, of bin number n, and f(n)
represents the centre frequency of that bin.
5. Spectral Roll-Off: it is a measure of otherworldly shape. It is characterized as the recurrence
underneath which the majority of the power range is concentrated (normally 85 % of the
power range).
6. Spectrum Spread: it portrays the second snapshot of the power range. It is a modest
descriptor of the state of the power range that shows whether it is amassed in the region of
its centroid, or spread out over the range. It permits separating between tone-like and
commotion like sounds.
7. Chroma: Chroma elements are a fascinating and intense representation for music sound in
which the whole range is anticipated onto 12 canisters speaking to the 12 particular semitones
(or chroma) of the melodic octave. Since, in music, notes precisely one octave separated are
seen as especially comparative, knowing the appropriation of chroma even without the total
recurrence (i.e. the first octave) can give helpful melodic data about the sound - and may even
uncover saw melodic similitude that is not clear in the first spectra.
8. Tonnetz: Tonnetz can be used to show traditional harmonic relationships.
Data Modelling
Figure 11, Schematic Logistic Regression classifier, Python Machine Learning by Sebastian Raschka
Neural networks are somewhat related to logistic regression. Basically, we can think of logistic
regression as a one layer neural network. It is very common to use logistic sigmoid functions as
activation functions in the hidden layer of a neural network like the schematic above but without
the threshold function. Therefore, we have started building our classifier with logistic regression with
softmax layer and then try out some deeper feed forward neural network and finally build a deeper
network with convolution layers, activation layers and fully connected layers to model more complex
relationships while not allowing model to over fit by introducing some regularization like dropouts.
Benchmark
Tao Feng paper on Deep learning for music genre classification where he has implemented the 4 class
classification model with training accuracy of 94 % and test accuracy of 61 % using restricted
Boltzmann machine. Thus, we aim to develop a fully 10 class classification system with some
improvements.
This paper discusses about the deep neural network approach which seems to be overfitting for 4 class
classifiers as the training accuracy is good but the testing is not so good. We here have compared few
other architectures with different set of features and testing on all the classes also.
Methodology
Implementation
The following diagram explains the complete training process starting from reading the audio files
and to feature extraction and then feeding to the Convolution Neural Network for model training.
There are few steps involved, to build a song classifier:
1. Pre-Processing: Convert the audio wave into a vector of acoustic coefficients. Extract a new
vector about every few milliseconds.
2. Acoustic Features Extraction: Extract FFT, MFCCs, Spectral Centroids, etc features.
3. Data Preparation: Stack all the features together into an array, and doing zero padding
wherever necessary to make all the feature vectors equal in dimension and one hot encode
the labels corresponding to the song.
4. Split data into train and test datasets.
5. Data Modelling: Find the sequence of bits that does the best job of fitting the acoustic data
also fitting a model of the kinds that best describes the genre.
a. Training the classifier using different
i. Simple Logistic
ii. Feed Forward Network with different hidden layers
iii. Convolutional Neural Network
b. Model Evaluation using accuracy
Features
Read File
DTFT/STF
Mel
Frequenci
MFCCs
Chromas
Tonnetz
Input Song
(*.mp3/*.au)
Neural
Network
Data Modelling
As it is supervised learning classification problem , therefore we intially tried with fairly simple model
that is multiclass logistic regression , then used a feed forward neural network with pair of different
accoustic features, then the concept of convolutions has been implemented with the notion of weight
sharing across the different convolutions of the audio features along with different activation layers
and fully connected layers.
Three modelling techniques has been tried out
1. Logistic Regression
2. Feed Forward Neural Network
3. Convolutional Neural Network: Rather than limiting to the known and the researched
features from the audio signals. There can be some other level of information/features that
is difficult to find , therefore convolutional neural network plays an important role to find
such features. Because of resources constraint, we can not go deep with the architecture.
After training and evaluating few models, the final model has been chosen to in the
application to predict the genre of the song that a user uploads.
Refinements:
While training the network, there were instance were the model over fitting. Initially when we started
with the feed forward neural network as we kept on increasing the number of hidden layers, we found
that after certain extent the model starts over fitting therefore, the number of hidden layers were
limited to only two.
Similar, problem also arise while using the convolution neural networks, we introduced the
regularization techniques i.e dropouts with keep probability of 0.2 and hence don’t allow model to
over fit.
When the same set features (tonnetz, chroma, stft, mfcc) were used to train the convolution network
that were used to train feed forward network, the model was under fitting and then after
experimenting with features , only mel spectrum log power amplitude is used with CNNs.
Results
Model Evaluation and Validation
Simply accuracy has been calculated on the testing set, for different models and the best
amongst is chosen based on the accuracy on the testing set only.
Classification Accuracy:
#
Model
Test Accuracy
1.
Logistic Regression
0.11
2.
Input layer -> 193 nodes fully connected layer -> 10 output
nodes
0.64
3.
Input layer -> 280 nodes fully connected layer -> 10 output
nodes
0.62
4.
Input layer -> 280 nodes fully connected layer -> 300 nodes
hidden layers -> 10 output nodes
0.67
5.
Input layer -> 280 nodes fully connected layer -> 300 nodes
fully connected layer -> 300 nodes fully connected layer-> 10
output nodes
0.615
6.
Input layer -> Convolution (30x1x20)-> Dropouts (0.2)-> 280
nodes fully connected layer ->10 output nodes
0.72
Following graph shows that as we increase the number of iterations of the neural network the loss
keeps on decreasing. All the architectures follow the same pattern. Though we need to restrict the
iterations to 5000 only, due to hardware constraint. But, for few architectures the saturation time
for the loss maybe more than 5000. The following graph shows that for training the error keeps on
reducing with increasing complexity of the model, here we meant the number of iterations as the
one of the model complexity parameter.
Figure 13, Training 2 Vs Training Iterations
As compared to Tao Fengs paper the four genre classification has been over fitted because the
training accuracy was 94% and the test accuracy was 61%. But, we have taken 10 genre classifier to
evaluate the model which has given the test accuracy of 72 % which is the better estimate of
generalization.
Conclusions
In this project, a lot signal processing techniques has been used to extract useful features from signals
and then used for modelling the classifier using Deep Neural Network Architectures. Convolutional
neural networks shown a significant improvement as expected which is more generalized as compared
to Tao Fengs paper. Though, there were some challenging areas when initially feed forward neural
networks were not giving good results, even after increasing the nodes and the hidden layers also, but
convnets made their way to solve these issues.
Convolutional neural networks have once again proved their strength in such highly dimensional
classification problems over feed forward neural networks. Therefore, in such areas we can explore
their use even more. Through this we conclude that the similar concepts of neural networks can be
applied to different domains. Which is not restricted to images. Moreover, using many features of
audio is better rather than just using the MFCCs, or just using the Chroma. But collectively using all
the extracted features provided good accuracy.
While comparing the different models we get to understand the comparison between the different
architectures, CNN vs NN. As, it is difficult to find all the features collectively, therefore convolutional
neural networks help as to learn the features also as we go deep in the network. Though the training
time increases drastically.
Thus, instead of extracting different combination of acoustic features and then using a feed forward
neural network, a convolution neural network with just the log amplitude of mel-spectrum has proved
better performance and proves the notion that using right convolutions helps in extracting and
selecting the relevant features to the problem rather than using all the features.
Future Scope
Some more architectures can be explored: -
1. Using Cross validation techniques such as K-Fold, which we can’t use because of the
resource constrain.
2. Hyper Parameter Tuning will definitely help us to find the best classifier.
3. Using some more features and trying to create some more data points by changing
the pitch of the song and other features which are independent of genre, to make
model more generic.
4. Can use different regularization techniques also such as dropout to avoid overfitting.
5. Recurrent Neural Networks with (Long Short Term Memory) LSTM and (Gated
Recurrent Unit) GRUs with (Connectionist temporal categorization) CTC loss.
6. Stack convolutional neural networks with Recurrent Neural Networks can be very
much useful. As Convolutional Neural Network will help to extract useful features and
then Recurrent Neural Networks will help to model the temporal aspects of the signal.
7. Recently the attentional encoders have been used in Text and Images, similar must be
designed for audios.
References
[1]. Feng, Tao. "Deep Learning for Music Genre Classification." (n.d.): n. pag. Print.
[2]. The Echo Nest - Powering Intelligent Digital Music Applications. The Intelligent Music
Application Platform. Web. 07 May 2012. http://the.echonest.com/
[3]. Last.fm. Last.fm. Web. 07 May 2012. http://www.last.fm/
[4]. Category-based intrinsic motivation, Proceedings of the Ninth International Conference on
Epigenetic Robotics,Rachel Lee, Ryan Walker, Lisa Meeden, and James Marshall (2009)
[5]. A growing neural gas learns topologies Advances in Neural Information Processing Systems 7
, Bernd Fritzke (1995)
[6]. Musical Genre Classification of Audio Signals, George Tzanetakis and Perry Cook,IEEE
Transactions on Speech and Audio Processing, 10(5), July 2002
[7]. Scanning the Dial: The Rapid Recognition of Music Genres, Robert o. Gjerdingen and David
Perrot Journal of New Music Research, Vol. 37, No. 2. (2008), pp. 93-100
[8]. Mathworks Forum
https://www.mathworks.com/matlabcentral/newsreader/view_thread/18880?requestedDo
main=www.mathworks.com