ArticlePDF Available

Music Genre Classification using Machine Learning Techniques

Authors:

Abstract and Figures

Categorizing music files according to their genre is a challenging task in the area of music information retrieval (MIR). In this study, we compare the performance of two classes of models. The first is a deep learning approach wherein a CNN model is trained end-to-end, to predict the genre label of an audio signal, solely using its spectrogram. The second approach utilizes hand-crafted features, both from the time domain and the frequency domain. We train four traditional machine learning classifiers with these features and compare their performance. The features that contribute the most towards this multi-class classification task are identified. The experiments are conducted on the Audio set data set and we report an AUC value of 0.894 for an ensemble classifier which combines the two proposed approaches.
Content may be subject to copyright.
Music Genre Classification using Machine Learning Techniques
Hareesh Bahuleyan
University of Waterloo, ON, Canada
hpallika@uwaterloo.ca
Abstract
Categorizing music files according to their
genre is a challenging task in the area
of music information retrieval (MIR). In
this study, we compare the performance
of two classes of models. The first is a
deep learning approach wherein a CNN
model is trained end-to-end, to predict the
genre label of an audio signal, solely us-
ing its spectrogram. The second approach
utilizes hand-crafted features, both from
the time domain and frequency domain.
We train four traditional machine learning
classifiers with these features and compare
their performance. The features that con-
tribute the most towards this classification
task are identified. The experiments are
conducted on the Audio set data set and we
report an AUC value of 0.894 for an en-
semble classifier which combines the two
proposed approaches.1
1 Introduction
With the growth of online music databases and
easy access to music content, people find it in-
creasing hard to manage the songs that they lis-
ten to. One way to categorize and organize songs
is based on the genre, which is identified by
some characteristics of the music such as rhyth-
mic structure, harmonic content and instrumen-
tation (Tzanetakis and Cook,2002). Being able
to automatically classify and provide tags to the
music present in a user’s library, based on genre,
would be beneficial for audio streaming services
such as Spotify and iTunes. This study explores
the application of machine learning (ML) algo-
rithms to identify and classify the genre of a given
1The code has been opensourced and is available
at https://github.com/HareeshBahuleyan/
music-genre- classification
audio file. The first model described in this paper
uses convolutional neural networks (Krizhevsky
et al.,2012), which is trained end-to-end on the
MEL spectrogram of the audio signal. In the sec-
ond part of the study, we extract features both in
the time domain and the frequency domain of the
audio signal. These features are then fed to con-
ventional machine learning models namely Logis-
tic Regression, Random Forests (Breiman,2001),
Gradient Boosting (Friedman,2001) and Support
Vector Machines which are trained to classify the
given audio file. The models are evaluated on the
Audio Set dataset (Gemmeke et al.,2017). We
compare the proposed models and also study the
relative importance of different features.
The rest of this paper is organized as follows.
Section 2describes the existing methods in the lit-
erature for the task of music genre classification.
Section 3is an overview of the the dataset used
in this study and how it was obtained. The pro-
posed models and the implementation details are
discussed in Section 4. The results are reported in
Section 5.2, followed by the conclusions from this
study in Section 6.
2 Literature Review
Music genre classification has been a widely stud-
ied area of research since the early days of the
Internet. Tzanetakis and Cook (2002) addressed
this problem with supervised machine learning ap-
proaches such as Gaussian Mixture model and k-
nearest neighbour classifiers. They introduced 3
sets of features for this task categorized as tim-
bral structure, rhythmic content and pitch con-
tent. Hidden Markov Models (HMMs), which
have been extensively used for speech recognition
tasks, have also been explored for music genre
classification (Scaringella and Zoia,2005;Soltau
et al.,1998). Support vector machines (SVMs)
arXiv:1804.01149v1 [cs.SD] 3 Apr 2018
with different distance metrics are studied and
compared in Mandel and Ellis (2005) for classi-
fying genre.
In Lidy and Rauber (2005), the authors dis-
cuss the contribution of psycho-acoustic features
for recognizing music genre, especially the impor-
tance of STFT taken on the Bark Scale (Zwicker
and Fastl,1999). Mel-frequency cepstral coef-
ficients (MFCCs), spectral contrast and spectral
roll-off were some of the features used by (Tzane-
takis and Cook,2002). A combination of visual
and acoustic features are used to train SVM and
AdaBoost classifiers in Nanni et al. (2016).
With the recent success of deep neural net-
works, a number of studies apply these techniques
to speech and other forms of audio data (Abdel-
Hamid et al.,2014;Gemmeke et al.,2017). Rep-
resenting audio in the time domain for input to
neural networks is not very straight-forward be-
cause of the high sampling rate of audio signals.
However, it has been addressed in Van Den Oord
et al. (2016) for audio generation tasks. A com-
mon alternative representation is the spectrogram
of a signal which captures both time and frequency
information. Spectrograms can be considered as
images and used to train convolutional neural net-
works (CNNs) (Wyse,2017). A CNN was de-
veloped to predict the music genre using the raw
MFCC matrix as input in Li et al. (2010). In
Lidy and Schindler (2016), a constant Q-transform
(CQT) spectrogram was provided as input to the
CNN to achieve the same task.
This work aims to provide a comparative study
between 1) the deep learning based models which
only require the spectrogram as input and, 2) the
traditional machine learning classifiers that need
to be trained with hand-crafted features. We also
investigate the relative importance of different fea-
tures.
3 Dataset
In this work, we make use of Audio Set, which is
a large-scale human annotated database of sounds
(Gemmeke et al.,2017). The dataset was cre-
ated by extracting 10-second sound clips from a
total of 2.1 million YouTube videos. The audio
files have been annotated on the basis of an on-
tology which covers 527 classes of sounds includ-
ing musical instruments, speech, vehicle sounds,
animal sounds and so on2. This study requires
only the audio files that belong to the music cat-
egory, specifically having one of the seven genre
tags shown in Table 1.
Table 1: Number of instances in each genre class
Genre Count
1 Pop Music 8100
2 Rock Music 7990
3 Hip Hop Music 6958
4 Techno 6885
5 Rhythm Blues 4247
6 Vocal 3363
7 Reggae Music 2997
Total 40540
The number of audio clips in each category
has also been tabulated. The raw audio clips of
these sounds have not been provided in the Audio
Set data release. However, the data provides the
YouTubeID of the corresponding videos, along
with the start and end times. Hence, the first task
is to retrieve these audio files. For the purpose of
audio retrieval from YouTube, the following steps
were carried out:
1. A command line program called
youtube-dl (Gonzalez,2006) was
utilized to download the video in the mp4
format.
2. The mp4 files are converted into the desired
wav format using an audio converter named
ffmpeg (Tomar,2006) (command line tool).
Each wav file is about 880 KB in size, which
means that the total data used in this study is ap-
proximately 34 GB.
4 Methodology
This section provides the details of the data pre-
processing steps followed by the description of
the two proposed approaches to this classification
problem.
2https://research.google.com/audioset/
ontology/index.html
Figure 1: Sample spectrograms for 1 audio signal from each music genre
Figure 2: Convolutional neural network architecture (Image Source: Hvass Tensorflow Tutorials)
4.1 Data Pre-processing
In order to improve the Signal-to-Noise Ratio
(SNR) of the signal, a pre-emphasis filter, given
by Equation 1is applied to the original audio sig-
nal.
y(t) = x(t)αx(t1) (1)
where, x(t)refers to the original signal, and y(t)
refers to the filtered signal and αis set to 0.97.
Such a pre-emphasis filter is useful to boost ampli-
tudes at high frequencies (Kim and Stern,2012).
4.2 Deep Neural Networks
Using deep learning, we can achieve the task of
music genre classification without the need for
hand-crafted features. Convolutional neural net-
works (CNNs) have been widely used for the task
of image classification (Krizhevsky et al.,2012).
The 3-channel (RGB) matrix representation of an
image is fed into a CNN which is trained to predict
the image class. In this study, the sound wave can
be represented as a spectrogram, which in turn can
be treated as an image (Nanni et al.,2016)(Lidy
and Schindler,2016). The task of the CNN is to
use the spectrogram to predict the genre label (one
of seven classes).
4.2.1 Spectrogram Generation
A spectrogram is a 2D representation of a signal,
having time on the x-axis and frequency on the
y-axis. A colormap is used to quantify the mag-
nitude of a given frequency within a given time
window. In this study, each audio signal was con-
verted into a MEL spectrogram (having MEL fre-
quency bins on the y-axis). The parameters used
to generate the power spectrogram using STFT are
listed below:
Sampling rate (sr) = 22050
Frame/Window size (n fft) = 2048
Time advance between frames (hop size)
= 512 (resulting in 75% overlap)
Window Function: Hann Window
Frequency Scale: MEL
Number of MEL bins: 96
Highest Frequency (f max) = sr/2
4.2.2 Convolutional Neural Networks
From the Figure 1, one can understand that there
exists some characteristic patterns in the spectro-
grams of the audio signals belonging to different
classes. Hence, spectrograms can be considered
as ’images’ and provided as input to a CNN, which
has shown good performance on image classifica-
tion tasks. Each block in a CNN consists of the
following operations3:
Convolution: This step involves sliding a
matrix filter (say 3x3 size) over the input im-
age which is of dimension image width
x image height. The filter is first placed
on the image matrix and then we compute an
element-wise multiplication between the fil-
ter and the overlapping portion of the image,
followed by a summation to give a feature
value. We use many such filters , the values
of which are ’learned’ during the training of
the neural network via backpropagation.
Pooling: This is a way to reduce the dimen-
sion of the feature map obtained from the
convolution step, formally know as the pro-
cess of down sampling. For example, by max
pooling with 2x2 window size, we only retain
the element with the maximum value among
the 4 elements of the feature map that are
covered in this window. We keep moving this
window across the feature map with a pre-
defined stride.
Non-linear Activation: The convolution op-
eration is linear and in order to make the neu-
ral network more powerful, we need to intro-
duce some non-linearity. For this purpose,
we can apply an activation function such as
ReLU4on each element of the feature map.
3https://ujjwalkarn.me/2016/08/11/
intuitive-explanation- convnets/
4https://en.wikipedia.org/wiki/
Rectifier_(neural_networks)
In this study, a CNN architecture known as
VGG-16, which was the top performing model in
the ImageNet Challenge 2014 (classification + lo-
calization task) was used (Simonyan and Zisser-
man,2014). The model consists of 5 convolutional
blocks (conv base), followed by a set of densely
connected layers, which outputs the probability
that a given image belongs to each of the possible
classes.
For the task of music genre classification using
spectrograms, we download the model architec-
ture with pre-trained weights, and extract the conv
base. The output of the conv base is then send to
a new feed-forward neural network which in turn
predicts the genre of the music, as depicted in Fig-
ure 2.
There are two possible settings while imple-
menting the pre-trained model:
1. Transfer learning: The weights in the conv
base are kept fixed but the weights in the
feed-forward network (represented by the
yellow box in Figure 2) are allowed to be
tuned to predict the correct genre label.
2. Fine tuning: In this setting, we start with the
pre-trained weights of VGG-16, but allow all
the model weights to be tuned during training
process.
The final layer of the neural network outputs
the class probabilities (using the softmax activa-
tion function) for each of the seven possible class
labels. Next, the cross-entropy loss is computed as
follows:
L=
M
X
c=1
yo,c log po,c (2)
where, Mis the number of classes; yo,c is a bi-
nary indicator whose value is 1 if observation obe-
longs to class cand 0otherwise; po,c is the model’s
predicted probability that observation obelongs to
class c. This loss is used to backpropagate the er-
ror, compute the gradients and thereby update the
weights of the network. This iterative process con-
tinues until the loss converges to a minimum value.
4.2.3 Implementation Details
The spectrogram images have a dimension of 216
x 216. For the feed-forward network connected
to the conv base, a 512-unit hidden layer is imple-
mented. Over-fitting is a common issue in neural
(a) Accuracy (b) Loss
Figure 3: Learning Curves - used for model selection; Epoch 4 has the minimum validation loss and
highest validation accuracy
networks. In order to prevent this, two strategies
are adopted:
1. L2-Regularization (Ng,2004): The loss
function of the neural network is added
with the term 1
2λPiwi
2, where wrefers to
the weights in the neural networks. This
method is used to penalize excessively high
weights. We would like the weights to be dif-
fused across all model parameters, and not
just among a few parameters. Also, intu-
itively, smaller weights would correspond to
a less complex model, thereby avoiding over-
fitting. λis set to a value of 0.001 in this
study.
2. Dropout (Srivastava et al.,2014): This is a
regularization mechanism in which we shut-
off some of the neurons (set their weights
to zero) randomly during training. In each
iteration, we thereby use a different combi-
nation of neurons to predict the final output.
This makes the model generalize without any
heavy dependence on a subset of the neurons.
A dropout rate of 0.3is used, which means
that a given weight is set to zero during an
iteration, with a probability of 0.3.
The dataset is randomly split into train (90%),
validation (5%) and test (5%) sets. The same split
is used for all experiments to ensure a fair compar-
ison of the proposed models.
The neural networks are implemented in Python
using Tensorflow 5; an NVIDIA Titan X GPU
was utilized for faster processing. All models
were trained for 10 epochs with a batch size of
5http://tensorflow.org/
32 with the ADAM optimizer (Kingma and Ba,
2014). One epoch refers to one iteration over the
entire training dataset.
Figure 3shows the learning curves - the loss
(which is being optimized) keeps decreasing as the
training progresses. Although the training accu-
racy keeps increasing, the validation accuracy first
increases and after a certain number of epochs, it
starts to decrease. This shows the model’s ten-
dency to overfit on the training data. The model
that is selected for evaluation purposes is the one
that has the highest accuracy and lowest loss on
the validation set (epoch 4 in Figure 3).
4.2.4 Baseline Feed-forward Neural Network
To assess the performance improvement that can
be achived by the CNNs, we also train a baseline
feed-forward neural network that takes as input
the same spectrogram image. The image which
is a 2-dimensional vector of pixel values is un-
wrapped or flattened into a 1-dimensional vector.
Using this vector, a simple 2-layer neural network
is trained to predict the genre of the audio signal.
The first hidden layer consists of 512 units and the
second layer has 32 units, followed by the out-
put layer. The activation function used is ReLU
and the same regularization techniques described
in Section 4.2.3 are adopted.
4.3 Manually Extracted Features
In this section, we describe the second category
of proposed models, namely the ones that re-
quire hand-crafted features to be fed into a ma-
chine learning classifier. Features can be broadly
classified as time domain and frequency domain
features. The feature extraction was done using
librosa6, a Python library.
4.3.1 Time Domain Features
These are features which were extracted from the
raw audio signal.
1. Central moments: This consists of the
mean, standard deviation, skewness and kur-
tosis of the amplitude of the signal.
2. Zero Crossing Rate (ZCR): A zero crosss-
ing point refers to one where the sig-
nal changes sign from positive to negative
(Gouyon et al.,2000). The entire 10 sec-
ond signal is divided into smaller frames, and
the number of zero-crossings present in each
frame are determined. The frame length is
chosen to be 2048 points with a hop size of
512 points. Note that these frame parameters
have been used consistently across all fea-
tures discussed in this section. Finally, the
average and standard deviation of the ZCR
across all frames are chosen as representative
features.
3. Root Mean Square Energy (RMSE): The
energy in a signal is calculated as:
N
X
n=1
|x(n)|2(3)
Further, the root mean square value can be
computed as:
v
u
u
t
1
N
N
X
n=1
|x(n)|2(4)
RMSE is calculated frame by frame and then
we take the average and standard deviation
across all frames.
4. Tempo: In general terms, tempo refers to the
how fast or slow a piece of music is; it is ex-
pressed in terms of Beats Per Minute (BPM).
Intuitively, different kinds of music would
have different tempos. Since the tempo of
the audio piece can vary with time, we aggre-
gate it by computing the mean across several
frames. The functionality in librosa first
computes a tempogram following (Grosche
et al.,2010) and then estimates a single value
for tempo.
6https://librosa.github.io/
4.3.2 Frequency Domain Features
The audio signal can be transformed into the fre-
quency domain by using the Fourier Transform.
We then extract the following features.
1. Mel-Frequency Cepstral Coefficients
(MFCC): Introduced in the early 1990s
by Davis and Mermelstein, MFCCs have
been very useful features for tasks such as
speech recognition (Davis and Mermelstein,
1990). First, the Short-Time Fourier-
Transform (STFT) of the signal is taken with
n fft=2048 and hop size=512 and a
Hann window. Next, we compute the power
spectrum and then apply the triangular MEL
filter bank, which mimics the human percep-
tion of sound. This is followed by taking the
discrete cosine transform of the logarithm
of all filterbank energies, thereby obtaining
the MFCCs. The parameter n mels, which
corresponds to the number of filter banks,
was set to 20 in this study.
2. Chroma Features: This is a vector which
corresponds to the total energy of the sig-
nal in each of the 12 pitch classes. (C, C#,
D, D#, E ,F, F#, G, G#, A, A#, B) (Ellis,
2007). The chroma vectors are then aggre-
gated across the frames to obtain a represen-
tative mean and standard deviation.
3. Spectral Centroid: For each frame, this cor-
responds to the frequency around which most
of the energy is centered (Tjoa,2017). It is a
magnitude weighted frequency calculated as:
fc=PkS(k)f(k)
Pkfk ,(5)
where S(k) is the spectral magnitude of fre-
quency bin k and f(k) is the frequency corre-
sponding to bin k.
4. Spectral Band-width: The p-th order spec-
tral band-width corresponds to the p-th or-
der moment about the spectral centroid (Tjoa,
2017) and is calculated as
[X
k
(S(k)f(k)fc)p]
1
p(6)
For example, p= 2 is analogous to a
weighted standard deviation.
5. Spectral Contrast: Each frame is divided
into a pre-specified number of frequency
bands. And, within each frequency band,
the spectral contrast is calculated as the dif-
ference between the maximum and minimum
magnitudes (Jiang et al.,2002).
6. Spectral Roll-off: This feature corresponds
to the value of frequency below which 85%
(this threshold can be defined by the user) of
the total energy in the spectrum lies (Tjoa,
2017).
For each of the spectral features described
above, the mean and standard deviation of the val-
ues taken across frames is considered as the repre-
sentative final feature that is fed to the model.
The features described in this section would
be would be used to train machine learning algo-
rithms (refer Section 4.4). The features that con-
tribute the most in achieving a good classification
performance will be identified and reported.
4.4 Classifiers
This section provides a brief overview of the four
machine learning classifiers adopted in this study.
1. Logistic Regression (LR): This linear clas-
sifier is generally used for binary classifica-
tion tasks. For this multi-class classification
task, the LR is implemented as a one-vs-rest
method. That is, 7 separate binary classi-
fiers are trained. During test time, the class
with the highest probability from among the
7 classifiers is chosen as the predicted class.
2. Random Forest (RF): Random Forest is a
ensemble learner that combines the predic-
tion from a pre-specified number of decision
trees. It works on the integration of two main
principles: 1) each decision tree is trained
with only a subset of the training samples
which is known as bootstrap aggregation (or
bagging) (Breiman,1996), 2) each decision
tree is required to make its prediction using
only a random subset of the features (Amit
and Geman,1997). The final predicted class
of the RF is determined based on the majority
vote from the individual classifiers.
3. Gradient Boosting (XGB): Boosting is an-
other ensemble classifier that is obtained by
combining a number of weak learners (such
as decision trees). However, unlike RFs,
boosting algorithms are trained in a sequen-
tial manner using forward stagewise additive
modelling (Hastie et al.,2001).
During the early iterations, the decision trees
learnt are fairly simple. As training pro-
gresses, the classifier become more powerful
because it is made to focus on the instances
where the previous learners made errors. At
the end of training, the final prediction is
a weighted linear combination of the output
from the individual learners. XGB refers to
eXtreme Gradient Boosting, which is an im-
plementation of boosting that supports train-
ing the model in a fast and parallelized man-
ner.
4. Support Vector Machines (SVM): SVMs
transform the original input data into a
high dimensional space using a kernel trick
(Cortes and Vapnik,1995). The transformed
data can be linearly separated using a hyper-
plane. The optimal hyperplane maximizes
the margin. In this study, a radial basis func-
tion (RBF) kernel is used to train the SVM
because such a kernel would be required
to address this non-linear problem. Simi-
lar to the logistic regression setting discussed
above, the SVM is also implemented as a
one-vs-rest classification task.
5 Evaluation
5.1 Metrics
In order to evaluate the performance of the models
described in Section 4, the following metrics will
be used.
Accuracy: Refers to the percentage of cor-
rectly classified test samples.
F-score: Based on the confusion matrix, it
is possible to calculate the precision and re-
call. F-score7is then computed as the har-
monic mean between precision and recall.
AUC: This evaluation criteria known as the
area under the receiver operator characteris-
tics (ROC) curve is a common way to judge
the performance of a multi-class classifica-
tion system. The ROC is a graph between the
7https://en.wikipedia.org/wiki/F1_
score
Table 2: Comparison of performance of the models on the test set
Accuracy F-score AUC
Spectrogram-based models
VGG-16 CNN Transfer Learning 0.63 0.61 0.891
VGG-16 CNN Fine Tuning 0.64 0.61 0.889
Feed-forward NN baseline 0.43 0.33 0.759
Feature Engineering based models
Logistic Regression (LR) 0.53 0.47 0.822
Random Forest (RF) 0.54 0.48 0.840
Support Vector Machines (SVM) 0.57 0.52 0.856
Extreme Gradient Boosting (XGB) 0.59 0.55 0.865
Ensemble Classifiers
VGG-16 CNN + XGB 0.65 0.62 0.894
true positive rate and the false positive rate. A
baseline model which randomly predicts each
class label with equal probability would have
an AUC of 0.5, and hence the system being
designed is expected to have a AUC higher
than 0.5.
5.2 Results and Discussion
In this section, the different modelling approaches
discussed in Section 4are evaluated based on the
metrics described in Section 5.1. The values have
been reported in Table 2.
The best performance in terms of all metrics
is observed for the convolutional neural network
model based on VGG-16 that uses only the spec-
trogram to predict the music genre. It was ex-
pected that the fine tuning setting, which addition-
ally allows the convolutional base to be trainable,
would enhance the CNN model when compared to
the transfer learning setting. However, as shown in
Table 2, the experimental results show that there is
no significant difference between transfer learning
and fine-tuning. The baseline feed-forward neural
network that uses the unrolled pixel values from
the spectrogram performs poorly on the test set.
This shows that CNNs can significantly improve
the scores on such an image classification task.
Among the models that use manually crafted
features, the one with the least performance is the
Logistic regression model. This is expected since
logistic regression is a linear classifier. SVMs
outperform random forests in terms of accuracy.
However, the XGB version of the gradient boost-
ing algorithm performs the best among the feature
engineered methods.
5.2.1 Most Important Features
In this section, we investigate which features con-
tribute the most during prediction, in this classifi-
cation task. To carry out this experiment, we chose
the XGB model, based on the results discussed in
the previous section. To do this, we rank the top
20 most useful features based on a scoring metric
(Figure 4). The metric is calculated as the number
of times a given feature is used as a decision node
among the individual decision trees that form the
gradient boosting predictor.
As can be observed from Figure 4, Mel-
Frequency Cepstral Coefficients (MFCC) appear
the most among the important features. Previ-
ous studies have reported MFCCs to improve the
performance of speech recognition systems (It-
tichaichareon et al.,2012). Our experiments show
that MFCCs contribute significantly to this task of
music genre classification. The mean and standard
deviation of the spectral contrasts at different fre-
quency bands are also important features. The mu-
sic tempo, calculated in terms of beats per minute
also appear in the top 20 useful features.
Next, we study how much of performance in
terms of AUC and accuracy, can be obtained by
just using the top Nwhile training the model.
From Table 3it can be seen that with only the top
10 features, the model performance is surprisingly
good. In comparison to the full model which has
97 features, the model with the top 30 features has
only a marginally lower performance (2 points on
Figure 4: Relative importance of features in the XGBoost model; the top 20 most contributing features
are displayed
the AUC metric and 4 point on the accuracy met-
ric).
Table 3: Ablation Study: Comparing XGB perfor-
mance keeping only top Nfeatures
N AUC Accuracy
10 0.803 0.47
20 0.837 0.52
30 0.845 0.55
97 0.865 0.59
The final experiment in this section is compar-
ison of time domain and frequency domain fea-
tures listed in Section 4.3. Two XGB models were
trained - one with only time domain features and
the other with only frequency domain features. Ta-
ble 4compares the results in terms of AUC and ac-
curacy. This experiment further confirms the fact
that frequency domain features are definitely bet-
ter than time domain features when it comes to
modelling audio for machine learning tasks.
5.2.2 Confusion Matrix
Confusion matrix is a tabular representation which
enables us to further understand the strengths and
weaknesses of our model. Element aij in the ma-
Table 4: Comparison of Time Domain features
and Frequency Domain features
Model AUC Accuracy
Time Domain only 0.731 0.40
Frequency Domain only 0.857 0.57
Both 0.865 0.59
trix refers to the number of test instances of class
ithat the model predicted as class j. Diagonal
elements aii corresponds to the correct predic-
tions. Figure 5compares the confusion matrices
of the best performing CNN model and XGB, the
best model among the feature-engineered classi-
fiers. Both models seems to be good at predict-
ing the class ’Rock’ music. However, many in-
stances of class ’Hip Hop’ are often confused with
class ’Pop’ and vice-versa. Such a behaviour is
expected when the genres of music are very close.
Some songs may fall into multiple genres, even as
much that it may be difficult for humans to recog-
nize the exact genre.
5.2.3 Ensemble Classifier
Ensembling is a commonly adopted practice
in machine learning, wherein, the results from
(a) VGG-16 CNN Transfer Learning
(b) Extreme Gradient Boosting (c) Ensemble Model
Figure 5: Confusion Matrices of the best performing models
different classifiers are combined. This is
done by either majority voting or by averaging
scores/probabilities. Such an ensembling scheme
which combines the prediction powers of differ-
ent classifiers makes the overall system more ro-
bust. In our case, each classifier outputs a predic-
tion probability for each of the class labels. Hence,
averaging the predicted probabilities from the dif-
ferent classifiers would be a straight-forward way
to do ensemble learning.
The methodologies described in 4.2 and 4.4 use
very different sources of input, the spectrograms
and the hand-crafted features respectively. Hence,
it makes sense to combine the models via ensem-
bling. In this study, the best CNN model namely,
VGG-16 Transfer Learning is ensembled with
XGBoost the best feature engineered model by av-
eraging the predicted probabilities. As shown in
Table 2, this ensembling is beneficial and is ob-
served to outperform the all individual classifiers.
The ROC curve for the ensemble model is above
that of VGG-16 Fine Tuning and XGBoost as il-
lustrated in Figure 6.
6 Conclusion
In this work, the task of music genre classifica-
tion is studied using the Audioset data. We pro-
Figure 6: ROC Curves for the best performing
models and their ensemble
pose two different approaches to solving this prob-
lem. The first involves generating a spectrogram
of the audio signal and treating it as an image. An
CNN based image classifier, namely VGG-16 is
trained on these images to predict the music genre
solely based on this spectrogram. The second ap-
proach consists of extracting time domain and fre-
quency domain features from the audio signals,
followed by training traditional machine learning
classifiers based on these features. XGBoost was
determined to be the best feature-based classifier;
the most important features were also reported.
The CNN based deep learning models were shown
to outperform the feature-engineered models. We
also show that ensembling the CNN and XGBoost
model proved to be beneficial. It is to be noted that
the dataset used in this study was audio clips from
YouTube videos, which are in general very noisy.
Futures studies can identify ways to pre-process
this noisy data before feeding it into a machine
learning model, in order to achieve better perfor-
mance.
References
Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui
Jiang, Li Deng, Gerald Penn, and Dong Yu. 2014.
Convolutional neural networks for speech recogni-
tion. IEEE/ACM Transactions on audio, speech, and
language processing 22(10):1533–1545.
Yali Amit and Donald Geman. 1997. Shape quantiza-
tion and recognition with randomized trees. Neural
computation 9(7):1545–1588.
Leo Breiman. 1996. Bagging predictors. Machine
learning 24(2):123–140.
Leo Breiman. 2001. Random forests. Machine learn-
ing 45(1):5–32.
Corinna Cortes and Vladimir Vapnik. 1995. Support-
vector networks. Machine learning 20(3):273–297.
Steven B Davis and Paul Mermelstein. 1990. Compar-
ison of parametric representations for monosyllabic
word recognition in continuously spoken sentences.
In Readings in speech recognition, Elsevier, pages
65–74.
Dan Ellis. 2007. Chroma feature analysis and synthe-
sis. Resources of Laboratory for the Recognition
and Organization of Speech and Audio-LabROSA .
Jerome H Friedman. 2001. Greedy function approx-
imation: a gradient boosting machine. Annals of
statistics pages 1189–1232.
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman,
Aren Jansen, Wade Lawrence, R Channing Moore,
Manoj Plakal, and Marvin Ritter. 2017. Audio set:
An ontology and human-labeled dataset for audio
events. In Acoustics, Speech and Signal Processing
(ICASSP), 2017 IEEE International Conference on.
IEEE, pages 776–780.
Ricardo Garcia Gonzalez. 2006. Youtube-dl: down-
load videos from youtube. com.
Fabien Gouyon, Franc¸ois Pachet, Olivier Delerue, et al.
2000. On the use of zero-crossing rate for an ap-
plication of classification of percussive sounds. In
Proceedings of the COST G-6 conference on Digital
Audio Effects (DAFX-00), Verona, Italy.
Peter Grosche, Meinard M¨
uller, and Frank Kurth. 2010.
Cyclic tempograma mid-level tempo representation
for musicsignals. In Acoustics Speech and Sig-
nal Processing (ICASSP), 2010 IEEE International
Conference on. IEEE, pages 5522–5525.
Trevor Hastie, Robert Tibshirani, and Jerome Fried-
man. 2001. The elements of statistical learnine.
Chadawan Ittichaichareon, Siwat Suksri, and
Thaweesak Yingthawornsuk. 2012. Speech
recognition using mfcc. In International Con-
ference on Computer Graphics, Simulation and
Modeling (ICGSM’2012) July. pages 28–29.
Dan-Ning Jiang, Lie Lu, Hong-Jiang Zhang, Jian-Hua
Tao, and Lian-Hong Cai. 2002. Music type classi-
fication by spectral contrast feature. In Multimedia
and Expo, 2002. ICME’02. Proceedings. 2002 IEEE
International Conference on. IEEE, volume 1, pages
113–116.
Chanwoo Kim and Richard M Stern. 2012. Power-
normalized cepstral coefficients (pncc) for robust
speech recognition. In Acoustics, Speech and Sig-
nal Processing (ICASSP), 2012 IEEE International
Conference on. IEEE, pages 4101–4104.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint
arXiv:1412.6980 .
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-
ton. 2012. Imagenet classification with deep con-
volutional neural networks. In Advances in neural
information processing systems. pages 1097–1105.
Tom LH Li, Antoni B Chan, and A Chun. 2010. Auto-
matic musical pattern feature extraction using con-
volutional neural network. In Proc. Int. Conf. Data
Mining and Applications.
Thomas Lidy and Andreas Rauber. 2005. Evaluation
of feature extractors and psycho-acoustic transfor-
mations for music genre classification. In ISMIR.
pages 34–41.
Thomas Lidy and Alexander Schindler. 2016. Parallel
convolutional neural networks for music genre and
mood classification. MIREX2016 .
Michael I Mandel and Dan Ellis. 2005. Song-level fea-
tures and support vector machines for music classi-
fication. In ISMIR. volume 2005, pages 594–599.
Loris Nanni, Yandre MG Costa, Alessandra Lumini,
Moo Young Kim, and Seung Ryul Baek. 2016.
Combining visual and acoustic features for music
genre classification. Expert Systems with Applica-
tions 45:108–117.
Andrew Y Ng. 2004. Feature selection, l 1 vs. l 2 regu-
larization, and rotational invariance. In Proceedings
of the twenty-first international conference on Ma-
chine learning. ACM, page 78.
Nicolas Scaringella and Giorgio Zoia. 2005. On the
modeling of time information for automatic genre
recognition systems in audio signals. In ISMIR.
pages 666–671.
Karen Simonyan and Andrew Zisserman. 2014. Very
deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556 .
Hagen Soltau, Tanja Schultz, Martin Westphal, and
Alex Waibel. 1998. Recognition of music types. In
Acoustics, Speech and Signal Processing, 1998. Pro-
ceedings of the 1998 IEEE International Conference
on. IEEE, volume 2, pages 1137–1140.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Dropout: A simple way to prevent neural networks
from overfitting. The Journal of Machine Learning
Research 15(1):1929–1958.
Steve Tjoa. 2017. Music information retrieval.
https://musicinformationretrieval.
com/spectral_features.html. Accessed:
2018-02-20.
Suramya Tomar. 2006. Converting video formats with
ffmpeg. Linux Journal 2006(146):10.
George Tzanetakis and Perry Cook. 2002. Musical
genre classification of audio signals. IEEE Trans-
actions on speech and audio processing 10(5):293–
302.
Aaron Van Den Oord, Sander Dieleman, Heiga Zen,
Karen Simonyan, Oriol Vinyals, Alex Graves,
Nal Kalchbrenner, Andrew Senior, and Koray
Kavukcuoglu. 2016. Wavenet: A generative model
for raw audio. arXiv preprint arXiv:1609.03499 .
Lonce Wyse. 2017. Audio spectrogram representations
for processing with convolutional neural networks.
arXiv preprint arXiv:1706.09559 .
E Zwicker and H Fastl. 1999. Psychoacoustics facts
and models .
... Some techniques have tried using Convolutional Neural Networks (CNNs), attention models, and other end-to-end models to classify spectrograms of songs into genres, as shown by the works of [1,7,15,54] and [52]. These works highlight that spectrograms are rich in textual and acoustic information as they encode the frequency, amplitude, and temporal correlation in sounds in the position and value of the pixels themselves and provide discernable patterns that make genres recognizable. ...
... More details can be found on the website for Sentence-BERT. 1 The following are the vision models we used: ...
... Table 3 shows the best performing model parameters (embeddings and alpha) for each genre (Table 2 acts as a legend for Table 3 . It also shows how the predictions of that model over that genre are distributed over all the Table 3 are reported over 5 runs/splits of the data, and are very competitive with the state of the art - [28] achieve between 53-72%, [42] achieves 82%, [1] recently achieved 65 %, among others. We note here that the comparison is not very standardized as the dataset is different and these state of the art models cannot be tuned to consume our dataset because of their incapability to consume small 15 second size clips and high training requirements. ...
Article
Full-text available
The essence of music is inherently multi-modal – with audio and lyrics going hand in hand. However, there is very less research done to study the intricacies of the multi-modal nature of music, and its relation with genres. Our work uses this multi-modality to present spectro-lyrical embeddings for music representation (SLEM), leveraging the power of open-sourced, lightweight, and state-of-the-art deep learning vision and language models to encode songs. This work summarises extensive experimentation with over 20 deep learning-based music embeddings of a self-curated and hand-labeled multi-lingual dataset of 226 recent songs spread over 5 genres. Our aim is to study the effects of varying the weight of lyrics and spectrograms in the embeddings on the multi-class genre classification. The purpose of this study is to prove that a simple linear combination of both modalities is better than either modality alone. Our methods achieve an accuracy ranging between 81.08% to 98.60% for different genres, by using the K-nearest neighbors algorithm on the multimodal embeddings. We successfully study the intricacies of genres in this representational space, including their misclassification, visual clustering with EM-GMM, and the domain-specific meaning of the multi-modal weight for each genre with respect to ’instrumentalness’ and ’energy’ metadata. SLEM presents one of the first works on an end-to-end method that uses spectro-lyrical embeddings without hand-engineered features.
... Zangerle et al. [8] propose to combine low-and high-level audio features of songs in a deep neural network that distinguishes low-and high-level features to account for their particularities. Bahuleyan [9] tried to extract different song features and using those features developed a deep learning model using artificial neural network (ANN), visual geometry group-16 (VGG-16), and convolutional neural network (CNN) model using YouTube song and VGG-16 CNN model gives almost 64% accuracy. Mamun et al. [10] presented an approach to classifying Bangla music song genre. ...
... In order to optimize the process, we used a function called categorical cross entropy (9) to calculate the error of our model. According to a study, mean squared error, classification error, and other loss-calculating functions have all been outperformed by cross entropy [24]. ...
Article
Full-text available
p>Music has a control over human moods and it can make someone calm or excited. It allows us to feel all emotions we experience. Nowadays, people are often attached with their phones and computers listening to music on Spotify, Soundcloud or any other internet platform. Music Information retrieval plays an important role for music recommendation according to lyrics, pitch, pattern of choices, and genre. In this study, we have tried to recognize the music genre for a better music recommendation system. We have collected an amount of 1820 Bangla songs from six different genres including Adhunik, Rock, Hip hop, Nazrul, Rabindra and Folk music. We have started with some traditional machine learning algorithms having K-Nearest Neighbor, Logistic Regression, Random Forest, Support Vector Machine and Decision Tree but ended up with a deep learning algorithm named Artificial Neural Network with an accuracy of 78% for recognizing music genres from six different genres. All mentioned algorithms are experimented with transformed mel-spectrograms and Mean Chroma Frequency Values of that raw amplitude data. But we found that music Tempo having Beats per Minute value with two previous features present better accuracy.</p
... The accuracy of classification is more than 80% with an error rate of 1.4%. [5] uses deep learning approaches to audio and video processing to classify musical styles. They examined the performance of two different types of models. ...
Article
Full-text available
People's mental health is greatly influenced by music. Music serves as a bond that links communities together by bringing people with common interests together. As a result of people's changing mindsets, music has now become a business. A person may go a day without interacting with others or wishing for a friend nowadays, but they cannot go a day without listening to music. Songs are divided into a variety of genres. Using machine learning algorithms, we can deliver a categorized list of music to the Smartphone user. The structure of music, as well as how humans perceive and understand it, is discussed in connection to musical style categorization. For this experiment, we focused on combining data from audio signals rather than data from several sources. The entire project includes a comprehensive machine learning technique for automatically categorizing musical genres based on audio inputs. This ensemble approach suggests that combining different types of domain-based audio characteristics might improve classification accuracy. We'll use machine learning techniques to build models that can categories audio recordings into different genres. After our trained model has been trained, we shall evaluate its performance. The goal of the research is to develop a machine learning system that outperforms existing music genre predicting methods. In this research, we used the GTZAN dataset to train a number of classification models. We compared and documented the results of all of these models in terms of predicted accuracy.
... These two researchers created the GTZAN dataset and to date it has been considered that it is a standard for species classification. According to Bahuleyan (2018), an approach to automatic music classification was presented by providing tags to songs in the user's library. In the study, it is understood that two separate approaches were used. ...
Article
Listening to the music affects the brain in ways which might help to promote the human health and arrange various diseases symptoms. Music is a phenomenon that is intertwined at every stage of human life. In the modern era music is formed by the combination of an incredible number of genres, some of which are contemporary, and some come from the past. The music genre represents a collection of musical works that develop according to a certain shape, expression and technique. The music genre of interest varies from person to person in society. Most listeners today do not know what kind of music they listen to. In this study, sound features were extracted from music data and the Keras model was trained using these features. The correct classification rate of a music genre of the trained model was determined as 71.66%. Mel Frequency Cepstral Coefficients (MFCC), Mel Spectrogram, Chroma Vector and Tonnetz methods in the Librosa library were used to extract sound properties from music data. Using the features calculated by the Librosa library, the most listened songs with Shazam in Türkiye were classified in with TensorFlow/Keras. Many methods can be used in classification. It is unclear which method the researchers should prefer. With this study, researchers will know classification with Keras, researchers who do not know about music will know music and know the genre of newly released songs.
Chapter
A major objective of this book series is to drive innovation in every aspect of Artificial Intelligent. It offers researchers, educators and students the opportunity to discuss and share ideas on topics, trends and developments in the fields of artificial intelligence, machine learning, deep learning and more, big data and computer science, computer intelligence and Technology. The content of the book is as follows
Technical Report
Full-text available
Our approach to the MIREX 2016 Train/Test Classification Tasks for Genre, Mood and Composer detection is based on an approach combining Mel-spectrogram transformed audio and Convolutional Neural Networks (CNN). We utilize two different CNN architectures, a sequential one, and a parallel one, the latter aiming at capturing both temporal and timbral information in two different pipelines, which are merged on a later stage. In both cases, the crucial CNN parameters such as filter kernel sizes and pooling sizes were carefully chosen after a range of experiments.
Article
Full-text available
Since musical genre is one of the most common ways used by people for managing digital music databases, music genre recognition is a crucial task, deep studied by the Music Information Retrieval (MIR) research community since 2002. In this work we present a novel and effective approach for automated musical genre recognition based on the fusion of different set of features. Both acoustic and visual features are considered, evaluated, compared and fused in a final ensemble which show classification accuracy comparable or even better than other state-of-the-art approaches. The visual features are locally extracted from sub-windows of the spectrogram taken by Mel scale zoning: the input signal is represented by its spectrogram which is divided in sub-windows in order to extract local features; feature extraction is performed by calculating texture descriptors and bag of features projections from each sub-window; the final decision is taken using an ensemble of SVM classifiers. In this work we show for the first time that a bag of feature approach can be effective in this problem. As the acoustic features are concerned, we propose an ensemble of heterogeneous classifiers for maximizing the performance that could be obtained starting from the acoustic features. First timbre features are obtained from the audio signal, second some statistical measures are calculated from the texture window and the modulation spectrum, third a feature selection is executed to increase the recognition performance and decrease the computational complexity. Finally, the resulting descriptors are classified by fusing the scores of heterogeneous classifiers (SVM and Random subspace of AdaBoost). The experimental evaluation is performed on three well-known databases: the Latin Music Database (LMD), the ISMIR 2004 database and the GTZAN genre collection. The reported performance of the proposed approach is very encouraging, since they outperform other state-of-the-art approaches, without any ad hoc parameter optimization (i.e. using the same ensemble of classifiers and parameters setting in all the three datasets). The advantage of using both visual and audio features is also proved by means of Q-statistics, which confirms that the two sets of features are partially independent and they are suitable to be fused together in a heterogeneous system. The MATLAB code of the ensemble of classifiers and for the visual features extraction will be publicly available (see footnote 1) to other researchers for future comparisons. The code for acoustic features is not available since it is used in a commercial system.
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
Article
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Article
Recently, the hybrid deep neural network (DNN)-hidden Markov model (HMM) has been shown to significantly improve speech recognition performance over the conventional Gaussian mixture model (GMM)-HMM. The performance improvement is partially attributed to the ability of the DNN to model complex correlations in speech features. In this paper, we show that further error rate reduction can be obtained by using convolutional neural networks (CNNs). We first present a concise description of the basic CNN and explain how it can be used for speech recognition. We further propose a limited-weight-sharing scheme that can better model speech features. The special structure such as local connectivity, weight sharing, and pooling in CNNs exhibits some degree of invariance to small shifts of speech features along the frequency axis, which is important to deal with speaker and environment variations. Experimental results show that CNNs reduce the error rate by 6%-10% compared with DNNs on the TIMIT phone recognition and the voice search large vocabulary speech recognition tasks.
Article
FFmpeg is a mini Swiss Army knife of format conversion tools.