ArticlePDF Available

Music Genre Classification using Machine Learning Techniques

April 2018

April 2018

Authors:

University of Waterloo

Categorizing music files according to their genre is a challenging task in the area of music information retrieval (MIR). In this study, we compare the performance of two classes of models. The first is a deep learning approach wherein a CNN model is trained end-to-end, to predict the genre label of an audio signal, solely using its spectrogram. The second approach utilizes hand-crafted features, both from the time domain and the frequency domain. We train four traditional machine learning classifiers with these features and compare their performance. The features that contribute the most towards this multi-class classification task are identified. The experiments are conducted on the Audio set data set and we report an AUC value of 0.894 for an ensemble classifier which combines the two proposed approaches.

Sample spectrograms for 1 audio signal from each music genre

…

: Number of instances in each genre class

…

: Comparison of performance of the models on the test set Accuracy F-score AUC

…

Learning Curves-used for model selection; Epoch 4 has the minimum validation loss and highest validation accuracy

…

Relative importance of features in the XGBoost model; the top 20 most contributing features are displayed

…

Figures - uploaded by Hareesh Bahuleyan

Content may be subject to copyright.

Content uploaded by Hareesh Bahuleyan

Content may be subject to copyright.

Music Genre Classiﬁcation using Machine Learning Techniques

Hareesh Bahuleyan

University of Waterloo, ON, Canada

hpallika@uwaterloo.ca

Abstract

Categorizing music ﬁles according to their

genre is a challenging task in the area

of music information retrieval (MIR). In

this study, we compare the performance

of two classes of models. The ﬁrst is a

deep learning approach wherein a CNN

model is trained end-to-end, to predict the

genre label of an audio signal, solely us-

ing its spectrogram. The second approach

utilizes hand-crafted features, both from

the time domain and frequency domain.

We train four traditional machine learning

classiﬁers with these features and compare

their performance. The features that con-

tribute the most towards this classiﬁcation

task are identiﬁed. The experiments are

conducted on the Audio set data set and we

report an AUC value of 0.894 for an en-

semble classiﬁer which combines the two

proposed approaches.1

1 Introduction

With the growth of online music databases and

easy access to music content, people ﬁnd it in-

creasing hard to manage the songs that they lis-

ten to. One way to categorize and organize songs

is based on the genre, which is identiﬁed by

some characteristics of the music such as rhyth-

mic structure, harmonic content and instrumen-

tation (Tzanetakis and Cook,2002). Being able

to automatically classify and provide tags to the

music present in a user’s library, based on genre,

would be beneﬁcial for audio streaming services

such as Spotify and iTunes. This study explores

the application of machine learning (ML) algo-

rithms to identify and classify the genre of a given

1The code has been opensourced and is available

at https://github.com/HareeshBahuleyan/

music-genre- classification

audio ﬁle. The ﬁrst model described in this paper

uses convolutional neural networks (Krizhevsky

et al.,2012), which is trained end-to-end on the

MEL spectrogram of the audio signal. In the sec-

ond part of the study, we extract features both in

the time domain and the frequency domain of the

audio signal. These features are then fed to con-

ventional machine learning models namely Logis-

tic Regression, Random Forests (Breiman,2001),

Gradient Boosting (Friedman,2001) and Support

Vector Machines which are trained to classify the

given audio ﬁle. The models are evaluated on the

Audio Set dataset (Gemmeke et al.,2017). We

compare the proposed models and also study the

relative importance of different features.

The rest of this paper is organized as follows.

Section 2describes the existing methods in the lit-

erature for the task of music genre classiﬁcation.

Section 3is an overview of the the dataset used

in this study and how it was obtained. The pro-

posed models and the implementation details are

discussed in Section 4. The results are reported in

Section 5.2, followed by the conclusions from this

study in Section 6.

2 Literature Review

Music genre classiﬁcation has been a widely stud-

ied area of research since the early days of the

Internet. Tzanetakis and Cook (2002) addressed

this problem with supervised machine learning ap-

proaches such as Gaussian Mixture model and k-

nearest neighbour classiﬁers. They introduced 3

sets of features for this task categorized as tim-

bral structure, rhythmic content and pitch con-

tent. Hidden Markov Models (HMMs), which

have been extensively used for speech recognition

tasks, have also been explored for music genre

classiﬁcation (Scaringella and Zoia,2005;Soltau

et al.,1998). Support vector machines (SVMs)

arXiv:1804.01149v1 [cs.SD] 3 Apr 2018

with different distance metrics are studied and

compared in Mandel and Ellis (2005) for classi-

fying genre.

In Lidy and Rauber (2005), the authors dis-

cuss the contribution of psycho-acoustic features

for recognizing music genre, especially the impor-

tance of STFT taken on the Bark Scale (Zwicker

and Fastl,1999). Mel-frequency cepstral coef-

ﬁcients (MFCCs), spectral contrast and spectral

roll-off were some of the features used by (Tzane-

takis and Cook,2002). A combination of visual

and acoustic features are used to train SVM and

AdaBoost classiﬁers in Nanni et al. (2016).

With the recent success of deep neural net-

works, a number of studies apply these techniques

to speech and other forms of audio data (Abdel-

Hamid et al.,2014;Gemmeke et al.,2017). Rep-

resenting audio in the time domain for input to

neural networks is not very straight-forward be-

cause of the high sampling rate of audio signals.

However, it has been addressed in Van Den Oord

et al. (2016) for audio generation tasks. A com-

mon alternative representation is the spectrogram

of a signal which captures both time and frequency

information. Spectrograms can be considered as

images and used to train convolutional neural net-

works (CNNs) (Wyse,2017). A CNN was de-

veloped to predict the music genre using the raw

MFCC matrix as input in Li et al. (2010). In

Lidy and Schindler (2016), a constant Q-transform

(CQT) spectrogram was provided as input to the

CNN to achieve the same task.

This work aims to provide a comparative study

between 1) the deep learning based models which

only require the spectrogram as input and, 2) the

traditional machine learning classiﬁers that need

to be trained with hand-crafted features. We also

investigate the relative importance of different fea-

tures.

3 Dataset

In this work, we make use of Audio Set, which is

a large-scale human annotated database of sounds

(Gemmeke et al.,2017). The dataset was cre-

ated by extracting 10-second sound clips from a

total of 2.1 million YouTube videos. The audio

ﬁles have been annotated on the basis of an on-

tology which covers 527 classes of sounds includ-

ing musical instruments, speech, vehicle sounds,

animal sounds and so on2. This study requires

only the audio ﬁles that belong to the music cat-

egory, speciﬁcally having one of the seven genre

tags shown in Table 1.

Table 1: Number of instances in each genre class

Genre Count

1 Pop Music 8100

2 Rock Music 7990

3 Hip Hop Music 6958

4 Techno 6885

5 Rhythm Blues 4247

6 Vocal 3363

7 Reggae Music 2997

Total 40540

The number of audio clips in each category

has also been tabulated. The raw audio clips of

these sounds have not been provided in the Audio

Set data release. However, the data provides the

YouTubeID of the corresponding videos, along

with the start and end times. Hence, the ﬁrst task

is to retrieve these audio ﬁles. For the purpose of

audio retrieval from YouTube, the following steps

were carried out:

1. A command line program called

youtube-dl (Gonzalez,2006) was

utilized to download the video in the mp4

format.

2. The mp4 ﬁles are converted into the desired

wav format using an audio converter named

ffmpeg (Tomar,2006) (command line tool).

Each wav ﬁle is about 880 KB in size, which

means that the total data used in this study is ap-

proximately 34 GB.

4 Methodology

This section provides the details of the data pre-

processing steps followed by the description of

the two proposed approaches to this classiﬁcation

problem.

2https://research.google.com/audioset/

ontology/index.html

Figure 1: Sample spectrograms for 1 audio signal from each music genre

Figure 2: Convolutional neural network architecture (Image Source: Hvass Tensorﬂow Tutorials)

4.1 Data Pre-processing

In order to improve the Signal-to-Noise Ratio

(SNR) of the signal, a pre-emphasis ﬁlter, given

by Equation 1is applied to the original audio sig-

nal.

y(t) = x(t)−α∗x(t−1) (1)

where, x(t)refers to the original signal, and y(t)

refers to the ﬁltered signal and αis set to 0.97.

Such a pre-emphasis ﬁlter is useful to boost ampli-

tudes at high frequencies (Kim and Stern,2012).

4.2 Deep Neural Networks

Using deep learning, we can achieve the task of

music genre classiﬁcation without the need for

hand-crafted features. Convolutional neural net-

works (CNNs) have been widely used for the task

of image classiﬁcation (Krizhevsky et al.,2012).

The 3-channel (RGB) matrix representation of an

image is fed into a CNN which is trained to predict

the image class. In this study, the sound wave can

be represented as a spectrogram, which in turn can

be treated as an image (Nanni et al.,2016)(Lidy

and Schindler,2016). The task of the CNN is to

use the spectrogram to predict the genre label (one

of seven classes).

4.2.1 Spectrogram Generation

A spectrogram is a 2D representation of a signal,

having time on the x-axis and frequency on the

y-axis. A colormap is used to quantify the mag-

nitude of a given frequency within a given time

window. In this study, each audio signal was con-

verted into a MEL spectrogram (having MEL fre-

quency bins on the y-axis). The parameters used

to generate the power spectrogram using STFT are

listed below:

•Sampling rate (sr) = 22050

•Frame/Window size (n fft) = 2048

•Time advance between frames (hop size)

= 512 (resulting in 75% overlap)

•Window Function: Hann Window

•Frequency Scale: MEL

•Number of MEL bins: 96

•Highest Frequency (f max) = sr/2

4.2.2 Convolutional Neural Networks

From the Figure 1, one can understand that there

exists some characteristic patterns in the spectro-

grams of the audio signals belonging to different

classes. Hence, spectrograms can be considered

as ’images’ and provided as input to a CNN, which

has shown good performance on image classiﬁca-

tion tasks. Each block in a CNN consists of the

following operations3:

•Convolution: This step involves sliding a

matrix ﬁlter (say 3x3 size) over the input im-

age which is of dimension image width

x image height. The ﬁlter is ﬁrst placed

on the image matrix and then we compute an

element-wise multiplication between the ﬁl-

ter and the overlapping portion of the image,

followed by a summation to give a feature

value. We use many such ﬁlters , the values

of which are ’learned’ during the training of

the neural network via backpropagation.

•Pooling: This is a way to reduce the dimen-

sion of the feature map obtained from the

convolution step, formally know as the pro-

cess of down sampling. For example, by max

pooling with 2x2 window size, we only retain

the element with the maximum value among

the 4 elements of the feature map that are

covered in this window. We keep moving this

window across the feature map with a pre-

deﬁned stride.

•Non-linear Activation: The convolution op-

eration is linear and in order to make the neu-

ral network more powerful, we need to intro-

duce some non-linearity. For this purpose,

we can apply an activation function such as

ReLU4on each element of the feature map.

3https://ujjwalkarn.me/2016/08/11/

intuitive-explanation- convnets/

4https://en.wikipedia.org/wiki/

Rectifier_(neural_networks)

In this study, a CNN architecture known as

VGG-16, which was the top performing model in

the ImageNet Challenge 2014 (classiﬁcation + lo-

calization task) was used (Simonyan and Zisser-

man,2014). The model consists of 5 convolutional

blocks (conv base), followed by a set of densely

connected layers, which outputs the probability

that a given image belongs to each of the possible

classes.

For the task of music genre classiﬁcation using

spectrograms, we download the model architec-

ture with pre-trained weights, and extract the conv

base. The output of the conv base is then send to

a new feed-forward neural network which in turn

predicts the genre of the music, as depicted in Fig-

ure 2.

There are two possible settings while imple-

menting the pre-trained model:

1. Transfer learning: The weights in the conv

base are kept ﬁxed but the weights in the

feed-forward network (represented by the

yellow box in Figure 2) are allowed to be

tuned to predict the correct genre label.

2. Fine tuning: In this setting, we start with the

pre-trained weights of VGG-16, but allow all

the model weights to be tuned during training

process.

The ﬁnal layer of the neural network outputs

the class probabilities (using the softmax activa-

tion function) for each of the seven possible class

labels. Next, the cross-entropy loss is computed as

follows:

L=−

c=1

yo,c ∗log po,c (2)

where, Mis the number of classes; yo,c is a bi-

nary indicator whose value is 1 if observation obe-

longs to class cand 0otherwise; po,c is the model’s

predicted probability that observation obelongs to

class c. This loss is used to backpropagate the er-

ror, compute the gradients and thereby update the

weights of the network. This iterative process con-

tinues until the loss converges to a minimum value.

4.2.3 Implementation Details

The spectrogram images have a dimension of 216

x 216. For the feed-forward network connected

to the conv base, a 512-unit hidden layer is imple-

mented. Over-ﬁtting is a common issue in neural

(a) Accuracy (b) Loss

Figure 3: Learning Curves - used for model selection; Epoch 4 has the minimum validation loss and

highest validation accuracy

networks. In order to prevent this, two strategies

are adopted:

1. L2-Regularization (Ng,2004): The loss

function of the neural network is added

with the term 1

2λPiwi

2, where wrefers to

the weights in the neural networks. This

method is used to penalize excessively high

weights. We would like the weights to be dif-

fused across all model parameters, and not

just among a few parameters. Also, intu-

itively, smaller weights would correspond to

a less complex model, thereby avoiding over-

ﬁtting. λis set to a value of 0.001 in this

study.

2. Dropout (Srivastava et al.,2014): This is a

regularization mechanism in which we shut-

off some of the neurons (set their weights

to zero) randomly during training. In each

iteration, we thereby use a different combi-

nation of neurons to predict the ﬁnal output.

This makes the model generalize without any

heavy dependence on a subset of the neurons.

A dropout rate of 0.3is used, which means

that a given weight is set to zero during an

iteration, with a probability of 0.3.

The dataset is randomly split into train (90%),

validation (5%) and test (5%) sets. The same split

is used for all experiments to ensure a fair compar-

ison of the proposed models.

The neural networks are implemented in Python

using Tensorﬂow 5; an NVIDIA Titan X GPU

was utilized for faster processing. All models

were trained for 10 epochs with a batch size of

5http://tensorflow.org/

32 with the ADAM optimizer (Kingma and Ba,

2014). One epoch refers to one iteration over the

entire training dataset.

Figure 3shows the learning curves - the loss

(which is being optimized) keeps decreasing as the

training progresses. Although the training accu-

racy keeps increasing, the validation accuracy ﬁrst

increases and after a certain number of epochs, it

starts to decrease. This shows the model’s ten-

dency to overﬁt on the training data. The model

that is selected for evaluation purposes is the one

that has the highest accuracy and lowest loss on

the validation set (epoch 4 in Figure 3).

4.2.4 Baseline Feed-forward Neural Network

To assess the performance improvement that can

be achived by the CNNs, we also train a baseline

feed-forward neural network that takes as input

the same spectrogram image. The image which

is a 2-dimensional vector of pixel values is un-

wrapped or ﬂattened into a 1-dimensional vector.

Using this vector, a simple 2-layer neural network

is trained to predict the genre of the audio signal.

The ﬁrst hidden layer consists of 512 units and the

second layer has 32 units, followed by the out-

put layer. The activation function used is ReLU

and the same regularization techniques described

in Section 4.2.3 are adopted.

4.3 Manually Extracted Features

In this section, we describe the second category

of proposed models, namely the ones that re-

quire hand-crafted features to be fed into a ma-

chine learning classiﬁer. Features can be broadly

classiﬁed as time domain and frequency domain

features. The feature extraction was done using

librosa6, a Python library.

4.3.1 Time Domain Features

These are features which were extracted from the

raw audio signal.

1. Central moments: This consists of the

mean, standard deviation, skewness and kur-

tosis of the amplitude of the signal.

2. Zero Crossing Rate (ZCR): A zero crosss-

ing point refers to one where the sig-

nal changes sign from positive to negative

(Gouyon et al.,2000). The entire 10 sec-

ond signal is divided into smaller frames, and

the number of zero-crossings present in each

frame are determined. The frame length is

chosen to be 2048 points with a hop size of

512 points. Note that these frame parameters

have been used consistently across all fea-

tures discussed in this section. Finally, the

average and standard deviation of the ZCR

across all frames are chosen as representative

features.

3. Root Mean Square Energy (RMSE): The

energy in a signal is calculated as:

n=1

|x(n)|2(3)

Further, the root mean square value can be

computed as:

n=1

|x(n)|2(4)

RMSE is calculated frame by frame and then

we take the average and standard deviation

across all frames.

4. Tempo: In general terms, tempo refers to the

how fast or slow a piece of music is; it is ex-

pressed in terms of Beats Per Minute (BPM).

Intuitively, different kinds of music would

have different tempos. Since the tempo of

the audio piece can vary with time, we aggre-

gate it by computing the mean across several

frames. The functionality in librosa ﬁrst

computes a tempogram following (Grosche

et al.,2010) and then estimates a single value

for tempo.

6https://librosa.github.io/

4.3.2 Frequency Domain Features

The audio signal can be transformed into the fre-

quency domain by using the Fourier Transform.

We then extract the following features.

1. Mel-Frequency Cepstral Coefﬁcients

(MFCC): Introduced in the early 1990s

by Davis and Mermelstein, MFCCs have

been very useful features for tasks such as

speech recognition (Davis and Mermelstein,

1990). First, the Short-Time Fourier-

Transform (STFT) of the signal is taken with

n fft=2048 and hop size=512 and a

Hann window. Next, we compute the power

spectrum and then apply the triangular MEL

ﬁlter bank, which mimics the human percep-

tion of sound. This is followed by taking the

discrete cosine transform of the logarithm

of all ﬁlterbank energies, thereby obtaining

the MFCCs. The parameter n mels, which

corresponds to the number of ﬁlter banks,

was set to 20 in this study.

2. Chroma Features: This is a vector which

corresponds to the total energy of the sig-

nal in each of the 12 pitch classes. (C, C#,

D, D#, E ,F, F#, G, G#, A, A#, B) (Ellis,

2007). The chroma vectors are then aggre-

gated across the frames to obtain a represen-

tative mean and standard deviation.

3. Spectral Centroid: For each frame, this cor-

responds to the frequency around which most

of the energy is centered (Tjoa,2017). It is a

magnitude weighted frequency calculated as:

fc=PkS(k)f(k)

Pkfk ,(5)

where S(k) is the spectral magnitude of fre-

quency bin k and f(k) is the frequency corre-

sponding to bin k.

4. Spectral Band-width: The p-th order spec-

tral band-width corresponds to the p-th or-

der moment about the spectral centroid (Tjoa,

2017) and is calculated as

(S(k)f(k)−fc)p]

p(6)

For example, p= 2 is analogous to a

weighted standard deviation.

5. Spectral Contrast: Each frame is divided

into a pre-speciﬁed number of frequency

bands. And, within each frequency band,

the spectral contrast is calculated as the dif-

ference between the maximum and minimum

magnitudes (Jiang et al.,2002).

6. Spectral Roll-off: This feature corresponds

to the value of frequency below which 85%

(this threshold can be deﬁned by the user) of

the total energy in the spectrum lies (Tjoa,

2017).

For each of the spectral features described

above, the mean and standard deviation of the val-

ues taken across frames is considered as the repre-

sentative ﬁnal feature that is fed to the model.

The features described in this section would

be would be used to train machine learning algo-

rithms (refer Section 4.4). The features that con-

tribute the most in achieving a good classiﬁcation

performance will be identiﬁed and reported.

4.4 Classiﬁers

This section provides a brief overview of the four

machine learning classiﬁers adopted in this study.

1. Logistic Regression (LR): This linear clas-

siﬁer is generally used for binary classiﬁca-

tion tasks. For this multi-class classiﬁcation

task, the LR is implemented as a one-vs-rest

method. That is, 7 separate binary classi-

ﬁers are trained. During test time, the class

with the highest probability from among the

7 classiﬁers is chosen as the predicted class.

2. Random Forest (RF): Random Forest is a

ensemble learner that combines the predic-

tion from a pre-speciﬁed number of decision

trees. It works on the integration of two main

principles: 1) each decision tree is trained

with only a subset of the training samples

which is known as bootstrap aggregation (or

bagging) (Breiman,1996), 2) each decision

tree is required to make its prediction using

only a random subset of the features (Amit

and Geman,1997). The ﬁnal predicted class

of the RF is determined based on the majority

vote from the individual classiﬁers.

3. Gradient Boosting (XGB): Boosting is an-

other ensemble classiﬁer that is obtained by

combining a number of weak learners (such

as decision trees). However, unlike RFs,

boosting algorithms are trained in a sequen-

tial manner using forward stagewise additive

modelling (Hastie et al.,2001).

During the early iterations, the decision trees

learnt are fairly simple. As training pro-

gresses, the classiﬁer become more powerful

because it is made to focus on the instances

where the previous learners made errors. At

the end of training, the ﬁnal prediction is

a weighted linear combination of the output

from the individual learners. XGB refers to

eXtreme Gradient Boosting, which is an im-

plementation of boosting that supports train-

ing the model in a fast and parallelized man-

ner.

4. Support Vector Machines (SVM): SVMs

transform the original input data into a

high dimensional space using a kernel trick

(Cortes and Vapnik,1995). The transformed

data can be linearly separated using a hyper-

plane. The optimal hyperplane maximizes

the margin. In this study, a radial basis func-

tion (RBF) kernel is used to train the SVM

because such a kernel would be required

to address this non-linear problem. Simi-

lar to the logistic regression setting discussed

above, the SVM is also implemented as a

one-vs-rest classiﬁcation task.

5 Evaluation

5.1 Metrics

In order to evaluate the performance of the models

described in Section 4, the following metrics will

be used.

•Accuracy: Refers to the percentage of cor-

rectly classiﬁed test samples.

•F-score: Based on the confusion matrix, it

is possible to calculate the precision and re-

call. F-score7is then computed as the har-

monic mean between precision and recall.

•AUC: This evaluation criteria known as the

area under the receiver operator characteris-

tics (ROC) curve is a common way to judge

the performance of a multi-class classiﬁca-

tion system. The ROC is a graph between the

7https://en.wikipedia.org/wiki/F1_

score

Table 2: Comparison of performance of the models on the test set

Accuracy F-score AUC

Spectrogram-based models

VGG-16 CNN Transfer Learning 0.63 0.61 0.891

VGG-16 CNN Fine Tuning 0.64 0.61 0.889

Feed-forward NN baseline 0.43 0.33 0.759

Feature Engineering based models

Logistic Regression (LR) 0.53 0.47 0.822

Random Forest (RF) 0.54 0.48 0.840

Support Vector Machines (SVM) 0.57 0.52 0.856

Extreme Gradient Boosting (XGB) 0.59 0.55 0.865

Ensemble Classiﬁers

VGG-16 CNN + XGB 0.65 0.62 0.894

true positive rate and the false positive rate. A

baseline model which randomly predicts each

class label with equal probability would have

an AUC of 0.5, and hence the system being

designed is expected to have a AUC higher

than 0.5.

5.2 Results and Discussion

In this section, the different modelling approaches

discussed in Section 4are evaluated based on the

metrics described in Section 5.1. The values have

been reported in Table 2.

The best performance in terms of all metrics

is observed for the convolutional neural network

model based on VGG-16 that uses only the spec-

trogram to predict the music genre. It was ex-

pected that the ﬁne tuning setting, which addition-

ally allows the convolutional base to be trainable,

would enhance the CNN model when compared to

the transfer learning setting. However, as shown in

Table 2, the experimental results show that there is

no signiﬁcant difference between transfer learning

and ﬁne-tuning. The baseline feed-forward neural

network that uses the unrolled pixel values from

the spectrogram performs poorly on the test set.

This shows that CNNs can signiﬁcantly improve

the scores on such an image classiﬁcation task.

Among the models that use manually crafted

features, the one with the least performance is the

Logistic regression model. This is expected since

logistic regression is a linear classiﬁer. SVMs

outperform random forests in terms of accuracy.

However, the XGB version of the gradient boost-

ing algorithm performs the best among the feature

engineered methods.

5.2.1 Most Important Features

In this section, we investigate which features con-

tribute the most during prediction, in this classiﬁ-

cation task. To carry out this experiment, we chose

the XGB model, based on the results discussed in

the previous section. To do this, we rank the top

20 most useful features based on a scoring metric

(Figure 4). The metric is calculated as the number

of times a given feature is used as a decision node

among the individual decision trees that form the

gradient boosting predictor.

As can be observed from Figure 4, Mel-

Frequency Cepstral Coefﬁcients (MFCC) appear

the most among the important features. Previ-

ous studies have reported MFCCs to improve the

performance of speech recognition systems (It-

tichaichareon et al.,2012). Our experiments show

that MFCCs contribute signiﬁcantly to this task of

music genre classiﬁcation. The mean and standard

deviation of the spectral contrasts at different fre-

quency bands are also important features. The mu-

sic tempo, calculated in terms of beats per minute

also appear in the top 20 useful features.

Next, we study how much of performance in

terms of AUC and accuracy, can be obtained by

just using the top Nwhile training the model.

From Table 3it can be seen that with only the top

10 features, the model performance is surprisingly

good. In comparison to the full model which has

97 features, the model with the top 30 features has

only a marginally lower performance (2 points on

Figure 4: Relative importance of features in the XGBoost model; the top 20 most contributing features

are displayed

the AUC metric and 4 point on the accuracy met-

ric).

Table 3: Ablation Study: Comparing XGB perfor-

mance keeping only top Nfeatures

N AUC Accuracy

10 0.803 0.47

20 0.837 0.52

30 0.845 0.55

97 0.865 0.59

The ﬁnal experiment in this section is compar-

ison of time domain and frequency domain fea-

tures listed in Section 4.3. Two XGB models were

trained - one with only time domain features and

the other with only frequency domain features. Ta-

ble 4compares the results in terms of AUC and ac-

curacy. This experiment further conﬁrms the fact

that frequency domain features are deﬁnitely bet-

ter than time domain features when it comes to

modelling audio for machine learning tasks.

5.2.2 Confusion Matrix

Confusion matrix is a tabular representation which

enables us to further understand the strengths and

weaknesses of our model. Element aij in the ma-

Table 4: Comparison of Time Domain features

and Frequency Domain features

Model AUC Accuracy

Time Domain only 0.731 0.40

Frequency Domain only 0.857 0.57

Both 0.865 0.59

trix refers to the number of test instances of class

ithat the model predicted as class j. Diagonal

elements aii corresponds to the correct predic-

tions. Figure 5compares the confusion matrices

of the best performing CNN model and XGB, the

best model among the feature-engineered classi-

ﬁers. Both models seems to be good at predict-

ing the class ’Rock’ music. However, many in-

stances of class ’Hip Hop’ are often confused with

class ’Pop’ and vice-versa. Such a behaviour is

expected when the genres of music are very close.

Some songs may fall into multiple genres, even as

much that it may be difﬁcult for humans to recog-

nize the exact genre.

5.2.3 Ensemble Classiﬁer

Ensembling is a commonly adopted practice

in machine learning, wherein, the results from

(a) VGG-16 CNN Transfer Learning

(b) Extreme Gradient Boosting (c) Ensemble Model

Figure 5: Confusion Matrices of the best performing models

different classiﬁers are combined. This is

done by either majority voting or by averaging

scores/probabilities. Such an ensembling scheme

which combines the prediction powers of differ-

ent classiﬁers makes the overall system more ro-

bust. In our case, each classiﬁer outputs a predic-

tion probability for each of the class labels. Hence,

averaging the predicted probabilities from the dif-

ferent classiﬁers would be a straight-forward way

to do ensemble learning.

The methodologies described in 4.2 and 4.4 use

very different sources of input, the spectrograms

and the hand-crafted features respectively. Hence,

it makes sense to combine the models via ensem-

bling. In this study, the best CNN model namely,

VGG-16 Transfer Learning is ensembled with

XGBoost the best feature engineered model by av-

eraging the predicted probabilities. As shown in

Table 2, this ensembling is beneﬁcial and is ob-

served to outperform the all individual classiﬁers.

The ROC curve for the ensemble model is above

that of VGG-16 Fine Tuning and XGBoost as il-

lustrated in Figure 6.

6 Conclusion

In this work, the task of music genre classiﬁca-

tion is studied using the Audioset data. We pro-

Figure 6: ROC Curves for the best performing

models and their ensemble

pose two different approaches to solving this prob-

lem. The ﬁrst involves generating a spectrogram

of the audio signal and treating it as an image. An

CNN based image classiﬁer, namely VGG-16 is

trained on these images to predict the music genre

solely based on this spectrogram. The second ap-

proach consists of extracting time domain and fre-

quency domain features from the audio signals,

followed by training traditional machine learning

classiﬁers based on these features. XGBoost was

determined to be the best feature-based classiﬁer;

the most important features were also reported.

The CNN based deep learning models were shown

to outperform the feature-engineered models. We

also show that ensembling the CNN and XGBoost

model proved to be beneﬁcial. It is to be noted that

the dataset used in this study was audio clips from

YouTube videos, which are in general very noisy.

Futures studies can identify ways to pre-process

this noisy data before feeding it into a machine

learning model, in order to achieve better perfor-

mance.

References

Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui

Jiang, Li Deng, Gerald Penn, and Dong Yu. 2014.

Convolutional neural networks for speech recogni-

tion. IEEE/ACM Transactions on audio, speech, and

language processing 22(10):1533–1545.

Yali Amit and Donald Geman. 1997. Shape quantiza-

tion and recognition with randomized trees. Neural

computation 9(7):1545–1588.

Leo Breiman. 1996. Bagging predictors. Machine

learning 24(2):123–140.

Leo Breiman. 2001. Random forests. Machine learn-

ing 45(1):5–32.

Corinna Cortes and Vladimir Vapnik. 1995. Support-

vector networks. Machine learning 20(3):273–297.

Steven B Davis and Paul Mermelstein. 1990. Compar-

ison of parametric representations for monosyllabic

word recognition in continuously spoken sentences.

In Readings in speech recognition, Elsevier, pages

65–74.

Dan Ellis. 2007. Chroma feature analysis and synthe-

sis. Resources of Laboratory for the Recognition

and Organization of Speech and Audio-LabROSA .

Jerome H Friedman. 2001. Greedy function approx-

imation: a gradient boosting machine. Annals of

statistics pages 1189–1232.

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman,

Aren Jansen, Wade Lawrence, R Channing Moore,

Manoj Plakal, and Marvin Ritter. 2017. Audio set:

An ontology and human-labeled dataset for audio

events. In Acoustics, Speech and Signal Processing

(ICASSP), 2017 IEEE International Conference on.

IEEE, pages 776–780.

Ricardo Garcia Gonzalez. 2006. Youtube-dl: down-

load videos from youtube. com.

Fabien Gouyon, Franc¸ois Pachet, Olivier Delerue, et al.

2000. On the use of zero-crossing rate for an ap-

plication of classiﬁcation of percussive sounds. In

Proceedings of the COST G-6 conference on Digital

Audio Effects (DAFX-00), Verona, Italy.

Peter Grosche, Meinard M¨

uller, and Frank Kurth. 2010.

Cyclic tempograma mid-level tempo representation

for musicsignals. In Acoustics Speech and Sig-

nal Processing (ICASSP), 2010 IEEE International

Conference on. IEEE, pages 5522–5525.

Trevor Hastie, Robert Tibshirani, and Jerome Fried-

man. 2001. The elements of statistical learnine.

Chadawan Ittichaichareon, Siwat Suksri, and

Thaweesak Yingthawornsuk. 2012. Speech

recognition using mfcc. In International Con-

ference on Computer Graphics, Simulation and

Modeling (ICGSM’2012) July. pages 28–29.

Dan-Ning Jiang, Lie Lu, Hong-Jiang Zhang, Jian-Hua

Tao, and Lian-Hong Cai. 2002. Music type classi-

ﬁcation by spectral contrast feature. In Multimedia

and Expo, 2002. ICME’02. Proceedings. 2002 IEEE

International Conference on. IEEE, volume 1, pages

113–116.

Chanwoo Kim and Richard M Stern. 2012. Power-

normalized cepstral coefﬁcients (pncc) for robust

speech recognition. In Acoustics, Speech and Sig-

nal Processing (ICASSP), 2012 IEEE International

Conference on. IEEE, pages 4101–4104.

Diederik P Kingma and Jimmy Ba. 2014. Adam: A

method for stochastic optimization. arXiv preprint

arXiv:1412.6980 .

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-

ton. 2012. Imagenet classiﬁcation with deep con-

volutional neural networks. In Advances in neural

information processing systems. pages 1097–1105.

Tom LH Li, Antoni B Chan, and A Chun. 2010. Auto-

matic musical pattern feature extraction using con-

volutional neural network. In Proc. Int. Conf. Data

Mining and Applications.

Thomas Lidy and Andreas Rauber. 2005. Evaluation

of feature extractors and psycho-acoustic transfor-

mations for music genre classiﬁcation. In ISMIR.

pages 34–41.

Thomas Lidy and Alexander Schindler. 2016. Parallel

convolutional neural networks for music genre and

mood classiﬁcation. MIREX2016 .

Michael I Mandel and Dan Ellis. 2005. Song-level fea-

tures and support vector machines for music classi-

ﬁcation. In ISMIR. volume 2005, pages 594–599.

Loris Nanni, Yandre MG Costa, Alessandra Lumini,

Moo Young Kim, and Seung Ryul Baek. 2016.

Combining visual and acoustic features for music

genre classiﬁcation. Expert Systems with Applica-

tions 45:108–117.

Andrew Y Ng. 2004. Feature selection, l 1 vs. l 2 regu-

larization, and rotational invariance. In Proceedings

of the twenty-ﬁrst international conference on Ma-

chine learning. ACM, page 78.

Nicolas Scaringella and Giorgio Zoia. 2005. On the

modeling of time information for automatic genre

recognition systems in audio signals. In ISMIR.

pages 666–671.

Karen Simonyan and Andrew Zisserman. 2014. Very

deep convolutional networks for large-scale image

recognition. arXiv preprint arXiv:1409.1556 .

Hagen Soltau, Tanja Schultz, Martin Westphal, and

Alex Waibel. 1998. Recognition of music types. In

Acoustics, Speech and Signal Processing, 1998. Pro-

ceedings of the 1998 IEEE International Conference

on. IEEE, volume 2, pages 1137–1140.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,

Ilya Sutskever, and Ruslan Salakhutdinov. 2014.

Dropout: A simple way to prevent neural networks

from overﬁtting. The Journal of Machine Learning

Research 15(1):1929–1958.

Steve Tjoa. 2017. Music information retrieval.

https://musicinformationretrieval.

com/spectral_features.html. Accessed:

2018-02-20.

Suramya Tomar. 2006. Converting video formats with

ffmpeg. Linux Journal 2006(146):10.

George Tzanetakis and Perry Cook. 2002. Musical

genre classiﬁcation of audio signals. IEEE Trans-

actions on speech and audio processing 10(5):293–

302.

Aaron Van Den Oord, Sander Dieleman, Heiga Zen,

Karen Simonyan, Oriol Vinyals, Alex Graves,

Nal Kalchbrenner, Andrew Senior, and Koray

Kavukcuoglu. 2016. Wavenet: A generative model

for raw audio. arXiv preprint arXiv:1609.03499 .

Lonce Wyse. 2017. Audio spectrogram representations

for processing with convolutional neural networks.

arXiv preprint arXiv:1706.09559 .

E Zwicker and H Fastl. 1999. Psychoacoustics facts

and models .

Classification and study of music genres with multimodal Spectro-Lyrical Embeddings for Music (SLEM)

Article

Full-text available

Apr 2024
MULTIMED TOOLS APPL

The essence of music is inherently multi-modal – with audio and lyrics going hand in hand. However, there is very less research done to study the intricacies of the multi-modal nature of music, and its relation with genres. Our work uses this multi-modality to present spectro-lyrical embeddings for music representation (SLEM), leveraging the power of open-sourced, lightweight, and state-of-the-art deep learning vision and language models to encode songs. This work summarises extensive experimentation with over 20 deep learning-based music embeddings of a self-curated and hand-labeled multi-lingual dataset of 226 recent songs spread over 5 genres. Our aim is to study the effects of varying the weight of lyrics and spectrograms in the embeddings on the multi-class genre classification. The purpose of this study is to prove that a simple linear combination of both modalities is better than either modality alone. Our methods achieve an accuracy ranging between 81.08% to 98.60% for different genres, by using the K-nearest neighbors algorithm on the multimodal embeddings. We successfully study the intricacies of genres in this representational space, including their misclassification, visual clustering with EM-GMM, and the domain-specific meaning of the multi-modal weight for each genre with respect to ’instrumentalness’ and ’energy’ metadata. SLEM presents one of the first works on an end-to-end method that uses spectro-lyrical embeddings without hand-engineered features.

Bangla song genre recognition using artificial neural network

Article

Full-text available

Jun 2024

p>Music has a control over human moods and it can make someone calm or excited. It allows us to feel all emotions we experience. Nowadays, people are often attached with their phones and computers listening to music on Spotify, Soundcloud or any other internet platform. Music Information retrieval plays an important role for music recommendation according to lyrics, pitch, pattern of choices, and genre. In this study, we have tried to recognize the music genre for a better music recommendation system. We have collected an amount of 1820 Bangla songs from six different genres including Adhunik, Rock, Hip hop, Nazrul, Rabindra and Folk music. We have started with some traditional machine learning algorithms having K-Nearest Neighbor, Logistic Regression, Random Forest, Support Vector Machine and Decision Tree but ended up with a deep learning algorithm named Artificial Neural Network with an accuracy of 78% for recognizing music genres from six different genres. All mentioned algorithms are experimented with transformed mel-spectrograms and Mean Chroma Frequency Values of that raw amplitude data. But we found that music Tempo having Beats per Minute value with two previous features present better accuracy.</p

JOURNAL OF OPTOELECTRONICS LASER A Machine Learning-Based Technique for Music Genre Detection and Categorization

Article

Full-text available

Jan 2022

People's mental health is greatly influenced by music. Music serves as a bond that links communities together by bringing people with common interests together. As a result of people's changing mindsets, music has now become a business. A person may go a day without interacting with others or wishing for a friend nowadays, but they cannot go a day without listening to music. Songs are divided into a variety of genres. Using machine learning algorithms, we can deliver a categorized list of music to the Smartphone user. The structure of music, as well as how humans perceive and understand it, is discussed in connection to musical style categorization. For this experiment, we focused on combining data from audio signals rather than data from several sources. The entire project includes a comprehensive machine learning technique for automatically categorizing musical genres based on audio inputs. This ensemble approach suggests that combining different types of domain-based audio characteristics might improve classification accuracy. We'll use machine learning techniques to build models that can categories audio recordings into different genres. After our trained model has been trained, we shall evaluate its performance. The goal of the research is to develop a machine learning system that outperforms existing music genre predicting methods. In this research, we used the GTZAN dataset to train a number of classification models. We compared and documented the results of all of these models in terms of predicted accuracy.

Genres Classification of Popular Songs Listening by Using Keras

Article

Feb 2024

Listening to the music affects the brain in ways which might help to promote the human health and arrange various diseases symptoms. Music is a phenomenon that is intertwined at every stage of human life. In the modern era music is formed by the combination of an incredible number of genres, some of which are contemporary, and some come from the past. The music genre represents a collection of musical works that develop according to a certain shape, expression and technique. The music genre of interest varies from person to person in society. Most listeners today do not know what kind of music they listen to. In this study, sound features were extracted from music data and the Keras model was trained using these features. The correct classification rate of a music genre of the trained model was determined as 71.66%. Mel Frequency Cepstral Coefficients (MFCC), Mel Spectrogram, Chroma Vector and Tonnetz methods in the Librosa library were used to extract sound properties from music data. Using the features calculated by the Librosa library, the most listened songs with Shazam in Türkiye were classified in with TensorFlow/Keras. Many methods can be used in classification. It is unclear which method the researchers should prefer. With this study, researchers will know classification with Keras, researchers who do not know about music will know music and know the genre of newly released songs.

Deep Dive: Music Genre Classification with Convolutional Neural Networks

Conference Paper

Dec 2023

MUSIC GENRE CLASSIFICATION

Chapter

Nov 2023

A major objective of this book series is to drive innovation in every aspect of Artificial Intelligent. It offers researchers, educators and students the opportunity to discuss and share ideas on topics, trends and developments in the fields of artificial intelligence, machine learning, deep learning and more, big data and computer science, computer intelligence and Technology. The content of the book is as follows

An Effective Machine Learning-Based Malware Detection Approach

Chapter

Oct 2023

Machine Learning Approaches and Analysis for Bangla Music Genre Classification

Conference Paper

Dec 2023

Sonic Signatures: Sequential Model-driven Music Genre Classification with Mel Spectograms

Conference Paper

Jan 2024

Music Genre Classification Using Attention-Based CNN-Feature Fusion Paradigm

Conference Paper

Nov 2023

Parallel Convolutional Neural Networks for Music Genre and Mood Classification

Technical Report

Full-text available

Aug 2016

Our approach to the MIREX 2016 Train/Test Classification Tasks for Genre, Mood and Composer detection is based on an approach combining Mel-spectrogram transformed audio and Convolutional Neural Networks (CNN). We utilize two different CNN architectures, a sequential one, and a parallel one, the latter aiming at capturing both temporal and timbral information in two different pipelines, which are merged on a later stage. In both cases, the crucial CNN parameters such as filter kernel sizes and pooling sizes were carefully chosen after a range of experiments.

Combining visual and acoustic features for music genre classification

Article

Full-text available

Oct 2015
EXPERT SYST APPL

Since musical genre is one of the most common ways used by people for managing digital music databases, music genre recognition is a crucial task, deep studied by the Music Information Retrieval (MIR) research community since 2002. In this work we present a novel and effective approach for automated musical genre recognition based on the fusion of different set of features. Both acoustic and visual features are considered, evaluated, compared and fused in a final ensemble which show classification accuracy comparable or even better than other state-of-the-art approaches. The visual features are locally extracted from sub-windows of the spectrogram taken by Mel scale zoning: the input signal is represented by its spectrogram which is divided in sub-windows in order to extract local features; feature extraction is performed by calculating texture descriptors and bag of features projections from each sub-window; the final decision is taken using an ensemble of SVM classifiers. In this work we show for the first time that a bag of feature approach can be effective in this problem. As the acoustic features are concerned, we propose an ensemble of heterogeneous classifiers for maximizing the performance that could be obtained starting from the acoustic features. First timbre features are obtained from the audio signal, second some statistical measures are calculated from the texture window and the modulation spectrum, third a feature selection is executed to increase the recognition performance and decrease the computational complexity. Finally, the resulting descriptors are classified by fusing the scores of heterogeneous classifiers (SVM and Random subspace of AdaBoost). The experimental evaluation is performed on three well-known databases: the Latin Music Database (LMD), the ISMIR 2004 database and the GTZAN genre collection. The reported performance of the proposed approach is very encouraging, since they outperform other state-of-the-art approaches, without any ad hoc parameter optimization (i.e. using the same ensemble of classifiers and parameters setting in all the three datasets). The advantage of using both visual and audio features is also proved by means of Q-statistics, which confirms that the two sets of features are partially independent and they are suitable to be fused together in a heterogeneous system. The MATLAB code of the ensemble of classifiers and for the visual features extraction will be publicly available (see footnote 1) to other researchers for future comparisons. The code for acoustic features is not available since it is used in a commercial system.

The elements of statistical learning

Book

Jan 2009

Audio Set: An ontology and human-labeled dataset for audio events

Conference Paper

Mar 2017

Random forests

Article

Jan 2001

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Article

Jun 2014
J MACH LEARN RES

Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.

Adam: A Method for Stochastic Optimization

Article

Dec 2014

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.

ImageNet Classification with Deep Convolutional Neural Networks

Article

Jan 2012
Adv Neural Inform Process Syst

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

Convolutional Neural Networks for Speech Recognition

Article

Oct 2014

Recently, the hybrid deep neural network (DNN)-hidden Markov model (HMM) has been shown to significantly improve speech recognition performance over the conventional Gaussian mixture model (GMM)-HMM. The performance improvement is partially attributed to the ability of the DNN to model complex correlations in speech features. In this paper, we show that further error rate reduction can be obtained by using convolutional neural networks (CNNs). We first present a concise description of the basic CNN and explain how it can be used for speech recognition. We further propose a limited-weight-sharing scheme that can better model speech features. The special structure such as local connectivity, weight sharing, and pooling in CNNs exhibits some degree of invariance to small shifts of speech features along the frequency axis, which is important to deal with speaker and environment variations. Experimental results show that CNNs reduce the error rate by 6%-10% compared with DNNs on the TIMIT phone recognition and the voice search large vocabulary speech recognition tasks.

Converting video formats with FFmpeg

Article

Jun 2006

Suramya Tomar

FFmpeg is a mini Swiss Army knife of format conversion tools.

Music Genre Classification using Machine Learning Techniques

Abstract and Figures

Recommended publications

Bangla Music Genre Classification Using Neural Network

Genre Classification using Spectrograms as input to CNN on Indian Music

Music Genre Classification using Deep Learning

Double Coated VGG16 Architecture: An Enhanced Approach for Genre Classification of Spectrographic Re...

Music Genre Classification using Deep Learning