ArticlePDF Available

Musical Genre Classification of Audio Signals

Authors:

Abstract and Figures

Musical genres are categorical labels created by humans to characterize pieces of music. A musical genre is characterized by the common characteristics shared by its members. These characteristics typically are related to the instrumentation, rhythmic structure, and harmonic content of the music. Genre hierarchies are commonly used to structure the large collections of music available on the Web. Currently musical genre annotation is performed manually. Automatic musical genre classification can assist or replace the human user in this process and would be a valuable addition to music information retrieval systems. In addition, automatic musical genre classification provides a framework for developing and evaluating features for any type of content-based analysis of musical signals. In this paper, the automatic classification of audio signals into an hierarchy of musical genres is explored. More specifically, three feature sets for representing timbral texture, rhythmic content and pitch content are proposed. The performance and relative importance of the proposed features is investigated by training statistical pattern recognition classifiers using real-world audio collections. Both whole file and real-time frame-based classification schemes are described. Using the proposed feature sets, classification of 61% for ten musical genres is achieved. This result is comparable to results reported for human musical genre classification.
Content may be subject to copyright.
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 5, JULY 2002 293
Musical Genre Classification of Audio Signals
George Tzanetakis, Student Member, IEEE, and Perry Cook, Member, IEEE
Abstract—Musical genres are categorical labels created by hu-
mans to characterize pieces of music. A musical genre is char-
acterized by the common characteristics shared by its members.
These characteristics typically are related to the instrumentation,
rhythmic structure, and harmonic content of the music. Genre hi-
erarchies are commonly used to structure the large collections of
music available on the Web. Currently musical genre annotation
is performed manually. Automatic musical genre classification can
assist or replace the human user in this process and would be a
valuable addition to music information retrieval systems. In ad-
dition, automatic musical genre classification provides a frame-
work for developing and evaluating features for any type of con-
tent-based analysis of musical signals.
In this paper, the automatic classification of audio signals into
an hierarchy of musical genres is explored. More specifically,
three feature sets for representing timbral texture, rhythmic
content and pitch content are proposed. The performance and
relative importance of the proposed features is investigated by
training statistical pattern recognition classifiers using real-world
audio collections. Both whole file and real-time frame-based
classification schemes are described. Using the proposed feature
sets, classification of 61% for ten musical genres is achieved. This
result is comparable to results reported for human musical genre
classification.
Index Terms—Audio classification, beat analysis, feature extrac-
tion, musical genre classification, wavelets.
I. INTRODUCTION
MUSICAL genres are labels created and used by humans
for categorizing and describing the vast universe of
music. Musical genres have no strict definitions and boundaries
as they arise through a complex interaction between the public,
marketing, historical, and cultural factors. This observation
has led some researchers to suggest the definition of a new
genre classification scheme purely for the purposes of music
information retrieval [1]. However even with current musical
genres, it is clear that the members of a particular genre share
certain characteristics typically related to the instrumentation,
rhythmic structure, and pitch content of the music.
Automatically extracting music information is gaining im-
portance as a way to structure and organize the increasingly
large numbers of music files available digitally on the Web. It is
very likely that in the near future all recorded music in human
Manuscript received November 28, 2001; revised April 11, 2002. This work
was supported by the NSF under Grant 9984087, the State of New Jersey Com-
mission on Science and Technology under Grant 01-2042-007-22, Intel, and
the Arial Foundation. The associate editor coordinating the review of this man-
uscript and approving it for publication was Prof. C.-C. Jay Kuo.
G. Tzanetakis is with the Computer Science Department, Princeton Univer-
sity, Princeton, NJ 08544 USA (e-mail: gtzan@cs.princeton.edu).
P. Cook is with the Computer Science and Music Departments, Princeton
University, Princeton, NJ 08544 USA (e-mail: prc@cs.princeton.edu).
Publisher Item Identifier 10.1109/TSA.2002.800560.
history will be available on the Web. Automatic music analysis
will be one of the services that music content distribution ven-
dors will use to attract customers. Another indication of the in-
creasing importance of digital music distribution is the legal at-
tention that companies like Napster have recently received.
Genre hierarchies, typically created manually by human ex-
perts, are currently one of the ways used to structure music con-
tent on the Web. Automatic musical genre classification can po-
tentially automate this process and provide an important com-
ponent for a complete music information retrieval system for
audio signals. In addition it provides a framework for devel-
oping and evaluating features for describing musical content.
Such features can be used for similarity retrieval, classification,
segmentation, and audio thumbnailing and form the foundation
of most proposed audio analysis techniques for music.
In this paper, the problem of automatically classifying audio
signals into an hierarchy of musical genres is addressed. More
specifically, three sets of features for representing timbral tex-
ture, rhythmic content and pitch content are proposed. Although
there has been significant work in the development of features
for speech recognition and music–speech discrimination there
has been relatively little work in the development of features
specifically designed for music signals. Although the timbral
texture feature set is based on features used for speech and gen-
eral sound classification, the other two feature sets (rhythmic
and pitch content) are new and specifically designed to rep-
resent aspects of musical content (rhythm and harmony). The
performance and relative importance of the proposed feature
sets is evaluated by training statistical pattern recognition clas-
sifiers using audio collections collected from compact disks,
radio, and the Web. Audio signals can be classified into an hier-
archy of music genres, augmented with speech categories. The
speech categories are useful for radio and television broadcasts.
Both whole-file classification and real-time frame classification
schemes are proposed.
The paper is structured as follows. A review of related work
is provided in Section II. Feature extraction and the three spe-
cific feature sets for describing timbral texture, rhythmic struc-
ture, and pitch content of musical signals are described in Sec-
tion III. Section IV deals with the automatic classification and
evaluation of the proposed features and Section V with conclu-
sions and future directions.
II. RELATED WORK
The basis of any type of automatic audio analysis system is
the extraction of feature vectors. A large number of different
feature sets, mainly originating from the area of speech recog-
nition, have been proposed to represent audio signals. Typically
1063-6676/02$17.00 © 2002 IEEE
294 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 5, JULY 2002
they are based on some form of time-frequency representation.
Although a complete overview of audio feature extraction is be-
yond the scope of this paper, some relevant representative audio
feature extraction references are provided.
Automatic classification of audio has also a long history orig-
inating from speech recognition. Mel-frequency cepstral coef-
ficients (MFCC) [2], are a set of perceptually motivated fea-
tures that have been widely used in speech recognition. They
provide a compact representation of the spectral envelope, such
that most of the signal energy is concentrated in the first coeffi-
cients.
More recently, audio classification techniques that include
nonspeech signals have been proposed. Most of these systems
target the classification of broadcast news and video in broad
categories like music, speech, and environmental sounds. The
problem of discrimination between music and speech has re-
ceived considerable attention from the early work of Saunders
[3] where simple thresholding of the average zero-crossing rate
and energy features is used, to the work of Scheirer and Slaney
[4] where multiple features and statistical pattern recognition
classifiers are carefully evaluated. In [5], audio signals are
segmented and classified into “music,” “speech,” “laughter,”
and nonspeech sounds using cepstral coefficients and a hidden
Markov model (HMM). A heuristic rule-based system for the
segmentation and classification of audio signals from movies
or TV programs based on the time-varying properties of simple
features is proposed in [6]. Signals are classified into two broad
groups of music and nonmusic which are further subdivided
into (music) harmonic environmental sound, pure music, song,
speech with music, environmental sound with music, and
(non-music) pure speech and nonharmonic environmental
sound. Berenzweig and Ellis [7] deal with the more difficult
problem of locating singing voice segments in musical signals.
In their system, the phoneme activation output of an automatic
speech recognition system is used as the feature vector for
classifying singing segments.
Another type of nonspeech audio classification system in-
volves isolated musical instrument sounds and sound effects.
In the pioneering work of Wold et al. [8] automatic retrieval,
classification and clustering of musical instruments, sound ef-
fects, and environmental sounds using automatically extracted
features is explored. The features used in their system are statis-
tics (mean, variance, autocorrelation) over the whole sound file
of short time features such as pitch, amplitude, brightness, and
bandwidth. Using the same dataset various other retrieval and
classification approaches have been proposed. Foote [9] pro-
poses the use of MFCC coefficients to construct a learning tree
vector quantizer. Histograms of the relative frequencies of fea-
ture vectors in each quantization bin are subsequently used for
retrieval. The same dataset is also used in [10] to evaluate a fea-
ture extraction and indexing scheme based on statistics of the
discrete wavelet transform (DWT) coefficients. Li [11] used the
same dataset to compare various classification methods and fea-
ture sets and proposed the use of the nearest feature line pattern
classification method.
In the previously cited systems, the proposed acoustic fea-
tures do not directly attempt to model musical signals and there-
fore are not adequate for automatic musical genre classification.
For example, no information regarding the rhythmic structure
of the music is utilized. Research in the areas of automatic beat
detection and multiple pitch analysis can provide ideas for the
development of novel features specifically targeted to the anal-
ysis of music signals.
Scheirer [12] describes a real-time beat tracking system for
audio signals with music. In this system, a filterbank is coupled
withanetworkofcombfilters thattrackthesignalperiodicities to
provide an estimate of the main beat and its strength. A real-time
beat tracking system based on a multiple agent architecture that
tracks several beat hypotheses in parallel is described in [13].
More recently,computationally simpler methods based on onset
detection at specific frequencies have been proposed in [14]
and [15]. The beat spectrum, described in [16], is a more global
representation of rhythm than just the main beat and its strength.
To the best of our knowledge, there has been little research
in feature extraction and classification with the explicit goal of
classifying musical genre. Reference [17] contains some early
work and preliminary results in automatic musical genre classi-
fication.
III. FEATURE EXTRACTION
Feature extraction is the process of computing a compact nu-
merical representation that can be used to characterize a seg-
ment of audio. The design of descriptive features for a specific
application is the main challenge in building pattern recogni-
tion systems. Once the features are extracted standard machine
learning techniques which are independent of the specific appli-
cation area can be used.
A. Timbral Texture Features
The features used to represent timbral texture are based on
standard features proposed for music-speech discrimination [4]
and speech recognition [2]. The calculated features are based
on the short time Fourier transform (STFT) and are calculated
for every short-time frame of sound. More details regarding
the STFT algorithm and the Mel-frequency cepstral coefficients
(MFCC) can be found in [18]. The use of MFCCs to separate
music and speech has been explored in [19]. The following spe-
cific features are used to represent timbral texture in our system.
1) Spectral Centroid: The spectral centroid is defined as the
center of gravity of the magnitude spectrum of the STFT
(1)
where is the magnitude of the Fourier transform at frame
and frequency bin . The centroid is a measure of spectral
shape and higher centroid values correspond to “brighter” tex-
tures with more high frequencies.
2) Spectral Rolloff: The spectral rolloff is defined as the fre-
quency below which 85% of the magnitude distribution is
concentrated
(2)
The rolloff is another measure of spectral shape.
TZANETAKIS AND COOK: MUSICAL GENRE CLASSIFICATION OF AUDIO SIGNALS 295
3) Spectral Flux: The spectral flux is defined as the squared
difference between the normalized magnitudes of successive
spectral distributions
(3)
where and are the normalized magnitude of the
Fourier transform at the current frame , and the previous frame
, respectively. The spectral flux is a measure of the amount
of local spectral change.
4) Time Domain Zero Crossings:
(4)
where the function is 1 for positive arguments and 0 for
negative arguments and is the time domain signal for frame
. Time domain zero crossings provide a measure of the noisi-
ness of the signal.
5) Mel-Frequency Cepstral Coefficients: Mel-frequency
cepstral coefficients (MFCC) are perceptually motivated
features that are also based on the STFT. After taking the
log-amplitude of the magnitude spectrum, the FFT bins are
grouped and smoothed according to the perceptually motivated
Mel-frequency scaling. Finally, in order to decorrelate the
resulting feature vectors a discrete cosine transform is per-
formed. Although typically 13 coefficients are used for speech
representation, we have found that the first five coefficients
provide the best genre classification performance.
6) Analysis and Texture Window: In short-time audio
analysis, the signal is broken into small, possibly overlapping,
segments in time and each segment is processed separately.
These segments are called analysis windows and have to
be small enough so that the frequency characteristics of the
magnitude spectrum are relatively stable (i.e., assume that the
signal for that short amount of time is stationary). However, the
sensation of a sound “texture” arises as the result of multiple
short-time spectrums with different characteristics following
some pattern in time. For example, speech contains vowel
and consonant sections which have very different spectral
characteristics.
Therefore, in order to capture the long term nature of sound
“texture,” the actual features computed in our system are the
running means and variances of the extracted features described
in the previous section over a number of analysis windows. The
term texture window is used in this paper to describe this larger
window and ideally should correspond to the minimum time
amount of sound that is necessary to identify a particular sound
or music “texture.” Essentially, rather than using the feature
values directly, the parameters of a running multidimensional
Gaussian distribution are estimated. More specifically, these pa-
rameters (means, variances) are calculated based on the texture
window which consists of the current feature vector in addition
to a specific number of feature vectors from the past. Another
way to think of the texture window is as a memory of the past.
For efficient implementation a circular buffer holding previous
feature vectors can be used. In our system, an analysis window
of 23 ms (512 samples at 22 050 Hz sampling rate) and a texture
window of 1 s (43 analysis windows) is used.
7) Low-Energy Feature: Low energy is the only feature that
is based on the texture window rather than the analysis window.
It is defined as the percentage of analysis windows that have less
RMS energy than the average RMS energy across the texture
window. As an example, vocal music with silences will have
large low-energy value while continuous strings will have small
low-energy value.
B. Timbral Texture Feature Vector
To summarize, the feature vector for describing timbral tex-
ture consists of the following features: means and variances of
spectral centroid, rolloff, flux, zerocrossings over the texture
window (8), low energy (1), and means and variances of the first
five MFCC coefficients over the texture window (excluding the
coefficient corresponding to the DC component) resulting in a
19-dimensional feature vector.
C. Rhythmic Content Features
Most automatic beat detection systems provide a running es-
timate of the main beat and an estimate of its strength. In ad-
dition to these features in order to characterize musical genres
more information about the rhythmic content of a piece can be
utilized. The regularity of the rhythm, the relation of the main
beat to the subbeats, and the relative strength of subbeats to the
main beat are some examples of characteristics we would like
to represent through feature vectors.
One of the common automatic beat detector structures con-
sists of a filterbank decomposition, followed by an envelope ex-
traction step and finally a periodicity detection algorithm which
is used to detect the lag at which the signal’s envelope is most
similar to itself. The process of automatic beat detection resem-
bles pitch detection with larger periods (approximately 0.5 s to
1.5 s for beat compared to 2 ms to 50 ms for pitch).
The calculation of features for representing the rhythmic
structure of music is based on the wavelet transform (WT)
which is a technique for analyzing signals that was developed
as an alternative to the STFT to overcome its resolution
problems. More specifically, unlike the STFT which provides
uniform time resolution for all frequencies, the WT provides
high time resolution and low-frequency resolution for high
frequencies, and low time and high-frequency resolution for
low frequencies. The discrete wavelet transform (DWT) is a
special case of the WT that provides a compact representation
of the signal in time and frequency that can be computed
efficiently using a fast, pyramidal algorithm related to multirate
filterbanks. More information about the WT and DWT can
be found in [20]. For the purposes of this work, the DWT
can be viewed as a computationally efficient way to calculate
an octave decomposition of the signal in frequency. More
specifically, the DWT can be viewed as a constant (center
frequency/bandwidth) with octave spacing between the centers
of the filters.
In the pyramidal algorithm, the signal is analyzed at different
frequency bands with different resolutions for each band. This is
achieved by successively decomposing the signal into a coarse
approximation and detail information. The coarse approxima-
tion is then further decomposed using the same wavelet decom-
position step. This decomposition step is achieved by successive
296 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 5, JULY 2002
Fig. 1. Beat histogram calculation flow diagram.
highpass and lowpass filtering of the time domain signal and is
defined by the following equations:
(5)
(6)
where , are the outputs of the highpass (g) and
lowpass (h) filters, respectively after subsampling by two. The
DAUB4 filters proposed by Daubechies [21] are used.
The feature set for representing rhythm structure is based
on detecting the most salient periodicities of the signal. Fig. 1
shows the flow diagram of the beat analysis algorithm. The
signal is first decomposed into a number of octave frequency
bands using the DWT. Following this decomposition, the
time domain amplitude envelope of each band is extracted
separately. This is achieved by applying full-wave rectification,
low pass filtering, and downsampling to each octave frequency
band. After mean removal, the envelopes of each band are then
summed together and the autocorrelation of the resulting sum
envelope is computed. The dominant peaks of the autocorre-
lation function correspond to the various periodicities of the
signal’s envelope. These peaks are accumulated overthe whole
sound file into a beat histogram where each bin corresponds to
the peak lag, i.e., the beat period in beats-per-minute (bpm).
Rather than adding one, the amplitude of each peak is added to
the beat histogram. That way, when the signal is very similar to
itself (strong beat) the histogram peaks will be higher.
The following building blocks are used for the beat analysis
feature extraction.
1) Full Wave Rectification:
(7)
is applied in order to extract the temporal envelope of the signal
rather than the time domain signal itself.
2) Low-Pass Filtering:
(8)
i.e., a one-pole filter with an alpha value of 0.99 which is used
to smooth the envelope. Full wave rectification followed by
low-pass filtering is a standard envelope extraction technique.
3) Downsampling:
(9)
where in our implementation. Because of the large pe-
riodicities for beat analysis, downsampling the signal reduces
computation time for the autocorrelation computation without
affecting the performance of the algorithm.
4) Mean Removal:
(10)
is applied in order to make the signal centered to zero for the
autocorrelation stage.
5) Enhanced Autocorrelation:
(11)
the peaks of the autocorrelation function correspond to the time
lags where the signal is most similar to itself. The time lags of
peaks in the right time range for rhythm analysis correspond
to beat periodicities. The autocorrelation function is enhanced
using a similar method to the multipitch analysis model of
Tolonen and Karjalainen [22] in order to reduce the effect
of integer multiples of the basic periodicities. The original
autocorrelation function of the summary of the envelopes, is
clipped to positive values and then time-scaled by a factor of
two and subtracted from the original clipped function. The
same process is repeated with other integer factors such that
repetitive peaks at integer multiples are removed.
6) Peak Detection and Histogram Calculation: The first
three peaks of the enhanced autocorrelation function that are in
the appropriate range for beat detection are selected and added
to a beat histogram (BH). The bins of the histogram correspond
to beats-per-minute (bpm) from 40 to 200 bpm. For each peak
of the enhanced autocorrelation function the peak amplitude
is added to the histogram. That way peaks that have high
amplitude (where the signal is highly similar) are weighted
more strongly than weaker peaks in the histogram calculation.
7) Beat Histogram Features: Fig. 2 shows a beat histogram
for a 30-s excerpt of the song “Come Together” by the Beatles.
The two main peaks of the BH correspond to the main beat at
approximately 80 bpm and its first harmonic (twice the speed) at
160 bpm. Fig. 3 shows four beat histograms of pieces from dif-
ferent musical genres. The upper left corner, labeled classical,
is the BH of an excerpt from “La Mer” by Claude Debussy. Be-
cause of the complexity of the multiple instruments of the or-
chestra there is no strong self-similarity and there is no clear
dominant peak in the histogram. More strong peaks can be seen
at the lower left corner, labeled jazz, which is an excerpt from a
live performance by Dee Dee Bridgewater. The two peaks cor-
respond to the beat of the song (70 and 140 bpm). The BH of
Fig. 2 is shown on the upper right corner where the peaks are
more pronounced because of the stronger beat of rock music.
TZANETAKIS AND COOK: MUSICAL GENRE CLASSIFICATION OF AUDIO SIGNALS 297
Fig. 2. Beat histogram example.
The highest peaks of the lower right corner indicate the strong
rhythmic structure of a HipHop song by Neneh Cherry.
A small-scale study (20 excerpts from various genres) con-
firmed that most of the time (18/20) the main beat corresponds
to the first or second BH peak. The results of this study and the
initial description of beat histograms can be found in [23]. Un-
like previous work in automatic beat detection which typically
aims to provide only an estimate of the main beat (or tempo) of
the song and possibly a measure of its strength, the BH repre-
sentation captures more detailed information about the rhythmic
content of the piece that can be used to intelligently guess the
musical genre of a song. Fig. 3 indicates that the BH of different
musical genres can be visually differentiated. Based on this ob-
servation a set of features based on the BH are calculated in
order to represent rhythmic content and are shown to be useful
for automatic musical genre classification. These are:
A0, A1: relative amplitude (divided by the sum of ampli-
tudes) of the first, and second histogram peak;
RA: ratio of the amplitude of the second peak divided by
the amplitude of the first peak;
P1, P2: period of the first, second peak in bpm;
SUM: overall sum of the histogram (indication of beat
strength).
For the BH calculation, the DWT is applied in a window of
65 536 samples at 22050 Hz sampling rate which corresponds
to approximately 3 s. This window is advanced by a hop sizeof
32 768 samples. This larger window is necessary to capture the
signal repetitions at the beat and subbeat levels.
D. Pitch Content Features
The pitch content feature set is based on multiple pitch detec-
tion techniques. More specifically, the multipitch detection al-
gorithm described by Tolonen and Karjalainen [22] is utilized.
In this algorithm, the signal is decomposed into two frequency
bands (below and above 1000 Hz) and amplitude envelopes are
extracted for each frequency band. The envelope extraction is
performed by applying half-wave rectificationand low-pass fil-
tering. The envelopes are summed and an enhanced autocorrela-
tion function is computed so that the effect of integer multiples
of the peak frequencies to multiple pitch detection is reduced.
The prominent peaks of this summary enhanced autocorre-
lation function (SACF) correspond to the main pitches for that
short segment of sound. This method is similar to the beat de-
tection structure for the shorter periods corresponding to pitch
perception. The three dominant peaks of the SACF are accumu-
lated into a PH over the whole soundfile. For the computation
of the PH, a pitch analysis window of 512 samples at 22050 Hz
sampling rate (approximately 23 ms) is used.
The frequencies corresponding to each histogram peak are
converted to musical pitches such that each bin of the PH corre-
sponds to a musical note with a specific pitch (for example A4
440 Hz). The musical notes are labeled using the MIDI note
numbering scheme. The conversion from frequency to MIDI
note number can be performed using
(12)
where is the frequency in Hertz and is the histogram bin
(MIDI note number).
Two versions of the PH are created: a folded (FPH) and un-
folded histogram (UPH). The unfolded version is created using
the above equation without any further modifications. In the
folded case, all notes are mapped to a single octave using
(13)
where is the folded histogram bin (pitch class or chroma
value), and is the unfolded histogram bin (or MIDI note
number). The folded version contains information regarding
the pitch classes or harmonic content of the music whereas the
unfolded version contains information about the pitch range of
the piece. The FPH is similar in concept to the chroma-based
representations used in [24] for audio-thumbnailing. More
information regarding the chroma and height dimension of
musical pitch can be found in [25]. The relation of musical
scales to frequency is discussed in more detail in [26].
Finally, the FPH is mapped to a circle of fifths histogram so
that adjacent histogram bins are spaced a fifth apart rather than
a semitone. This mapping is achieved by
(14)
where is the new folded histogram bin after the mapping and
is the original folded histogram bin. The number seven corre-
sponds to seven semitones or the music interval of a fifth. That
way, the distances between adjacent bins after the mapping are
better suited for expressing tonal music relations (tonic-dom-
inant) and the extracted features result in better classification
accuracy.
Although musical genres by no means can be characterized
fully by their pitch content, there are certain tendencies that
can lead to useful feature vectors. For example jazz or classical
music tend to have a higher degree of pitch change than rock
or pop music. As a consequence, pop or rock music pitch his-
tograms will have fewer and more pronounced peaks than the
histograms of jazz or classical music.
Based on these observations the following features are com-
puted from the UPH and FPH in order to represent pitch content.
FA0: Amplitude of maximum peak of the folded his-
togram. This corresponds to the most dominant pitch
class of the song. For tonal music this peak will typically
298 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 5, JULY 2002
Fig. 3. Beat histogram examples.
correspond to the tonic or dominant chord. This peak
will be higher for songs that do not have many harmonic
changes.
UP0: Period of the maximum peak of the unfolded his-
togram. This corresponds to the octave range of the dom-
inant musical pitch of the song.
FP0: Period of the maximum peak of the folded his-
togram. This corresponds to the main pitch class of the
song.
IPO1: Pitch interval between the two most prominent
peaks of the folded histogram. This corresponds to the
main tonal interval relation. For pieces with simple
harmonic structure this feature will have value 1 or 1
corresponding to fifth or fourth interval (tonic-dominant).
SUM The overall sum of the histogram. This is feature is
a measure of the strength of the pitch detection.
E. Whole File and Real-Time Features
In this work, both the rhythmic and pitch content feature
set are computed over the whole file. This approach poses no
problem if the file is relatively homogeneous but is not appro-
priate if the file contains regions of different musical texture.
Automatic segmentation algorithms [27], [28] can be used to
segment the file into regions and apply classification to each
region separately. If real-time performance is desired, only the
timbral texture feature set can be used. It might possible to com-
pute the rhythmic and pitch features in real-time using only
short-time information but we have not explored this possibility.
IV. EVALUATION
In order to evaluate the proposed feature sets, standard sta-
tistical pattern recognition classifiers were trained using real-
world data collected from a variety of different sources.
A. Classification
For classification purposes, a number of standard statistical
pattern recognition (SPR) classifiers were used. The basic idea
behind SPR is to estimate the probability density function (pdf)
for the feature vectors of each class. In supervised learning a la-
beled training set is used to estimate the pdf for each class. In
the simple Gaussian (GS) classifier, each pdf is assumed to be
a multidimensional Gaussian distribution whose parameters are
estimated using the training set. In the Gaussian mixture model
(GMM) classifier, each class pdf is assumed to consist of a mix-
ture of a specific number of multidimensional Gaussian dis-
tributions. The iterative EM algorithm can be used to estimate
the parameters of each Gaussian component and the mixture
weights. In this work GMM classifiers with diagonal covariance
matrices are used and their initialization is performed using the
-means algorithm with multiple random starting points. Fi-
nally, the -nearest neighbor ( -NN) classifier is an example
TZANETAKIS AND COOK: MUSICAL GENRE CLASSIFICATION OF AUDIO SIGNALS 299
Fig. 4. Audio classification hierarchy.
TABLE I
CLASSIFICATION ACCURACY MEAN AND STANDARD DEVIATION
of a nonparametric classifier where each sample is labeled ac-
cording to the majority of its nearest neighbors. That way, no
functional form for the pdf is assumed and it is approximated
locally using the training set. More information about statistical
pattern recognition can be found in [29].
B. Datasets
Fig. 4 shows the hierachy of musical genres used for evalu-
ation augmented by a few (three) speech-related categories. In
addition, a music/speech classifier similar to [4] has been im-
plemented. For each of the 20 musical genres and three speech
genres, 100 representative excerpts were used for training. Each
excerpt was 30 s long resulting in (23 * 100 * 30 s 19 h)
of training audio data. To ensure variety of different recording
qualities the excerpts were taken from radio, compact disks, and
MP3 compressed audio files. The files were stored as 22 050 Hz,
16-bit, mono audio files. An effort was made to ensure that
the training sets are representative of the corresponding musical
genres. The Genres dataset has the following classes: classical,
country, disco, hiphop, jazz, rock, blues, reggae, pop, metal.
The classical dataset has the following classes: choir, orchestra,
piano, string quartet. The jazz dataset has the following classes:
bigband, cool, fusion, piano, quartet, swing.
C. Results
Table I shows the classification accuracy percentage results of
different classifiers and musical genre datasets. With the excep-
tion of the RT GS row, theseresults have been computed using a
single-vector to represent the whole audio file. The vector con-
sists of the timbral texture features [9 (FFT) 10 (MFCC)
19 dimensions], the rhythmic content features (6 dimensions),
Fig. 5. Classification accuracy percentages (RND
=
random, RT
=
real time,
WF
=
whole file).
and the pitch content features (five dimensions) resulting in a
30-dimensional feature vector. In order to compute a single tim-
bral-texture vector for the whole file the mean feature vector
over the whole file is used.
The row RT GS shows classification accuracy percentage re-
sults for real-time classification per frame using only the tim-
bral texture feature set (19 dimensions). In this case, each file
is represented by a time series of feature vectors, one for each
analysis window. Frames from the same audio file are never split
between training and testing data in order to avoid false higher
accuracy due to the similarity of feature vectors from the same
file. A comparison of random classification, real-time features,
and whole-file features is shown in Fig. 5. The data for creating
this bar graph corresponds to the random, RT GS, and GMM(3)
rows of Table I.
The classification results are calculated using a ten-fold cross-
validation evaluation where the dataset to be evaluated is ran-
domly partitioned so that 10% is used for testing and 90% is
used for training. The process is iterated with different random
partitions and the results are averaged (for Table I, 100 iterations
were performed). This ensures that the calculated accuracy will
not be biased because of a particular partitioning of training and
testing. If the datasets are representative of the corresponding
musical genres then these results are also indicative of the clas-
sification performance with real-world unknown signals. The
part shows the standard deviation of classification accuracy for
the iterations. The row labeled random corresponds to the clas-
sification accuracy of a chance guess.
The additional music/speech classification has 86% (random
would be 50%) accuracy and the speech classification (male,
female, sports announcing) has 74% (random 33%). Sports
announcing refers to any type of speech over a very noisy
background. The STFT-based feature set is used for the
music/speech classification and the MFCC-based feature set is
used for the speech classification.
1) Confusion Matrices: Table II shows more detailed infor-
mation about the musical genre classifier performance in the
form of a confusion matrix. In a confusion matrix, the columns
correspond to the actual genre and the rows to the predicted
genre. For example, the cell of row 5, column 1 with value 26
means that 26% of the classical music (column 1) was wrongly
300 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 5, JULY 2002
TABLE II
GENRE CONFUSION MATRIX
TABLE III
JAZZ CONFUSION MATRIX
TABLE IV
CLASSICAL CONFUSION MATRIX
classified as jazz music (row 2). The percentages of correct clas-
sification lie in the diagonal of the confusion matrix. The confu-
sion matrix shows that the misclassifications of the system are
similar to what a human would do. For example, classical music
is misclassified as jazz music for pieces with strong rhythm from
composers like Leonard Bernstein and George Gershwin. Rock
music has the worst classification accuracy and is easily con-
fused with other genres which is expected because of its broad
nature.
TablesIII and IV show the confusion matrices for the classical
and jazz genre datasets. In the classical genre dataset, orchestral
music is mostly misclassified as string quartet. As can be seen
from the confusion matrix (Table III), jazz genres are mostly
misclassified as fusion. This is due to the fact that fusion is a
broad category that exhibits large variability of feature values.
jazz quartet seems to be a particularly difficult genre to correctly
classify using the proposed features (it is mostly misclassified
as cool and fusion).
2) Importance of Texture Window Size: Fig. 6 shows how
changing the size of the texture window affects the classification
performance. It can be seen that the use of a texture window
increases significantly the classification accuracy. The value of
zero analysis windows corresponds to using directly the features
computed from the analysis window. After approximately 40
analysis windows (1 s) subsequent increases in texture window
size do not improve classification as they do not provide any
additional statistical information. Based on this plot, the value of
Fig. 6. Effect of texture window size to classification accuracy.
TABLE V
INDIVIDUAL FEATURE SET IMPORTANCE
40 analysis windows was chosen as the texture window size. The
timbral-texture feature set (STFT and MFCC) for the whole file
and a single Gaussian classifier (GS) were used for the creation
of Fig. 6.
3) Importance of Individual Feature Sets: Table V shows
the individual importance of the proposed feature sets for the
task of automatic musical genre classification. As can be seen,
the nontimbral texture features pitch histogram features (PHF)
and beat histogram features (BHF) perform worse than the tim-
bral-texture features (STFT, MFCC) in all cases. However, in
all cases, the proposed feature sets perform better than random
classification therefore provide some information about musical
genre and therefore musical content in general. The last row of
Table V corresponds to the full combined feature set and the
first row corresponds to random classification. The number in
parentheses beside each feature set denotes the number of in-
dividual features for that particular feature set. The results of
Table V were calculated using a single Gaussian classifier (GS)
using the whole-file approach.
The classification accuracy of the combined feature set, in
some cases, is not significantly increased compared to the in-
dividual feature set classification accuracies. This fact does not
necessarily imply that the features are correlated or do not con-
tain useful information because it can be the case that a specific
file is correctly classified by two different feature sets that con-
tain different and uncorrelated feature information. In addition,
although certain individual features are correlated, the addition
of each specific feature improves classification accuracy. The
rhythmic and pitch content feature sets seem to play a less im-
portant role in the classical and jazz dataset classification com-
pared to the Genre dataset. This is an indication that it is possible
TZANETAKIS AND COOK: MUSICAL GENRE CLASSIFICATION OF AUDIO SIGNALS 301
TABLE VI
BEST INDIVIDUAL FEATURES
that genre-specific feature sets need to be designed for more de-
tailed subgenre classification.
Table VI shows the best individual features for each feature
set. These are the sum of the beat histogram (BHF.SUM), the
period of the first peak of the folded pitch histogram (PHF.FP0),
the variance of the spectral centroid over the texture window
(STFT.FPO) and the mean of the first MFCC coefficient over
the texture window (MFCC.MMFCC1).
D. Human Performance for Genre Classification
The performance of humans in classifying musical genre has
been investigated in [30]. Using a ten-way forced-choice para-
digm college students were able to accurately judge (53% cor-
rect) after listening to only 250-ms samples and (70% correct)
after listening to 3 s (chance would be 10%). Listening to more
than 3 s did not improve their performance. The subjects where
trained using representative samples from each genre. The ten
genres used in this study were: blues, country, classical, dance,
jazz, latin, pop, R&B, rap, and rock. Although direct compar-
ison of these results with the automatic musical genre classifica-
tion results, is not possible due to different genres and datasets, it
is clear that the automatic performance is not far away from the
human performance. Moreover, these results indicate the fuzzy
nature of musical genre boundaries.
V. CONCLUSIONS AND FUTURE WORK
Despite the fuzzy nature of genre boundaries, musical genre
classification can be performed automatically with results sig-
nificantly better than chance, and performance comparable to
human genre classification. Three feature sets for representing
timbral texture, rhythmic content and pitch content of music
signals were proposed and evaluated using statistical pattern
recognition classifiers trained with large real-world audio
collections. Using the proposed feature sets classification of
61% (nonreal time) and 44% (real time), has been achieved in a
dataset consisting of ten musical genres. The success of the pro-
posed features for musical genre classification testifies to their
potential as the basis for other types of automatic techniques
for music signals such as similarity retrieval, segmentation and
audio thumbnailing which are based on extracting features to
describe musical content.
An obvious direction for future research is expanding the
genre hierarchy both in width and depth. Other semantic de-
scriptions such as emotion or voice style will be investigated
as possible classification categories. More exploration of the
pitch content feature set could possibly lead to better perfor-
mance. Alternative multiple pitch detection algorithms, for ex-
ample based on cochlear models, could be used to create the
pitch histograms. For the calculation of the beat histogram we
plan to explore other filterbank front-ends as well as onset based
periodicity detection as in [14] and [15]. We are also planning
to investigate real-time running versions of the rhythmic struc-
ture and harmonic content feature sets. Another interesting pos-
sibility is the extraction of similar features directly from MPEG
audio compressed data as in [31] and [32]. We are also plan-
ning to use the proposed feature sets with alternative classi-
fication and clustering methods such as artificial neural net-
works. Finally, we are planning to use the proposed feature set
for query-by-example similarity retrieval of music signals and
audio thumbnailing. By having separate feature sets to repre-
sent timbre, rhythm, and harmony, different types of similarity
retrieval are possible. Two other possible sources of informa-
tion about musical genre content are melody and singer voice.
Although melody extraction is a hard problem that is not solved
for general audio it might be possible to obtain some statistical
information even from imperfect melody extraction algorithms.
Singing voice extraction and analysis is another interesting di-
rection for future research.
The software used for this paper is available as part
of MARSYAS [33], a free software framework for rapid
development and evaluation of computer audition appli-
cations. The framework follows a client–server architec-
ture. The C++ server contains all the pattern recognition,
signal processing, and numerical computations and is con-
trolled by a client graphical user interface written in Java.
MARSYAS is available under the GNU Public License at
http://www.cs.princeton.edu/~gtzan/marsyas.html.
ACKNOWLEDGMENT
The authors would like to thank the anonymous reviewers for
their careful reading of the paper and suggestions for improve-
ment. D. Turnbull helped with the implementation of the Genre-
Gram user interface and G. Tourtellot implemented the multiple
pitch analysis algorithm. Many thanks to G. Essl for discussions
and help with the beat histogram calculation.
REFERENCES
[1] F. Pachet and D. Cazaly, “A classification of musical genre,” in Proc.
RIAO Content-Based Multimedia Information Access Conf., Paris,
France, Mar. 2000.
[2] S. Davis and P. Mermelstein, “Experiments in syllable-based recognition
of continuous speech,” IEEE Trans. Acoust., Speech, Signal Processing,
vol. 28, pp. 357–366, Aug. 1980.
[3] J. Saunders, “Real time discrimination of broadcast speech/music,” in
Proc. Int. Conf. Acoustics, Speech, Signal Processing (ICASSP), 1996,
pp. 993–996.
[4] E. Scheirer and M. Slaney, “Construction and evaluation of a robust
multifeature speech/music discriminator,” in Proc. Int. Conf. Acoustics,
Speech, Signal Processing (ICASSP), 1997, pp. 1331–1334.
[5] D. Kimber and L. Wilcox, “Acoustic segmentation for audio browsers,”
in Proc. Interface Conf., Sydney, Australia, July 1996.
[6] T. Zhang and J. Kuo, “Audio content analysis for online audiovisual data
segmentation and classification,” Trans. Speech Audio Processing, vol.
9, pp. 441–457, May 2001.
[7] A. L. Berenzweig and D. P. Ellis, “Locating singing voice segments
within musical signals,” in Proc. Int. Workshop on Applications of Signal
Processing to Audio and Acoustics (WASPAA) Mohonk, NY, 2001, pp.
119–123.
[8] E. Wold, T. Blum, D. Keislar, and J. Wheaton, “Content-based classifi-
cation, search, and retrieval of audio,” IEEE Multimedia, vol. 3, no. 2,
1996.
302 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 5, JULY 2002
[9] J. Foote, “Content-based retrieval of music and audio,” Multimed.
Storage Archiv. Syst. II, pp. 138–147, 1997.
[10] G. Li and A. Khokar, “Content-based indexing and retrieval of audio
data using wavelets,” in Proc. Int. Conf. Multimedia Expo II, 2000, pp.
885–888.
[11] S. Li, “Content-based classification and retrieval of audio using the
nearest feature line method,” IEEE Trans. Speech Audio Processing,
vol. 8, pp. 619–625, Sept. 2000.
[12] E. Scheirer, “Tempo and beat analysis of acoustic musical signals,” J.
Acoust. Soc. Amer., vol. 103, no. 1, p. 588, 601, Jan. 1998.
[13] M. Goto and Y. Muraoka, “Music understanding at the beat level:
Real-time beat tracking of audio signals,” in Computational Auditory
Scene Analysis, D. Rosenthal and H. Okuno, Eds. Mahwah, NJ:
Lawrence Erlbaum, 1998, pp. 157–176.
[14] J. Laroche, “Estimating tempo, swing and beat locations in audio record-
ings,” in Proc. Int. Workshop on Applications of Signal Processing to
Audio and Acoustics WASPAA, Mohonk, NY, 2001, pp. 135–139.
[15] J. Seppänen, “Quantum grid analysis of musical signals,” in Proc. Int.
Workshop on Applications of Signal Processing to Audio and Acoustics
(WASPAA) Mohonk, NY, 2001, pp. 131–135.
[16] J. Foote and S. Uchihashi, “The beat spectrum: A new approach to
rhythmic analysis,” in Proc. Int. Conf. Multimedia Expo., 2001.
[17] G. Tzanetakis, G. Essl, and P. Cook, “Automatic musical genre classifi-
cation of audio signals,” in Proc. Int. Symp. Music Information Retrieval
(ISMIR), Oct. 2001.
[18] L. Rabiner and B. H. Juang, Fundamentals of Speech Recogni-
tion. Englewood Cliffs, NJ: Prentice-Hall, 1993.
[19] B. Logan, “Mel frequency cepstral coefficients for music modeling,” in
Proc. Int. Symp. Music Information Retrieval (ISMIR), 2000.
[20] S. G. Mallat, A Wavelet Tour of Signal Processing. New York: Aca-
demic, 1999.
[21] I. Daubechies, “Orthonormal bases of compactly supported wavelets,”
Commun. Pure Appl. Math, vol. 41, pp. 909–996, 1988.
[22] T. Tolonen and M. Karjalainen, “A computationally efficient multip-
itch analysis model,” IEEE Trans. Speech Audio Processing, vol. 8, pp.
708–716, Nov. 2000.
[23] G. Tzanetakis, G. Essl, and P. Cook, “Audio analysis using the discrete
wavelet transform,” in Proc. Conf. Acoustics and Music Theory Appli-
cations, Sept. 2001.
[24] M. A. Bartsch and G. H. Wakefield, “To catch a chorus: Using chroma-
based representation for audio thumbnailing,” in Proc. Int. Workshop on
Applications of Signal Processing to Audio and Acoustics Mohonk,
NY, 2001, pp. 15–19.
[25] R. N. Shepard, “Circularity in judgments of relative pitch,” J. Acoust.
Soc. Amer., vol. 35, pp. 2346–2353, 1964.
[26] J. Pierce, “Consonance and scales,” in Music Cognition and Comput-
erized Sound, P. Cook, Ed. Cambridge, MA: MIT Press, 1999, pp.
167–185.
[27] J.-J. Aucouturier and M. Sandler, “Segmentation of musical signals
using hidden Markov models,” in Proc. 110th Audio Engineering
Society Convention, Amsterdam, The Netherlands, May 2001.
[28] G. Tzanetakis and P. Cook, “Multifeature audio segmentation for
browsing and annotation,” in Proc. Workshop Applications of Signal
Processing to Audio and Acoustics (WASPAA), New Paltz, NY, 1999.
[29] R. Duda, P. Hart, and D. Stork, Pattern Classification. New York:
Wiley, 2000.
[30] D. Perrot and R. Gjerdigen, “Scanning the dial: An exploration of fac-
tors in identification of musical style,” in Proc. Soc. Music Perception
Cognition, 1999, p. 88, (abstract).
[31] D. Pye, “Content-based methods for the management of digital music,”
in Proc. Int. Conf Acoustics, Speech, Signal Processing (ICASSP), 2000.
[32] G. Tzanetakis and P. Cook, “Sound analysis using MPEG compressed
audio,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing
(ICASSP), Istanbul, Turkey, 2000.
[33] , “Marsyas: A framework for audio analysis,” Organized Sound,
vol. 4, no. 3, 2000.
George Tzanetakis (S’98) received the B.Sc. degree
in computer science from the University of Crete,
Greece, and the M.A. degree in computer science
from Princeton University, Princeton, NJ, where he
is currently pursuing the Ph.D. degree.
His research interests are in the areas of signal
processing, machine learning, and graphical user
interfaces for audio content analysis with emphasis
on music information retrieval.
Perry Cook (S’84–M’90) received the B.A. degree
in music from the University of Missouri at Kansas
City (UMKC) Conservatory of Music, the B.S.E.E.
degree from UMKC Engineering School, and the
M.S. and Ph.D. degrees in electrical engineering
from Stanford University, Stanford, CA.
He is Associate Professor of computer science,
with a joint appointment in music, at Princeton
University, Princeton, NJ. He served as Technical
Director for Stanford’s Center for Computer
Research in Music and Acoustics and has consulted
and worked in the areas of DSP, image compression, music synthesis, and
speech processing for NeXT, Media Vision, and other companies. His research
interests include physically based sound synthesis, human–computer interfaces
for the control of sound, audio analysis, auditory display, and immersive sound
environments.
... This study listed the music features used to describe a music piece and how it affects its processing. Other studies the properties of music genres and proposes a set of features to represent texture, rhythm structure, form and strength to use in the proposed genre identification algorithm and statistical pattern recognition classifier [8]. ...
... Various approaches have been employed to classify Western music genres for music identification and classification, such as the Naïve-Bayes approach [8], Decision Trees [9], Support Vector Machines (SVMs) [10], Nearest-Neighbour (NN) classifiers [11], Gaussian Mixture Models [12], Linear Discriminant Analysis (LDA) [13], Hidden Markov Models (HMM) [14,15], Multi-layer Perceptron Neural Nets [16], and self-organising maps neural networks [17]. Also, combinations of the different algorithms were attempted to classify musical instruments, such as Gaussian Mixture Models and support vector machines [18]. ...
Article
Music Information Retrieval (MIR) is one data science application crucial for different tasks such as recommendation systems, genre identification, fingerprinting, and novelty assessment. Different Machine Learning techniques are utilised to analyse digital music records, such as clustering, classification, similarity scoring, and identifying various properties for the different tasks. Music is represented digitally using diverse transformations and is clustered and classified successfully for Western Music. However, Eastern Music poses a challenge, and some techniques have achieved success in clustering and classifying Turkish and Persian Music. This research presents an evaluation of machine learning algorithms' performance on pre-labelled Arabic Music with their Arabic genre (Maqam). The study introduced new data representations of the Arabic music dataset and identified the most suitable machine-learning methods and future enhancements.
... To address the lack of adult speech and music in these datasets while fine-tuning our pretrained model, we used speech and music data from a small LibriSpeech corpus (libriTTS) [23] and GTZAN [24] to balance the training/finetuning dataset. We collected 800 adult speech and 800 music samples for fine-tuning; each sample was 4 seconds long. ...
Preprint
Full-text available
Certain environmental noises have been associated with negative developmental outcomes for infants and young children. Though classifying or tagging sound events in a domestic environment is an active research area, previous studies focused on data collected from a non-stationary microphone placed in the environment or from the perspective of adults. Further, many of these works ignore infants or young children in the environment or have data collected from only a single family where noise from the fixed sound source can be moderate at the infant's position or vice versa. Thus, despite the recent success of large pre-trained models for noise event detection, the performance of these models on infant-centric noise soundscapes in the home is yet to be explored. To bridge this gap, we have collected and labeled noises in home soundscapes from 22 families in an unobtrusive manner, where the data are collected through an infant-worn recording device. In this paper, we explore the performance of a large pre-trained model (Audio Spectrogram Transformer [AST]) on our noise-conditioned infant-centric environmental data as well as publicly available home environmental datasets. Utilizing different training strategies such as resampling, utilizing public datasets, mixing public and infant-centric training sets, and data augmentation using noise and masking, we evaluate the performance of a large pre-trained model on sparse and imbalanced infant-centric data. Our results show that fine-tuning the large pre-trained model by combining our collected dataset with public datasets increases the F1-score from 0.11 (public datasets) and 0.76 (collected datasets) to 0.84 (combined datasets) and Cohen's Kappa from 0.013 (public datasets) and 0.77 (collected datasets) to 0.83 (combined datasets) compared to only training with public or collected datasets, respectively.
... Machine learning models have also been applied in music classification tasks, where models like CNNs and SVMs are utilized to categorize music into genres, moods, or other attributes based on audio features extracted from the music [11][12][13][14][15][16][17][18]. Furthermore, models such as Deep Belief Networks (DBNs) and Recurrent Neural Networks (RNNs) have been employed in music transcription, where the goal is to convert audio signals into musical notation [19][20][21][22][23][24]. ...
Article
Full-text available
The combination of machine learning with music composition and production is proving viable for innovative applications, enabling the creation of novel musical experiences that were once the exclusive domain of human composers. This paper explores the transformative role of machine learning in music, particularly focusing on emotion music generation and style modeling. Through the development and application of models including DNNs, GANs, and Autoencoders, this study delves into how machine learning is being harnessed to not only generate music that embodies specific emotional contexts but also to transfer distinct musical styles onto new compositions. This research discusses the principles of these models, their operational mechanisms, and evaluates their effectiveness through various metrics such as accuracy, precision, and creative authenticity. The outcomes illustrate that these technologies not only enhance the creative possibilities in music but also democratize music production, making it more accessible to non-experts. The implications of these advancements suggest a significant shift in the music industry, where machine learning could become a central component of creative processes. These results pave a path to the understanding of the potential and limitations of machine learning in music and forecasts future trends in this evolving landscape.
... By using voice interactions, digital assistants are changing how various tasks can be performed, including personal information management tasks (e.g., creating and reviewing calendar entries with verbal commands). A voice interface for music retrieval typically allows the user to query a database with artist names, song titles, genres, or keywords (Bainbridge et al., 2003), and might further add data like the user's playback history to choose the correct action or optimise the results Tzanetakis & Cook, 2002). But it is not yet clear what the outcome of these changes will be nor if they genuinely address user needs (Khaokaew et al., 2022). ...
Article
Full-text available
Introduction. Music streaming services have changed how music is played and perceived, but also how it is managed by individuals. Voice interfaces to such services are becoming increasingly com-mon, for example through voice assistants on mobile and smart devices, and have the poten-tial to further change personal music management by introducing new beneficial features and new challenges. Method. To explore the implications of voice assistants for personal music listening and management we surveyed 248 participants online and in a lab setting to investigate (a) in which situa-tions people use voice assistants to play music, (b) how the situations compare to established activities common during non-voice assistant music listening, and (c) what kinds of com-mands they use. Analysis. We categorised 653 situations of voice assistant use, which reflect differences to non-voice assistant music listening, and established 11 command types, which mostly reflect finding or refinding activities but also indicate keeping and organisation activities. Results. Voice assistants have some benefits for music listening and personal music management, but also a notable lack of support for traditional personal information management activities, like browsing, that are common when managing music. Conclusion. Having characterised the use of voice assistants to play music, we consider their role in per-sonal music management and make suggestions for improved design and future research.
... Boashash [17] discusses the problem of estimating the instantaneous frequency of a signal, which is a crucial component of spectrograms. Tzanetakis and Cook [18] discuss classifying musical genres from audio signals, a typical application of spectrograms. ...
Article
Full-text available
This paper presents a model for sound classification in construction that leverages a unique combination of Mel spectrograms and Mel-Frequency Cepstral Coefficient (MFCC) values. This model combines deep neural networks like Convolution Neural Networks (CNN) and Long short-term memory (LSTM) to create CNN-LSTM and MFCCs-LSTM architectures, enabling the extraction of spectral and temporal features from audio data. The audio data, generated from construction activities in a real-time closed environment is used to evaluate the proposed model and resulted in an overall Precision, Recall, and F1-score of 91%, 89%, and 91%, respectively. This performance surpasses other established models, including Deep Neural Networks (DNN), CNN, and Recurrent Neural Networks (RNN), as well as a combination of these models as CNN-DNN, CNN-RNN, and CNN-LSTM. These results underscore the potential of combining Mel spectrograms and MFCC values to provide a more informative representation of sound data, thereby enhancing sound classification in noisy environments.
Article
Full-text available
In this paper, we present a segmentation algorithm for acoustic musical signals, using a hidden Markov model. Through unsupervised learning, we discover regions in the music that present steady statistical properties: textures. We investigate different front-ends for the system, and compare their performances. We then show that the obtained segmentation often translates a structure explained by musicology: chorus and verse, different instrumental sections, etc. Finally, we discuss the necessity of the HMM and conclude that an efficient segmentation of music is more than a static clustering and should make use of the dynamics of the data.
Conference Paper
Full-text available
The Discrete Wa velet Transform (DWT) is a transformation that can be used to analyze the temporal and spectral properties of non-stationary signals like a udio. In this paper we describe some applications of the DWT to the problem of extracting information from non-speech audio. More specifically automatic c lassification of various types of audio u sing the DWT is described and compared with o ther traditional feature extractors proposed in the literature. In addition, a technique for detecting the beat attributes of music is presented. Both synthetic and real world stimuli were used to evaluate the performance of the beat detection algorithm.
Conference Paper
Article
A digital computer was used to synthesize a scale of tones with fundamentals 1 8 oct apart; each tone had nonharmonic partials separated by 1 8 oct or multiples thereof. When sounded together, two tones separated by an even number of 1 8 ‐oct intervals were more consonant than two tones separated by an odd number of 1 8 ‐oct intervals. The scale synthesized is one example of many possible unconventional scales that can exhibit consonance and dissonance.