ArticlePDF Available

Closed-set speaker identification using VQ and GMM based models

Authors:

Abstract and Figures

An array of features and methods are being developed over the past six decades for Speaker Identification (SI) and Speaker Verification (SV), jointly known as Speaker Recognition(SR). Mel Frequency Cepstral Coefficients (MFCC) is generally used as feature vectors in most of the cases because it gives higher accuracy compared to other features. The presented paper focuses on comparative study of state-of-the-art SR techniques along with their design challenges, robustness issues and performance evaluation methods. Rigorous experiments have been performed using Gaussian Mixture Model (GMM) with variations like Universal Background Model (UBM) and/or Vector Quantization (VQ) and/or VQ based UBM-GMM (VQ-UBM-GMM) with detail discussion. Other popular methods have been included, namely, Linear Discriminate Analysis (LDA), Probabilistic LDA (PLDA), Gaussian PLDA (GPLDA), Multi-condition GPLDA (MGPLDA), Identity Vector (i-vector) for comparative study only. Three popular audio data-sets have been used in the experiments, namely, IITG-MV SR, Hyke-2011 and ELSDSR. Hyke-2011 and ELSDSR contain clean speech while IITG-MV SR contains noisy audio data with variations in channel (device), environment, spoken style. We propose a new data mixing approach for SR to make the system independent of recording device, spoken style and environment. The accuracy we obtained for VQ and GMM based methods for databases, Hyke-2011 and ELSDSR are varies from \(99.6\%\) to \(100\%\) whereas accuracy for IITG-MV SR is upto \(98\%\). Indeed, in some cases the accuracies degrade drastically due to mismatch between training and testing data as well as singularity problem of GMM. The experimental results serve as a benchmark for VQ/GMM/UBM based methods for the IITG-MV SR database.
This content is subject to copyright. Terms and conditions apply.
Noname manuscript No.
(will be inserted by the editor)
Closed-set Speaker Identification Using VQ and
GMM Based Models
Bidhan Barai ·Tapas Chakraborty ·
Nibaran Das ·Subhadip Basu ·Mita
Nasipuri
the date of receipt and acceptance should be inserted later
Abstract An array of features and methods are being developed over the
past six decades for Speaker Identification (SI) and Speaker Verification (SV),
jointly known as Speaker Recognition(SR).Mel Frequency Cepstral Coefficients
(MFCC) is generally used as feature vectors in most of the cases because it
gives higher accuracy compared to other features. The presented paper focuses
on comparative study of state-of-the-art SR techniques along with their design
challenges, robustness issues and performance evaluation methods. Rigorous
experiments have been performed using Gaussian Mixture Model (GMM) with
variations like Universal Background Model (UBM) and/or Vector Quantiza-
tion (VQ) and/or VQ based UBM-GMM (VQ-UBM-GMM) with detail dis-
cussion. Other popular methods have been included, namely, Linear Discrimi-
nate Analysis (LDA),Probabilistic LDA (PLDA),Gaussian PLDA (GPLDA),
Multi-condition GPLDA (MGPLDA), Identity Vector (i-vector) for compar-
ative study only. Three popular audio data-sets have been used in the ex-
periments, namely, IITG-MV SR, Hyke-2011 and ELSDSR. Hyke-2011 and
ELSDSR contain clean speech while IITG-MV SR contains noisy audio data
with variations in channel (device), environment, spoken style. We propose a
new data mixing approach for SR to make the system independent of record-
ing device, spoken style and environment. The accuracy we obtained for VQ
and GMM based methods for databases, Hyke-2011 and ELSDSR are varies
from 99.6% to 100% whereas accuracy for IITG-MV SR is upto 98%. Indeed,
in some cases the accuracies degrade drastically due to mismatch between
training and testing data as well as singularity problem of GMM. The exper-
Jadavpur University
Department of Computer Science & Engineering
E-mail: bidhanbarai.rs@jadavpuruniversity.in
E-mail: ju.tapas@gmail.com
E-mail: nibaranju@gmail.com
E-mail: subhadip.basu@jadavpuruniversity.in
E-mail: mitanasipuri@gmail.com
2 Bidhan Barai et al.
imental results serve as a benchmark for VQ/GMM/UBM based methods for
the IITG-MV SR database.
Keywords MFCC ·VQ ·GMM ·i-Vector ·PLDA
1 Introduction
SR is a branch of bio-metric recognition where the speaker specific psycho-
physiological characteristics of speech waveform are analysed to uniquely recog-
nise individual speaker using speaker’s voice signal [1,2]. These characteristics
include both voice tract characteristics (spectral features) and voice source
characteristics (supra-segmental features) of speech. Feature(s) is/are the at-
tribute(s) by which individual entities (speakers) are identified uniquely. A
set of features together form feature vector (generally, a vector is an array
of numbers). The process (or steps) of computing feature vector(s) is known
as feature extraction. SR is an example of typical Pattern Recognition (PR)
problem.
Any conventional PR method consists of two basic steps, Feature Extrac-
tion/Selection and Modelling/Classification [3,4]. In SR, speaker specific fea-
tures are extracted first from each of the voice signals available in the database,
and then a model is built for each class (for SR, each class represents a speaker)
in the database. This process is known as Training/Enrolment. When the voice
sample of an unknown speaker is available for SR, same set of features are ex-
tracted in similar manner. Then this set of features of the unknown speaker
(test data) is compared with every model of known voice samples (for identifi-
cation) and a statistical distance or a score for the voice samples of unknown
speaker is computed with respect to all the known speaker’s models. The min-
imum distance or maximum score (any one measure or combination of mea-
sures) identifies (classify) the unknown speaker as the speaker corresponding
to that model. This process is known as Testing. In this step we use all the
speaker models for classification. For example, in GMM based SR using MFCC
feature, we first compute MFCC feature vectors (13 MFC Coefficients) from
the speech signals of all the speakers for training the GMMs of all speakers
(this is done from the training data) and next from testing speech signal the
MFCCs are extracted in the similar fashion to compute scores (or similarity
measures) with respect to every enrolled speaker (or trained speaker model)
for identification purpose.
In a typical SR experiment, each speaker provides a single score. The op-
timum score (optimum because if we use distance measure then minimum
distance provides the classified speaker, on contrast if we use probability mea-
sure then maximum probability provides the classified speaker) provides the
classified speaker. Indeed, SR using GMM was introduced before 1992 and
later a lot of modifications are done in this approach. In this paper, we study
the model based SR using short-time spectral features with the help of the ap-
proaches mentioned above and also provide some features and methods from
speech recognition which are as well useful for SR because SR and speech
Closed-set Speaker Identification Using VQ and GMM Based Models 3
recognition share some common characteristics, features and methodologies.
SR using Super Vector is an example of modification of GMM where a super
vector is formed by concatenating the means of GMM. Here each speaker is
represented by a high dimensional super vector, called high level feature, rather
than a set of MFCC vectors.
1.1 Classification of SR
The SR is classified into three groups, namely - (a) Speaker Identification (SI)
and Speaker Verification (SV),(b) Text-dependent and Text-independent,(c)
Closed-set and Open-set. SI is the type of SR where we are required to de-
termine the identity of an unknown speaker i.e., which speaker among the
enrolled speakers is speaking and in contrast to SI, SV is the task of authen-
ticating an unknown speaker’s identity i.e., we are required to verify whether
the claim of an unknown speaker will be accepted or rejected by the SR sys-
tem. Among these two types of SR, SV is the most popular one because of
its application in access control and security. In text-dependent SR the text
(content of speech) is fixed (or same) for training and testing speech data
whereas in text-independent SR the text of training and testing speaker is
not fixed (or same) [13]. Finally, in closed-set SR, it is known that the un-
known speaker is one of the enrolled speakers but we don’t know which one
among them whereas in open-set SR the unknown speaker may or may not
present among the enrolled speakers. It is known that among these types of
SR open-set text-independent SI(OSTI-SI) is the most challenging class of SR.
In OSTI-SI, the score of unknown speaker is compared with the scores of all of
the enrolled speakers using a decision function to determine - 1) whether the
unknown speaker is one of the enrolled speakers, 2) if yes, which one of the
enrolled speakers it belongs to. Task 1) and 2) are accomplished simultane-
ously by adding a complementary model which is built by all speakers’ speech
data except the known speakers’ data. This also means that an OSTI-SI SR
is solved if we have comparatively large number of speakers’ data outside the
known speakers. The nature of OSTI-SI SR problem is quite different from
SV. Indeed, SV is always an open-set SR problem [5,13].
1.2 Challenges in SR
The SR is expanding day by day with a broad range of applications. Deploying
an SR system of high accuracy for real time applications is still challenging.
The performance of SR system degrades considerably due to the mismatch
among the various factors. The factors that play very crucial roles in high
performance SR system are discussed as follows [2]:
Noisy Environment: After acquisition of speech signals from different speak-
ers for designing an SR system, may contaminate with various types of noise
4 Bidhan Barai et al.
namely convolutional noise, additive noise, reverberation noise (speech con-
taining echoes), random noise, impulse noise, white noise and so on. The de-
tails of these noises are found in [6,7,8, 9, 13].
Environmental Mismatch: It is extremely difficult to accumulate the speech
signal from the different speakers in the same environment for training and
testing. Accuracy of SR system is highly dependent upon the mismatch be-
tween training and testing environment [4,13].
Channel Mismatch: The recognition accuracy of SR degrades drastically
for recording device (also known as channel) mismatch between training and
testing data, that will be observed in sections 5 of the presented paper.
Spoken Style Mismatch: Spoken style is also a very important issue in de-
signing an SR system because it has significant effect on the performance of
SR system. In the experiment we have used two types of spoken style - reading
and conversation. In the IITG-MV SR database, in case of conversation, two
speakers spoke with each other and then the recorded speech signal is pro-
cessed to separate the speakers and then combined the individual speaker’s
segmented speech to create the whole speeches of two different speakers [10].
We shall see that the performance of SR system degrades significantly if there
is a mismatch of spoken style between training and testing utterances.
Language of Utterance Mismatch: The language mismatch between train-
ing and testing data also has significant effect in recognition accuracy of SR.
But it does not affect as greatly as device and environment mismatch [2].
Short Utterance: The acquisition of speech signal with considerably suffi-
cient duration for training and testing is very difficult to design an SR system.
So, sometimes we are bound to design a system with limited data (3 to 5
seconds or less). The length of the speech or duration of utterance plays an
important role in SR. Very short utterance degrades the recognition accu-
racy considerably. Mandasari et al. [11] examined the effect of short utterance
and proposed a calibration strategy to model the calibration parameters using
Quality Measure Functions (QMFs) for SR to improve the recognition accu-
racy [16].
Long Utterance: If the available data for design is very large, then the data
must be reduced using data reduction techniques which may lead to loss of sig-
nificant data (information). The accumulation and annotation of large amount
of data is also very difficult.
The SR research has a long period of research history, about six decades,
but due to the above difficulties (challenges) the performance degrades and
the SR research is still motivating the researchers around the world to increase
the performance of SR system. The SR researchers have developed an array of
features, feature extraction techniques, methodologies and scoring techniques
for recognition till date to combat with the above difficulties. These methods
and techniques must go through several steps which lead to increase in system
components. Hence, the SR systems are being complex as the research of SR
progresses through time which leads to the other difficulty issues like time
complexity of the SR system. The example of one technique is i-vector [12].
The advancement of Deep Learning (DL) techniques [29] provide immense
Closed-set Speaker Identification Using VQ and GMM Based Models 5
impact on the SR research, specially they improve the recognition performance
of SR system. Deep learning technique such as Convolutional Neural Network
(CNN) [32,29] methods require large volume of balanced data set to train a
model properly. Otherwise the approaches may not perform well. Generally,
the spectrogram [17] which is generated from the speech signal, is used as
input to a CNN. In [17], Chakraborty et al. uses the CNN methods for SR on
IITG-MV SR database along with other like VQ and GMM/UBM-GMM. On
the data-set, VQ and GMM/UBM-GMM performed better than CNN because
some speakers on the database contain very short utterances such as 22 seconds
as compared to other speakers (average speech duration is about 5 minutes),
which makes the data unbalanced, though short utterances can decrease the
performances of both GMM/UBM-GMM and CNN based SR. But CNNs are
more susceptible to short utterances than GMM/UBM-GMM [18]. Hence, we
restrict our study on VQ and GMM based methods; though very recently to
address the problem of short utterance, the researchers proposed different CNN
and hybrid CNN based models [18,19,25, 29] which we have not discussed in
the present paper to understand the impact of different CNN based models on
IITG-MV SR database.
Another aspect of this paper is that however a rich number of references
are given but the paper has been written in such a way that a reader can
implement an SR system easily using MFCC feature and VQ-UBM-GMM
based classifier. We can observe that there is a broad range of application areas
of SR. Some example of applications of SR are authentication, forensic SR [22],
multi-speaker tracking [23], singer identification [14] security and surveillance,
personalized user interfaces and access control [24].
1.3 Contributions
In the previous sub-section, we have listed the major research challenges in-
volved in text-independent closed-set SR. In this paper, however, we have
attempted to address some of those aspects specifically related to, 1) record-
ing device independence, 2) spoken style independence, 3) environment/session
independence. Major contributions of this paper may be highlighted as follows:
A comprehensive survey of different text-independent closed-set SR tech-
niques has been presented.
Extensive SR experimentation is performed over various GMM based clas-
sifiers.
The IITG-MV SR database is used to evaluate the benchmark perfor-
mances.
Singularity problem of the GMM has been studied.
We propose a new data mixing approach for SR and examine its accu-
racy under various mismatch/dependent cases, involving the variability in
recording devices, spoken styles, and environment/sessions.
The rest of the paper is organized in the following way. Section 2 is ded-
icated for overall system design with state-of-the-art SR and the evaluation
6 Bidhan Barai et al.
strategies to measure the performance of SR systems. Modelling and Classi-
fication are not described in different subsections because classification is de-
pendent on the modelling method and hence modelling and classification are
described in a single subsection 2.2. In section 3 we describe various state-of-
the-art methods to combat difficulties in designing an SR system. The concep-
tual comparison of features and methods are presented in section 4. In section
5 the performances with diagnostic analysis of some existing SR systems are
reported along with our experiment in three databases, namely, Hyke-2011
[20], ELSDSR [21] and IITG MV SR [10] databases. Finally we conclude the
presented paper.
2 Overall Design and State-of-the-art SR
SR is basically a PR problem and every PR problem has two basic steps fea-
ture extraction/selection and modelling/classification. In the training phase,
speaker specific information is extracted using various digital signal processing
(DSP) techniques and algorithms from speech signal of every speaker. Use of
DSP tools helps to transform each speaker’s speech data (i.e. feature vector or
set of feature vectors) in such a way that each speaker is uniquely identifiable.
Then a model (for example GMM) is built over the feature vectors in feature
space. Now, every speaker is represented by the speaker model which is char-
acterized by model parameters. In the testing phase, the feature vectors are
extracted in the similar fashion from the test/unknown speaker’s speech sig-
nal for classification. During classification, speaker specific models were built
initially (training). Then feature vectors of unknown speaker was compared
(using score or similarity measure) with respect to all speakers for decision.
In SI, we make the decision that who is the unknown speaker among all the
enrolled speakers and in SV, we decide whether the test/unknown speaker who
claimed her/his identity is accepted or rejected by the SR system.
2.1 State-of-the-art Feature Description
In feature extraction step the speaker’s raw data (speech waveform) is mapped
into a measurable space, known as feature space with the help of Digital Signal
Processing (DSP) techniques, where every speaker is uniquely distinguishable
from each other. Mathematically, feature extraction is a mapping f:R7−RD
i.e., Y=f(X), by which Xin a space (raw data) Xis transformed into D-
dimensional feature space Y[26] and produces the feature vector or a set
of feature vectors Yin feature space. Here, fcan be viewed as a process of
obtaining the feature vector y:yY. Indeed, a speaker in a feature space is
represented by a single feature vector or by a set of feature vectors. Short-Time
Signal Analysis (STSA)/Short-Time Fourier Transform(STFT) of segmented
speech waveform is the most popular and powerful DSP tool used to extract
feature for SR. Most of acoustic features used for SR are extracted with the
Closed-set Speaker Identification Using VQ and GMM Based Models 7
Fig. 1: Block Diagram of Speaker Identification (SI)
Fig. 2: Block Diagram of Speaker Verification (SV)
help of STSA technique. The example of such features except MFCC and
Gammatone Filter Cepstral Coefficients(GFCC) are Linear frequency cepstral
coefficient(LFCC) [27,28], Linear Predictive Cepstral Coefficients(LPCC) [30,
28], Perceptual Linear Predictive Cepstral Coefficients(PLPCC). Among these
features MFCC, LFCC, GFCC [31,33] are based on filterbank analysis and
LPCC, PLPCC are based on LP analysis [34,35,36], however, they all are the
types of STFT that use Fast Fourier Transform (FFT). In STFT the complete
speech signal of a speaker (say 10 second long speech) is segmented into small
8 Bidhan Barai et al.
Fig. 3: Complete Block Diagram of MFCC Computation
frames (say 25 milliseconds long speech frames). Generally, each frame provides
a feature vector in feature space. Hence we shall get a set of feature vectors
from the complete speech signal. Among state-of-the-art features, MFCC is the
most popular and useful feature for SI. A complete description of computation
of MFCC feature is found in [2,3, 4]. Here, we provide a brief description of
MFCC vector computation as a block diagram 3.
Generally, in SI and SV, after the feature extraction every speaker is
represented by a sequence (or set) of Ddimensional feature vectors X=
(x1,x2. . . xT) where Tis the total number of feature vectors and each of xτ,
τ∈ {1,2. . . T }is represented by a Ddimensional column vector to meet
conventional mathematical notations. Indeed, after the feature extraction we
shall get a Ddimensional row vector from every 25 ms frame (an array of
real numbers) and for mathematical notations we consider every row vector
as column vector. It is important to note that the computation of GFCC is
similar to MFCC, the only difference is that in GFCC , Gammatone Filter
Bank is used rather that mel filter bank. The equation of a gammatone filter
in time domain is as follows:
gm(t) = at2πbmcos(2πfct+φ) (1)
where 0a0is a constant (usually a = 1). 0n0is called the filter order (usually n =
4). 0φ0is the phase shift. 0f0
cis the centre frequency and 0b0
mis the attenuation
factor for mth filter. The complete description of GFCC computation is found
in [46,47,48, 49].
In SR, feature extraction process converts the complete speech waveform
into a set of feature vectors which can distinguish different speakers. The
LFCC, GFCC, MFCC are computed directly with the help of Fast Fourier
Transform (FFT) power spectrum but LPCC and PLPCC are obtained us-
ing all-pole model to represent the smooth spectrum. The above STSA based
features used only magnitude spectrum (power spectrum is the squared mag-
nitude spectrum) and phase spectrum is discarded. Indeed, phase spectrum
Closed-set Speaker Identification Using VQ and GMM Based Models 9
based features, like Group Delay Function (GDF),Modified GDF (MODGDF)
[37], Instantaneous Frequency (IF) [38,39], Instantaneous Frequency Deviation
(IFD) [40] are often used to extract complimentary speaker information [41,
42,43]. The extraction of complementary feature set is given in [30]. Indeed,
the MFCC features are not only used in SR but also used in speaker’s gender
and age classification, speech recognition, language recognition [44,45].
2.2 Modelling, State-of-the-art Models and Classification
In 2.1 we described how the raw speech waveform of a speaker will be trans-
formed into the measurable MFCC feature space of dimension D. In this space
every training speaker is represented by a set of feature vectors Xwhich is
known as template (voice print). For the Snumber of speaker, we will get S
number of sets Xtrain(ζ) where ζ= 1,2, . . . , S . If unknown speaker’s speech
signal is deployed to the SR system for recognition, the system will compute
the feature set Xtest i.e., set of feature vectors of unknown speaker. Indeed, in
this paper we are interested with the closed set SR. Now the recognition can
be carried out by the template matching strategy, where Xtest is compared
with Xtrain(ζ) for ζ= 1,2, . . . , S and some distance is calculated using frame
by frame (vector by vector) comparison for each ζwithout using speaker mod-
elling, where minimum of Snumerical distances or maximum of Snumerical
score corresponding to an enrolled speaker’s feature set gives the class of the
unknown speaker. But if the size of the sets Xtrain(ζ) and Xtest are large this
strategy will be cumbersome and the SR system will be unable to use in real
time. This problem is solved using statistical modelling, statistical distance
and scoring techniques. VQ,GMM are examples of such approach.
2.2.1 VQ Based SR
In VQ the vectors in the feature space xfor the training speaker are grouped
into a Knumber of distinct regions where K << T and a reconstruction vector
is defined to represent a region. Therefore we shall get a set of Knumber of
reconstruction vectors. This collection of the reconstruction vectors (or set
of vectors) Kcis known as codebook. VQ is basically a data condensation
technique [26]. Mathematically speaking in the VQ for the ζth speaker, the
complete training data set (MFCCs for training) Xtrain(ζ) is mapped/grouped
into Knumber of clusters where each cluster is represented by the centroid
of individual cluster/group and the set of Kcentroids is the codebook Kc(ζ)
. For the Identification the MFCCs of unknown speaker (testing data set)
Xtest is compared with Kc(ζ) of all speakers , ζ[1, S] to compute distances
(for example, manhattan, euclidean etc.) or scores (for example reciprocal of
distances). The minimum distance or maximum score provides the identified
speaker.
Let us assume the feature space xof a training speaker contains a total
of Tnumber of feature vectors Xtrain ={x1,x2,...,xT}. A vector quan-
10 Bidhan Barai et al.
tizer which provides minimum distortion is known as Voronoi or Nearest-
Neighbour (NN) quantizer. There are plenty of algorithms like Linde- Buzo-
Gray (LBG) [50], Self-Organizing Map (SOM) [51] and Principal Component
Analysis (PCA) based LBG [52] are available to compute the codebook effi-
ciently. The LBG algorithm is very similar to k-means clustering algorithm
in which it takes a set of vectors Xtrain ={xiRD:i= 1,2, . . . , T }
as inputs and generate a set of reconstruction vectors C={cjRD:
j= 1,2, . . . , C}with a user defined C << T as output according to the
similarity measure. To construct a vector quantizer generally we take D=
13 (Dimension of M F CC vector), C= 256,512 or 1024. Indeed, the conver-
gence of the LBG algorithm [50] depends on the initialization of the codebook
C, a distortion, and a threshold during the implementation. Sufficient number
of iterations is required to guarantee convergence of the algorithm. In this way
we compute the codebooks of all the Sspeakers, represented by C1,C2,...,CS.
For the classification of unknown speaker the SR system, first, maps /trans-
forms the unknown speaker’s speech waveform into the MFCC feature space
x. Let us assume Xtest ={x1,x2,...,xP}be the set of Pnumber of MFCC
feature vectors of the unknown/test speaker. Therefore, now we have one test
set Xtest and Scodebook models of all the speakers Cifor i= 1,2, . . . , S. Now
we compute similarity measures ˆ
Dibetween Xtest and Cifor i= 1,2, . . . , S.
Generally, the similarity measure is computed with the help of Euclidean dis-
tance. In this method we first compare Xtest and 1st speaker’s codebook C1
where we find the Euclid distances of a test vector x1and all the code vectors
of C1and we take the minimum distance . Similarly, we shall compute the dis-
tance between x2,x3,...,xPand all the code vectors in C1and every time we
shall take the minimum distance. Hence we shall get Pnumber of minimum
distances and we shall sum up all the Pminimum distances to get a single
distance ˆ
D1between codebook of 1st speaker, C1and the test speaker’s set of
vectors, Xtest . Similarly, we shall compute the distances between Xtest and
other speakers codebooks C2,C3,...,CSrepresented by ˆ
D2,ˆ
D3,..., ˆ
DSrespec-
tively. Hence, mathematically, we can define a set membership function Di(·)
as follows
Di(Xtest,Ci) = ˆ
Di=1
P
P
X
j=1
min
xj∈Xtest
ck∈Ci
kxjckk2(2)
for k[1, C] where i= 1,2, . . . , S and k·k2represents the 2-norm (Eu-
clidean distance). Among all ˆ
Di’s the minimum value will provide the identified
speaker. Therefore, we can write mathematically
T est Speaker Identif ied as =ˆ
S= arg min( ˆ
Di)
i[1,S]
(3)
Hence the test/unknown speaker is the ˆ
Sth speaker among the total of S
orderly arranged training speakers.
Closed-set Speaker Identification Using VQ and GMM Based Models 11
2.2.2 GMM Based SR
In GMM the vectors in the feature space xfor the training speaker is fit into
GMM which is characterized by model parameters. This means that using the
MFCCs of training speaker we shall built GMM, denoted by P(x|Φ) which
is the sum of Mcomponent multivariate weighted Gaussian curves and is
defined/characterized by weights ωh, mean vectors µhand covariance matrices
Σhfor h[1, M ]. Mathematically, Mcomponent GMM for sth speaker is
p(x|Φs) =
M
X
h=1
ωhfh(x;φh) (4)
where weights ωhrepresent the fraction of data points belonging to hth model
and they summed to 1 (PM
h=1 ωh= 1), the functions fh(·), h = 1,2, . . . , M
are the component density functions and Φis the set of parameters Φ=
{ωh, φh:h= 1,2, . . . , M }[26]. For GMM, fh(·) are the multi variate Gaussian
probability density functions (pdfs) N(xt;µh, Σh) and φh={µh, Σh}for h=
1,2, . . . , M . During identification using the MFCCs of unknown speaker and
the GMM of ζth speaker we compute the ζth score Sζwhere ζ= 1,2. . . S.
We compute the scores of all Snumber of speaker and the maximum score
provides the identified speaker. Here
N(xt;µh, Σh) = 1
(2π)D/2|Σh|1/2e1
2(xtµh)0Σ1
h(xtµh)(5)
where (xtµh)0represents the transpose of column vector (xtµh).
GMM based classifier for SR is a popular state-of-the-art approach that
is used extensively in SI and SV systems. Let us assume that the feature
space xof a training speaker contains a total of Tnumber of feature vectors
Xtrain ={x1,x2,...,xT}. In GMM the data (feature vector) are fit into the
sum of weighted multi dimensional Gaussian curves (or distributions where
Gaussian distributions differ by parameters only) just like we have done in
polynomial curve fitting. Generally, a mixture model (like GMM) approximates
data distribution by the M-component density functions (or pdf’s) fhwhere
h= 1,2, . . . , M to the data set Xtr ain having Tnumber of patterns (or feature
vectors). Let random vector x∈ Xtrain be an arbitrary pattern, then the
mixture model density function p(x|Φ) evaluated at xis [26]:
p(x|Φ) =
M
X
h=1
ωhN(xt;µh, Σh) (6)
Now we are in a position to estimate the set of model parameters Φ=
{ωh,µh, Σh:h[1, M ]}. To do so we apply a very popular technique Maxi-
mum Likelihood-Parameter Estimation(MLE). In this technique our aim is to
gradually maximize the following probability
p(Xtrain|Φ) =
T
Y
t=1
p(xt|Φ) (7)
12 Bidhan Barai et al.
with the help of Expectation Maximization (EM) iterations using following
equations
ωh=1
T
T
X
t=1
p(h|xt, Φ) : h[1, M ] (8)
µh=PT
t=1p(h|xt, Φ)xt
PT
t=1p(h|xt, Φ):h[1, M ] (9)
Σh=PT
t=1p(h|xt, Φ)(xhµh)(xhµh)0
PT
t=1p(h|xt, Φ):h[1, M ] (10)
where p(h|xt, Φ) is a conditional probability that is found by Bay’s theorem
as follows
p(h|xt, Φ) = whN(xt;µh, Σh)
PM
j=1 wjN(xt;µj,Σj):h[1, M ] (11)
Now we describe EM iteration briefly. The complete block diagram of EM
iteration for MLE is shown in figure ??. The iteration start with the initial
guess of ωh,µhand Σhfor h[1, M ]. Let us assume the initial values are ω0
h,
µ0
hand Σ0
hfor h[1, M ]. Here ω0
h=1
Mfor all h[1, M ]. The means µ0
hfor
all h[1, M ] is computed using k-means algorithm. Corresponding to every
µhwe compute Σhusing usual method. In the first iteration equations (8)-
(10) have the values ω0
h,µ0
h,Σ0
hfor computation and compute new values, say
ω1
h,µ1
h,Σ1
hwill be considered as initial value to compute new values ω2
h,µ2
h,
Σ2
h, in second iteration. Hence in general in the ith iteration the final values
will be ωi
h,µi
h,Σi
h. In experiment we choose i= 5 experimentally since after
5 iteration there is very very less difference between old values ω(i1)
h,µ(i1)
h,
Σ(i1)
hand new values ωi
h,µi
h,Σi
h.This technique is analogous to the solution
of polynomials using fixed point iterative method.
So far we have described the steps to build speakers’ GMM models for the
enrolment. Now we shall describe the identification/classification method using
GMMs of enrolled speakers. To do so, a popular technique, called Maximum
Log-Likelihood (MLL) scoring, is available for SR, which uses Minimum Error
Bayes’ Decision Rule. Let us given Snumber of speakers, S={1,2, . . . , S}
and their corresponding enrolled models are Φ1, Φ2, . . . , ΦS. Let the set of
feature vector of test/unknown speaker is Xtest ={x1,x2,...,xP}. Now the
task is to compute the speaker model with maximum posteriori probability i.e.,
MLL score which in turn leads to the identified speaker (the index or ID of
maximum score is returned). Therefore using minimum error Bayes’ decision
rule the task is carried out by likelihood function and the score Sξis defined
by the following equation
Sξ=Lp(Xtest|Φξ)P r (Φξ)
p(Xtest)(12)
Closed-set Speaker Identification Using VQ and GMM Based Models 13
Fig. 4: Complete Block Diagram of EM Iteration for Model(GMM) Parameters (ωi,µiand Σi) Estimation
14 Bidhan Barai et al.
where P r(Φξ) is the occurrence of the ξth speaker model and p(Xtest) is the
occurrence of the test speaker in the training data. Here we assume that the
models and speakers are equally likely then we are not required to calculate
P r(Φξ) and p(Xtest) because for every testing case they remain same. Hence
we just drop the P r(Φξ)
p(Xtest)from the eqn. (12) and leads to the following
Sξ=L(p(Xtest|Φξ)) (13)
where L(·) is likelihood function defined by
L(p(Xtest|Φξ)) =
P
Y
t=1
p(xt|Φξ) (14)
where t∈ {x1,x2,...,xP}. This equation signifies that Φξ’s are given in
the enrolled speaker models and we put the feature vectors in Xtest into the
eqn.(13). We evaluate eqn.(13) for all the Φξ,{ξ= 1,2, . . . , S}keeping Xtest
fix to get Slikelihood values. The speaker with corresponding maximum ML
value is the identified speaker. However, to make the computation simple we
often take the logarithm of the MLL values as given by the following equation
Slog
ξ=log(L(p(Xtest |Φξ))) = log(
P
Y
t=1
p(xt|Φξ)) (15)
This is known as MLL scoring. We know that a probability value is always
1. This equation has a drawback. If Pis very large then this product will
tends to 0. If we take logarithm of the above equation then the product will
becomes summation and eliminates the problem. Taking logarithm of this
equation gives -
Slog
ξ=
P
X
t=1
log(p(xt|Φξ) (16)
The identified speaker is the one who has maximum MLL score and given
by ˆ
S= arg max
ξ∈S
(Slog
ξ) (17)
For the difficulties as mentioned in 1.2 , the SR is still developing and
remains a centre of interest among the researchers. There are plenty of fea-
tures and several approaches have been applied to design and evaluate the
SR [10]. The i-vector technique remains the state-of-the-art technique for SR
over the last few years [12]. However, other features like spectral feature -
Formant Frequencies ( F1,F2,F3),Pitch Contours [1], Phase Information
[31,53], features derived from short time processing of speech signal - static
MFCC and dynamic MFCC (1st and 2nd order derivatives of static MFCC de-
noted by ∆MF C C and 2M F C C respectively) [54], Spectral-temporal Recep-
tive Fields (STRF) and MFCC balanced feature,Autocorrelation,Zero Cross-
ing Rate (ZCR),Harmonic Feature,Auditory-Based Feature [55], Group De-
lay Feature and Modified Group Delay Feature (MODGDF) [37], Mel Filter
Closed-set Speaker Identification Using VQ and GMM Based Models 15
Bank Energy-Based Slope Feature [56] and model based (domain) feature -
GMM Super Vector [57,58], Bottleneck Feature of DNN (BF-DNN) [59,60]
remained the state-of-the-art features for SR before the i-vector [12,61,62,63,
64,65]. But among them MFCC and GFCC are still using besides i-vector
and even new features are invented with the advancement of the SR research
[66]. Avci et al. [67] proposed a novel optimum feature extraction and clas-
sification using Genetic-Wavelet Packet-Neural Network (GWPNN) for SR.
Mary et al. [68] proposed a novel prosodic feature which is manifested in
terms of measurable parameters such as fundamental frequency (F0), dura-
tion and energy. Rama Murthy et al. [42] introduces Instantaneous Frequency
(IF) and Analytic Phase features and showed the significance of analytic phase
in SR. The modelling/classification methods that are used in SR are Vector
Quantization [69], Support Vector Machine (SVM) [70], Least Square SVM
(LS-SVM),k-Nearest Neighbour (k-NN),GMM [71], GMM-Universal Back-
ground Model (GMM-UBM) [72], Hidden Markov Model (HMM),Fuzzy Sets
[1,73], Artificial Neural Network (ANN) [74], Deep Neural Network (DNN)
[12,75,76, 77, 78], Linear Discriminant Analysis (LDA),Probabilistic Linear
Discriminant Analysis (PLDA) [45], Heavy Tailed PLDA (PLDA-HT) [79],
Discriminant Analysis via Support Vectors (SVDA) ,Gaussian PLDA (G-
PLDA). Sometimes a combination of multiple classifiers (hybrid classifier) like
SVM-GMM [80], GMM-VQ,VQ-GMM-UBM [81], SVM-HMM,ANN-HMM
[74], Maximum a Posteriori Vector Quantization (VQ-MAP) ,VQ-MAP-LS-
SVM [82], VQ-HMM,VQ-GMM-SVM,GMM-UBM-PLDA [79] are used for
SR. Novoselov et al. [57] proposed unconventional non-linear PLDA, for i-
vector space, which employs DNN-based sufficient statistics calculation that
outperform conventional GMM-based systems. In the recent research on SR
uses high level features (Model domain feature). In this approach some map-
ping or function is performed on the model parameters to get the final feature
vector (composed of model parameters) and sometimes normalization is done
over these feature vector(s).
For the real-time application of SR, the robustness is a very critical issue
because the speech signal may contain additive, multiplicative and convolu-
tional noises, room reverberation, there may be language, environment (train
and bus stations, laboratory, office, classroom etc.), device (microphone) mis-
match and these factors lead to a great degradation of the recognition accuracy
(performance of SR). Here, by the device mismatch we mean that the record-
ing devices of training and testing speech signal are different. Similarly, we
refer to difference of language of utterance and environment between training
and testing speech signals by language mismatch and environment mismatch
respectively. Due to these factors, the same SR system gives various accuracy
in different conditions. To make the SR robust, we must remove the effect of
mismatch conditions from the feature vectors in feature and/or model and/or
score domain with the help of transformation and/or normalization on these
domains before the final classification of test speaker. Indeed, for robust SR,
GMM based approach along with other classifiers (hybrid classifier) is the most
useful technique. This happen because GMM provides multiple techniques for
16 Bidhan Barai et al.
the transformation and normalization in model and/or score domains. Gen-
erally, transformation modifies data in such a way that inter-speaker vari-
ability (variability of training or testing data between two speakers) increases
and intra-speaker variability (variability of training and testing data of same
speaker) decreases [86,109]. Srinivasan et al. [84] studied that Time-Frequency
(T-F) masking before Gammatone Feature (GF) and GFCC feature extraction
provide significant improvement in recognition accuracy in SR. Wang et al. [7]
examined that vocal source and vocal tract features ∆M F CC ,2M F CC and
Linear Prediction (LP) Residual features make SR system robust. Ming et al.
[83] proposed a novel multi-condition training data method under various noise
to model the noise . Togneri et al. [86] studied the robustness of GMM and
missing data under the various mismatch and noisy condition. Garcia-Romero
et al. [87] proposed a novel multi-conditioning GPLDA model of i-vectors for
robust SR under noise and reverberation. Zhao et al. [88] studied and provide
the analysis of robustness of MFCC and GFCC under the noisy condition. An-
other study in [46] proposed a novel CASA-based speech processing for robust
SR. Cooke et al. [89] proposed a novel approach, for robust automatic speech
recognition in missing and unreliable speech data using continuous-density
HMM, which was used in SR as well.
Since SR experiments are classified in two categories - SI and SV, therefore
there are two types of performance measures. One type for SI and another for
the SV. For the SI system, the performance of SR system is measured by the
average percentage of correctly identified speakers in more than one testing
and training data pair. To do so, training and testing data are divided in more
than two or three groups for a single experiment and we find the percentages of
accuracy for all the training and testing pairs and take the average percentage
of them as final accuracy. However, the performance measure for SV is quite
different from SI system. There are three measurement parameters for SV
system, namely, False Acceptance Rate (FAR),False Rejection Rate (FRR)
and Equal Error Rate (EER). The performance measures are discussed broadly
in subsequent section.
2.2.3 VQ/GMM Based SR
So far we have discussed about VQ and GMM based classification. In this
section we shall discuss VQ/GMM based classification in which both VQ and
GMM are applied for Modelling and classification. Conveniently, here VQ is
applied as a data reduction (not dimension reduction) technique where the
number of feature vectors are reduced from the rich number of feature vectors
to considerably less number of feature vectors (speech signal of each speaker
is sufficiently large). Then the GMM is applied to the set of reduced feature
vectors for modelling the speaker.
Let, we have a speaker’s speech data for modelling (enrolment). We first
transform the speech raw data into MFCC feature space of dimension Dto
get set of Tnumber of feature vectors Xtrain ={x1,x2,...,xT}={xi
RD: 1 iT}. Then we are required to apply VQ on the set of feature
Closed-set Speaker Identification Using VQ and GMM Based Models 17
vectors Xtrain. Let us Xtransformed into a codebook C={c1,c2,...,cC}=
{ciRD: 1 iC}of Cnumber of code vectors where C << T . Here
VQ is viewed as a mapping f:Xtrain 7−→ C, that reduced the Tnumber
feature vectors into Cnumber of code vectors where C << T . Thus till now
we have codebook Cof Cnumber of code vectors. For the speaker modelling
we built GMM over the codebook C={ci: 1 iC}to get M component
GMM represented by set of parameters Φ={(ωh,µh, Σh):1hM}of
codebook as described in 2.2.1. Let, the ith speaker has the GMM given by
Φi={(ωi
h,µi
h, Σi
h) : 1 hM and 1iS}.
Next the speech waveform of test/unknown speaker is mapped (or trans-
formed) into the MFCC feature vectors in Ddimensional feature space to
get the set of Pnumber of feature vectors Xtest ={x1,x2,...,xP}. For the
classification of this test speaker we use MLL scoring as described in 2.2.2.
2.2.4 Universal Background Model (UBM)/GMM Based SR
Generally, GMM-UBM is used for SV. However, this model can also be applied
for SI with limited data (speech waveform is not of sufficient duration) [72]. In
this method, we pool some amount of data of all speakers and build a GMM as
a common model of all the speakers so that it becomes speaker independent.
That’s why it is called Universal Background Model (UBM). Generally, UBM
contains data (MFCC vectors) of all the enrolled speakers as well as other
speakers other than enrolled speaker. From the another point of view, UBM
actually represents the model of language because in this model we take a large
number of speakers to build UBM which represents the model of speech in a
fixed language (assuming that all the speakers speak in a fixed language and
not in multi language). Hence UBM is nothing but the language model [2,71]
and this model can also be used for language identification. In SV, this model
is called imposter model [86]. With the help of training data of all speakers and
maximum a posteriori (MAP) estimation, we built GMM of all the speakers
using UBM. Let, we have Snumber of speakers whose set of feature vectors
are X1,X2,...,XS. From these feature sets some amount of vectors (say 200
vectors from each set) are taken and a GMM is built as described in section
2.2.2. This is called UBM. Using the remaining training feature vectors of each
speaker, we built the adapted GMM of every speaker by Bayesian learning or
maximum a posteriori (MAP) estimation.
2.2.5 i-Vector Based SR
The Identity Vector (i-vector) approach is a robust and popular technique
for SR, at present. It incorporates all the updates during the adaptation of
UBM, for example, mean vector µUBM is formed by the concatenation of
all the mean vectors of UBM components, one below another. If Dbe the
vector dimension and Mbe the number of Gaussian components UBM has,
then dimension of the UBM mean vector will be MD ×1 (a column vector).
Indeed, µUB M is speaker and channel independent, because the UBM is built
18 Bidhan Barai et al.
by taking MFCCs from all devices (channels) of all the speaker. This vector is
called GMM super vector and from this super vector, the i-vector is extracted.
Here all the information of updates is modelled in a low dimensional space,
called the total variability space. In this technique, the UBM super vector µubm
is assumed to be generated by the following equation
µi=µubm +Tr(18)
where Tis a rectangular matrix of low rank, called Total Variability Matrix
(TVM) and ris a random vector which follows a prior standard normal dis-
tribution N(0,I) [64,90]. It is important to mention that only adapting the
UBM mean vectors to form super vector produces enough information for the
SR (i.e., covariance matrix is not necessary). Similarly, the i-vector from train-
ing data (remaining vectors after formation of UBM) is computed by adapting
the UBM super vector. Here the i-vector is the MAP point of estimation of
the random vector r(also called the latent variable) adapting the µubm using
training data (analogous to GMM-UBM adaptation). The i-vector serves as
ahigh level feature because this feature is extracted from the model domain
(speaker model, which is at the higher level of SR, is formed after feature
extraction). Hence, extraction of i-vector is the feature extraction step in the
model domain.
The i-vectors from all the speakers and the test/unknown speaker are the
input to the classifier for the score computation and decision [65]. For the i-
vector based classification, generally SVM, HMM, ANN, DNN classifiers are
used.
2.2.6 e-Vector based SR
The i-vector approach for text-independent SR is the recent (current) state-
of-the-art technique. In Joint Factor Analysis (JFA) we are required to model
speaker and inter-session variability differently. It is important to mention that
for for JFA based SR, every speaker’s speech is recorded in at least two different
session [91,92]. IITG-MV SR database is very fruitful for this experiment
to examine the session, environmental, and channel variability because it is
multi-session, multi-environment, multi-channel database [10]. But in i-vector
approach all the variability is modelled in a single subspace of low-dimension
as described above. The JFA computes a more relevant and more informative
subspace other than total variability (T) i-vector subspace. Basically, e-vector
is the representation of speech waveform similar to both JFA and i-vector [61,
63,94,95]. The e-vector is calculated in a similar way as i-vector with slight
variation but produces more accurate feature (high level feature) subspace
than JFA and i-vector. Cumani et al. reported in [96] that replacing i-vector
with e-vector the recognition rate improves 10% for the NIST 2012 and 2010
SR Evaluations [97].
Since e-vector incorporates both i-vector and JFA model for the almost all
kind of variability, we are required to define JFA model(sometimes called Affine
Closed-set Speaker Identification Using VQ and GMM Based Models 19
Linear Model) [92]. Basically, JFA overcomes the limitations in i-vector based
approach. In JFA model, the speaker dependent GMM supervector (UBM-
GMM supervector) is decomposed into speaker and channel dependent vectors
(S and C respectively), given by
µjf a =S+C(19)
where speaker dependent and channel dependent components (vectors) are
given by,
S=µ0+Vy+Wz(20)
C=Ux(21)
where µ0is a speaker and session independent supervector (computed from a
general UBM which is created from mixture of MFCCs from all channel, all
session, all speakers including large open-set speakers and training MFCCs of
specific session and channel), Vis low rank eigenvoices matrix, Wia a diago-
nal residual variability matrix not captured by the speakers’ MFCC subspace,
yand zare both independent random vectors having standard normal distri-
butions, N(0,I), Uis low rank channel variance matrix, called eigenchannels
and xis normally distributed channel factor vector like yand z[98].
Using i-vector approach, the channel compensation is performed in a com-
paratively low-dimensional subspace, instead of the much larger GMM super-
vector space. Since the models (18) and (19) are very similar, hence the TVM
training (i.e., computation of T) in (18) are performed similarly to eigenvoice
matrix (V) training in (19). However, there is only one difference in the V ma-
trix estimation. In JFA model, the segments of speech waveform of the same
speaker are considered as a single class but in e-vector model all the segments
are considered as different classes in T matrix estimation. The eigenvectors
forming T matrix span both the channel and speaker subspaces. Therefore,
matrix T does not model the speaker sub- space as well as the eigenvoice ma-
trix V. For this reason, Cumani et al. in [96] has proposed a modelling, called
e-vector, technique that uses the advantage of the best of both the JFA and
the i-vector techniques. Due to the similarity, the i-vectors framework is kept
but a different T matrix is estimated, which represents speaker space more ac-
curately. The procedure for estimating V and T is found in [64]. The e-vector
is very similar model as i-vector as follows
µi=µubm +Er(22)
where µiRDand µubm RDare GMM super vector and µubm is the UBM
mean super vector, ris a random vector which obeys the prior distribution
N(0,I) where 0is zero vector and Iis identity matrix . Here the new matrix
ERD×Dis similar to TVM in i-vector extraction in equation (18). The
complete estimation of e-vector Eis found in [96].
After the extraction of e-vector, generally SVM, ANN, DNN and HMM
classifiers or hybrid classifiers, which are very common in the literature, are
used for classification of unknown/test speaker and the scoring techniques
20 Bidhan Barai et al.
like Cosine Kernel and Cosine Distance Scoring are found in [64]. In [99] a
very useful i-vector based scoring techniques, called practical PLDA scoring
variants, are discussed.
3 Combating Difficulties
In 1.2 we observed that we could face so many challenges during training
and/or testing stages. So we may need to remove the adverse effects as well as
unwanted interference. The adverse effects of noises (additive, multiplicative
and convolutional, reverberation or echo), environmental mismatch, language
mismatch and channel mismatch (recording device mismatch and telephone
network or transmission channels mismatch) are very common to real time
SR. Due to these adverse effects, the SR accuracy degrades substantially so
that the SR becomes unusable in real time applications. From the discussions
about SR till now, we can view the SR has three domains: (i) feature domain,
(ii) model domain, and (iii) score /classification domain. These adverse effects
are generally removed in any one and/or two or in all the three domains.
3.1 Feature Domain Compensation
There are many compensation techniques depending on the types of adversities
are available in the literature. The following techniques are generally applied
for the SR.
3.1.1 Velocity and Acceleration Feature Concatenation
If the number of speakers in database is large, then dynamic features are
required along with static feature for improving accuracy. Sometimes energy
feature is also included for every frame. However, MFCC represents the static
features but dynamic features are also required if the number of speakers is
large. Hence, dynamic features are optional and need not required for small
database (nearly less than 200 speakers). There are two types of dynamic
MFCC which are known as velocity and acceleration coefficients which are
represented by and 2MFCC respectively. Conveniently, these two features
provide robustness in the feature space.The MFCC vector is computed by
∆cn=PQ
r=1 r(cn+rcnr)
2PQ
r=1 r2(23)
Here we must take static MFC coefficients slightly greater than n= 13 de-
pending on the value of Q. A typical value of Qis 2 (Q= 1 is also possible).
For Q= 2, n= 19 is fair enough for and 2MFCC vector computa-
tion. 2MFCC vector is computed using eqn.(23) on ∆cn. We concatenate
13 static M F C C, 13 M F C C and 13 2M F C C to form complete
feature vector x={cn, ∆cn, ∆2cn}of dimension D= 39.
Closed-set Speaker Identification Using VQ and GMM Based Models 21
3.1.2 Cepstral Mean Subtraction (CMS)
If speech signal is distorted by the convolutional noise then the noise compo-
nent of speech signal is removed by CMS [100]. In CMS we first compute the
mean vector (µ) and then we subtract µfrom the each feature vector (xt) to
get new feature vector (ˆxt) as follows:
ˆxt=xtµwhere 1tT(24)
3.1.3 Cepstral Mean and Variance Normalization (CMVN)
Let xtis the Ddimensional tth feature vector (MFCC vector of tth frame)
and each of these vectors xthas the element xt(i) in ith dimension (ith MFC
coefficient) and X= [x1,x2,x3,...,xT] is the set of Tnumber of MFCC
vectors which represents a speaker. In CMVN each feature vector is normalized
(or compensated) according to the following equations
µ(i) = 1
T
T
X
t=1
xt(i),1tT& 1 iD(25)
σ(i) = v
u
u
t
1
T1
T
X
t=1
(xt(i)µ(i))2,1tT& 1 iD(26)
Let mean and variance normalized feature vector of xtis ˆxt. The CMVN
feature vector, ˆxt, is computed as follows
ˆxt(i) = xt(i)µ(i)
σ(i), where 1tT and 1iD(27)
where tis the index for the vector (frame) and iis the index for the dimension of
the vector. Here ˆxthas an element ˆxt(i) in the ith dimension i.e., ˆxt={ˆxt(i)}
for i= 1,2, . . . , D and t= 1,2, . . . , T . This normalization is done for both
training and testing sets of feature vectors. Then GMM is built on the nor-
malized set of training vectors Xtrain and MLL is computed using normalized
set of test vectors Xtest and GMM λtrain of Xtrain for the identification.
3.1.4 Cepstral Liftering
The value of the cepstral coefficient Cndecreases as nincreases. Hence, to
rescale the value of Cn, a lifter function G(n) is multiplied with Cn[101]. Few
lifter functions [102,103] are defined as follows:
Linear Lifter:G(n) = n
Statistical Lifter:G(n) = 1
ˆσn
where ˆσnis the standard deviation of nth cepstral coefficient calculated
from the training data.
22 Bidhan Barai et al.
– Sinusoidal Lifter:G(n) = 1 + J
2sin(πn
J), where Jis the dimension of
vector.
Exponential Lifter:G(n) = nse1
2(n
τ)2
where τand sare constants. Typically, their values are τ= 5 and s= 1.5.
Hence, after cepstral liftering we shall get the feature vector c={cn}for
n= 1,2, . . . , J , given by
cn=G(n)Cn, for n = 1,2, . . . , J (28)
Note the difference between small cnand capital Cn.
3.1.5 Frequency Warping Normalization (FWN) in Frequency Domain
FWN is a frequency domain signal processing technique where the frequencies
are mapped into an standard range (within the Nyquist range). i.e., [0,Fs
2].
The governing equation for this operation is given by
f0=ffmin
fmax fmin π(29)
where f0is the mapped frequency of fand the frequencies are redistributed
on the interval [0, π]. However, FWN should be discussed in subsection 2.1 but
this section is optional and is required only when fmin is different from 0 Hz.
That is, if fmin = 300 or other values except fmin = 0, then we apply FWN.
3.2 Model Domain Compensation
The model domain compensation is the most popular and useful for SR. The
methods for compensation of adverse effects are found in abundance in liter-
ature. GMM based SR using MAP adaptation is most popular and state-of-
the-art technology for text-independent SR in adverse environment. In this
technique, speaker models (during training or enrolment) are derived from the
speaker independent common GMM known as UBM using MAP adaptation.
Here the UBM is formed by clean speech of all and additional speakers before
the training session. Normally, mean vector adaption is considered while the
weight and covariance matrix are neglected as described in (2.2.4). Similarly,
the UBM super vector is formed by the concatenation of UBM mean vectors.
After this operation GMM super vector is formed for Channel Factor com-
pensation or removal of channel factor. The channel factor adaptation of the
ith utterance and jth GMM super vector computed in super vector model as
follows
µij =µj+Uxij (30)
where µjis original super vector of jth GMM and µij is the ith adapted super
vector. Uis the low rank matrix which projects the channel factor subspace
into the super vector domain. The ith vector xij contains the channel factor
and jth GMM super vector. Here we apply eqn. (30) during testing step only
Closed-set Speaker Identification Using VQ and GMM Based Models 23
and not during training step. The µjis adapted using MAP during training
step. The score is computed by MLL of test utterance using compensated
super vector [104]. The channel factors subspace is modelled by the low rank
matrix U which is the distortion due to the intersession variability. The matrix
Uis computed using EM algorithm as describe in [93].
3.3 Score Domain Compensation
In SR the score normalization is very important because scores of test speaker
are very much dependent on data which influence the scores differently. So to
make the scores scale independent (as well as test trial independent) normal-
ization of scores are essential in SV. Another reason of score normalization
in SV is that generally, the decision threshold (θ) is chosen in such a way
where EER holds (i.e., F AR =F RR) in Detection Error Trade-off (DET)
curve with the help of multiple test trials. For SI, the other advantage of score
normalization is to make the score independent of background noise, channel
(device), environment in mismatch conditions. So, in SI normalization is not
such important for matched condition. However, for SI and SV in mismatch
conditions score normalization is equally important. Score domain normal-
ization technique maps the scores of test speaker ¯
Sξ’s corresponding to λξ’s
into a standard range of scores. The most popular normalization techniques
are TNorm, ZNorm and HNorm [105, 106, 107]. Among them, generally, Test-
independent Normalization (TNorm) score is computed for SI. Let the TNorm
scores of the test speaker corresponding to the all enrolled speakers be ϕT(¯
Sξ)
and original scores are ¯
Sξfor for 1 ξS. Let µsand σsare the mean and
standard deviation of all the Snumber of speakers. Then we have
µs=1
S
S
X
ξ=1
¯
Sξ(31)
σs=v
u
u
t
1
S1
S
X
ξ=1
(¯
Sξµs)2(32)
ϕT(¯
Sξ) = ¯
Sξµs
σs
, where 1ξS(33)
The other normalization techniques that are used in SV are found in [105,106,
107].
4 Conceptual Comparison of Approaches
SR has an array of methodologies along with features. The reason to develop
so many methodologies and features is to increase the accuracy of SR sys-
tem. In this section we discuss the conceptual comparison among the features
24 Bidhan Barai et al.
(or feature vectors) and methodologies. In the very past frame wise spectro-
gram was used as feature but this spectrogram contains several other factors
along with the speaker specific information. In the classification step, all the
spectrograms of train speakers are compared with that of the test speaker
for recognition. So the accuracy was not good enough, although clean speech
shows a little improvement [108]. To combat with these difficulties, the back-
ground noise removal techniques and filtering techniques are used to get rid of
noise and factors that are not speaker specific and form a sparse representation
of spectrogram which is a more speaker specific feature. The other obsolete
features are pitch, formants (F0, F1, F2), zero-crossing rate(ZCR), energy of
the frame and many more. All these features can be brought together and
concatenated to form a feature vector from each frame, and these feature vec-
tor based SR show a little improvement over spectrogram based SR. In the
present days, the feature vectors are extracted through frame by frame pro-
cessing of speech signal. Example of such features are MFCC, GFCC and these
features are called low level features because they are extracted from low level
frequency domain representation of signal (frame) using Mel and Gammatone
filter bank respectively. Since to extract these features, the power spectrum
(computed from FFT of frame) is passed through filters to form energy frame,
at the same time this enhances the higher frequencies and attenuates some
unwanted frequencies. The higher frequencies of power spectrum contain more
speaker specific information. This is the reason why MFCC and GFCC become
a state-of-the-art feature for SR. Beside these, the high level features like super
vector, i-vector are also become the state-of-the-art feature. They are called
high level feature (or model based feature) because they are extracted from
the GMM and UBM-GMM model of the MFCC feature vectors. After the ad-
vancement of machine learning and deep learning, the classifiers like Artificial
Neural Network (ANN), Convolutionnal Neural Network (CNN) [32], Multi
layer perceptron (MLP), Deep Neural Network (DNN) are used at the present
days [12,14]. Indeed, in case of DNN, we need not have to compute hand
crafted feature (like MFCC). They extract feature(s) from raw data automat-
ically. But here we must provide input data (speech signal) in well understood
numerical form (which are mathematically computed) to the classifier and
then recognize speaker. The example of such raw data is pre-processed frames
of complete speech signal, power spectrum and spectrogram (time-frequency
representation of speech signal) of every frame [17]. If we use spectrogram,
then we can think the SR as an image processing problem because the spec-
trogram basically a plot (image) of processed speech signal (frames). The high
level features are also be used in ANN, CNN, DNN for SR. The difference is
that we provide inputs such as super vector, i-vector to the classifiers.
After the feature extraction (like extraction of MFCCs) we have employed
two different ways of classification. One is classification without using data
model and the other one is using data model. In the first case, after the feature
extraction, we store the feature vectors of all speakers. When we have an
unknown/test speaker, feature vectors of test speaker are compared with each
feature vectors of all speaker to generate distances (e.g, Euclidean distance) or
Closed-set Speaker Identification Using VQ and GMM Based Models 25
scores (e.g, inverse of distance) with respect to every speaker. This method is
called template matching. The minimum distance or maximum score provide
the classified speaker. This method suffers from a lot of drawbacks. If we have
mnumber of MFCCs for a training (known) speaker and nnumber of MFCCs
for test speaker then there are required m×ncomparison to evaluate a score
(or distance). If MFCCs are large then the SR system will take too much
time to recognize a speaker which makes it unusable in real time application.
Besides this, we require very large memory to store all the MFCCs of known
(training) speakers even the accuracy of this approach is not very impressive.
Here comes the concept of the model based classification. In this approach, the
known speakers’ MFCCs are not directly used for classification, but a model is
built for every speaker. Here we do not store MFCCs in the memory rather for
every speaker, we store the model parameters which characterize the model of
that speaker.
In this paper we used three models, namely VQ, GMM and UBM-GMM
and their combinations giving five classifiers VQ, GMM, VQ-GMM, UBM-
GMM and VQ-UBM-GMM. The performances of these classifiers are evaluated
over the speech recorded on five different recording devices, given in section 5.
In VQ based SR, the number of MFCCs of train and test speakers are
represented by codebook Cof size C(number of representative codewords).
Here C << number of MFCCs both train and test. So representing the speaker
as coebook we store codebooks (VQ model) of all the train speakers hence,
saving time and space and also there is a smaller improvement in performance
of SR than template matching.
In GMM based SR, for every speaker we compute a set of Mcompo-
nents GMM parameters, namely weight(ω), mean(µ) and covariance(Σ) us-
ing MFCCs of train speaker Xtrain. Thus for sth speaker the set of parameters
Φs={ωh,µh, Σh}where h[1, M ] and Mis much smaller than the num-
ber of MFCCs hence, saving both memory and computational time in GMM
based SR. Here we shall observe that the performance is very stable (means
performance does not differ too much from other GMM based classifiers for
five recording devices) and performance improves significantly. This is the rea-
son that GMM based classifiers are very reliable. However, the computational
time in computing GMM parameters is slightly large but it is much less than
the computational time of codebook in VQ based SR. Also the computational
time depends on the number of MFCCs of train speakers in GMM and VQ
both. Indeed, in GMM, the model parameters Φdepends on the initial guess of
the parameters ω,µ, and Σand the initialization is random. Hence different
speakers are initialized differently and final GMM is highly dependant on such
initial values of parameters where the final parameters are biased to the values
that are may not be speaker specific. The random initialization (that is plausi-
bly biased) of parameters is eliminated in UBM-GMM based SR where every
speaker is initialized by same such parameter values, that definitely conveys
speaker specific properties (as MFCCs for UBM shares speaker independent
common parameter values) because all speakers’ MFCCs are mixed and final
26 Bidhan Barai et al.
GMM is completely depends on training MFCCs and not initial values because
every speaker is initialized my same parameter values.
In UBM-GMM based SR, at first a speaker independent model is built
which is a GMM itself and we called this model Universal Background Model
(UBM). To build UBM, we collect small amount of MFCCs from a very large
number of speakers where any train speaker must be present in the database
and UBM also have some MFCCs from speakers other than the speakers in
database. Using this mixed MFCCs a common GMM is built using usual
method, which is then called UBM. Now the UBM has a set of parameters Φ0
={ω0,µ0, Σ0}. We can think of this UBM as speech model because it contains
the speech of all kind of speakers. So, we get a model of speech of a specific
language. One more important issue is that since we collect MFCCs from
a large number of speaker this method expand the region of MFCC feature
vectors for both UBM and UBM-GMM and ensure that the GMM of train
speaker definitely present in this region of UBM’s MFCCs i,e,. the final GMM
of train speaker, called UBM-GMM, becomes bounded in the region of UBM.
To compute UBM-GMM of train speakers we initialize the UBM-GMM by the
parameters of UBM, Φ0for all the speakers i.e., all the speakers are initialized
by the same parameter value Φ0. Then we apply EM iteration to compute final
parameters Φsfor {s= 1,2, . . . , S}of UBM-GMMs of all train speakers. In
each iteration the UBM eventually takes the form of training MFCCs of train
speaker. We shall observe that UBM-GMM provide much better and stable
performances for all devices i.e., the SR system shows robustness.
In VQ-GMM based SR, we first compute the codebooks of training and
testing speaker using their MFCCs. Since the size of codebooks is much less
than that of computed MFCCs so the computation of GMM speeds up. How-
ever, here we compromise with the computational time of VQ since computa-
tion of VQ is also very time consuming. In experiment, although we see that
in some cases the performance of SR improves in VQ-GMM based SR, the
classifier is not as much as robust than UBM-GMM based SR.
In VQ-UBM-GMM based SR, MFCCs are accumulated from many speak-
ers (here speakers may present outside the database) and we compute code-
book to reduce the large number of MFCCs of all the speakers. Now we built
the speaker independent UBM using the codebook. Then this UBM is adapted
using MAP adaptation technique that described earlier to build UBM-GMM
of every speaker and stored in back-end. The classification/identification is
similar as VQ-GMM classifier.
5 Experimental Results and Discussions
A comprehensive analysis of performance of SR system is undertaken here.
Various performance measures are used for evaluating a SR system is also
described. Some reported performance analysis are also discussed here. Then
the performance evaluation of a SR system based on MFCC feature and GMM
Closed-set Speaker Identification Using VQ and GMM Based Models 27
is carried out on three databases, namely, Hyke 2011, ELSDSR and IITG-MV-
SR.
5.1 Performance Measure for SR systems
The performance analysis metrics of SR system for SI and SV are different.
The performance of SV system is measured by False Acceptance Rate (FAR)
and False Rejection Rate (FRR). These two rates are measured as follows:
F AR =#rejected true speakers
T otal #speakers ×100%
F RR =#accepted impostors
T otal #speakers ×100%
(34)
Here 0#0symbol denotes 0number of0. For evaluating the performance of SV
systems, the researchers often use the decision error trade-off (DET) curve.
The DET curve is the plot of FAR vs FRR. The decision for accepting or
rejecting a speaker is based on the threshold value that is chosen by inspecting
the DET curve. By changing the threshold value different pairs of (FAR,FRR)
are generated and the point on the DET curve where FAR and FRR become
equal is called Equal Error Rate (EER). The threshold value is chosen at this
point.
The performance of SI system is measured by the percentage of correct
identification which is a single value unlike the SV system. Hence, the accuracy
(η) is evaluated by the following equation:
η=#speakers correctly classif ied
T otal #speakers ×100% (35)
The performance of closed-set SI is measured by the equation (35). But the
performance of open-set SI can be measured by any one of equations (34) and
(35) or both. This is because the open-set SI is very similar to the SV in a
sense that in both cases verification is performed to take the decision whether
the claimed identity is accepted or the speaker is present in the database.
5.2 Some Reported Performance of SR systems
Togneri et al.[86] conducted the experiment using GMM, GMM-UBM and
GMM-SVM and a comparative discussion was made in the paper [86]. In this
case, the GMM-SVM achieves superior recognition over the GMM-UBM sys-
tem by around 3%. He conducted the experiment for the both cases, using
original feature set i.e., 13 MFCC + 13 ∆M F CC + 13 2M F C C result-
ing in vector dimension D= 39 and reduced features (temporal derivatives
are excluded, only a 13 dimensional vector is taken including C1coefficient).
These MFCC vectors are extracted from 25 ms frame generated every 10 ms
28 Bidhan Barai et al.
(i.e., frame shift is 10 ms) using Hamming window and pre-emphasis factor
α= 0.97. He also applied CMN for enhancement of speech signal. The author
showed that the best performance of the GMM classifier on TIMIT database
(every speaker has 10 utterances where 8 utterances are used for training and
2 utterances for testing) is 99.2% with 32 Gaussian mixture and the perfor-
mance deteriorates with the increase of Gaussian mixtures. If the training data
is insufficient (i.e., 3 utterances are used for training and 2 are used for test-
ing) the best result for GMM system is only 79.7% with 16 mixtures but the
performance of the GMM-UBM and GMM-SVM classifiers improve with the
increase in number of Gaussian mixtures and the best system performance is
achieved with 128 Gaussian mixtures. In this case GMM-SVM performs better
than GMM-UBM [86]. In [15] the author has proposed an adaptive variational
mode decomposition approach to enhance the speech signal and provide per-
formance analysis. The best accuracy of GMM-UBM and GMM-SVM are 96%
and 93% respectively. In TIMIT database, additive noise is induced for every
speaker to create the noise mismatch condition and we observed that there
is a significant degradation in accuracy in such cases. In mismatch condition
between training and testing data, the accuracy degrades by around 20%. Be-
side this MFCC feature, GFCC is also very popular. In some cases GFCCs
perform better than MFCCs. It is also observe that GFCC is more robust in
some adverse (mismatch) condition and cepstral liftering (the detail is given
in 3.1.4) of GFCC improves the accuracy than MFCC [88,85]. Here we choose
MFCC because the computational complexity of MFCC is quite easier than
GFCC and if the noise is very high then MFCC performs better than GFCC.
It is worth mentioning that our database contains speech signals contaminated
with noise.
The dimension of feature vector plays a crucial role in computational cost
for training and testing stages as well as computation of MFCC. The number of
vectors is also an important factor in computational cost. The main advantage
of using VQ-GMM system is that it reduces the number of vectors considerably
without significant loss in recognition accuracy if enough training and testing
data are considered. In our experiment with the three databases, we reduced
the feature vectors to 1024 vectors using VQ technique and then the GMM
is built over the 1024 vectors. Even the testing procedure is carried out using
1024 cluster centroids.
5.3 Performance Analysis of Presented SR System for IITG-MV SR,
ELSDSR and Hyke-2011
The SR experiment is carried out extensively over the three databases - 1)
IITG Multi-variability Speaker Recognition Database (IITG-MV SR Phase I
& II) in both matched and mismatched conditions and 2)ELSDSR, 3) Hyke-
2011 in matched condition. Database (2) and (3) have no mismatch condition
and these two databases contain clean speech. The speech languages of IITG-
MV SR are English and Indian regional languages (like Bengali, Hindi, Tamil
Closed-set Speaker Identification Using VQ and GMM Based Models 29
etc )[10]. The IITG-MV SR Phase I, II contain recorded speech from five
recording devices namely, digital recorder (D01), Headset (H01), Tablet PC
(T01) in both phases and Nokia 5130c mobile (M01), Sony Ericsson W350i
mobile (M02) in Phase I. But in our experiment we have used only Phase I
data because this Phase I data satisfies all the three conditions (Text, style
and channel independent SR) in the title of this paper. Phase I is recorded in
noisy office environment and Phase II in noisy multi environment condition
(other than office, like Laboratory and Hostel room). For Phase I, each record-
ing device has two sets of speech signal for every speaker namely session 1 and
session 2. The session 1 contains the speech of two languages, English and
Indian Regional languages of 100 speakers in two modes (reading style and
conversational style). We use the reading-style speech signal (in .wav format)
as training data and conversational-style (in .wav format) as testing data. In
Phase III(a), there are 200 speakers in truly conversational mode (there is no
post processing to separate the speakers) that are recorded using mobile phone
handset at sampling frequency 8 kHz for the experiment ”single speaker recog-
nition” and in Phase III(b) there are 198 speakers (99 speaker pairs) in truly
conversational mode that are recorded using mobile phone handset at sampling
frequency 8 kHz for the experiment ”two speaker recognition”. Hence Phase
III(b)database can be used for ”Speaker Diarization (who spoke when?)”. In
Phase IV database there are 144 speakers that are collected using mobile phone
handset at sampling frequency 8 kHz to facilitate UBM-GMM based speaker
recognition and it contain a large number of imposter speakers. The complete
description of the database is found in [20]. However ELSDSR and Hyke-2011
contains clean speech i.e., noise level is very low and the speeches are recorded
with same microphone, so there is no device mismatch for training and testing.
Hyke-2011 contains speeches of digits, from 0 to 9 only (no text). The ELS-
DSR contains speeches of text[21]. For IITG-MV SR database, the sampling
frequency for D01, H01, T01 is 16kHz, for M01, M02, M03, M04 is 8kHz and
where as that for ELSDSR, Hyke-2011 is 8kHz. We chose frame size about 25
ms and overlap about 17 ms i.e., frame shift is (25 17) = 8 ms for 16kHz
speech signal and 50 ms frame size and about 34 ms overlap i.e., frame sift is
(50 34) = 16 ms for 8kHz speech signal. The pre-emphasis factor αis set
to 0.97. we have used 1024-point FFT. For mel scale frequency conversion,
maximum and minimum linear frequencies are chosen as fmin = 0,340 Hz and
fmax = 4500 Hz. Number of triangular filters in filter bank is B= 26 which
produces 26 MFC coefficients and among them first 13 MFCC excluding C1
are chosen to create MFCC feature vector of dimension D= 13. The accuracy
rate for the mentioned databases are reported in the paper[3] for GMM an
VQ-GMM based classifiers. In VQ we consider 1024 clusters, to reduce large
number of vectors, upon which GMM is built using 5 EM iteration. In Barai
et al.[2] it is shown that the accuracy on Hyke-2011 and ELSDSR is 100%
due to the clean speech for VQ, GMM and VQ-GMM based SR. However, we
have also examined the results for Hyke-2011 and ELSDSR using other classi-
fiers as mentioned in section 4. But we have not provided accuracy figures in
the tables, because in those cases we also got high accuracy (from 99.6% to
30 Bidhan Barai et al.
100%) for all the classifiers[3,4]. We have also examined spoken style, text and
channel (recording device) independent SR experiment to provide benchmark
accuracy for the IITG-MV database using the five classifiers presented in this
paper.
For experimenting with spoken style variation, we use speech signal of
reading style for the training purpose and conversation style is used for test-
ing purpose. So there is a spoken style mismatch between training and testing
data. The experimental results are given in table 1 for all the devices D01, H01,
T01, M01, M02. In table 1 it is clearly observed that the accuracy varies be-
tween 43% 96%. All these results are channel dependent which means there
is no channel mismatch between training and testing data i.e., recording device
is same for training and testing data. An interesting result can be observed
for devices D01 and T01 with classifiers VQ and VQ-GMM respectively. Here
the accuracy are 60% and 43% respectively. The cause of this drastic degra-
dation of results is due to singularity problem of covariance matrix. We know
that covariance matrix is positive semi definite in other words Σh0. Now if
Σhbecomes 0 or Σh0 then equation (5) becomes undefined because Σ1
h
cannot be exists as determinant of Σh,|Σh|= 0 or |Σh| → 0. Also the term
1
(2π)D/2|Σh|1/2→ ∞, makes the equation (5) inconsistent. The singularity prob-
lem is displayed by ”*” marks in superscript position of the accuracy in all the
tables presented in this paper. Indeed, the singularity problem is found in very
rare cases. Even the singularity may not occur all the time, for example, in the
databases Hyke-2011 and ELSDSR singularity problem does not occur but in
case of IITG-MV SR database singularity occurs in very few cases. We cannot
say with certainty whether covariance matrix Σhis singular or not singular
before modelling GMM based classifiers. If singularity occurs, then only VQ
based classifier out performs GMM based classifiers. The GMM based clas-
sifiers like, GMM, UBM-GMM, VQ-GMM, VQ-UBM-GMM may or may not
suffer from singularity problem. But for VQ classifier singularity will definitely
not occur because we measure Euclidean distance between trained codebooks
and testing MFCCs and minimum distance (which gives a value with certainty)
provides the classified speaker. If we neglect singularity, we can see that the
accuracy for D01, H01, T01 is better than M01 and M02. The reason for this
is the sampling frequency (fs) of the speech signal. Since sampling frequencies
of D01, H01, T01 are fs= 16 kH z and M01, M02 are fs= 8 kHz. Hence, the
mel filter bank covers the bandwidth of fs/2kHz = 8 kH z for D01, H01, T01
and for M01, M02 is fs/2kHz = 4 kHz which is much less than rest three
devices. Hence D01, H01, T01 can cover more bandwidth (or speaker specific
information) than M01, M02, which leads to the better accuracy.
In table 1, the various conditions of data is given. In the first column,
name of the classifiers are displayed. In the second column, ”UBM with VQ”
there are three types of tags, namely ,X,×. Here ’’ means UBM is not
required for classifiers GMM and VQ-GMM. ’×’ means vector quantization is
not carried out before UBM and GMM. Generally, VQ is done before modelling
and to compute codebook. The third and fourth column ”VQ on Train MFCC”
Closed-set Speaker Identification Using VQ and GMM Based Models 31
Table 1: The accuracy in percentage(%) of SR system for five devises using
five classifiers presented in the paper with various combination of training
and testing data. Here there is no channel mismatch. Here ’*’ mark indicates
singularity problem of covariance matrix.
Classifier UBM with VQ VQ on Train MFCC VQ on Test MFCC D01 H01 T01 M01 M02
GMM – × × 89 90 91 81 76
UBM-GMM × × × 88 88 86 80 87
VQ – X×89 94 96 80 76
VQ – X X 60 94 74 81 84
VQ – ×X92 88 96 76 83
VQ-GMM – X X 89 71 91 78 76
VQ-GMM – X×89 50* 90 75 78
VQ-GMM – ×X90 71 43* 79 94
VQ-UBM-GMM X X X 87 90 90 73 82
VQ-UBM-GMM X X ×86 88 88 83 82
VQ-UBM-GMM X×X86 87 87 97 83
VQ-UBM-GMM ×X X 79 87 81 82 73
VQ-UBM-GMM ×X×80 87 87 80 73
VQ-UBM-GMM × × X89 87 87 81 88
and ”VQ on Test MFCC” respectively, there are two types of tags, namely X,
×. Here ’X’ means VQ is done and ’×’ means VQ is not done before UBM
and GMM modelling. For the other tables in this paper, we use the same
tag marks with similar meaning. The table 1 is not spoken style independent
because here we use reading spoken style speech for training and conversational
spoken style speech for testing. It is observed that in spoken style dependent
SR for every devices the range of accuracy is a little large, 43% to 97% i.e., the
difference is 53%. Hence, initially it seems that the classifier and/or feature are
not robust which is not true because this large difference occurs due to either
singular coveriance matrix and/or spoken style mismatch and it can be noted
that the classifier VQ-UBM-GMM classifier performs better than others. In
the literature it is proved that MFCC feature is a robust feature. Now what
about the robustness of the classifiers?
Table 2: The accuracy in percentage(%) of ASR system for five devises using
five classifiers presented in this paper for spoken style independent experiment
as well as cross validated. Here there is no channel mismatch
Classifier UBM with VQ VQ on Train MFCC VQ on Test MFCC D01 H01 T01 M01 M02
GMM – × × 98 98 98 98 98
UBM-GMM × × × 97 97 97 97 98
VQ – X×97 97 97 97 98
VQ – X X 97 97 97 97 98
VQ – ×X98 96 97 97 98
VQ-GMM – X X 86 84 80 83 82
VQ-GMM – X×83 81 76 80 81
VQ-GMM – ×X97 97 97 97 98
VQ-UBM-GMM X X X 97 97 97 97 98
VQ-UBM-GMM X X ×97 97 97 97 98
VQ-UBM-GMM X×X97 97 97 97 98
VQ-UBM-GMM ×X X 98 98 98 98 98
VQ-UBM-GMM ×X×98 98 98 98 98
VQ-UBM-GMM × × X98 98 98 98 98
32 Bidhan Barai et al.
The robustness of the classifiers is checked by cross validation method.
In the experiment we apply 3-fold cross validation. To do so, feature vectors
(MFCCs) of training data consisting of speeches in reading style and testing
data from speeches in conversational style are computed for every speaker
and then they are randomized. Then, for every speaker, we divide MFCCs
randomly into three groups. We hav selected any two groups of MFCCs to
mix them together and taken as training MFCCs and the remaining group
is taken for testing MFCCs. There are three such combinations of training
and testing MFCCs. Hence each classifier provides three accuracy and the
average accuracy of all the three combination is provided in table 2. This
cross validation can also be viewed as spoken style independent experiment
because we have mixed the MFCCs of reading style and conversational style.
In this table, it is clearly shown that the accuracy of all the classifiers varies
between 76% 98% i.e., the difference is 22% which is much less than 53%
of table 1. Here it has shown that the VQ-UBM-GMM classifier performs
better than others. Fortunately, the singularity problem of covariance matrix
has not occurred here. Accuracy in table 2 is better than that in table 1. The
improvement of accuracy is due to the expansion in terms of variability of
training data in feature space i.e., MFCC feature space expands because the
reading style MFCCs and conversational MFCCs are mixed. So the difference
(distance) between kfold training and testing data is reduced and the
similarity increases which leads to better performance. Also we have created
three (k= 3) training and testing pair randomly from the fix region of feature
space where training and testing data are mixed.
So far we have discussed about spoken style dependent and independent
experiment but both experiment is channel dependent. This means that train-
ing MFCCs and testing MFCCs are taken from same device for each speaker.
Now we shall examine the accuracy of channel (device) independent exper-
iment with and without cross validation. The accuracy of device dependent
experiment is found in papers [2, 4]. The benchmark accuracy of device inde-
pendent experiment without cross validation is given in table 3.
To make the experiment device independent, we need to prepare the data
for every speaker. We have mixed the reading style speech signals of each
speaker recorded by all devices for training and conversational style speech
for each speaker recorded by all devices for testing. So, it has become device
independent but spoken style dependent. Here, we can see that the accuracy of
VQ classifier varies in the range 74% to 96% i.e., the difference is 22%. Hence
VQ is not a robust classifier. If we neglect singularity problem, the accuracy
of GMM based classifiers varies in the range 86% to 91% i.e., the difference
is 5%. Hence we can say that GMM based classifier is more robust if we can
make sure that the singularity problem does not occur. Next we examine the
performance of classifiers with cross validation. The cross validation makes the
experiment devise independent, spoken style independent.
Initially we have mixed reading style MFCCs and conversational style
MFCCs together for each speaker from each device. Then we have mixed
Closed-set Speaker Identification Using VQ and GMM Based Models 33
Table 3: The accuracy in percentage(%) of ASR system for channel indepen-
dent and spoken style dependent case. Here ’*’ mark indicates singularity
problem of covariance matrix.
Classifier UBM with VQ VQ on Train MFCC VQ on Test MFCC Accuracy(%)
GMM – × × 91
UBM-GMM × × × 86
VQ – X×96
VQ – X X 74
VQ – ×X96
VQ-GMM – X X 90
VQ-GMM – X×43*
VQ-GMM – ×X91
VQ-UBM-GMM X X X 87
VQ-UBM-GMM X X ×87
VQ-UBM-GMM X×X87
VQ-UBM-GMM ×X X 81
VQ-UBM-GMM ×X×88
VQ-UBM-GMM × × X88
MFCCs of all devices for every speaker. Here we have used the 3-fold cross
validation as described earlier. The accuracy is given in table 4.
Table 4: The accuracy in percentage (%) of ASR system for devise independent,
spoken style independent experiment as well as cross validation using five
classifiers presented in this paper.
Classifier UBM with VQ VQ on Train MFCC VQ on Test MFCC Accuracy(%)
GMM – × × 98
UBM-GMM × × × 97
VQ – X×98
VQ – X X 97
VQ – ×X96
VQ-GMM – X X 97
VQ-GMM – X×97
VQ-GMM – ×X96
VQ-UBM-GMM X X X 98
VQ-UBM-GMM X X ×97
VQ-UBM-GMM X×X98
VQ-UBM-GMM ×X X 98
VQ-UBM-GMM ×X×97
VQ-UBM-GMM × × X98
It is observed that the accuracy varies in the range 96% to 98%i.e., the
difference is only 2%. The reason for this high accuracy is the very large MFCC
feature space because we mix the spoken styles MFCCs as well as MFCCs of
all the devices.
5.4 Development of SR System in MATLAB
The SR system is implemented in MATLAB R2015awith the help of two
matlab toolbox, namely, VOICEBOX [110]and NETLAB 3 [111]. The digital
34 Bidhan Barai et al.
signal processing for feature extraction is carried out using the useful func-
tions in VOICEBOX and the modelling and classification task is carried out
using the useful functions in NETLAB 3. The VOICEBOX toolbox contains
functions for the following purposes:
Audio File Input/Output - Read and write WAV and other speech file
formats;
Frequency Scales - Convert between Hz, Mel, Erb and MIDI frequency
scales;
Fourier/DCT/Hartley Transforms - Various related transforms;
Random Number and Probability Distributions - Generate random vectors
and noise signals;
Vector Distances - Calculate distances between vector lists;
Speech Analysis - Active level estimation, Spectrograms;
LPC Analysis of Speech - Linear Predictive Coding routines;
Speech Synthesis - Text-to-speech synthesis and glottal waveform models;
Speech Enhancement - Spectral noise subtraction;
Speech Coding - PCM coding, Vector quantisation;
Speech Recognition - Front-end processing for recognition;
Signal Processing - Miscellaneous signal processing functions;
Information Theory - Routines for entropy calculation and symbol codes;
Computer Vision - Routines for 3D rotation;
Printing and Display Functions - Utilities for printing and graphics;
Voicebox Parameters and System Interface - Get or set VOICEBOX and
WINDOWS system parameters;
Utility Functions - Miscellaneous utility functions.
It can be seen that VOICEBOX toolbox contains matlab functions for fea-
ture extraction and modelling/classification. But NETLAB 3 contains matlab
functions for modelling/classification. Hence, together they contain all the
matlab functions for SR system and for performance evaluation. The NET-
LAB 3 contains the matlab codes and functions for the data visualisation and
modelling system.
6 Conclusion
There are several variations of SR, based on the application areas [10]. For
example, SR in noisy environment, SR in mismatch conditions, recognition of
speakers from one mixed speech signal of more than one speaker (a very hard
task and it is called SR after source separation)[112], speaker segmentation
and recognition using speech segments of individual speakers during conver-
sation and many more. In this paper, we consider ”text-independent closed
set speaker identification” experiment in various training and testing condi-
tion with focuses on spoken style, channel match/mismatch conditions with
3-fold cross validation. Analytical results are reported in this context as well.
Closed-set Speaker Identification Using VQ and GMM Based Models 35
In this paper, model based speaker recognition for matched condition using
time-frequency feature is over emphasized and the channel (device) and spoken
style dependencies are examined. Here the number of vectors is reduced using
VQ techniques to lower computations in parameter estimation for GMM. The
other existing classification/modelling techniques, methods are mentioned. In
the present day, the researchers focus on the recognition in noisy/mismatched
and reverberant condition. That is why we have mentioned the features, meth-
ods and techniques for speaker recognition in noisy/mismatched conditions.
Also, enough references are given to help the readers to find out the state-of-
the-art as well as novel methods and techniques for the very recent problems
in this field.
Though, the identification rate is very high for the clean speech waveform
in matched condition, the identification rate degrades drastically for the noisy
environment or mismatched condition (training and testing environments are
different). The study by the famous researchers like Douglas A. Reynolds,
Roberto Togneri, Richard C. Rose revealed that the performance of SI and SV
systems using MFCC feature is better than the SI and SV systems using other
features like LPC, PLPC and spectral features [86]. But the performance of
SI and SV systems with MFCC feature degrades drastically in the noisy or
mismatched condition. In this case, the GF and GFCC features give better
results [85,88]. The gender dependency of SI and SV systems is also a rele-
vant factor. Kenny et. al. showed in [92,93,109] that i-vector feature performs
well using the generative modelling (like GMM, LDA, PLDA, and so). The
computational cost is also very high in case of SI and SV systems due to
complex feature extraction techniques, large dimension of feature vectors and
complex modelling techniques. In the GMM, the estimation of parameters also
suffers from high computational cost. So, the dimension reduction techniques
like Principal Component Analysis (PCA), Kernel PCA can be very useful
to reduce the dimension of the feature vectors. In the classification/modelling
part, we can also use VQ techniques to reduce the number of vectors so that
the parameter estimation can be made with the considerably less number of
training vectors which in turn will reduce the computational cost. The accu-
racy of SR system mainly depends on the number of cepstral coefficients taken
to form feature vector and the number of Gaussian components taken in the
GMM. If the number of speakers is large, cepstral coefficients and Gaussian
components should be increased to get higher recognition rate. To increase the
vector dimension, MFCC and 2MFCC are concatenated with the original
MFCC. The cepstral coefficients (MFCCs) more than 14 do not contain useful
information and if the original MFCC is increased beyond 14 (which is the di-
mension of the MFCC feature vector) performance of the SR system degrades.
So it is better to choose MFC coefficients less than 14. In our experiment it
is observed that 13 MFC coefficients for every MFCC vectors provide a stable
accuracy for all the classifiers.
36 Bidhan Barai et al.
6.1 Future Research Directions
Initialization of GMM and singularity of covariance matrix is very crucial for
SI and SV. It is important to note that the singularity of covariance matrix
depends on the initialization of GMM. Proper initialization and making sure
covariance matrix will not be singular still remains a topic of research. Beside
this blind elimination of session, channel effect, reverberation and background
noise without manipulation of training and testing speech data so much are
still a difficult task in SR.
Another important observation is that the deep learning approaches like
CNN are not performing as good as VQ and GMM methods for IITG-MV
SR database, due to lack of balanced training data-set so that each class
has equal contribution in overall loss estimation. Moreover, some speakers
have extremely short duration of audio (around one tenth of other speak-
ers). Data augmentation techniques along with some newly introduced hybrid
CNN architectures which are used for overcome the limitation of short ut-
terance may be used in future to improve the performance on the database.
Other experiments, for example text dependent/independent, channel depen-
dent/independent, reading style dependent/independent, and session depen-
dent/independent experiments using various approaches are still remain the
centre of attraction amongst the researches.
Acknowledgment
This project is partially supported by the CMATER research laboratory of the
Computer Science and Engineering Department, Jadavpur University, India;
UPE-II project, Government of India and DBT project (No. BT/PR16356/BID/
7/596/2016 ), Ministry of Science and Technology, Government of India un-
der Dr. Subhadip Basu. Bidhan Barai is partially supported by the RGNF
Research Award (F1-17.1/2014-15/RGNF-2014-15-SC-WES-67459/(SA-III))
from UGC, Government of India.
References
1. Pal, S.K. and Majumder, D.D., Fuzzy sets and decision making approaches in vowel and
speaker recognition, IEEE Transactions on Systems, Man, and Cybernetics, 7(8), pp.625-
629 (1977).
2. Barai B., Das D., Das N., Basu S., Nasipuri M., VQ/GMM-Based Speaker Identification
with Emphasis on Language Dependency, Advanced Computing and Systems for Secu-
rity(ACSS), Advances in Intelligent Systems and Computing, vol 883. Springer, Singapore
(2019)
3. Barai, B., Das, D., Das, N., Basu, S. and Nasipuri, M., Closed-set text-independent auto-
matic speaker recognition system using VQ/GMM, In Intelligent Engineering Informatics
pp. 337-346. Springer, Singapore (2018).
4. Barai B., Das D., Das N., Basu S., and Nasipuri M., An ASR system using MFCC
and VQ/GMM with emphasis on environmental dependency, IEEE Calcutta Conference
(CALCON), Kolkata, pp. 362-366 (2017 ).
Closed-set Speaker Identification Using VQ and GMM Based Models 37
5. Fortuna, J., Sivakumaran, P., Ariyaeeinia, A. and Malegaonkar, A., Open-set speaker
identification using adapted Gaussian mixture models. In Ninth European Conference on
Speech Communication and Technology (2005).
6. D. Matrouf, W. Ben Kheder, P. Bousquet, M. Ajili and J. Bonastre, Dealing with additive
noise in speaker recognition systems based on i-vector approach, 23rd European Signal
Processing Conference (EUSIPCO), Nice, 2015, pp. 2092-2096 (2015).
7. Wang, N., Ching, P.C., Zheng, N.H. and Lee, T., Robust speaker recognition using both
vocal source and vocal tract features estimated from noisy input utterances. In 2007 IEEE
International Symposium on Signal Processing and Information Technology (pp. 772-777)
(2007).
8. Rao KS, Sarkar S., Robust speaker recognition in noisy environments. Cham: Springer
International Publishing; Jun 21 (2014 ).
9. Fujihara, H., Kitahara, T., Goto, M., Komatani, K., Ogata, T. and Okuno, H.G., Speaker
identification under noisy environments by using harmonic structure extraction and reli-
able frame weighting. In Ninth International Conference on Spoken Language Processing
( 2006).
10. Haris, B.C., Pradhan, G., Misra, A., Prasanna, S.R.M., Das, R.K. and Sinha, R., Multi-
variability speaker recognition database in Indian scenario. International Journal of Speech
Technology, 15(4), pp.441-453 (2012).
11. Mandasari, M.I., Saeidi, R., McLaren, M. and van Leeuwen, D.A., Quality measure
functions for calibration of speaker recognition systems in various duration conditions.
IEEE Transactions on Audio, Speech, and Language Processing, 21(11), pp.2425-2438
(2013).
12. Reyes-D´ıaz, F.J., Hern´andez-Sierra, G. and de Lara, J.R.C., 2021. DNN and i-vector
combined method for speaker recognition on multi-variability environments. International
Journal of Speech Technology, 24(2), pp.409-418.
13. Ganchev, T., Potamitis, I., Fakotakis, N. and Kokkinakis, G., 2004. Text-independent
speaker verification for real fast-varying noisy environments. International Journal of
Speech Technology, 7(4), pp.281-292.
14. Murthy, Y.S., Koolagudi, S.G. and Raja, T.J., 2021. Singer identification for Indian
singers using convolutional neural networks. International Journal of Speech Technology,
pp.1-16.
15. Ram, R. and Mohanty, M.N., 2018. Performance analysis of adaptive variational mode
decomposition approach for speech enhancement. International Journal of Speech Tech-
nology, 21(2), pp.369-381.
16. Mandasari, M.I., Saeidi, R. and van Leeuwen, D.A., Quality measures based calibration
with duration and noise dependency for speaker recognition. Speech Communication, 72,
pp.126-137 (2015).
17. Chakraborty, T., Barai, B., Chatterjee, B., Das, N., Basu, S., Nasipuri,M., Closed-
set device-independent speaker identification using cnn, in: International Conference on
Intelligent Computing and Communication (ICICC - 2019), Springer ( 2019).
18. Liu, Z., Wu, Z., Li, T., Li, J. and Shen, C., GMM and CNN hybrid method for short ut-
terance speaker recognition. IEEE Transactions on Industrial informatics, 14(7), pp.3244-
3252 (2018).
19. Anand, P., Singh, A.K., Srivastava, S. and Lall, B., Few Shot Speaker Recognition using
Deep Neural Networks. arXiv preprint arXiv:1904.08775 (2019).
20. Reda, A., Panjwani, S. and Cutrell, E., June. Hyke: a low-cost remote attendance track-
ing system for developing regions. In Proceedings of the 5th ACM workshop on Networked
systems for developing regions, pp.15-20, ACM (2011).
21. Feng, L. and Hansen, L.K., A new database for speaker recognition. IMM, Informatik
og Matematisk Modelling, DTU (2005).
22. Rose, P., Technical forensic speaker recognition: Evaluation, types and testing of evi-
dence. Computer Speech & Language, 20(2-3), pp.159-191 (2006).
23. Singh, N., Khan, R.A. and Shree, R., Applications of speaker recognition. Procedia
engineering, 38, pp.3122-3126 (2012).
24. Lleida, E. and Rodriguez-Fuentes, L.J., Speaker and language recognition and charac-
terization: Introduction to the CSL special issue (2018).
38 Bidhan Barai et al.
25. Abd El-Moneim, S., Sedik, A., Nassar, M.A., El-Fishawy, A.S., Sharshar, A.M., Hassan,
S.E., Mahmoud, A.Z., Dessouky, M.I., El-Banby, G.M., Abd El-Samie, F.E. and El-Rabaie,
E.S.M., 2021. Text-dependent and text-independent speaker recognition of reverberant
speech based on CNN. International Journal of Speech Technology, pp.1-14.
26. Pal, S.K. and Mitra, P., Pattern recognition algorithms for data mining. Chapman and
Hall/CRC, (2004).
27. Fan, X. and Hansen, J.H., April. Speaker identification with whispered speech based on
modified LFCC parameters and feature mapping. In 2009 IEEE International Conference
on Acoustics, Speech and Signal Processing (pp. 4553-4556) IEEE, (2009).
28. Lawson, A., Vabishchevich, P., Huggins, M., Ardis, P., Battles, B. and Stauffer, A.,
May. Survey and evaluation of acoustic features for speaker recognition. In 2011 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5444-
5447) IEEE (2011).
29. Hourri, S., Nikolov, N.S. and Kharroubi, J., 2020. A deep learning approach to inte-
grate convolutional neural networks in speaker recognition. International Journal of Speech
Technology, 23, pp.615-623.
30. Lerato, L. and Mashao, D.J., September. Enhancement of GMM speaker identifica-
tion performance using complementary feature sets. In 2004 IEEE Africon. 7th Africon
Conference in Africa (IEEE Cat. No. 04CH37590) (Vol. 1, pp. 257-261) IEEE (2004).
31. Nakagawa, S., Wang, L. and Ohtsuka, S., Speaker identification and verification by com-
bining MFCC and phase information. IEEE transactions on audio, speech, and language
processing, 20(4), pp.1085-1095 (2011).
32. Hourri, S., Nikolov, N.S. and Kharroubi, J., 2021. Convolutional neural network vectors
for speaker recognition. International Journal of Speech Technology, 24(2), pp.389-400.
33. Shahamiri, S.R. and Salim, S.S.B., Artificial neural networks as speech recognisers for
dysarthric speech: Identifying the best-performing set of MFCC parameters and studying
a speaker-independent approach. Advanced Engineering Informatics, 28(1), pp.102-110
(2014).
34. Furui, S., Digital speech processing: synthesis, and recognition, CRC Press (2018).
35. Rabiner, L.R. and Schafer, R.W., Theory and applications of digital speech processing
(Vol. 64). Upper Saddle River, NJ: Pearson ( 2011).
36. Nica, A., Caruntu, A., Toderean, G. and Buza, O., Analysis and synthesis of vowels
using Matlab. In 2006 IEEE International Conference on Automation, Quality and Testing,
Robotics (Vol. 2, pp. 371-374) IEEE (2006, May).
37. Hegde, R.M., Murthy, H.A. and Gadde, V.R.R., Significance of the modified group
delay feature in speech recognition. IEEE Transactions on Audio, Speech, and Language
Processing, 15(1), pp.190-202 (2006).
38. Grimaldi, M. and Cummins, F., Speaker identification using instantaneous frequen-
cies. IEEE Transactions on Audio, Speech, and Language Processing, 16(6), pp.1097-1111
(2008).
39. Tsiakoulis, P., Potamianos, A. and Dimitriadis, D., Instantaneous frequency and band-
width estimation using filterbank arrays. In 2013 IEEE International Conference on Acous-
tics, Speech and Signal Processing (pp. 8032-8036) IEEE (2013, May).
40. McCowan, I., Dean, D., McLaren, M., Vogt, R. and Sridharan, S., The delta-phase spec-
trum with application to voice activity detection and speaker recognition. IEEE Transac-
tions on Audio, Speech, and Language Processing, 19(7), pp.2026-2038 (2011).
41. Murty, K.S.R. and Yegnanarayana, B., Combining evidence from residual phase and
MFCC features for speaker recognition. IEEE signal processing letters, 13(1), pp.52-55
(2005).
42. Vijayan, K., Reddy, P.R. and Murty, K.S.R., Significance of analytic phase of speech
signals in speaker verification. Speech Communication, 81, pp.54-71 (2016).
43. Vijayan, K., Kumar, V. and Murty, K.S.R., Feature extraction from analytic phase of
speech signals for speaker verification. In Fifteenth Annual Conference of the International
Speech Communication Association (2014).
44. Qawaqneh, Z., Mallouh, A.A. and Barkana, B.D., Deep neural network framework and
transformed MFCCs for speaker’s age and gender classification. Knowledge-Based Sys-
tems, 115, pp.5-14 (2017).
45. Khosravani, A. and Homayounpour, M.M., A PLDA approach for language and text
independent speaker recognition. Computer Speech & Language, 45, pp.457-474 (2017).
Closed-set Speaker Identification Using VQ and GMM Based Models 39
46. Zhao, X., Shao, Y. and Wang, D., CASA-based robust speaker identification. IEEE
Transactions on Audio, Speech, and Language Processing, 20(5), pp.1608-1616 (2012).
47. Rouat, J., Computational auditory scene analysis: Principles, algorithms, and applica-
tions (wang, d. and brown, gj, eds.; 2006)[book review]. IEEE Transactions on Neural
Networks, 19(1), pp.199-199 (2008).
48. Shi, X., Yang, H. and Zhou, P., Robust speaker recognition based on improved GFCC.
In 2016 2nd IEEE International Conference on Computer and Communications (ICCC)
(pp. 1927-1931) IEEE(2016, October).
49. Zhang, Y. and Abdulla, W.H., Gammatone auditory filterbank and independent com-
ponent analysis for speaker identification. In Ninth International Conference on Spoken
Language Processing (2006).
50. Linde, Y., Buzo, A. and Gray, R., An algorithm for vector quantizer design. IEEE
Transactions on communications, 28(1), pp.84-95 (1980).
51. Kohonen, T., The self-organizing map. Proceedings of the IEEE, 78(9), pp.1464-1480
(1990).
52. Han, C.C., Chen, Y.N., Lo, C.C. and Wang, C.T., A novel approach for vector quanti-
zation using a neural network, mean shift, and principal component analysis-based seed
re-initialization. Signal Processing, 87(5), pp.799-810 (2007).
53. Wang, L., Minami, K., Yamamoto, K. and Nakagawa, S., Speaker recognition by combin-
ing MFCC and phase information in noisy conditions. IEICE transactions on information
and systems, 93(9), pp.2397-2406 (2010).
54. Tirumala, S.S., Shahamiri, S.R., Garhwal, A.S. and Wang, R., Speaker identification
features extraction methods: A systematic review. Expert Systems with Applications, 90,
pp.250-271 (2017).
55. Li, Q. and Huang, Y., Robust speaker identification using an auditory-based feature.
In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (pp.
4514-4517) IEEE(2010, March).
56. Madikeri, S.R. and Murthy, H.A., Mel filter bank energy-based slope feature and its ap-
plication to speaker recognition. In 2011 National Conference on Communications (NCC)
(pp. 1-4) IEEE(2011, January).
57. Novoselov, S., Pekhovsky, T., Kudashev, O., Mendelev, V.S. and Prudnikov, A., Non-
linear PLDA for i-vector speaker verification. In Sixteenth Annual Conference of the In-
ternational Speech Communication Association (2015).
58. Campbell, W.M., Sturim, D.E., Reynolds, D.A. and Solomonoff, A., SVM based speaker
verification using a GMM supervector kernel and NAP variability compensation. In 2006
IEEE International Conference on Acoustics Speech and Signal Processing Proceedings
(Vol. 1, pp. I-I) IEEE (2006, May).
59. Yaman, S., Pelecanos, J. and Sarikaya, R., Bottleneck features for speaker recognition.
In Odyssey 2012-The Speaker and Language Recognition Workshop (2012).
60. Lozano-Diez, A., Silnova, A., Matejka, P., Glembek, O., Plchot, O., Pesan, J., Burget, L.
and Gonzalez-Rodriguez, J., Analysis and Optimization of Bottleneck Features for Speaker
Recognition. In Odyssey (Vol. 2016, pp. 21-24) (2016).
61. Zeinali, H., Sameti, H. and Burget, L., HMM-based phrase-independent i-vector extrac-
tor for text-dependent speaker verification. IEEE/ACM Transactions on Audio, Speech,
and Language Processing, 25(7), pp.1421-1435 (2017).
62. Khosravani, A. and Homayounpour, M.M., Nonparametrically trained PLDA for short
duration i-vector speaker verification. Computer Speech & Language, 52, pp.105-122
(2018).
63. Li, M. and Narayanan, S., Simplified supervised i-vector modeling with application to
robust and efficient language identification and speaker verification. Computer speech &
language, 28(4), pp.940-958 (2014).
64. Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P. and Ouellet, P., Front-end factor
analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language
Processing, 19(4), pp.788-798 (2010).
65. Dehak, N., Plchot, O., Bahari, M.H., Burget, L. and Dehak, R., GMM weights adap-
tation based on subspace approaches for speaker verification. Proceedings Odyssey 2014,
pp.48-53 (2014).
66. Ghahabi, O. and Hernando, J., Restricted Boltzmann machines for vector representation
of speech in speaker recognition. Computer Speech & Language, 47, pp.16-29 (2018).
40 Bidhan Barai et al.
67. Avci, E., A new optimum feature extraction and classification method for speaker recog-
nition: GWPNN. Expert Systems with Applications, 32(2), pp.485-498 (2007).
68. Mary, L. and Yegnanarayana, B., Extraction and representation of prosodic features for
language and speaker recognition. Speech communication, 50(10), pp.782-796 (2008).
69. Djellali, H. and Laskri, M.T., Random vector quantisation modelling in automatic
speaker verification. International Journal of Biometrics, 5(3-4), pp.248-265 (2013).
70. Campbell, W.M., Sturim, D.E. and Reynolds, D.A., Support vector machines using
GMM supervectors for speaker verification. IEEE signal processing letters, 13(5), pp.308-
311 (2006).
71. Reynolds, D.A., Speaker identification and verification using Gaussian mixture speaker
models. Speech communication, 17(1-2), pp.91-108 (1995).
72. Markov, K. and Nakagawa, S., Frame level likelihood normalization for text-independent
speaker identification using Gaussian mixture models. In Proceeding of Fourth Inter-
national Conference on Spoken Language Processing. ICSLP’96 (Vol. 3, pp. 1764-1767)
IEEE(1996, October).
73. Susan, S. and Sharma, S., A fuzzy nearest neighbor classifier for speaker identification. In
2012 Fourth International Conference on Computational Intelligence and Communication
Networks (pp. 842-845) IEEE(2012, November).
74. Zeinali, H., Sameti, H. and Burget, L., Text-dependent speaker verification based on
i-vectors, Neural Networks and Hidden Markov Models. Computer Speech & Language,
46, pp.53-71 (2017).
75. McLaren, M., Castan, D., Ferrer, L. and Lawson, A., On the Issue of Calibration in
DNN-Based Speaker Recognition Systems. In INTERSPEECH (pp. 1825-1829) (2016,
September).
76. Matˇejka, P., Glembek, O., Novotn´y, O., Plchot, O., Gr´ezl, F., Burget, L. and Cer-
nock´y, J.H., Analysis of DNN approaches to speaker identification. In 2016 IEEE inter-
national conference on acoustics, speech and signal processing (ICASSP) (pp. 5100-5104)
IEEE(2016, March).
77. Richardson, F., Reynolds, D. and Dehak, N., Deep neural network approaches to speaker
and language recognition. IEEE signal processing letters, 22(10), pp.1671-1675 (2015).
78. Richardson, F., Reynolds, D. and Dehak, N., A unified deep neural network for speaker
and language recognition. arXiv preprint arXiv:1504.00923 (2015).
79. Matˇejka, P., Glembek, O., Castaldo, F., Alam, M.J., Plchot, O., Kenny, P., Burget, L.
and ˇ
Cernocky, J., Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verifi-
cation. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP) (pp. 4828-4831) IEEE(2011, May).
80. You, C.H., Lee, K.A. and Li, H., GMM-SVM kernel with a Bhattacharyya-based dis-
tance for speaker recognition. IEEE Transactions on Audio, Speech, and Language Pro-
cessing, 18(6), pp.1300-1312 (2009).
81. Nguyen, V.X., Nguyen, V.P. and Pham, T.V., Robust speaker identification based on
hybrid model of VQ and GMM-UBM. In 2015 International Conference on Advanced
Technologies for Communications (ATC) (pp. 490-495) IEEE(2015, October).
82. Ling, Z. and Hong, Z., The improved VQ-MAP and its combination with LS-SVM for
speaker recognition. In IEEE Conference Anthology (pp. 1-4) IEEE(2013, January).
83. Ming, J., Stewart, D. and Vaseghi, S., 2005, March. Speaker identification in unknown
noisy conditions-a universal compensation approach. In Proceedings.(ICASSP’05). IEEE
International Conference on Acoustics, Speech, and Signal Processing, 2005. (Vol. 1, pp.
I-617). IEEE.
84. Shao, Y., Srinivasan, S. and Wang, D., 2007, April. Incorporating auditory feature
uncertainties in robust speaker identification. In 2007 IEEE International Conference on
Acoustics, Speech and Signal Processing-ICASSP’07 (Vol. 4, pp. IV-277). IEEE.
85. Shao, Y. and Wang, D., 2008, March. Robust speaker identification using auditory fea-
tures and computational auditory scene analysis. In 2008 IEEE International Conference
on Acoustics, Speech and Signal Processing (pp. 1589-1592). IEEE.
86. Togneri, R. and Pullella, D., An overview of speaker identification: Accuracy and ro-
bustness issues. IEEE circuits and systems magazine, 11(2), pp.23-61 (2011).
87. Garcia-Romero, D., Zhou, X. and Espy-Wilson, C.Y., Multicondition training of Gaus-
sian PLDA models in i-vector space for noise and reverberation robust speaker recogni-
tion. In 2012 IEEE international conference on acoustics, speech and signal processing
(ICASSP) (pp. 4257-4260) IEEE(2012, March).
Closed-set Speaker Identification Using VQ and GMM Based Models 41
88. Zhao, X. and Wang, D., Analyzing noise robustness of MFCC and GFCC features in
speaker identification. In 2013 IEEE international conference on acoustics, speech and
signal processing (pp. 7204-7208) IEEE(2013, May).
89. Cooke, M., Green, P., Josifovski, L. and Vizinho, A., Robust automatic speech recogni-
tion with missing and unreliable acoustic data. Speech communication, 34(3), pp.267-285
(2001).
90. Kuhn, R., Nguyen, P., Junqua, J.C. and Boman, R., Panasonic Corp, Speaker verifica-
tion and speaker identification based on eigenvoices. U.S. Patent 6,141,644 (2000).
91. Vogt, R.J., Baker, B.J. and Sridharan, S., Modelling session variability in text indepen-
dent speaker verification (2005).
92. Kenny, P., Joint factor analysis of speaker and session variability: Theory and algo-
rithms. CRIM, Montreal,(Report) CRIM-06/08-13, 14, pp.28-29 (2005).
93. Kenny, P., Boulianne, G. and Dumouchel, P., 2005. Eigenvoice modeling with sparse
training data. IEEE transactions on speech and audio processing, 13(3), pp.345-354.
94. Kenny, P., Stafylakis, T., Ouellet, P. and Alam, M.J., JFA-based front ends for speaker
recognition. In 2014 IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP) (pp. 1705-1709) IEEE(2014, May).
95. Novoselov, S., Pekhovsky, T., Shulipa, A. and Sholokhov, A., Text-dependent GMM-JFA
system for password based speaker verification. In 2014 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP) (pp. 729-737) IEEE(2014, May).
96. Cumani, S. and Laface, P., Speaker recognition using e–vectors. IEEE/ACM Transac-
tions on Audio, Speech, and Language Processing, 26(4), pp.736-748 (2018).
97. Martin, A.F., Greenberg, C.S., Stanford, V.M., Howard, J.M., Doddington, G.R. and
Godfrey, J.J., Performance factor analysis for the 2012 NIST speaker recognition evalua-
tion. In Fifteenth Annual Conference of the International Speech Communication Associ-
ation (2014).
98. Kanagasundaram, A., Dean, D. and Sridharan, S., JFA based speaker recognition using
delta-phase and MFCC features. In SST 2012 14th Australasian International Conference
on Speech Science and Technology (2012, December).
99. Rajan, P., Afanasyev, A., Hautam¨aki, V. and Kinnunen, T., From single to multiple en-
rollment i-vectors: Practical PLDA scoring variants for speaker verification. Digital Signal
Processing, 31, pp.93-101 (2014).
100. Garcia, A.A. and Mammone, R.J., Channel-robust speaker identification using
modified-mean cepstral mean normalization with frequency warping. In 1999 IEEE Inter-
national Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99
(Cat. No. 99CH36258) (Vol. 1, pp. 325-328) IEEE(1999, March).
101. Juang, B.H., Rabiner, L. and Wilpon, J.G., On the use of bandpass liftering in speech
recognition. IEEE Transactions on acoustics, speech, and signal processing, 35(7), pp.947-
954 (1987).
102. Paliwal, K.K., Decorrelated and liftered filter-bank energies for robust speech recogni-
tion. In Sixth European Conference on Speech Communication and Technology (1999).
103. Chapaneri, S.V., Spoken digits recognition using weighted MFCC and improved fea-
tures for dynamic time warping. International Journal of Computer Applications, 40(3),
pp.6-12 (2012).
104. Colibro, D., Vair, C., Castaldo, F., Dalmasso, E. and Laface, P., Speaker recognition
using channel factors feature compensation. In 2006 14th European Signal Processing
Conference (pp. 1-5) IEEE(2006, September).
105. Aronowitz, H. and Aronowitz, V., Efficient score normalization for speaker recognition.
In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (pp.
4402-4405) IEEE(2010, March).
106. B ¨
UY ¨
UK, O. and Arslan, M.L., Model selection and score normalization for text-
dependent single utterance speaker verification. Turkish Journal of Electrical Engineering
and Computer Science, 20(Sup. 2), pp.1277-1295 (2012).
107. Zheng, R., Zhang, S. and Xu, B., A comparative study of feature and score normal-
ization for speaker verification. In International Conference on Biometrics (pp. 531-538).
Springer, Berlin, Heidelberg (2006, January).
108. Bolt, R.H., Cooper, F.S., David, E.E., Denes, P.B., Pickett, J.M. and Stevens, K.N.,
Identification of a speaker by speech spectrograms. Science, 166(3903), pp.338-343 (1969).
42 Bidhan Barai et al.
109. Kenny, P., Ouellet, P., Dehak, N., Gupta, V. and Dumouchel, P., 2008. A study of
interspeaker variability in speaker verification. IEEE Transactions on Audio, Speech, and
Language Processing, 16(5), pp.980-988.
110. Brookes, M., 1997. Voicebox: Speech processing toolbox for matlab. Software, available
[Mar. 2011] from www. ee. ic. ac. uk/hp/staff/dmb/voicebox/voicebox. html, 47.
111. Nabney, I., 2002. NETLAB: algorithms for pattern recognition. Springer Science &
Business Media.
112. Sawada, H., Mukai, R., Araki, S. and Makino, S., 2004. A robust and precise method
for solving the permutation problem of frequency-domain blind source separation. IEEE
transactions on speech and audio processing, 12(5), pp.530-538.
... Also time complexity of SV is more than SI because in SV we first use SI to classify the testing speaker and then we compute a threshold value ( ) using training data to compare with the normalized score of testing speaker to give the answer: whether is it a true speaker or an imposter? ' (Barai et al., 2017(Barai et al., , 2018(Barai et al., , 2019(Barai et al., , 2021'. A number of low level and high level features and methods have been developed so far. ...
... With the advancement of machine learning (ML) and deep learning (DL) technology, the high level feature like super vector ' (You et al., 2008)' and i-vector ' (Richardson et al., 2015;Wu et al., 2022) comes into existence and they constitute high level state-of-the-art features at very present days. They are high level feature because they are extracted in model domain ' (Barai et al., 2017(Barai et al., , 2018(Barai et al., , 2019(Barai et al., , 2021 '. Flavio et al. in paper '(Reyes-Díaz et al., 2021)' presents method which deals with compensation of variability for high level feature (DNN and i-vector combined method) in SR system where utterance duration variability, reverberation and environmental noise are simultaneously present. ...
... of Technology, Guwahati (IITG) abbreviated as IITG-MV SR ' (Haris et al., 2012)'. This database is more suitable for multi variability SR than other available databases because it is multi session, multi lingual, multi environment, multi style and multi sensor (five different recording device) database ' (Barai et al., 2021) ...
Article
Full-text available
This paper describes the effect of analysis window functions on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate analysis window function is required to extract frames from the complete speech signal of a speaker prior to STFA. The number of frames are consider as the number of MFCC feature vectors of a speaker which uniquely represents the speaker in feature space (domain). For the recognition purpose Vector Quantization (VQ) and/or Gaussian Mixture Model (GMM) and/or Universal Background Model GMM (UBM-GMM) based classifiers are used and a comparative study is made. Generally in state-of-the-art MFCC feature vector extraction, Hamming (in some places abbreviated as Ham in this paper) window function is used, but here we also examine the effect of other window functions like rectangular window, Hann window, B-spline windows, polynomial windows, adjustable windows, hybrid windows and Lanczos window in SR. In the present paper, we briefly describe the analysis window functions and try to evaluate text-independent speaker identification (SI). We also use voice activity detector (VAD) to discard the silence frames before STFA. Indeed, silence frames removal leads to the better performance of SR because MFCC of silent frames make the MFCC feature space intrinsic (MFCC with impurity). Here IITG MV SR database contains speech signal of speakers recorded by different devices, namely, D01, H01, T01, M01 and M02, in different environment, different language, different session. This is the reason for calling the database multi variability. It is observed that VQ classifier performs better than other GMM based classifiers for this database and the classifiers VQ-GMM, VQ-UGM-GMM and the combination of them suffers from singularity problem of covariance matrix. So we evaluate the performance of device D01 for all the classifiers and the three classifiers namely, GMM, UBM-GMM and VQ are used for the remaining four recording devices, H01, T01, M01, M02 because except these three classifiers, all other classifiers suffer from singularity problem of covariance matrix in SI. It is observed that VQ provide the highest accuracy for all the devices.
... The accuracy of this model is higher than 99% in most audio databases, and its performance is relatively superior. 13 MilošEvić M et al. used the GMM model and a model training strategy with Mel cepstrum coefficients as input feature vectors in the Russian emotional speech database to improve the speaker recognition system, thereby reducing external factors such as noise and distortion. The results confirm that using emotional speech can effectively improve the robustness of speaker recognition systems. ...
... The training process allows the distribution described by GMM to approach the distribution of the training data, ultimately achieving optimal parameters. In equation (12), a s m represents the mean adjustment factor of the m-th Gc, and the expression of adaptiveΣ m is Equation (13). ...
Article
Full-text available
Digital technology still has a low level of intelligence in the microgrid mode of teaching behavior analysis, resulting in the traditional manual observation and recording stage still being used for speaker identity classification, and the efficiency of teaching behavior analysis is also low. In response to the above issues, the research is based on the teacher‐student analysis method and proposes a dual clustering algorithm based on the general background model Gaussian mixture model for speaker identity classification, thereby realizing the development and design of intelligent behavior analysis software. The research results indicate that the average recall rate of behavior transition points in the classroom teaching discourse corpus of the intelligent behavior analysis software is 89.03%, which is better than traditional analysis methods. Therefore, the intelligent behavior analysis software constructed by the dual clustering algorithm has high effectiveness and practicality. The research proposes a method model and implements intelligent visualization for classroom teaching behavior analysis, improving the efficiency of analyzing current microgrid teaching behavior.
... Traditional classifiers like Gaussian Mixture Models (GMM) [3], Support Vector Machine (SVM) [4], Bayes classifier, Hidden Markov Models (HMM) [18] were used earlier as classifiers. Those classifiers require prior knowledge and human effort in feature design. ...
... Those classifiers require prior knowledge and human effort in feature design. Mel Frequency Cepstral Coefficient ( MFCC), Gammatone Frequency Cepstral Coefficients (GFCC) [3] features, extracted from audio signals, were used as input to traditional classifiers. In recent years, i-Vector became state-of-the-art technique, and RBM has been used for Speaker Recognition using i-Vector [12]. ...
Article
Full-text available
This paper presents an unique audio database, we named it Multivariate Audio Database (MAuD), where audio data has been collected in real life scenarios. MAuD contains 229 audio files, each of duration approx 5 minutes, collected across different conferencing apps, spoken languages, background noises and discussion topics. Various audio conferencing applications have been used for collecting these data e.g. Mobile conference calls, Zoom, Google Meet, Skype and Hangout. During this collection, speakers of different age, sex spoke in several languages and on various topics. Audio was recorded using devices of one of the speakers. Background noises were then introduced synthetically. Researchers may find this database useful as it can be used for several signal processing experiments e.g. conference app identification, background noise identification, speaker identification, identification of who speaks when. We have explored classification of some of the above mentioned mismatch cases (conference app and background noise). Pre-trained deep learning models (ResNet18 and DenseNet201) has been used for these purposes. We have achieved more than 98% accuracy in both the experiments that confirms MAuD contains high quality audio specific properties.
... Pioneering studies in this domain employed basic linear models, which laid the groundwork for more complex approaches. The introduction of Gaussian Mixture Models (GMM) marked a significant advancement, as exemplified in the research by (Barai, Chakraborty, Das, Basu, & Nasipuri, 2022;Kamiński & Dobrowolski, 2022;Sisman, Yamagishi, King, & Li, 2020), which provided a robust method for modeling voice characteristics. ...
Article
Full-text available
The efficacy of machine learning models in speaker recognition tasks is critical for advancements in security systems, biometric authentication, and personalized user interfaces. This study provides a comparative analysis of three prominent machine learning models: Naive Bayes, Logistic Regression, and Gradient Boosting, using the LibriSpeech test-clean dataset—a corpus of read English speech from audiobooks designed for training and evaluating speech recognition systems. Mel-Frequency Cepstral Coefficients (MFCCs) were extracted as features from the audio samples to represent the power spectrum of the speakers’ voices. The models were evaluated based on precision, recall, F1-score, and accuracy to determine their performance in correctly identifying speakers. Results indicate that Logistic Regression outperformed the other models, achieving nearly perfect scores across all metrics, suggesting its superior capability for linear classification in high-dimensional spaces. Naive Bayes also demonstrated high efficiency and robustness, despite the inherent assumption of feature independence, while Gradient Boosting showed slightly lower performance, potentially due to model complexity and overfitting. The study underscores the potential of simpler machine learning models to achieve high accuracy in speaker recognition tasks, particularly where computational resources are limited. However, limitations such as the controlled nature of the dataset and the focus on a single feature type were noted, with recommendations for future research to include more diverse environmental conditions and feature sets.
Chapter
Audio and video conferencing apps like Google meet, Zoom, Mobile call conference are becoming more and more popular. Conferencing apps are used not only by professionals for remote work, but also for keeping social relations. Present situation demands understanding of these platforms in details and extract useful features to recognize them. Identification of conference call platforms will add value in forensic analysis. Our research focuses on collecting audio data using various conferencing apps. Audio data are collected in real world situation, i.e., in noisy environments, where speakers spoke in conversational style using multiple languages. After data collection, we have examined whether platform specific properties are present in the audio files or not. Pre-trained deep learning models (DenseNet, ResNet) are used to extract features automatically from the audio files. High recognition accuracy (99%) clearly indicates that these audio files contain significant amount of platform specific information.
Article
Full-text available
Automatic Speaker Identification (ASI) involves the process of distinguishing an audio stream associated with numerous speakers' utterances. Some common aspects, such as the framework difference, overlapping of different sound events, and the presence of various sound sources during recording, make the ASI task much more complicated and complex. This research proposes a deep learning model to improve the accuracy of the ASI system and reduce the model training time under limited computation resources. In this research, the performance of the transformer model is investigated. Seven audio features, chromagram, Mel-spectrogram, tonnetz, Mel-Frequency Cepstral Coefficients (MFCCs), delta MFCCs, delta-delta MFCCs and spectral contrast, are extracted from the ELSDSR, CSTR-VCTK, and Ar-DAD, datasets. The evaluation of various experiments demonstrates that the best performance was achieved by the proposed transformer model using seven audio features on all datasets. For ELSDSR, CSTR-VCTK, and Ar-DAD, the highest attained accuracies are 0.99, 0.97, and 0.99, respectively. The experimental results reveal that the proposed technique can achieve the best performance for ASI problems.
Article
Full-text available
Speaker recognition is one of several biometric recognition systems owing to its high importance in numerous applications of security and telecommunications. The key aspiration of speaker recognition systems is to know who is speaking depending on voice characteristics. This paper presents an extensive study of speaker recognition in both text-dependent and text-independent cases. Convolutional Neural Network (CNN) based feature extraction is extended to the text-dependent and text-independent speaker recognition tasks. In addition, the effect of reverberation on the speaker recognition system is addressed. All speech signals are converted into images by obtaining their spectrograms. Two proposed CNN models are presented for efficient speaker recognition from clean and reverberant speech signals. They depend on image processing concepts applied on spectrograms of speech signals. One of the proposed models is compared with a conventional Benchmark model in the text-independent scenario. The performance of the recognition system is measured by the recognition rate in the cases of clean and reverberant speech.
Article
Full-text available
Singer identification is one of the important aspects of music information retrieval (MIR). In this work, traditional feature-based and trending convolutional neural network (CNN) based approaches are considered and compared for identifying singers. Two different datasets, namely artist20 and the Indian popular singers’ database with 20 singers are used in this work to evaluate proposed approaches. Cepstral features such as Mel-frequency cepstral coefficients (MFCCs) and linear prediction cepstral coefficients (LPCCs) are considered to represent timbre information. Shifted delta cepstral (SDC) features are also computed beside the cepstral coefficients to capture temporal information. In addition, chroma features are computed from 12 semitones of a musical octave, overall forming a 46-dimensional feature vector. Experiments are conducted with different feature combinations, and suitable features are selected using the genetic algorithm-based feature selection (GAFS) approach. Two different classification techniques, namely artificial neural networks (ANNs) and random forest (RF), are considered on the features mentioned above. Further, spectrograms and chromagrams of audio clips are directly fed to CNN for classification. The singer identification results obtained using CNNs seem to be better than the traditional isolated and ensemble classifiers. Average accuracy of around 75% is observed with CNN in the case of Indian popular singers database. Whereas, on artist20 dataset, the proposed configuration of feature-based approach and CNN could not give better than 60% accuracy.
Article
Full-text available
The article deals with the compensation of variability in Automatic Speaker Verification systems in scenarios where the variability conditions due to utterance duration, reverberation and environmental noise are simultaneously present. We introduce a new representation of the speaker’s discriminative information, based on the use of a deep neural network trained discriminatively for speaker classification and i-vector representation. The proposed representation allows us to increase the verification performance by reducing the error between 2.5 and 7.9 % for all variability conditions compared to baseline systems. We also analyze the speaker verification system robustness based on interquartile range, obtaining a 1.19 times improvement compared to baselines evaluated.
Article
Full-text available
Deep learning models are now considered state-of-the-art in many areas of pattern recognition. In speaker recognition, several architectures have been studied, such as deep neural networks (DNNs), deep belief networks (DBNs), restricted Boltzmann machines (RBMs), and so on, while convolutional neural networks (CNNs) are the most widely used models in computer vision. The problem is that CNN is limited to the computer vision field due to its structure which is designed for two-dimensional data. To overcome this limitation, we aim at developing a customized CNN for speaker recognition. The goal of this paper is to propose a new approach to extract speaker characteristics by constructing CNN filters linked to the speaker. Besides, we propose new vectors to identify speakers, which we call in this work convVectors. Experiments have been performed with a gender-dependent corpus (THUYG-20 SRE) under three noise conditions : clean, 9db, and 0db. We compared the proposed method with our baseline system and the state-of-the-art methods. Results showed that the convVectors method was the most robust, improving the baseline system by an average of 43%, and recording an equal error rate of 1.05% EER. This is an important finding to understand how deep learning models can be adapted to the problem of speaker recognition.