ArticlePDF Available

Closed-set speaker identification using VQ and GMM based models

September 2021
International Journal of Speech Technology 25(2):1-24

September 2021
25(2):1-24

DOI:10.1007/s10772-021-09899-9

Authors:

Bidhan Barai

Techno India Saltlake

Tapas Chakraborty

Jadavpur University

Nibaran Das

Jadavpur University

Subhadip Basu

Jadavpur University

Show all 5 authorsHide

An array of features and methods are being developed over the past six decades for Speaker Identification (SI) and Speaker Verification (SV), jointly known as Speaker Recognition(SR). Mel Frequency Cepstral Coefficients (MFCC) is generally used as feature vectors in most of the cases because it gives higher accuracy compared to other features. The presented paper focuses on comparative study of state-of-the-art SR techniques along with their design challenges, robustness issues and performance evaluation methods. Rigorous experiments have been performed using Gaussian Mixture Model (GMM) with variations like Universal Background Model (UBM) and/or Vector Quantization (VQ) and/or VQ based UBM-GMM (VQ-UBM-GMM) with detail discussion. Other popular methods have been included, namely, Linear Discriminate Analysis (LDA), Probabilistic LDA (PLDA), Gaussian PLDA (GPLDA), Multi-condition GPLDA (MGPLDA), Identity Vector (i-vector) for comparative study only. Three popular audio data-sets have been used in the experiments, namely, IITG-MV SR, Hyke-2011 and ELSDSR. Hyke-2011 and ELSDSR contain clean speech while IITG-MV SR contains noisy audio data with variations in channel (device), environment, spoken style. We propose a new data mixing approach for SR to make the system independent of recording device, spoken style and environment. The accuracy we obtained for VQ and GMM based methods for databases, Hyke-2011 and ELSDSR are varies from $99.6\%$ to $100\%$ whereas accuracy for IITG-MV SR is upto $98\%$. Indeed, in some cases the accuracies degrade drastically due to mismatch between training and testing data as well as singularity problem of GMM. The experimental results serve as a benchmark for VQ/GMM/UBM based methods for the IITG-MV SR database.

Block Diagram of Speaker Identification (SI)

…

Block Diagram of Speaker Verification (SV)

…

Complete Block Diagram of MFCC Computation

…

Complete Block Diagram of EM Iteration for Model(GMM) Parameters (ωi,μiandΣi\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega _i, \varvec{\mu }_i \, and \, \Sigma _i$$\end{document}) Estimation

…

Figures - available from: International Journal of Speech Technology

This content is subject to copyright. Terms and conditions apply.

Content uploaded by Bidhan Barai

Content may be subject to copyright.

Noname manuscript No.

(will be inserted by the editor)

Closed-set Speaker Identiﬁcation Using VQ and

GMM Based Models

Bidhan Barai ·Tapas Chakraborty ·

Nibaran Das ·Subhadip Basu ·Mita

Nasipuri

the date of receipt and acceptance should be inserted later

Abstract An array of features and methods are being developed over the

past six decades for Speaker Identiﬁcation (SI) and Speaker Veriﬁcation (SV),

jointly known as Speaker Recognition(SR).Mel Frequency Cepstral Coeﬃcients

(MFCC) is generally used as feature vectors in most of the cases because it

gives higher accuracy compared to other features. The presented paper focuses

on comparative study of state-of-the-art SR techniques along with their design

challenges, robustness issues and performance evaluation methods. Rigorous

experiments have been performed using Gaussian Mixture Model (GMM) with

variations like Universal Background Model (UBM) and/or Vector Quantiza-

tion (VQ) and/or VQ based UBM-GMM (VQ-UBM-GMM) with detail dis-

cussion. Other popular methods have been included, namely, Linear Discrimi-

nate Analysis (LDA),Probabilistic LDA (PLDA),Gaussian PLDA (GPLDA),

Multi-condition GPLDA (MGPLDA), Identity Vector (i-vector) for compar-

ative study only. Three popular audio data-sets have been used in the ex-

periments, namely, IITG-MV SR, Hyke-2011 and ELSDSR. Hyke-2011 and

ELSDSR contain clean speech while IITG-MV SR contains noisy audio data

with variations in channel (device), environment, spoken style. We propose a

new data mixing approach for SR to make the system independent of record-

ing device, spoken style and environment. The accuracy we obtained for VQ

and GMM based methods for databases, Hyke-2011 and ELSDSR are varies

from 99.6% to 100% whereas accuracy for IITG-MV SR is upto 98%. Indeed,

in some cases the accuracies degrade drastically due to mismatch between

training and testing data as well as singularity problem of GMM. The exper-

Jadavpur University

Department of Computer Science & Engineering

E-mail: bidhanbarai.rs@jadavpuruniversity.in

E-mail: ju.tapas@gmail.com

E-mail: nibaranju@gmail.com

E-mail: subhadip.basu@jadavpuruniversity.in

E-mail: mitanasipuri@gmail.com

2 Bidhan Barai et al.

imental results serve as a benchmark for VQ/GMM/UBM based methods for

the IITG-MV SR database.

Keywords MFCC ·VQ ·GMM ·i-Vector ·PLDA

1 Introduction

SR is a branch of bio-metric recognition where the speaker speciﬁc psycho-

physiological characteristics of speech waveform are analysed to uniquely recog-

nise individual speaker using speaker’s voice signal [1,2]. These characteristics

include both voice tract characteristics (spectral features) and voice source

characteristics (supra-segmental features) of speech. Feature(s) is/are the at-

tribute(s) by which individual entities (speakers) are identiﬁed uniquely. A

set of features together form feature vector (generally, a vector is an array

of numbers). The process (or steps) of computing feature vector(s) is known

as feature extraction. SR is an example of typical Pattern Recognition (PR)

problem.

Any conventional PR method consists of two basic steps, Feature Extrac-

tion/Selection and Modelling/Classiﬁcation [3,4]. In SR, speaker speciﬁc fea-

tures are extracted ﬁrst from each of the voice signals available in the database,

and then a model is built for each class (for SR, each class represents a speaker)

in the database. This process is known as Training/Enrolment. When the voice

sample of an unknown speaker is available for SR, same set of features are ex-

tracted in similar manner. Then this set of features of the unknown speaker

(test data) is compared with every model of known voice samples (for identiﬁ-

cation) and a statistical distance or a score for the voice samples of unknown

speaker is computed with respect to all the known speaker’s models. The min-

imum distance or maximum score (any one measure or combination of mea-

sures) identiﬁes (classify) the unknown speaker as the speaker corresponding

to that model. This process is known as Testing. In this step we use all the

speaker models for classiﬁcation. For example, in GMM based SR using MFCC

feature, we ﬁrst compute MFCC feature vectors (13 MFC Coeﬃcients) from

the speech signals of all the speakers for training the GMMs of all speakers

(this is done from the training data) and next from testing speech signal the

MFCCs are extracted in the similar fashion to compute scores (or similarity

measures) with respect to every enrolled speaker (or trained speaker model)

for identiﬁcation purpose.

In a typical SR experiment, each speaker provides a single score. The op-

timum score (optimum because if we use distance measure then minimum

distance provides the classiﬁed speaker, on contrast if we use probability mea-

sure then maximum probability provides the classiﬁed speaker) provides the

classiﬁed speaker. Indeed, SR using GMM was introduced before 1992 and

later a lot of modiﬁcations are done in this approach. In this paper, we study

the model based SR using short-time spectral features with the help of the ap-

proaches mentioned above and also provide some features and methods from

speech recognition which are as well useful for SR because SR and speech

Closed-set Speaker Identiﬁcation Using VQ and GMM Based Models 3

recognition share some common characteristics, features and methodologies.

SR using Super Vector is an example of modiﬁcation of GMM where a super

vector is formed by concatenating the means of GMM. Here each speaker is

represented by a high dimensional super vector, called high level feature, rather

than a set of MFCC vectors.

1.1 Classiﬁcation of SR

The SR is classiﬁed into three groups, namely - (a) Speaker Identiﬁcation (SI)

and Speaker Veriﬁcation (SV),(b) Text-dependent and Text-independent,(c)

Closed-set and Open-set. SI is the type of SR where we are required to de-

termine the identity of an unknown speaker i.e., which speaker among the

enrolled speakers is speaking and in contrast to SI, SV is the task of authen-

ticating an unknown speaker’s identity i.e., we are required to verify whether

the claim of an unknown speaker will be accepted or rejected by the SR sys-

tem. Among these two types of SR, SV is the most popular one because of

its application in access control and security. In text-dependent SR the text

(content of speech) is ﬁxed (or same) for training and testing speech data

whereas in text-independent SR the text of training and testing speaker is

not ﬁxed (or same) [13]. Finally, in closed-set SR, it is known that the un-

known speaker is one of the enrolled speakers but we don’t know which one

among them whereas in open-set SR the unknown speaker may or may not

present among the enrolled speakers. It is known that among these types of

SR open-set text-independent SI(OSTI-SI) is the most challenging class of SR.

In OSTI-SI, the score of unknown speaker is compared with the scores of all of

the enrolled speakers using a decision function to determine - 1) whether the

unknown speaker is one of the enrolled speakers, 2) if yes, which one of the

enrolled speakers it belongs to. Task 1) and 2) are accomplished simultane-

ously by adding a complementary model which is built by all speakers’ speech

data except the known speakers’ data. This also means that an OSTI-SI SR

is solved if we have comparatively large number of speakers’ data outside the

known speakers. The nature of OSTI-SI SR problem is quite diﬀerent from

SV. Indeed, SV is always an open-set SR problem [5,13].

1.2 Challenges in SR

The SR is expanding day by day with a broad range of applications. Deploying

an SR system of high accuracy for real time applications is still challenging.

The performance of SR system degrades considerably due to the mismatch

among the various factors. The factors that play very crucial roles in high

performance SR system are discussed as follows [2]:

Noisy Environment: After acquisition of speech signals from diﬀerent speak-

ers for designing an SR system, may contaminate with various types of noise

4 Bidhan Barai et al.

namely convolutional noise, additive noise, reverberation noise (speech con-

taining echoes), random noise, impulse noise, white noise and so on. The de-

tails of these noises are found in [6,7,8, 9, 13].

Environmental Mismatch: It is extremely diﬃcult to accumulate the speech

signal from the diﬀerent speakers in the same environment for training and

testing. Accuracy of SR system is highly dependent upon the mismatch be-

tween training and testing environment [4,13].

Channel Mismatch: The recognition accuracy of SR degrades drastically

for recording device (also known as channel) mismatch between training and

testing data, that will be observed in sections 5 of the presented paper.

Spoken Style Mismatch: Spoken style is also a very important issue in de-

signing an SR system because it has signiﬁcant eﬀect on the performance of

SR system. In the experiment we have used two types of spoken style - reading

and conversation. In the IITG-MV SR database, in case of conversation, two

speakers spoke with each other and then the recorded speech signal is pro-

cessed to separate the speakers and then combined the individual speaker’s

segmented speech to create the whole speeches of two diﬀerent speakers [10].

We shall see that the performance of SR system degrades signiﬁcantly if there

is a mismatch of spoken style between training and testing utterances.

Language of Utterance Mismatch: The language mismatch between train-

ing and testing data also has signiﬁcant eﬀect in recognition accuracy of SR.

But it does not aﬀect as greatly as device and environment mismatch [2].

Short Utterance: The acquisition of speech signal with considerably suﬃ-

cient duration for training and testing is very diﬃcult to design an SR system.

So, sometimes we are bound to design a system with limited data (3 to 5

seconds or less). The length of the speech or duration of utterance plays an

important role in SR. Very short utterance degrades the recognition accu-

racy considerably. Mandasari et al. [11] examined the eﬀect of short utterance

and proposed a calibration strategy to model the calibration parameters using

Quality Measure Functions (QMFs) for SR to improve the recognition accu-

racy [16].

Long Utterance: If the available data for design is very large, then the data

must be reduced using data reduction techniques which may lead to loss of sig-

niﬁcant data (information). The accumulation and annotation of large amount

of data is also very diﬃcult.

The SR research has a long period of research history, about six decades,

but due to the above diﬃculties (challenges) the performance degrades and

the SR research is still motivating the researchers around the world to increase

the performance of SR system. The SR researchers have developed an array of

features, feature extraction techniques, methodologies and scoring techniques

for recognition till date to combat with the above diﬃculties. These methods

and techniques must go through several steps which lead to increase in system

components. Hence, the SR systems are being complex as the research of SR

progresses through time which leads to the other diﬃculty issues like time

complexity of the SR system. The example of one technique is i-vector [12].

The advancement of Deep Learning (DL) techniques [29] provide immense

Closed-set Speaker Identiﬁcation Using VQ and GMM Based Models 5

impact on the SR research, specially they improve the recognition performance

of SR system. Deep learning technique such as Convolutional Neural Network

(CNN) [32,29] methods require large volume of balanced data set to train a

model properly. Otherwise the approaches may not perform well. Generally,

the spectrogram [17] which is generated from the speech signal, is used as

input to a CNN. In [17], Chakraborty et al. uses the CNN methods for SR on

IITG-MV SR database along with other like VQ and GMM/UBM-GMM. On

the data-set, VQ and GMM/UBM-GMM performed better than CNN because

some speakers on the database contain very short utterances such as 22 seconds

as compared to other speakers (average speech duration is about 5 minutes),

which makes the data unbalanced, though short utterances can decrease the

performances of both GMM/UBM-GMM and CNN based SR. But CNNs are

more susceptible to short utterances than GMM/UBM-GMM [18]. Hence, we

restrict our study on VQ and GMM based methods; though very recently to

address the problem of short utterance, the researchers proposed diﬀerent CNN

and hybrid CNN based models [18,19,25, 29] which we have not discussed in

the present paper to understand the impact of diﬀerent CNN based models on

IITG-MV SR database.

Another aspect of this paper is that however a rich number of references

are given but the paper has been written in such a way that a reader can

implement an SR system easily using MFCC feature and VQ-UBM-GMM

based classiﬁer. We can observe that there is a broad range of application areas

of SR. Some example of applications of SR are authentication, forensic SR [22],

multi-speaker tracking [23], singer identiﬁcation [14] security and surveillance,

personalized user interfaces and access control [24].

1.3 Contributions

In the previous sub-section, we have listed the major research challenges in-

volved in text-independent closed-set SR. In this paper, however, we have

attempted to address some of those aspects speciﬁcally related to, 1) record-

ing device independence, 2) spoken style independence, 3) environment/session

independence. Major contributions of this paper may be highlighted as follows:

–A comprehensive survey of diﬀerent text-independent closed-set SR tech-

niques has been presented.

–Extensive SR experimentation is performed over various GMM based clas-

siﬁers.

–The IITG-MV SR database is used to evaluate the benchmark perfor-

mances.

–Singularity problem of the GMM has been studied.

–We propose a new data mixing approach for SR and examine its accu-

racy under various mismatch/dependent cases, involving the variability in

recording devices, spoken styles, and environment/sessions.

The rest of the paper is organized in the following way. Section 2 is ded-

icated for overall system design with state-of-the-art SR and the evaluation

6 Bidhan Barai et al.

strategies to measure the performance of SR systems. Modelling and Classi-

ﬁcation are not described in diﬀerent subsections because classiﬁcation is de-

pendent on the modelling method and hence modelling and classiﬁcation are

described in a single subsection 2.2. In section 3 we describe various state-of-

the-art methods to combat diﬃculties in designing an SR system. The concep-

tual comparison of features and methods are presented in section 4. In section

5 the performances with diagnostic analysis of some existing SR systems are

reported along with our experiment in three databases, namely, Hyke-2011

[20], ELSDSR [21] and IITG MV SR [10] databases. Finally we conclude the

presented paper.

2 Overall Design and State-of-the-art SR

SR is basically a PR problem and every PR problem has two basic steps fea-

ture extraction/selection and modelling/classiﬁcation. In the training phase,

speaker speciﬁc information is extracted using various digital signal processing

(DSP) techniques and algorithms from speech signal of every speaker. Use of

DSP tools helps to transform each speaker’s speech data (i.e. feature vector or

set of feature vectors) in such a way that each speaker is uniquely identiﬁable.

Then a model (for example GMM) is built over the feature vectors in feature

space. Now, every speaker is represented by the speaker model which is char-

acterized by model parameters. In the testing phase, the feature vectors are

extracted in the similar fashion from the test/unknown speaker’s speech sig-

nal for classiﬁcation. During classiﬁcation, speaker speciﬁc models were built

initially (training). Then feature vectors of unknown speaker was compared

(using score or similarity measure) with respect to all speakers for decision.

In SI, we make the decision that who is the unknown speaker among all the

enrolled speakers and in SV, we decide whether the test/unknown speaker who

claimed her/his identity is accepted or rejected by the SR system.

2.1 State-of-the-art Feature Description

In feature extraction step the speaker’s raw data (speech waveform) is mapped

into a measurable space, known as feature space with the help of Digital Signal

Processing (DSP) techniques, where every speaker is uniquely distinguishable

from each other. Mathematically, feature extraction is a mapping f:R7−→ RD

i.e., Y=f(X), by which Xin a space (raw data) ΩXis transformed into D-

dimensional feature space ΩY[26] and produces the feature vector or a set

of feature vectors Yin feature space. Here, fcan be viewed as a process of

obtaining the feature vector y:y∈ΩY. Indeed, a speaker in a feature space is

represented by a single feature vector or by a set of feature vectors. Short-Time

Signal Analysis (STSA)/Short-Time Fourier Transform(STFT) of segmented

speech waveform is the most popular and powerful DSP tool used to extract

feature for SR. Most of acoustic features used for SR are extracted with the

Closed-set Speaker Identiﬁcation Using VQ and GMM Based Models 7

Fig. 1: Block Diagram of Speaker Identiﬁcation (SI)

Fig. 2: Block Diagram of Speaker Veriﬁcation (SV)

help of STSA technique. The example of such features except MFCC and

Gammatone Filter Cepstral Coeﬃcients(GFCC) are Linear frequency cepstral

coeﬃcient(LFCC) [27,28], Linear Predictive Cepstral Coeﬃcients(LPCC) [30,

28], Perceptual Linear Predictive Cepstral Coeﬃcients(PLPCC). Among these

features MFCC, LFCC, GFCC [31,33] are based on ﬁlterbank analysis and

LPCC, PLPCC are based on LP analysis [34,35,36], however, they all are the

types of STFT that use Fast Fourier Transform (FFT). In STFT the complete

speech signal of a speaker (say 10 second long speech) is segmented into small

8 Bidhan Barai et al.

Fig. 3: Complete Block Diagram of MFCC Computation

frames (say 25 milliseconds long speech frames). Generally, each frame provides

a feature vector in feature space. Hence we shall get a set of feature vectors

from the complete speech signal. Among state-of-the-art features, MFCC is the

most popular and useful feature for SI. A complete description of computation

of MFCC feature is found in [2,3, 4]. Here, we provide a brief description of

MFCC vector computation as a block diagram 3.

Generally, in SI and SV, after the feature extraction every speaker is

represented by a sequence (or set) of Ddimensional feature vectors X=

(x1,x2. . . xT) where Tis the total number of feature vectors and each of xτ,

τ∈ {1,2. . . T }is represented by a Ddimensional column vector to meet

conventional mathematical notations. Indeed, after the feature extraction we

shall get a Ddimensional row vector from every 25 ms frame (an array of

real numbers) and for mathematical notations we consider every row vector

as column vector. It is important to note that the computation of GFCC is

similar to MFCC, the only diﬀerence is that in GFCC , Gammatone Filter

Bank is used rather that mel ﬁlter bank. The equation of a gammatone ﬁlter

in time domain is as follows:

gm(t) = at−2πbmcos(2πfct+φ) (1)

where 0a0is a constant (usually a = 1). 0n0is called the ﬁlter order (usually n =

4). 0φ0is the phase shift. 0f0

cis the centre frequency and 0b0

mis the attenuation

factor for mth ﬁlter. The complete description of GFCC computation is found

in [46,47,48, 49].

In SR, feature extraction process converts the complete speech waveform

into a set of feature vectors which can distinguish diﬀerent speakers. The

LFCC, GFCC, MFCC are computed directly with the help of Fast Fourier

Transform (FFT) power spectrum but LPCC and PLPCC are obtained us-

ing all-pole model to represent the smooth spectrum. The above STSA based

features used only magnitude spectrum (power spectrum is the squared mag-

nitude spectrum) and phase spectrum is discarded. Indeed, phase spectrum

Closed-set Speaker Identiﬁcation Using VQ and GMM Based Models 9

based features, like Group Delay Function (GDF),Modiﬁed GDF (MODGDF)

[37], Instantaneous Frequency (IF) [38,39], Instantaneous Frequency Deviation

(IFD) [40] are often used to extract complimentary speaker information [41,

42,43]. The extraction of complementary feature set is given in [30]. Indeed,

the MFCC features are not only used in SR but also used in speaker’s gender

and age classiﬁcation, speech recognition, language recognition [44,45].

2.2 Modelling, State-of-the-art Models and Classiﬁcation

In 2.1 we described how the raw speech waveform of a speaker will be trans-

formed into the measurable MFCC feature space of dimension D. In this space

every training speaker is represented by a set of feature vectors Xwhich is

known as template (voice print). For the Snumber of speaker, we will get S

number of sets Xtrain(ζ) where ζ= 1,2, . . . , S . If unknown speaker’s speech

signal is deployed to the SR system for recognition, the system will compute

the feature set Xtest i.e., set of feature vectors of unknown speaker. Indeed, in

this paper we are interested with the closed set SR. Now the recognition can

be carried out by the template matching strategy, where Xtest is compared

with Xtrain(ζ) for ζ= 1,2, . . . , S and some distance is calculated using frame

by frame (vector by vector) comparison for each ζwithout using speaker mod-

elling, where minimum of Snumerical distances or maximum of Snumerical

score corresponding to an enrolled speaker’s feature set gives the class of the

unknown speaker. But if the size of the sets Xtrain(ζ) and Xtest are large this

strategy will be cumbersome and the SR system will be unable to use in real

time. This problem is solved using statistical modelling, statistical distance

and scoring techniques. VQ,GMM are examples of such approach.

2.2.1 VQ Based SR

In VQ the vectors in the feature space Ωxfor the training speaker are grouped

into a Knumber of distinct regions where K << T and a reconstruction vector

is deﬁned to represent a region. Therefore we shall get a set of Knumber of

reconstruction vectors. This collection of the reconstruction vectors (or set

of vectors) Kcis known as codebook. VQ is basically a data condensation

technique [26]. Mathematically speaking in the VQ for the ζth speaker, the

complete training data set (MFCCs for training) Xtrain(ζ) is mapped/grouped

into Knumber of clusters where each cluster is represented by the centroid

of individual cluster/group and the set of Kcentroids is the codebook Kc(ζ)

. For the Identiﬁcation the MFCCs of unknown speaker (testing data set)

Xtest is compared with Kc(ζ) of all speakers , ζ∈[1, S] to compute distances

(for example, manhattan, euclidean etc.) or scores (for example reciprocal of

distances). The minimum distance or maximum score provides the identiﬁed

speaker.

Let us assume the feature space Ωxof a training speaker contains a total

of Tnumber of feature vectors Xtrain ={x1,x2,...,xT}. A vector quan-

10 Bidhan Barai et al.

tizer which provides minimum distortion is known as Voronoi or Nearest-

Neighbour (NN) quantizer. There are plenty of algorithms like Linde- Buzo-

Gray (LBG) [50], Self-Organizing Map (SOM) [51] and Principal Component

Analysis (PCA) based LBG [52] are available to compute the codebook eﬃ-

ciently. The LBG algorithm is very similar to k-means clustering algorithm

in which it takes a set of vectors Xtrain ={xi∈RD:i= 1,2, . . . , T }

as inputs and generate a set of reconstruction vectors C={cj∈RD:

j= 1,2, . . . , C}with a user deﬁned C << T as output according to the

similarity measure. To construct a vector quantizer generally we take D=

13 (Dimension of M F CC vector), C= 256,512 or 1024. Indeed, the conver-

gence of the LBG algorithm [50] depends on the initialization of the codebook

C, a distortion, and a threshold during the implementation. Suﬃcient number

of iterations is required to guarantee convergence of the algorithm. In this way

we compute the codebooks of all the Sspeakers, represented by C1,C2,...,CS.

For the classiﬁcation of unknown speaker the SR system, ﬁrst, maps /trans-

forms the unknown speaker’s speech waveform into the MFCC feature space

Ωx. Let us assume Xtest ={x1,x2,...,xP}be the set of Pnumber of MFCC

feature vectors of the unknown/test speaker. Therefore, now we have one test

set Xtest and Scodebook models of all the speakers Cifor i= 1,2, . . . , S. Now

we compute similarity measures ˆ

Dibetween Xtest and Cifor i= 1,2, . . . , S.

Generally, the similarity measure is computed with the help of Euclidean dis-

tance. In this method we ﬁrst compare Xtest and 1st speaker’s codebook C1

where we ﬁnd the Euclid distances of a test vector x1and all the code vectors

of C1and we take the minimum distance . Similarly, we shall compute the dis-

tance between x2,x3,...,xPand all the code vectors in C1and every time we

shall take the minimum distance. Hence we shall get Pnumber of minimum

distances and we shall sum up all the Pminimum distances to get a single

distance ˆ

D1between codebook of 1st speaker, C1and the test speaker’s set of

vectors, Xtest . Similarly, we shall compute the distances between Xtest and

other speakers codebooks C2,C3,...,CSrepresented by ˆ

D2,ˆ

D3,..., ˆ

DSrespec-

tively. Hence, mathematically, we can deﬁne a set membership function Di(·)

as follows

Di(Xtest,Ci) = ˆ

Di=1

j=1

min

xj∈Xtest

ck∈Ci

kxj−ckk2(2)

for k∈[1, C] where i= 1,2, . . . , S and k·k2represents the 2-norm (Eu-

clidean distance). Among all ˆ

Di’s the minimum value will provide the identiﬁed

speaker. Therefore, we can write mathematically

T est Speaker Identif ied as =ˆ

S= arg min( ˆ

Di)

i∈[1,S]

(3)

Hence the test/unknown speaker is the ˆ

Sth speaker among the total of S

orderly arranged training speakers.

Closed-set Speaker Identiﬁcation Using VQ and GMM Based Models 11

2.2.2 GMM Based SR

In GMM the vectors in the feature space Ωxfor the training speaker is ﬁt into

GMM which is characterized by model parameters. This means that using the

MFCCs of training speaker we shall built GMM, denoted by P(x|Φ) which

is the sum of Mcomponent multivariate weighted Gaussian curves and is

deﬁned/characterized by weights ωh, mean vectors µhand covariance matrices

Σhfor h∈[1, M ]. Mathematically, Mcomponent GMM for sth speaker is

p(x|Φs) =

h=1

ωhfh(x;φh) (4)

where weights ωhrepresent the fraction of data points belonging to hth model

and they summed to 1 (PM

h=1 ωh= 1), the functions fh(·), h = 1,2, . . . , M

are the component density functions and Φis the set of parameters Φ=

{ωh, φh:h= 1,2, . . . , M }[26]. For GMM, fh(·) are the multi variate Gaussian

probability density functions (pdfs) N(xt;µh, Σh) and φh={µh, Σh}for h=

1,2, . . . , M . During identiﬁcation using the MFCCs of unknown speaker and

the GMM of ζth speaker we compute the ζth score Sζwhere ζ= 1,2. . . S.

We compute the scores of all Snumber of speaker and the maximum score

provides the identiﬁed speaker. Here

N(xt;µh, Σh) = 1

(2π)D/2|Σh|1/2e−1

2(xt−µh)0Σ−1

h(xt−µh)(5)

where (xt−µh)0represents the transpose of column vector (xt−µh).

GMM based classiﬁer for SR is a popular state-of-the-art approach that

is used extensively in SI and SV systems. Let us assume that the feature

space Ωxof a training speaker contains a total of Tnumber of feature vectors

Xtrain ={x1,x2,...,xT}. In GMM the data (feature vector) are ﬁt into the

sum of weighted multi dimensional Gaussian curves (or distributions where

Gaussian distributions diﬀer by parameters only) just like we have done in

polynomial curve ﬁtting. Generally, a mixture model (like GMM) approximates

data distribution by the M-component density functions (or pdf’s) fhwhere

h= 1,2, . . . , M to the data set Xtr ain having Tnumber of patterns (or feature

vectors). Let random vector x∈ Xtrain be an arbitrary pattern, then the

mixture model density function p(x|Φ) evaluated at xis [26]:

p(x|Φ) =

h=1

ωhN(xt;µh, Σh) (6)

Now we are in a position to estimate the set of model parameters Φ=

{ωh,µh, Σh:h∈[1, M ]}. To do so we apply a very popular technique Maxi-

mum Likelihood-Parameter Estimation(MLE). In this technique our aim is to

gradually maximize the following probability

p(Xtrain|Φ) =

t=1

p(xt|Φ) (7)

12 Bidhan Barai et al.

with the help of Expectation Maximization (EM) iterations using following

equations

ωh=1

t=1

p(h|xt, Φ) : h∈[1, M ] (8)

µh=PT

t=1p(h|xt, Φ)xt

t=1p(h|xt, Φ):h∈[1, M ] (9)

Σh=PT

t=1p(h|xt, Φ)(xh−µh)(xh−µh)0

t=1p(h|xt, Φ):h∈[1, M ] (10)

where p(h|xt, Φ) is a conditional probability that is found by Bay’s theorem

as follows

p(h|xt, Φ) = whN(xt;µh, Σh)

j=1 wjN(xt;µj,Σj):h∈[1, M ] (11)

Now we describe EM iteration brieﬂy. The complete block diagram of EM

iteration for MLE is shown in ﬁgure ??. The iteration start with the initial

guess of ωh,µhand Σhfor h∈[1, M ]. Let us assume the initial values are ω0

µ0

hand Σ0

hfor h∈[1, M ]. Here ω0

h=1

Mfor all h∈[1, M ]. The means µ0

hfor

all h∈[1, M ] is computed using k-means algorithm. Corresponding to every

µhwe compute Σhusing usual method. In the ﬁrst iteration equations (8)-

(10) have the values ω0

h,µ0

h,Σ0

hfor computation and compute new values, say

ω1

h,µ1

h,Σ1

hwill be considered as initial value to compute new values ω2

h,µ2

Σ2

h, in second iteration. Hence in general in the ith iteration the ﬁnal values

will be ωi

h,µi

h,Σi

h. In experiment we choose i= 5 experimentally since after

5 iteration there is very very less diﬀerence between old values ω(i−1)

h,µ(i−1)

Σ(i−1)

hand new values ωi

h,µi

h,Σi

h.This technique is analogous to the solution

of polynomials using ﬁxed point iterative method.

So far we have described the steps to build speakers’ GMM models for the

enrolment. Now we shall describe the identiﬁcation/classiﬁcation method using

GMMs of enrolled speakers. To do so, a popular technique, called Maximum

Log-Likelihood (MLL) scoring, is available for SR, which uses Minimum Error

Bayes’ Decision Rule. Let us given Snumber of speakers, S={1,2, . . . , S}

and their corresponding enrolled models are Φ1, Φ2, . . . , ΦS. Let the set of

feature vector of test/unknown speaker is Xtest ={x1,x2,...,xP}. Now the

task is to compute the speaker model with maximum posteriori probability i.e.,

MLL score which in turn leads to the identiﬁed speaker (the index or ID of

maximum score is returned). Therefore using minimum error Bayes’ decision

rule the task is carried out by likelihood function and the score Sξis deﬁned

by the following equation

Sξ=Lp(Xtest|Φξ)P r (Φξ)

p(Xtest)(12)

Closed-set Speaker Identiﬁcation Using VQ and GMM Based Models 13

Fig. 4: Complete Block Diagram of EM Iteration for Model(GMM) Parameters (ωi,µiand Σi) Estimation

14 Bidhan Barai et al.

where P r(Φξ) is the occurrence of the ξth speaker model and p(Xtest) is the

occurrence of the test speaker in the training data. Here we assume that the

models and speakers are equally likely then we are not required to calculate

P r(Φξ) and p(Xtest) because for every testing case they remain same. Hence

we just drop the P r(Φξ)

p(Xtest)from the eqn. (12) and leads to the following

Sξ=L(p(Xtest|Φξ)) (13)

where L(·) is likelihood function deﬁned by

L(p(Xtest|Φξ)) =

t=1

p(xt|Φξ) (14)

where t∈ {x1,x2,...,xP}. This equation signiﬁes that Φξ’s are given in

the enrolled speaker models and we put the feature vectors in Xtest into the

eqn.(13). We evaluate eqn.(13) for all the Φξ,{ξ= 1,2, . . . , S}keeping Xtest

ﬁx to get Slikelihood values. The speaker with corresponding maximum ML

value is the identiﬁed speaker. However, to make the computation simple we

often take the logarithm of the MLL values as given by the following equation

Slog

ξ=log(L(p(Xtest |Φξ))) = log(

t=1

p(xt|Φξ)) (15)

This is known as MLL scoring. We know that a probability value is always

≤1. This equation has a drawback. If Pis very large then this product will

tends to 0. If we take logarithm of the above equation then the product will

becomes summation and eliminates the problem. Taking logarithm of this

equation gives -

Slog

ξ=

t=1

log(p(xt|Φξ) (16)

The identiﬁed speaker is the one who has maximum MLL score and given

by ˆ

S= arg max

ξ∈S

(Slog

ξ) (17)

For the diﬃculties as mentioned in 1.2 , the SR is still developing and

remains a centre of interest among the researchers. There are plenty of fea-

tures and several approaches have been applied to design and evaluate the

SR [10]. The i-vector technique remains the state-of-the-art technique for SR

over the last few years [12]. However, other features like spectral feature -

Formant Frequencies ( F1,F2,F3),Pitch Contours [1], Phase Information

[31,53], features derived from short time processing of speech signal - static

MFCC and dynamic MFCC (1st and 2nd order derivatives of static MFCC de-

noted by ∆MF C C and ∆2M F C C respectively) [54], Spectral-temporal Recep-

tive Fields (STRF) and MFCC balanced feature,Autocorrelation,Zero Cross-

ing Rate (ZCR),Harmonic Feature,Auditory-Based Feature [55], Group De-

lay Feature and Modiﬁed Group Delay Feature (MODGDF) [37], Mel Filter

Closed-set Speaker Identiﬁcation Using VQ and GMM Based Models 15

Bank Energy-Based Slope Feature [56] and model based (domain) feature -

GMM Super Vector [57,58], Bottleneck Feature of DNN (BF-DNN) [59,60]

remained the state-of-the-art features for SR before the i-vector [12,61,62,63,

64,65]. But among them MFCC and GFCC are still using besides i-vector

and even new features are invented with the advancement of the SR research

[66]. Avci et al. [67] proposed a novel optimum feature extraction and clas-

siﬁcation using Genetic-Wavelet Packet-Neural Network (GWPNN) for SR.

Mary et al. [68] proposed a novel prosodic feature which is manifested in

terms of measurable parameters such as fundamental frequency (F0), dura-

tion and energy. Rama Murthy et al. [42] introduces Instantaneous Frequency

(IF) and Analytic Phase features and showed the signiﬁcance of analytic phase

in SR. The modelling/classiﬁcation methods that are used in SR are Vector

Quantization [69], Support Vector Machine (SVM) [70], Least Square SVM

(LS-SVM),k-Nearest Neighbour (k-NN),GMM [71], GMM-Universal Back-

ground Model (GMM-UBM) [72], Hidden Markov Model (HMM),Fuzzy Sets

[1,73], Artiﬁcial Neural Network (ANN) [74], Deep Neural Network (DNN)

[12,75,76, 77, 78], Linear Discriminant Analysis (LDA),Probabilistic Linear

Discriminant Analysis (PLDA) [45], Heavy Tailed PLDA (PLDA-HT) [79],

Discriminant Analysis via Support Vectors (SVDA) ,Gaussian PLDA (G-

PLDA). Sometimes a combination of multiple classiﬁers (hybrid classiﬁer) like

SVM-GMM [80], GMM-VQ,VQ-GMM-UBM [81], SVM-HMM,ANN-HMM

[74], Maximum a Posteriori Vector Quantization (VQ-MAP) ,VQ-MAP-LS-

SVM [82], VQ-HMM,VQ-GMM-SVM,GMM-UBM-PLDA [79] are used for

SR. Novoselov et al. [57] proposed unconventional non-linear PLDA, for i-

vector space, which employs DNN-based suﬃcient statistics calculation that

outperform conventional GMM-based systems. In the recent research on SR

uses high level features (Model domain feature). In this approach some map-

ping or function is performed on the model parameters to get the ﬁnal feature

vector (composed of model parameters) and sometimes normalization is done

over these feature vector(s).

For the real-time application of SR, the robustness is a very critical issue

because the speech signal may contain additive, multiplicative and convolu-

tional noises, room reverberation, there may be language, environment (train

and bus stations, laboratory, oﬃce, classroom etc.), device (microphone) mis-

match and these factors lead to a great degradation of the recognition accuracy

(performance of SR). Here, by the device mismatch we mean that the record-

ing devices of training and testing speech signal are diﬀerent. Similarly, we

refer to diﬀerence of language of utterance and environment between training

and testing speech signals by language mismatch and environment mismatch

respectively. Due to these factors, the same SR system gives various accuracy

in diﬀerent conditions. To make the SR robust, we must remove the eﬀect of

mismatch conditions from the feature vectors in feature and/or model and/or

score domain with the help of transformation and/or normalization on these

domains before the ﬁnal classiﬁcation of test speaker. Indeed, for robust SR,

GMM based approach along with other classiﬁers (hybrid classiﬁer) is the most

useful technique. This happen because GMM provides multiple techniques for

16 Bidhan Barai et al.

the transformation and normalization in model and/or score domains. Gen-

erally, transformation modiﬁes data in such a way that inter-speaker vari-

ability (variability of training or testing data between two speakers) increases

and intra-speaker variability (variability of training and testing data of same

speaker) decreases [86,109]. Srinivasan et al. [84] studied that Time-Frequency

(T-F) masking before Gammatone Feature (GF) and GFCC feature extraction

provide signiﬁcant improvement in recognition accuracy in SR. Wang et al. [7]

examined that vocal source and vocal tract features ∆M F CC ,∆2M F CC and

Linear Prediction (LP) Residual features make SR system robust. Ming et al.

[83] proposed a novel multi-condition training data method under various noise

to model the noise . Togneri et al. [86] studied the robustness of GMM and

missing data under the various mismatch and noisy condition. Garcia-Romero

et al. [87] proposed a novel multi-conditioning GPLDA model of i-vectors for

robust SR under noise and reverberation. Zhao et al. [88] studied and provide

the analysis of robustness of MFCC and GFCC under the noisy condition. An-

other study in [46] proposed a novel CASA-based speech processing for robust

SR. Cooke et al. [89] proposed a novel approach, for robust automatic speech

recognition in missing and unreliable speech data using continuous-density

HMM, which was used in SR as well.

Since SR experiments are classiﬁed in two categories - SI and SV, therefore

there are two types of performance measures. One type for SI and another for

the SV. For the SI system, the performance of SR system is measured by the

average percentage of correctly identiﬁed speakers in more than one testing

and training data pair. To do so, training and testing data are divided in more

than two or three groups for a single experiment and we ﬁnd the percentages of

accuracy for all the training and testing pairs and take the average percentage

of them as ﬁnal accuracy. However, the performance measure for SV is quite

diﬀerent from SI system. There are three measurement parameters for SV

system, namely, False Acceptance Rate (FAR),False Rejection Rate (FRR)

and Equal Error Rate (EER). The performance measures are discussed broadly

in subsequent section.

2.2.3 VQ/GMM Based SR

So far we have discussed about VQ and GMM based classiﬁcation. In this

section we shall discuss VQ/GMM based classiﬁcation in which both VQ and

GMM are applied for Modelling and classiﬁcation. Conveniently, here VQ is

applied as a data reduction (not dimension reduction) technique where the

number of feature vectors are reduced from the rich number of feature vectors

to considerably less number of feature vectors (speech signal of each speaker

is suﬃciently large). Then the GMM is applied to the set of reduced feature

vectors for modelling the speaker.

Let, we have a speaker’s speech data for modelling (enrolment). We ﬁrst

transform the speech raw data into MFCC feature space of dimension Dto

get set of Tnumber of feature vectors Xtrain ={x1,x2,...,xT}={xi∈

RD: 1 ≤i≤T}. Then we are required to apply VQ on the set of feature

Closed-set Speaker Identiﬁcation Using VQ and GMM Based Models 17

vectors Xtrain. Let us Xtransformed into a codebook C={c1,c2,...,cC}=

{ci∈RD: 1 ≤i≤C}of Cnumber of code vectors where C << T . Here

VQ is viewed as a mapping f:Xtrain 7−→ C, that reduced the Tnumber

feature vectors into Cnumber of code vectors where C << T . Thus till now

we have codebook Cof Cnumber of code vectors. For the speaker modelling

we built GMM over the codebook C={ci: 1 ≤i≤C}to get M component

GMM represented by set of parameters Φ={(ωh,µh, Σh):1≤h≤M}of

codebook as described in 2.2.1. Let, the ith speaker has the GMM given by

Φi={(ωi

h,µi

h, Σi

h) : 1 ≤h≤M and 1≤i≤S}.

Next the speech waveform of test/unknown speaker is mapped (or trans-

formed) into the MFCC feature vectors in Ddimensional feature space to

get the set of Pnumber of feature vectors Xtest ={x1,x2,...,xP}. For the

classiﬁcation of this test speaker we use MLL scoring as described in 2.2.2.

2.2.4 Universal Background Model (UBM)/GMM Based SR

Generally, GMM-UBM is used for SV. However, this model can also be applied

for SI with limited data (speech waveform is not of suﬃcient duration) [72]. In

this method, we pool some amount of data of all speakers and build a GMM as

a common model of all the speakers so that it becomes speaker independent.

That’s why it is called Universal Background Model (UBM). Generally, UBM

contains data (MFCC vectors) of all the enrolled speakers as well as other

speakers other than enrolled speaker. From the another point of view, UBM

actually represents the model of language because in this model we take a large

number of speakers to build UBM which represents the model of speech in a

ﬁxed language (assuming that all the speakers speak in a ﬁxed language and

not in multi language). Hence UBM is nothing but the language model [2,71]

and this model can also be used for language identiﬁcation. In SV, this model

is called imposter model [86]. With the help of training data of all speakers and

maximum a posteriori (MAP) estimation, we built GMM of all the speakers

using UBM. Let, we have Snumber of speakers whose set of feature vectors

are X1,X2,...,XS. From these feature sets some amount of vectors (say 200

vectors from each set) are taken and a GMM is built as described in section

2.2.2. This is called UBM. Using the remaining training feature vectors of each

speaker, we built the adapted GMM of every speaker by Bayesian learning or

maximum a posteriori (MAP) estimation.

2.2.5 i-Vector Based SR

The Identity Vector (i-vector) approach is a robust and popular technique

for SR, at present. It incorporates all the updates during the adaptation of

UBM, for example, mean vector µUBM is formed by the concatenation of

all the mean vectors of UBM components, one below another. If Dbe the

vector dimension and Mbe the number of Gaussian components UBM has,

then dimension of the UBM mean vector will be MD ×1 (a column vector).

Indeed, µUB M is speaker and channel independent, because the UBM is built

18 Bidhan Barai et al.

by taking MFCCs from all devices (channels) of all the speaker. This vector is

called GMM super vector and from this super vector, the i-vector is extracted.

Here all the information of updates is modelled in a low dimensional space,

called the total variability space. In this technique, the UBM super vector µubm

is assumed to be generated by the following equation

µi=µubm +Tr(18)

where Tis a rectangular matrix of low rank, called Total Variability Matrix

(TVM) and ris a random vector which follows a prior standard normal dis-

tribution N(0,I) [64,90]. It is important to mention that only adapting the

UBM mean vectors to form super vector produces enough information for the

SR (i.e., covariance matrix is not necessary). Similarly, the i-vector from train-

ing data (remaining vectors after formation of UBM) is computed by adapting

the UBM super vector. Here the i-vector is the MAP point of estimation of

the random vector r(also called the latent variable) adapting the µubm using

training data (analogous to GMM-UBM adaptation). The i-vector serves as

ahigh level feature because this feature is extracted from the model domain

(speaker model, which is at the higher level of SR, is formed after feature

extraction). Hence, extraction of i-vector is the feature extraction step in the

model domain.

The i-vectors from all the speakers and the test/unknown speaker are the

input to the classiﬁer for the score computation and decision [65]. For the i-

vector based classiﬁcation, generally SVM, HMM, ANN, DNN classiﬁers are

used.

2.2.6 e-Vector based SR

The i-vector approach for text-independent SR is the recent (current) state-

of-the-art technique. In Joint Factor Analysis (JFA) we are required to model

speaker and inter-session variability diﬀerently. It is important to mention that

for for JFA based SR, every speaker’s speech is recorded in at least two diﬀerent

session [91,92]. IITG-MV SR database is very fruitful for this experiment

to examine the session, environmental, and channel variability because it is

multi-session, multi-environment, multi-channel database [10]. But in i-vector

approach all the variability is modelled in a single subspace of low-dimension

as described above. The JFA computes a more relevant and more informative

subspace other than total variability (T) i-vector subspace. Basically, e-vector

is the representation of speech waveform similar to both JFA and i-vector [61,

63,94,95]. The e-vector is calculated in a similar way as i-vector with slight

variation but produces more accurate feature (high level feature) subspace

than JFA and i-vector. Cumani et al. reported in [96] that replacing i-vector

with e-vector the recognition rate improves 10% for the NIST 2012 and 2010

SR Evaluations [97].

Since e-vector incorporates both i-vector and JFA model for the almost all

kind of variability, we are required to deﬁne JFA model(sometimes called Aﬃne

Closed-set Speaker Identiﬁcation Using VQ and GMM Based Models 19

Linear Model) [92]. Basically, JFA overcomes the limitations in i-vector based

approach. In JFA model, the speaker dependent GMM supervector (UBM-

GMM supervector) is decomposed into speaker and channel dependent vectors

(S and C respectively), given by

µjf a =S+C(19)

where speaker dependent and channel dependent components (vectors) are

given by,

S=µ0+Vy+Wz(20)

C=Ux(21)

where µ0is a speaker and session independent supervector (computed from a

general UBM which is created from mixture of MFCCs from all channel, all

session, all speakers including large open-set speakers and training MFCCs of

speciﬁc session and channel), Vis low rank eigenvoices matrix, Wia a diago-

nal residual variability matrix not captured by the speakers’ MFCC subspace,

yand zare both independent random vectors having standard normal distri-

butions, N(0,I), Uis low rank channel variance matrix, called eigenchannels

and xis normally distributed channel factor vector like yand z[98].

Using i-vector approach, the channel compensation is performed in a com-

paratively low-dimensional subspace, instead of the much larger GMM super-

vector space. Since the models (18) and (19) are very similar, hence the TVM

training (i.e., computation of T) in (18) are performed similarly to eigenvoice

matrix (V) training in (19). However, there is only one diﬀerence in the V ma-

trix estimation. In JFA model, the segments of speech waveform of the same

speaker are considered as a single class but in e-vector model all the segments

are considered as diﬀerent classes in T matrix estimation. The eigenvectors

forming T matrix span both the channel and speaker subspaces. Therefore,

matrix T does not model the speaker sub- space as well as the eigenvoice ma-

trix V. For this reason, Cumani et al. in [96] has proposed a modelling, called

e-vector, technique that uses the advantage of the best of both the JFA and

the i-vector techniques. Due to the similarity, the i-vectors framework is kept

but a diﬀerent T matrix is estimated, which represents speaker space more ac-

curately. The procedure for estimating V and T is found in [64]. The e-vector

is very similar model as i-vector as follows

µi=µubm +Er(22)

where µi∈RDand µubm ∈RDare GMM super vector and µubm is the UBM

mean super vector, ris a random vector which obeys the prior distribution

N(0,I) where 0is zero vector and Iis identity matrix . Here the new matrix

E∈RD×Dis similar to TVM in i-vector extraction in equation (18). The

complete estimation of e-vector Eis found in [96].

After the extraction of e-vector, generally SVM, ANN, DNN and HMM

classiﬁers or hybrid classiﬁers, which are very common in the literature, are

used for classiﬁcation of unknown/test speaker and the scoring techniques

20 Bidhan Barai et al.

like Cosine Kernel and Cosine Distance Scoring are found in [64]. In [99] a

very useful i-vector based scoring techniques, called practical PLDA scoring

variants, are discussed.

3 Combating Diﬃculties

In 1.2 we observed that we could face so many challenges during training

and/or testing stages. So we may need to remove the adverse eﬀects as well as

unwanted interference. The adverse eﬀects of noises (additive, multiplicative

and convolutional, reverberation or echo), environmental mismatch, language

mismatch and channel mismatch (recording device mismatch and telephone

network or transmission channels mismatch) are very common to real time

SR. Due to these adverse eﬀects, the SR accuracy degrades substantially so

that the SR becomes unusable in real time applications. From the discussions

about SR till now, we can view the SR has three domains: (i) feature domain,

(ii) model domain, and (iii) score /classiﬁcation domain. These adverse eﬀects

are generally removed in any one and/or two or in all the three domains.

3.1 Feature Domain Compensation

There are many compensation techniques depending on the types of adversities

are available in the literature. The following techniques are generally applied

for the SR.

3.1.1 Velocity and Acceleration Feature Concatenation

If the number of speakers in database is large, then dynamic features are

required along with static feature for improving accuracy. Sometimes energy

feature is also included for every frame. However, MFCC represents the static

features but dynamic features are also required if the number of speakers is

large. Hence, dynamic features are optional and need not required for small

database (nearly less than 200 speakers). There are two types of dynamic

MFCC which are known as velocity and acceleration coeﬃcients which are

represented by ∆and ∆2MFCC respectively. Conveniently, these two features

provide robustness in the feature space.The ∆MFCC vector is computed by

∆cn=PQ

r=1 r(cn+r−cn−r)

2PQ

r=1 r2(23)

Here we must take static MFC coeﬃcients slightly greater than n= 13 de-

pending on the value of Q. A typical value of Qis 2 (Q= 1 is also possible).

For Q= 2, n= 19 is fair enough for ∆and ∆2MFCC vector computa-

tion. ∆2MFCC vector is computed using eqn.(23) on ∆cn. We concatenate

13 −static M F C C, 13 −∆ M F C C and 13 −∆2M F C C to form complete

feature vector x={cn, ∆cn, ∆2cn}of dimension D= 39.

Closed-set Speaker Identiﬁcation Using VQ and GMM Based Models 21

3.1.2 Cepstral Mean Subtraction (CMS)

If speech signal is distorted by the convolutional noise then the noise compo-

nent of speech signal is removed by CMS [100]. In CMS we ﬁrst compute the

mean vector (µ) and then we subtract µfrom the each feature vector (xt) to

get new feature vector (ˆxt) as follows:

ˆxt=xt−µwhere 1≤t≤T(24)

3.1.3 Cepstral Mean and Variance Normalization (CMVN)

Let xtis the Ddimensional tth feature vector (MFCC vector of tth frame)

and each of these vectors xthas the element xt(i) in ith dimension (ith MFC

coeﬃcient) and X= [x1,x2,x3,...,xT] is the set of Tnumber of MFCC

vectors which represents a speaker. In CMVN each feature vector is normalized

(or compensated) according to the following equations

µ(i) = 1

t=1

xt(i),1≤t≤T& 1 ≤i≤D(25)

σ(i) = v

T−1

t=1

(xt(i)−µ(i))2,1≤t≤T& 1 ≤i≤D(26)

Let mean and variance normalized feature vector of xtis ˆxt. The CMVN

feature vector, ˆxt, is computed as follows

ˆxt(i) = xt(i)−µ(i)

σ(i), where 1≤t≤T and 1≤i≤D(27)

where tis the index for the vector (frame) and iis the index for the dimension of

the vector. Here ˆxthas an element ˆxt(i) in the ith dimension i.e., ˆxt={ˆxt(i)}

for i= 1,2, . . . , D and t= 1,2, . . . , T . This normalization is done for both

training and testing sets of feature vectors. Then GMM is built on the nor-

malized set of training vectors Xtrain and MLL is computed using normalized

set of test vectors Xtest and GMM λtrain of Xtrain for the identiﬁcation.

3.1.4 Cepstral Liftering

The value of the cepstral coeﬃcient Cndecreases as nincreases. Hence, to

rescale the value of Cn, a lifter function G(n) is multiplied with Cn[101]. Few

lifter functions [102,103] are deﬁned as follows:

– Linear Lifter:G(n) = n

– Statistical Lifter:G(n) = 1

ˆσn

where ˆσnis the standard deviation of nth cepstral coeﬃcient calculated

from the training data.

22 Bidhan Barai et al.

– Sinusoidal Lifter:G(n) = 1 + J

2sin(πn

J), where Jis the dimension of

vector.

– Exponential Lifter:G(n) = nse−1

2(n

τ)2

where τand sare constants. Typically, their values are τ= 5 and s= 1.5.

Hence, after cepstral liftering we shall get the feature vector c={cn}for

n= 1,2, . . . , J , given by

cn=G(n)Cn, for n = 1,2, . . . , J (28)

Note the diﬀerence between small cnand capital Cn.

3.1.5 Frequency Warping Normalization (FWN) in Frequency Domain

FWN is a frequency domain signal processing technique where the frequencies

are mapped into an standard range (within the Nyquist range). i.e., [0,Fs

2].

The governing equation for this operation is given by

f0=f−fmin

fmax −fmin π(29)

where f0is the mapped frequency of fand the frequencies are redistributed

on the interval [0, π]. However, FWN should be discussed in subsection 2.1 but

this section is optional and is required only when fmin is diﬀerent from 0 Hz.

That is, if fmin = 300 or other values except fmin = 0, then we apply FWN.

3.2 Model Domain Compensation

The model domain compensation is the most popular and useful for SR. The

methods for compensation of adverse eﬀects are found in abundance in liter-

ature. GMM based SR using MAP adaptation is most popular and state-of-

the-art technology for text-independent SR in adverse environment. In this

technique, speaker models (during training or enrolment) are derived from the

speaker independent common GMM known as UBM using MAP adaptation.

Here the UBM is formed by clean speech of all and additional speakers before

the training session. Normally, mean vector adaption is considered while the

weight and covariance matrix are neglected as described in (2.2.4). Similarly,

the UBM super vector is formed by the concatenation of UBM mean vectors.

After this operation GMM super vector is formed for Channel Factor com-

pensation or removal of channel factor. The channel factor adaptation of the

ith utterance and jth GMM super vector computed in super vector model as

follows

µij =µj+Uxij (30)

where µjis original super vector of jth GMM and µij is the ith adapted super

vector. Uis the low rank matrix which projects the channel factor subspace

into the super vector domain. The ith vector xij contains the channel factor

and jth GMM super vector. Here we apply eqn. (30) during testing step only

Closed-set Speaker Identiﬁcation Using VQ and GMM Based Models 23

and not during training step. The µjis adapted using MAP during training

step. The score is computed by MLL of test utterance using compensated

super vector [104]. The channel factors subspace is modelled by the low rank

matrix U which is the distortion due to the intersession variability. The matrix

Uis computed using EM algorithm as describe in [93].

3.3 Score Domain Compensation

In SR the score normalization is very important because scores of test speaker

are very much dependent on data which inﬂuence the scores diﬀerently. So to

make the scores scale independent (as well as test trial independent) normal-

ization of scores are essential in SV. Another reason of score normalization

in SV is that generally, the decision threshold (θ) is chosen in such a way

where EER holds (i.e., F AR =F RR) in Detection Error Trade-oﬀ (DET)

curve with the help of multiple test trials. For SI, the other advantage of score

normalization is to make the score independent of background noise, channel

(device), environment in mismatch conditions. So, in SI normalization is not

such important for matched condition. However, for SI and SV in mismatch

conditions score normalization is equally important. Score domain normal-

ization technique maps the scores of test speaker ¯

Sξ’s corresponding to λξ’s

into a standard range of scores. The most popular normalization techniques

are TNorm, ZNorm and HNorm [105, 106, 107]. Among them, generally, Test-

independent Normalization (TNorm) score is computed for SI. Let the TNorm

scores of the test speaker corresponding to the all enrolled speakers be ϕT(¯

Sξ)

and original scores are ¯

Sξfor for 1 ≤ξ≤S. Let µsand σsare the mean and

standard deviation of all the Snumber of speakers. Then we have

µs=1

ξ=1

Sξ(31)

σs=v

S−1

ξ=1

(¯

Sξ−µs)2(32)

ϕT(¯

Sξ) = ¯

Sξ−µs

σs

, where 1≤ξ≤S(33)

The other normalization techniques that are used in SV are found in [105,106,

107].

4 Conceptual Comparison of Approaches

SR has an array of methodologies along with features. The reason to develop

so many methodologies and features is to increase the accuracy of SR sys-

tem. In this section we discuss the conceptual comparison among the features

24 Bidhan Barai et al.

(or feature vectors) and methodologies. In the very past frame wise spectro-

gram was used as feature but this spectrogram contains several other factors

along with the speaker speciﬁc information. In the classiﬁcation step, all the

spectrograms of train speakers are compared with that of the test speaker

for recognition. So the accuracy was not good enough, although clean speech

shows a little improvement [108]. To combat with these diﬃculties, the back-

ground noise removal techniques and ﬁltering techniques are used to get rid of

noise and factors that are not speaker speciﬁc and form a sparse representation

of spectrogram which is a more speaker speciﬁc feature. The other obsolete

features are pitch, formants (F0, F1, F2), zero-crossing rate(ZCR), energy of

the frame and many more. All these features can be brought together and

concatenated to form a feature vector from each frame, and these feature vec-

tor based SR show a little improvement over spectrogram based SR. In the

present days, the feature vectors are extracted through frame by frame pro-

cessing of speech signal. Example of such features are MFCC, GFCC and these

features are called low level features because they are extracted from low level

frequency domain representation of signal (frame) using Mel and Gammatone

ﬁlter bank respectively. Since to extract these features, the power spectrum

(computed from FFT of frame) is passed through ﬁlters to form energy frame,

at the same time this enhances the higher frequencies and attenuates some

unwanted frequencies. The higher frequencies of power spectrum contain more

speaker speciﬁc information. This is the reason why MFCC and GFCC become

a state-of-the-art feature for SR. Beside these, the high level features like super

vector, i-vector are also become the state-of-the-art feature. They are called

high level feature (or model based feature) because they are extracted from

the GMM and UBM-GMM model of the MFCC feature vectors. After the ad-

vancement of machine learning and deep learning, the classiﬁers like Artiﬁcial

Neural Network (ANN), Convolutionnal Neural Network (CNN) [32], Multi

layer perceptron (MLP), Deep Neural Network (DNN) are used at the present

days [12,14]. Indeed, in case of DNN, we need not have to compute hand

crafted feature (like MFCC). They extract feature(s) from raw data automat-

ically. But here we must provide input data (speech signal) in well understood

numerical form (which are mathematically computed) to the classiﬁer and

then recognize speaker. The example of such raw data is pre-processed frames

of complete speech signal, power spectrum and spectrogram (time-frequency

representation of speech signal) of every frame [17]. If we use spectrogram,

then we can think the SR as an image processing problem because the spec-

trogram basically a plot (image) of processed speech signal (frames). The high

level features are also be used in ANN, CNN, DNN for SR. The diﬀerence is

that we provide inputs such as super vector, i-vector to the classiﬁers.

After the feature extraction (like extraction of MFCCs) we have employed

two diﬀerent ways of classiﬁcation. One is classiﬁcation without using data

model and the other one is using data model. In the ﬁrst case, after the feature

extraction, we store the feature vectors of all speakers. When we have an

unknown/test speaker, feature vectors of test speaker are compared with each

feature vectors of all speaker to generate distances (e.g, Euclidean distance) or

Closed-set Speaker Identiﬁcation Using VQ and GMM Based Models 25

scores (e.g, inverse of distance) with respect to every speaker. This method is

called template matching. The minimum distance or maximum score provide

the classiﬁed speaker. This method suﬀers from a lot of drawbacks. If we have

mnumber of MFCCs for a training (known) speaker and nnumber of MFCCs

for test speaker then there are required m×ncomparison to evaluate a score

(or distance). If MFCCs are large then the SR system will take too much

time to recognize a speaker which makes it unusable in real time application.

Besides this, we require very large memory to store all the MFCCs of known

(training) speakers even the accuracy of this approach is not very impressive.

Here comes the concept of the model based classiﬁcation. In this approach, the

known speakers’ MFCCs are not directly used for classiﬁcation, but a model is

built for every speaker. Here we do not store MFCCs in the memory rather for

every speaker, we store the model parameters which characterize the model of

that speaker.

In this paper we used three models, namely VQ, GMM and UBM-GMM

and their combinations giving ﬁve classiﬁers VQ, GMM, VQ-GMM, UBM-

GMM and VQ-UBM-GMM. The performances of these classiﬁers are evaluated

over the speech recorded on ﬁve diﬀerent recording devices, given in section 5.

In VQ based SR, the number of MFCCs of train and test speakers are

represented by codebook Cof size C(number of representative codewords).

Here C << number of MFCCs both train and test. So representing the speaker

as coebook we store codebooks (VQ model) of all the train speakers hence,

saving time and space and also there is a smaller improvement in performance

of SR than template matching.

In GMM based SR, for every speaker we compute a set of Mcompo-

nents GMM parameters, namely weight(ω), mean(µ) and covariance(Σ) us-

ing MFCCs of train speaker Xtrain. Thus for sth speaker the set of parameters

Φs={ωh,µh, Σh}where h∈[1, M ] and Mis much smaller than the num-

ber of MFCCs hence, saving both memory and computational time in GMM

based SR. Here we shall observe that the performance is very stable (means

performance does not diﬀer too much from other GMM based classiﬁers for

ﬁve recording devices) and performance improves signiﬁcantly. This is the rea-

son that GMM based classiﬁers are very reliable. However, the computational

time in computing GMM parameters is slightly large but it is much less than

the computational time of codebook in VQ based SR. Also the computational

time depends on the number of MFCCs of train speakers in GMM and VQ

both. Indeed, in GMM, the model parameters Φdepends on the initial guess of

the parameters ω,µ, and Σand the initialization is random. Hence diﬀerent

speakers are initialized diﬀerently and ﬁnal GMM is highly dependant on such

initial values of parameters where the ﬁnal parameters are biased to the values

that are may not be speaker speciﬁc. The random initialization (that is plausi-

bly biased) of parameters is eliminated in UBM-GMM based SR where every

speaker is initialized by same such parameter values, that deﬁnitely conveys

speaker speciﬁc properties (as MFCCs for UBM shares speaker independent

common parameter values) because all speakers’ MFCCs are mixed and ﬁnal

26 Bidhan Barai et al.

GMM is completely depends on training MFCCs and not initial values because

every speaker is initialized my same parameter values.

In UBM-GMM based SR, at ﬁrst a speaker independent model is built

which is a GMM itself and we called this model Universal Background Model

(UBM). To build UBM, we collect small amount of MFCCs from a very large

number of speakers where any train speaker must be present in the database

and UBM also have some MFCCs from speakers other than the speakers in

database. Using this mixed MFCCs a common GMM is built using usual

method, which is then called UBM. Now the UBM has a set of parameters Φ0

={ω0,µ0, Σ0}. We can think of this UBM as speech model because it contains

the speech of all kind of speakers. So, we get a model of speech of a speciﬁc

language. One more important issue is that since we collect MFCCs from

a large number of speaker this method expand the region of MFCC feature

vectors for both UBM and UBM-GMM and ensure that the GMM of train

speaker deﬁnitely present in this region of UBM’s MFCCs i,e,. the ﬁnal GMM

of train speaker, called UBM-GMM, becomes bounded in the region of UBM.

To compute UBM-GMM of train speakers we initialize the UBM-GMM by the

parameters of UBM, Φ0for all the speakers i.e., all the speakers are initialized

by the same parameter value Φ0. Then we apply EM iteration to compute ﬁnal

parameters Φsfor {s= 1,2, . . . , S}of UBM-GMMs of all train speakers. In

each iteration the UBM eventually takes the form of training MFCCs of train

speaker. We shall observe that UBM-GMM provide much better and stable

performances for all devices i.e., the SR system shows robustness.

In VQ-GMM based SR, we ﬁrst compute the codebooks of training and

testing speaker using their MFCCs. Since the size of codebooks is much less

than that of computed MFCCs so the computation of GMM speeds up. How-

ever, here we compromise with the computational time of VQ since computa-

tion of VQ is also very time consuming. In experiment, although we see that

in some cases the performance of SR improves in VQ-GMM based SR, the

classiﬁer is not as much as robust than UBM-GMM based SR.

In VQ-UBM-GMM based SR, MFCCs are accumulated from many speak-

ers (here speakers may present outside the database) and we compute code-

book to reduce the large number of MFCCs of all the speakers. Now we built

the speaker independent UBM using the codebook. Then this UBM is adapted

using MAP adaptation technique that described earlier to build UBM-GMM

of every speaker and stored in back-end. The classiﬁcation/identiﬁcation is

similar as VQ-GMM classiﬁer.

5 Experimental Results and Discussions

A comprehensive analysis of performance of SR system is undertaken here.

Various performance measures are used for evaluating a SR system is also

described. Some reported performance analysis are also discussed here. Then

the performance evaluation of a SR system based on MFCC feature and GMM

Closed-set Speaker Identiﬁcation Using VQ and GMM Based Models 27

is carried out on three databases, namely, Hyke 2011, ELSDSR and IITG-MV-

SR.

5.1 Performance Measure for SR systems

The performance analysis metrics of SR system for SI and SV are diﬀerent.

The performance of SV system is measured by False Acceptance Rate (FAR)

and False Rejection Rate (FRR). These two rates are measured as follows:

F AR =#rejected true speakers

T otal #speakers ×100%

F RR =#accepted impostors

T otal #speakers ×100%

(34)

Here 0#0symbol denotes 0number of0. For evaluating the performance of SV

systems, the researchers often use the decision error trade-oﬀ (DET) curve.

The DET curve is the plot of FAR vs FRR. The decision for accepting or

rejecting a speaker is based on the threshold value that is chosen by inspecting

the DET curve. By changing the threshold value diﬀerent pairs of (FAR,FRR)

are generated and the point on the DET curve where FAR and FRR become

equal is called Equal Error Rate (EER). The threshold value is chosen at this

point.

The performance of SI system is measured by the percentage of correct

identiﬁcation which is a single value unlike the SV system. Hence, the accuracy

(η) is evaluated by the following equation:

η=#speakers correctly classif ied

T otal #speakers ×100% (35)

The performance of closed-set SI is measured by the equation (35). But the

performance of open-set SI can be measured by any one of equations (34) and

(35) or both. This is because the open-set SI is very similar to the SV in a

sense that in both cases veriﬁcation is performed to take the decision whether

the claimed identity is accepted or the speaker is present in the database.

5.2 Some Reported Performance of SR systems

Togneri et al.[86] conducted the experiment using GMM, GMM-UBM and

GMM-SVM and a comparative discussion was made in the paper [86]. In this

case, the GMM-SVM achieves superior recognition over the GMM-UBM sys-

tem by around 3%. He conducted the experiment for the both cases, using

original feature set i.e., 13 MFCC + 13 ∆M F CC + 13 ∆2M F C C result-

ing in vector dimension D= 39 and reduced features (temporal derivatives

are excluded, only a 13 dimensional vector is taken including C1coeﬃcient).

These MFCC vectors are extracted from 25 ms frame generated every 10 ms

28 Bidhan Barai et al.

(i.e., frame shift is 10 ms) using Hamming window and pre-emphasis factor

α= 0.97. He also applied CMN for enhancement of speech signal. The author

showed that the best performance of the GMM classiﬁer on TIMIT database

(every speaker has 10 utterances where 8 utterances are used for training and

2 utterances for testing) is 99.2% with 32 Gaussian mixture and the perfor-

mance deteriorates with the increase of Gaussian mixtures. If the training data

is insuﬃcient (i.e., 3 utterances are used for training and 2 are used for test-

ing) the best result for GMM system is only 79.7% with 16 mixtures but the

performance of the GMM-UBM and GMM-SVM classiﬁers improve with the

increase in number of Gaussian mixtures and the best system performance is

achieved with 128 Gaussian mixtures. In this case GMM-SVM performs better

than GMM-UBM [86]. In [15] the author has proposed an adaptive variational

mode decomposition approach to enhance the speech signal and provide per-

formance analysis. The best accuracy of GMM-UBM and GMM-SVM are 96%

and 93% respectively. In TIMIT database, additive noise is induced for every

speaker to create the noise mismatch condition and we observed that there

is a signiﬁcant degradation in accuracy in such cases. In mismatch condition

between training and testing data, the accuracy degrades by around 20%. Be-

side this MFCC feature, GFCC is also very popular. In some cases GFCCs

perform better than MFCCs. It is also observe that GFCC is more robust in

some adverse (mismatch) condition and cepstral liftering (the detail is given

in 3.1.4) of GFCC improves the accuracy than MFCC [88,85]. Here we choose

MFCC because the computational complexity of MFCC is quite easier than

GFCC and if the noise is very high then MFCC performs better than GFCC.

It is worth mentioning that our database contains speech signals contaminated

with noise.

The dimension of feature vector plays a crucial role in computational cost

for training and testing stages as well as computation of MFCC. The number of

vectors is also an important factor in computational cost. The main advantage

of using VQ-GMM system is that it reduces the number of vectors considerably

without signiﬁcant loss in recognition accuracy if enough training and testing

data are considered. In our experiment with the three databases, we reduced

the feature vectors to 1024 vectors using VQ technique and then the GMM

is built over the 1024 vectors. Even the testing procedure is carried out using

1024 cluster centroids.

5.3 Performance Analysis of Presented SR System for IITG-MV SR,

ELSDSR and Hyke-2011

The SR experiment is carried out extensively over the three databases - 1)

IITG Multi-variability Speaker Recognition Database (IITG-MV SR Phase I

& II) in both matched and mismatched conditions and 2)ELSDSR, 3) Hyke-

2011 in matched condition. Database (2) and (3) have no mismatch condition

and these two databases contain clean speech. The speech languages of IITG-

MV SR are English and Indian regional languages (like Bengali, Hindi, Tamil

Closed-set Speaker Identiﬁcation Using VQ and GMM Based Models 29

etc )[10]. The IITG-MV SR Phase I, II contain recorded speech from ﬁve

recording devices namely, digital recorder (D01), Headset (H01), Tablet PC

(T01) in both phases and Nokia 5130c mobile (M01), Sony Ericsson W350i

mobile (M02) in Phase I. But in our experiment we have used only Phase I

data because this Phase I data satisﬁes all the three conditions (Text, style

and channel independent SR) in the title of this paper. Phase I is recorded in

noisy oﬃce environment and Phase II in noisy multi environment condition

(other than oﬃce, like Laboratory and Hostel room). For Phase I, each record-

ing device has two sets of speech signal for every speaker namely session 1 and

session 2. The session 1 contains the speech of two languages, English and

Indian Regional languages of 100 speakers in two modes (reading style and

conversational style). We use the reading-style speech signal (in .wav format)

as training data and conversational-style (in .wav format) as testing data. In

Phase III(a), there are 200 speakers in truly conversational mode (there is no

post processing to separate the speakers) that are recorded using mobile phone

handset at sampling frequency 8 kHz for the experiment ”single speaker recog-

nition” and in Phase III(b) there are 198 speakers (99 speaker pairs) in truly

conversational mode that are recorded using mobile phone handset at sampling

frequency 8 kHz for the experiment ”two speaker recognition”. Hence Phase

III(b)database can be used for ”Speaker Diarization (who spoke when?)”. In

Phase IV database there are 144 speakers that are collected using mobile phone

handset at sampling frequency 8 kHz to facilitate UBM-GMM based speaker

recognition and it contain a large number of imposter speakers. The complete

description of the database is found in [20]. However ELSDSR and Hyke-2011

contains clean speech i.e., noise level is very low and the speeches are recorded

with same microphone, so there is no device mismatch for training and testing.

Hyke-2011 contains speeches of digits, from 0 to 9 only (no text). The ELS-

DSR contains speeches of text[21]. For IITG-MV SR database, the sampling

frequency for D01, H01, T01 is 16kHz, for M01, M02, M03, M04 is 8kHz and

where as that for ELSDSR, Hyke-2011 is 8kHz. We chose frame size about 25

ms and overlap about 17 ms i.e., frame shift is (25 −17) = 8 ms for 16kHz

speech signal and 50 ms frame size and about 34 ms overlap i.e., frame sift is

(50 −34) = 16 ms for 8kHz speech signal. The pre-emphasis factor αis set

to 0.97. we have used 1024-point FFT. For mel scale frequency conversion,

maximum and minimum linear frequencies are chosen as fmin = 0,340 Hz and

fmax = 4500 Hz. Number of triangular ﬁlters in ﬁlter bank is B= 26 which

produces 26 MFC coeﬃcients and among them ﬁrst 13 MFCC excluding C1

are chosen to create MFCC feature vector of dimension D= 13. The accuracy

rate for the mentioned databases are reported in the paper[3] for GMM an

VQ-GMM based classiﬁers. In VQ we consider 1024 clusters, to reduce large

number of vectors, upon which GMM is built using 5 EM iteration. In Barai

et al.[2] it is shown that the accuracy on Hyke-2011 and ELSDSR is 100%

due to the clean speech for VQ, GMM and VQ-GMM based SR. However, we

have also examined the results for Hyke-2011 and ELSDSR using other classi-

ﬁers as mentioned in section 4. But we have not provided accuracy ﬁgures in

the tables, because in those cases we also got high accuracy (from 99.6% to

30 Bidhan Barai et al.

100%) for all the classiﬁers[3,4]. We have also examined spoken style, text and

channel (recording device) independent SR experiment to provide benchmark

accuracy for the IITG-MV database using the ﬁve classiﬁers presented in this

paper.

For experimenting with spoken style variation, we use speech signal of

reading style for the training purpose and conversation style is used for test-

ing purpose. So there is a spoken style mismatch between training and testing

data. The experimental results are given in table 1 for all the devices D01, H01,

T01, M01, M02. In table 1 it is clearly observed that the accuracy varies be-

tween 43% ∼96%. All these results are channel dependent which means there

is no channel mismatch between training and testing data i.e., recording device

is same for training and testing data. An interesting result can be observed

for devices D01 and T01 with classiﬁers VQ and VQ-GMM respectively. Here

the accuracy are 60% and 43% respectively. The cause of this drastic degra-

dation of results is due to singularity problem of covariance matrix. We know

that covariance matrix is positive semi deﬁnite in other words Σh≥0. Now if

Σhbecomes 0 or Σh→0 then equation (5) becomes undeﬁned because Σ−1

cannot be exists as determinant of Σh,|Σh|= 0 or |Σh| → 0. Also the term

(2π)D/2|Σh|1/2→ ∞, makes the equation (5) inconsistent. The singularity prob-

lem is displayed by ”*” marks in superscript position of the accuracy in all the

tables presented in this paper. Indeed, the singularity problem is found in very

rare cases. Even the singularity may not occur all the time, for example, in the

databases Hyke-2011 and ELSDSR singularity problem does not occur but in

case of IITG-MV SR database singularity occurs in very few cases. We cannot

say with certainty whether covariance matrix Σhis singular or not singular

before modelling GMM based classiﬁers. If singularity occurs, then only VQ

based classiﬁer out performs GMM based classiﬁers. The GMM based clas-

siﬁers like, GMM, UBM-GMM, VQ-GMM, VQ-UBM-GMM may or may not

suﬀer from singularity problem. But for VQ classiﬁer singularity will deﬁnitely

not occur because we measure Euclidean distance between trained codebooks

and testing MFCCs and minimum distance (which gives a value with certainty)

provides the classiﬁed speaker. If we neglect singularity, we can see that the

accuracy for D01, H01, T01 is better than M01 and M02. The reason for this

is the sampling frequency (fs) of the speech signal. Since sampling frequencies

of D01, H01, T01 are fs= 16 kH z and M01, M02 are fs= 8 kHz. Hence, the

mel ﬁlter bank covers the bandwidth of fs/2kHz = 8 kH z for D01, H01, T01

and for M01, M02 is fs/2kHz = 4 kHz which is much less than rest three

devices. Hence D01, H01, T01 can cover more bandwidth (or speaker speciﬁc

information) than M01, M02, which leads to the better accuracy.

In table 1, the various conditions of data is given. In the ﬁrst column,

name of the classiﬁers are displayed. In the second column, ”UBM with VQ”

there are three types of tags, namely −,X,×. Here ’−’ means UBM is not

required for classiﬁers GMM and VQ-GMM. ’×’ means vector quantization is

not carried out before UBM and GMM. Generally, VQ is done before modelling

and to compute codebook. The third and fourth column ”VQ on Train MFCC”

Closed-set Speaker Identiﬁcation Using VQ and GMM Based Models 31

Table 1: The accuracy in percentage(%) of SR system for ﬁve devises using

ﬁve classiﬁers presented in the paper with various combination of training

and testing data. Here there is no channel mismatch. Here ’*’ mark indicates

singularity problem of covariance matrix.

Classiﬁer UBM with VQ VQ on Train MFCC VQ on Test MFCC D01 H01 T01 M01 M02

GMM –– × × 89 90 91 81 76

UBM-GMM × × × 88 88 86 80 87

VQ –– X×89 94 96 80 76

VQ –– X X 60 94 74 81 84

VQ –– ×X92 88 96 76 83

VQ-GMM –– X X 89 71 91 78 76

VQ-GMM –– X×89 50* 90 75 78

VQ-GMM –– ×X90 71 43* 79 94

VQ-UBM-GMM X X X 87 90 90 73 82

VQ-UBM-GMM X X ×86 88 88 83 82

VQ-UBM-GMM X×X86 87 87 97 83

VQ-UBM-GMM ×X X 79 87 81 82 73

VQ-UBM-GMM ×X×80 87 87 80 73

VQ-UBM-GMM × × X89 87 87 81 88

and ”VQ on Test MFCC” respectively, there are two types of tags, namely X,

×. Here ’X’ means VQ is done and ’×’ means VQ is not done before UBM

and GMM modelling. For the other tables in this paper, we use the same

tag marks with similar meaning. The table 1 is not spoken style independent

because here we use reading spoken style speech for training and conversational

spoken style speech for testing. It is observed that in spoken style dependent

SR for every devices the range of accuracy is a little large, 43% to 97% i.e., the

diﬀerence is 53%. Hence, initially it seems that the classiﬁer and/or feature are

not robust which is not true because this large diﬀerence occurs due to either

singular coveriance matrix and/or spoken style mismatch and it can be noted

that the classiﬁer VQ-UBM-GMM classiﬁer performs better than others. In

the literature it is proved that MFCC feature is a robust feature. Now what

about the robustness of the classiﬁers?

Table 2: The accuracy in percentage(%) of ASR system for ﬁve devises using

ﬁve classiﬁers presented in this paper for spoken style independent experiment

as well as cross validated. Here there is no channel mismatch

Classiﬁer UBM with VQ VQ on Train MFCC VQ on Test MFCC D01 H01 T01 M01 M02

GMM –– × × 98 98 98 98 98

UBM-GMM × × × 97 97 97 97 98

VQ –– X×97 97 97 97 98

VQ –– X X 97 97 97 97 98

VQ –– ×X98 96 97 97 98

VQ-GMM –– X X 86 84 80 83 82

VQ-GMM –– X×83 81 76 80 81

VQ-GMM –– ×X97 97 97 97 98

VQ-UBM-GMM X X X 97 97 97 97 98

VQ-UBM-GMM X X ×97 97 97 97 98

VQ-UBM-GMM X×X97 97 97 97 98

VQ-UBM-GMM ×X X 98 98 98 98 98

VQ-UBM-GMM ×X×98 98 98 98 98

VQ-UBM-GMM × × X98 98 98 98 98

32 Bidhan Barai et al.

The robustness of the classiﬁers is checked by cross validation method.

In the experiment we apply 3-fold cross validation. To do so, feature vectors

(MFCCs) of training data consisting of speeches in reading style and testing

data from speeches in conversational style are computed for every speaker

and then they are randomized. Then, for every speaker, we divide MFCCs

randomly into three groups. We hav selected any two groups of MFCCs to

mix them together and taken as training MFCCs and the remaining group

is taken for testing MFCCs. There are three such combinations of training

and testing MFCCs. Hence each classiﬁer provides three accuracy and the

average accuracy of all the three combination is provided in table 2. This

cross validation can also be viewed as spoken style independent experiment

because we have mixed the MFCCs of reading style and conversational style.

In this table, it is clearly shown that the accuracy of all the classiﬁers varies

between 76% ∼98% i.e., the diﬀerence is 22% which is much less than 53%

of table 1. Here it has shown that the VQ-UBM-GMM classiﬁer performs

better than others. Fortunately, the singularity problem of covariance matrix

has not occurred here. Accuracy in table 2 is better than that in table 1. The

improvement of accuracy is due to the expansion in terms of variability of

training data in feature space i.e., MFCC feature space expands because the

reading style MFCCs and conversational MFCCs are mixed. So the diﬀerence

(distance) between k−fold training and testing data is reduced and the

similarity increases which leads to better performance. Also we have created

three (k= 3) training and testing pair randomly from the ﬁx region of feature

space where training and testing data are mixed.

So far we have discussed about spoken style dependent and independent

experiment but both experiment is channel dependent. This means that train-

ing MFCCs and testing MFCCs are taken from same device for each speaker.

Now we shall examine the accuracy of channel (device) independent exper-

iment with and without cross validation. The accuracy of device dependent

experiment is found in papers [2, 4]. The benchmark accuracy of device inde-

pendent experiment without cross validation is given in table 3.

To make the experiment device independent, we need to prepare the data

for every speaker. We have mixed the reading style speech signals of each

speaker recorded by all devices for training and conversational style speech

for each speaker recorded by all devices for testing. So, it has become device

independent but spoken style dependent. Here, we can see that the accuracy of

VQ classiﬁer varies in the range 74% to 96% i.e., the diﬀerence is 22%. Hence

VQ is not a robust classiﬁer. If we neglect singularity problem, the accuracy

of GMM based classiﬁers varies in the range 86% to 91% i.e., the diﬀerence

is 5%. Hence we can say that GMM based classiﬁer is more robust if we can

make sure that the singularity problem does not occur. Next we examine the

performance of classiﬁers with cross validation. The cross validation makes the

experiment devise independent, spoken style independent.

Initially we have mixed reading style MFCCs and conversational style

MFCCs together for each speaker from each device. Then we have mixed

Closed-set Speaker Identiﬁcation Using VQ and GMM Based Models 33

Table 3: The accuracy in percentage(%) of ASR system for channel indepen-

dent and spoken style dependent case. Here ’*’ mark indicates singularity

problem of covariance matrix.

Classiﬁer UBM with VQ VQ on Train MFCC VQ on Test MFCC Accuracy(%)

GMM –– × × 91

UBM-GMM × × × 86

VQ –– X×96

VQ –– X X 74

VQ –– ×X96

VQ-GMM –– X X 90

VQ-GMM –– X×43*

VQ-GMM –– ×X91

VQ-UBM-GMM X X X 87

VQ-UBM-GMM X X ×87

VQ-UBM-GMM X×X87

VQ-UBM-GMM ×X X 81

VQ-UBM-GMM ×X×88

VQ-UBM-GMM × × X88

MFCCs of all devices for every speaker. Here we have used the 3-fold cross

validation as described earlier. The accuracy is given in table 4.

Table 4: The accuracy in percentage (%) of ASR system for devise independent,

spoken style independent experiment as well as cross validation using ﬁve

classiﬁers presented in this paper.

Classiﬁer UBM with VQ VQ on Train MFCC VQ on Test MFCC Accuracy(%)

GMM –– × × 98

UBM-GMM × × × 97

VQ –– X×98

VQ –– X X 97

VQ –– ×X96

VQ-GMM –– X X 97

VQ-GMM –– X×97

VQ-GMM –– ×X96

VQ-UBM-GMM X X X 98

VQ-UBM-GMM X X ×97

VQ-UBM-GMM X×X98

VQ-UBM-GMM ×X X 98

VQ-UBM-GMM ×X×97

VQ-UBM-GMM × × X98

It is observed that the accuracy varies in the range 96% to 98%i.e., the

diﬀerence is only 2%. The reason for this high accuracy is the very large MFCC

feature space because we mix the spoken styles MFCCs as well as MFCCs of

all the devices.

5.4 Development of SR System in MATLAB

The SR system is implemented in MATLAB R2015awith the help of two

matlab toolbox, namely, VOICEBOX [110]and NETLAB 3 [111]. The digital

34 Bidhan Barai et al.

signal processing for feature extraction is carried out using the useful func-

tions in VOICEBOX and the modelling and classiﬁcation task is carried out

using the useful functions in NETLAB 3. The VOICEBOX toolbox contains

functions for the following purposes:

–Audio File Input/Output - Read and write WAV and other speech ﬁle

formats;

–Frequency Scales - Convert between Hz, Mel, Erb and MIDI frequency

scales;

–Fourier/DCT/Hartley Transforms - Various related transforms;

–Random Number and Probability Distributions - Generate random vectors

and noise signals;

–Vector Distances - Calculate distances between vector lists;

–Speech Analysis - Active level estimation, Spectrograms;

–LPC Analysis of Speech - Linear Predictive Coding routines;

–Speech Synthesis - Text-to-speech synthesis and glottal waveform models;

–Speech Enhancement - Spectral noise subtraction;

–Speech Coding - PCM coding, Vector quantisation;

–Speech Recognition - Front-end processing for recognition;

–Signal Processing - Miscellaneous signal processing functions;

–Information Theory - Routines for entropy calculation and symbol codes;

–Computer Vision - Routines for 3D rotation;

–Printing and Display Functions - Utilities for printing and graphics;

–Voicebox Parameters and System Interface - Get or set VOICEBOX and

WINDOWS system parameters;

–Utility Functions - Miscellaneous utility functions.

It can be seen that VOICEBOX toolbox contains matlab functions for fea-

ture extraction and modelling/classiﬁcation. But NETLAB 3 contains matlab

functions for modelling/classiﬁcation. Hence, together they contain all the

matlab functions for SR system and for performance evaluation. The NET-

LAB 3 contains the matlab codes and functions for the data visualisation and

modelling system.

6 Conclusion

There are several variations of SR, based on the application areas [10]. For

example, SR in noisy environment, SR in mismatch conditions, recognition of

speakers from one mixed speech signal of more than one speaker (a very hard

task and it is called SR after source separation)[112], speaker segmentation

and recognition using speech segments of individual speakers during conver-

sation and many more. In this paper, we consider ”text-independent closed

set speaker identiﬁcation” experiment in various training and testing condi-

tion with focuses on spoken style, channel match/mismatch conditions with

3-fold cross validation. Analytical results are reported in this context as well.

Closed-set Speaker Identiﬁcation Using VQ and GMM Based Models 35

In this paper, model based speaker recognition for matched condition using

time-frequency feature is over emphasized and the channel (device) and spoken

style dependencies are examined. Here the number of vectors is reduced using

VQ techniques to lower computations in parameter estimation for GMM. The

other existing classiﬁcation/modelling techniques, methods are mentioned. In

the present day, the researchers focus on the recognition in noisy/mismatched

and reverberant condition. That is why we have mentioned the features, meth-

ods and techniques for speaker recognition in noisy/mismatched conditions.

Also, enough references are given to help the readers to ﬁnd out the state-of-

the-art as well as novel methods and techniques for the very recent problems

in this ﬁeld.

Though, the identiﬁcation rate is very high for the clean speech waveform

in matched condition, the identiﬁcation rate degrades drastically for the noisy

environment or mismatched condition (training and testing environments are

diﬀerent). The study by the famous researchers like Douglas A. Reynolds,

Roberto Togneri, Richard C. Rose revealed that the performance of SI and SV

systems using MFCC feature is better than the SI and SV systems using other

features like LPC, PLPC and spectral features [86]. But the performance of

SI and SV systems with MFCC feature degrades drastically in the noisy or

mismatched condition. In this case, the GF and GFCC features give better

results [85,88]. The gender dependency of SI and SV systems is also a rele-

vant factor. Kenny et. al. showed in [92,93,109] that i-vector feature performs

well using the generative modelling (like GMM, LDA, PLDA, and so). The

computational cost is also very high in case of SI and SV systems due to

complex feature extraction techniques, large dimension of feature vectors and

complex modelling techniques. In the GMM, the estimation of parameters also

suﬀers from high computational cost. So, the dimension reduction techniques

like Principal Component Analysis (PCA), Kernel PCA can be very useful

to reduce the dimension of the feature vectors. In the classiﬁcation/modelling

part, we can also use VQ techniques to reduce the number of vectors so that

the parameter estimation can be made with the considerably less number of

training vectors which in turn will reduce the computational cost. The accu-

racy of SR system mainly depends on the number of cepstral coeﬃcients taken

to form feature vector and the number of Gaussian components taken in the

GMM. If the number of speakers is large, cepstral coeﬃcients and Gaussian

components should be increased to get higher recognition rate. To increase the

vector dimension, ∆MFCC and ∆2MFCC are concatenated with the original

MFCC. The cepstral coeﬃcients (MFCCs) more than 14 do not contain useful

information and if the original MFCC is increased beyond 14 (which is the di-

mension of the MFCC feature vector) performance of the SR system degrades.

So it is better to choose MFC coeﬃcients less than 14. In our experiment it

is observed that 13 MFC coeﬃcients for every MFCC vectors provide a stable

accuracy for all the classiﬁers.

36 Bidhan Barai et al.

6.1 Future Research Directions

Initialization of GMM and singularity of covariance matrix is very crucial for

SI and SV. It is important to note that the singularity of covariance matrix

depends on the initialization of GMM. Proper initialization and making sure

covariance matrix will not be singular still remains a topic of research. Beside

this blind elimination of session, channel eﬀect, reverberation and background

noise without manipulation of training and testing speech data so much are

still a diﬃcult task in SR.

Another important observation is that the deep learning approaches like

CNN are not performing as good as VQ and GMM methods for IITG-MV

SR database, due to lack of balanced training data-set so that each class

has equal contribution in overall loss estimation. Moreover, some speakers

have extremely short duration of audio (around one tenth of other speak-

ers). Data augmentation techniques along with some newly introduced hybrid

CNN architectures which are used for overcome the limitation of short ut-

terance may be used in future to improve the performance on the database.

Other experiments, for example text dependent/independent, channel depen-

dent/independent, reading style dependent/independent, and session depen-

dent/independent experiments using various approaches are still remain the

centre of attraction amongst the researches.

Acknowledgment

This project is partially supported by the CMATER research laboratory of the

Computer Science and Engineering Department, Jadavpur University, India;

UPE-II project, Government of India and DBT project (No. BT/PR16356/BID/

7/596/2016 ), Ministry of Science and Technology, Government of India un-

der Dr. Subhadip Basu. Bidhan Barai is partially supported by the RGNF

Research Award (F1-17.1/2014-15/RGNF-2014-15-SC-WES-67459/(SA-III))

from UGC, Government of India.

References

1. Pal, S.K. and Majumder, D.D., Fuzzy sets and decision making approaches in vowel and

speaker recognition, IEEE Transactions on Systems, Man, and Cybernetics, 7(8), pp.625-

629 (1977).

2. Barai B., Das D., Das N., Basu S., Nasipuri M., VQ/GMM-Based Speaker Identiﬁcation

with Emphasis on Language Dependency, Advanced Computing and Systems for Secu-

rity(ACSS), Advances in Intelligent Systems and Computing, vol 883. Springer, Singapore

(2019)

3. Barai, B., Das, D., Das, N., Basu, S. and Nasipuri, M., Closed-set text-independent auto-

matic speaker recognition system using VQ/GMM, In Intelligent Engineering Informatics

pp. 337-346. Springer, Singapore (2018).

4. Barai B., Das D., Das N., Basu S., and Nasipuri M., An ASR system using MFCC

and VQ/GMM with emphasis on environmental dependency, IEEE Calcutta Conference

(CALCON), Kolkata, pp. 362-366 (2017 ).

Closed-set Speaker Identiﬁcation Using VQ and GMM Based Models 37

5. Fortuna, J., Sivakumaran, P., Ariyaeeinia, A. and Malegaonkar, A., Open-set speaker

identiﬁcation using adapted Gaussian mixture models. In Ninth European Conference on

Speech Communication and Technology (2005).

6. D. Matrouf, W. Ben Kheder, P. Bousquet, M. Ajili and J. Bonastre, Dealing with additive

noise in speaker recognition systems based on i-vector approach, 23rd European Signal

Processing Conference (EUSIPCO), Nice, 2015, pp. 2092-2096 (2015).

7. Wang, N., Ching, P.C., Zheng, N.H. and Lee, T., Robust speaker recognition using both

vocal source and vocal tract features estimated from noisy input utterances. In 2007 IEEE

International Symposium on Signal Processing and Information Technology (pp. 772-777)

(2007).

8. Rao KS, Sarkar S., Robust speaker recognition in noisy environments. Cham: Springer

International Publishing; Jun 21 (2014 ).

9. Fujihara, H., Kitahara, T., Goto, M., Komatani, K., Ogata, T. and Okuno, H.G., Speaker

identiﬁcation under noisy environments by using harmonic structure extraction and reli-

able frame weighting. In Ninth International Conference on Spoken Language Processing

( 2006).

10. Haris, B.C., Pradhan, G., Misra, A., Prasanna, S.R.M., Das, R.K. and Sinha, R., Multi-

variability speaker recognition database in Indian scenario. International Journal of Speech

Technology, 15(4), pp.441-453 (2012).

11. Mandasari, M.I., Saeidi, R., McLaren, M. and van Leeuwen, D.A., Quality measure

functions for calibration of speaker recognition systems in various duration conditions.

IEEE Transactions on Audio, Speech, and Language Processing, 21(11), pp.2425-2438

(2013).

12. Reyes-D´ıaz, F.J., Hern´andez-Sierra, G. and de Lara, J.R.C., 2021. DNN and i-vector

combined method for speaker recognition on multi-variability environments. International

Journal of Speech Technology, 24(2), pp.409-418.

13. Ganchev, T., Potamitis, I., Fakotakis, N. and Kokkinakis, G., 2004. Text-independent

speaker veriﬁcation for real fast-varying noisy environments. International Journal of

Speech Technology, 7(4), pp.281-292.

14. Murthy, Y.S., Koolagudi, S.G. and Raja, T.J., 2021. Singer identiﬁcation for Indian

singers using convolutional neural networks. International Journal of Speech Technology,

pp.1-16.

15. Ram, R. and Mohanty, M.N., 2018. Performance analysis of adaptive variational mode

decomposition approach for speech enhancement. International Journal of Speech Tech-

nology, 21(2), pp.369-381.

16. Mandasari, M.I., Saeidi, R. and van Leeuwen, D.A., Quality measures based calibration

with duration and noise dependency for speaker recognition. Speech Communication, 72,

pp.126-137 (2015).

17. Chakraborty, T., Barai, B., Chatterjee, B., Das, N., Basu, S., Nasipuri,M., Closed-

set device-independent speaker identiﬁcation using cnn, in: International Conference on

Intelligent Computing and Communication (ICICC - 2019), Springer ( 2019).

18. Liu, Z., Wu, Z., Li, T., Li, J. and Shen, C., GMM and CNN hybrid method for short ut-

terance speaker recognition. IEEE Transactions on Industrial informatics, 14(7), pp.3244-

3252 (2018).

19. Anand, P., Singh, A.K., Srivastava, S. and Lall, B., Few Shot Speaker Recognition using

Deep Neural Networks. arXiv preprint arXiv:1904.08775 (2019).

20. Reda, A., Panjwani, S. and Cutrell, E., June. Hyke: a low-cost remote attendance track-

ing system for developing regions. In Proceedings of the 5th ACM workshop on Networked

systems for developing regions, pp.15-20, ACM (2011).

21. Feng, L. and Hansen, L.K., A new database for speaker recognition. IMM, Informatik

og Matematisk Modelling, DTU (2005).

22. Rose, P., Technical forensic speaker recognition: Evaluation, types and testing of evi-

dence. Computer Speech & Language, 20(2-3), pp.159-191 (2006).

23. Singh, N., Khan, R.A. and Shree, R., Applications of speaker recognition. Procedia

engineering, 38, pp.3122-3126 (2012).

24. Lleida, E. and Rodriguez-Fuentes, L.J., Speaker and language recognition and charac-

terization: Introduction to the CSL special issue (2018).

38 Bidhan Barai et al.

25. Abd El-Moneim, S., Sedik, A., Nassar, M.A., El-Fishawy, A.S., Sharshar, A.M., Hassan,

S.E., Mahmoud, A.Z., Dessouky, M.I., El-Banby, G.M., Abd El-Samie, F.E. and El-Rabaie,

E.S.M., 2021. Text-dependent and text-independent speaker recognition of reverberant

speech based on CNN. International Journal of Speech Technology, pp.1-14.

26. Pal, S.K. and Mitra, P., Pattern recognition algorithms for data mining. Chapman and

Hall/CRC, (2004).

27. Fan, X. and Hansen, J.H., April. Speaker identiﬁcation with whispered speech based on

modiﬁed LFCC parameters and feature mapping. In 2009 IEEE International Conference

on Acoustics, Speech and Signal Processing (pp. 4553-4556) IEEE, (2009).

28. Lawson, A., Vabishchevich, P., Huggins, M., Ardis, P., Battles, B. and Stauﬀer, A.,

May. Survey and evaluation of acoustic features for speaker recognition. In 2011 IEEE

International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5444-

5447) IEEE (2011).

29. Hourri, S., Nikolov, N.S. and Kharroubi, J., 2020. A deep learning approach to inte-

grate convolutional neural networks in speaker recognition. International Journal of Speech

Technology, 23, pp.615-623.

30. Lerato, L. and Mashao, D.J., September. Enhancement of GMM speaker identiﬁca-

tion performance using complementary feature sets. In 2004 IEEE Africon. 7th Africon

Conference in Africa (IEEE Cat. No. 04CH37590) (Vol. 1, pp. 257-261) IEEE (2004).

31. Nakagawa, S., Wang, L. and Ohtsuka, S., Speaker identiﬁcation and veriﬁcation by com-

bining MFCC and phase information. IEEE transactions on audio, speech, and language

processing, 20(4), pp.1085-1095 (2011).

32. Hourri, S., Nikolov, N.S. and Kharroubi, J., 2021. Convolutional neural network vectors

for speaker recognition. International Journal of Speech Technology, 24(2), pp.389-400.

33. Shahamiri, S.R. and Salim, S.S.B., Artiﬁcial neural networks as speech recognisers for

dysarthric speech: Identifying the best-performing set of MFCC parameters and studying

a speaker-independent approach. Advanced Engineering Informatics, 28(1), pp.102-110

(2014).

34. Furui, S., Digital speech processing: synthesis, and recognition, CRC Press (2018).

35. Rabiner, L.R. and Schafer, R.W., Theory and applications of digital speech processing

(Vol. 64). Upper Saddle River, NJ: Pearson ( 2011).

36. Nica, A., Caruntu, A., Toderean, G. and Buza, O., Analysis and synthesis of vowels

using Matlab. In 2006 IEEE International Conference on Automation, Quality and Testing,

Robotics (Vol. 2, pp. 371-374) IEEE (2006, May).

37. Hegde, R.M., Murthy, H.A. and Gadde, V.R.R., Signiﬁcance of the modiﬁed group

delay feature in speech recognition. IEEE Transactions on Audio, Speech, and Language

Processing, 15(1), pp.190-202 (2006).

38. Grimaldi, M. and Cummins, F., Speaker identiﬁcation using instantaneous frequen-

cies. IEEE Transactions on Audio, Speech, and Language Processing, 16(6), pp.1097-1111

(2008).

39. Tsiakoulis, P., Potamianos, A. and Dimitriadis, D., Instantaneous frequency and band-

width estimation using ﬁlterbank arrays. In 2013 IEEE International Conference on Acous-

tics, Speech and Signal Processing (pp. 8032-8036) IEEE (2013, May).

40. McCowan, I., Dean, D., McLaren, M., Vogt, R. and Sridharan, S., The delta-phase spec-

trum with application to voice activity detection and speaker recognition. IEEE Transac-

tions on Audio, Speech, and Language Processing, 19(7), pp.2026-2038 (2011).

41. Murty, K.S.R. and Yegnanarayana, B., Combining evidence from residual phase and

MFCC features for speaker recognition. IEEE signal processing letters, 13(1), pp.52-55

(2005).

42. Vijayan, K., Reddy, P.R. and Murty, K.S.R., Signiﬁcance of analytic phase of speech

signals in speaker veriﬁcation. Speech Communication, 81, pp.54-71 (2016).

43. Vijayan, K., Kumar, V. and Murty, K.S.R., Feature extraction from analytic phase of

speech signals for speaker veriﬁcation. In Fifteenth Annual Conference of the International

Speech Communication Association (2014).

44. Qawaqneh, Z., Mallouh, A.A. and Barkana, B.D., Deep neural network framework and

transformed MFCCs for speaker’s age and gender classiﬁcation. Knowledge-Based Sys-

tems, 115, pp.5-14 (2017).

45. Khosravani, A. and Homayounpour, M.M., A PLDA approach for language and text

independent speaker recognition. Computer Speech & Language, 45, pp.457-474 (2017).

Closed-set Speaker Identiﬁcation Using VQ and GMM Based Models 39

46. Zhao, X., Shao, Y. and Wang, D., CASA-based robust speaker identiﬁcation. IEEE

Transactions on Audio, Speech, and Language Processing, 20(5), pp.1608-1616 (2012).

47. Rouat, J., Computational auditory scene analysis: Principles, algorithms, and applica-

tions (wang, d. and brown, gj, eds.; 2006)[book review]. IEEE Transactions on Neural

Networks, 19(1), pp.199-199 (2008).

48. Shi, X., Yang, H. and Zhou, P., Robust speaker recognition based on improved GFCC.

In 2016 2nd IEEE International Conference on Computer and Communications (ICCC)

(pp. 1927-1931) IEEE(2016, October).

49. Zhang, Y. and Abdulla, W.H., Gammatone auditory ﬁlterbank and independent com-

ponent analysis for speaker identiﬁcation. In Ninth International Conference on Spoken

Language Processing (2006).

50. Linde, Y., Buzo, A. and Gray, R., An algorithm for vector quantizer design. IEEE

Transactions on communications, 28(1), pp.84-95 (1980).

51. Kohonen, T., The self-organizing map. Proceedings of the IEEE, 78(9), pp.1464-1480

(1990).

52. Han, C.C., Chen, Y.N., Lo, C.C. and Wang, C.T., A novel approach for vector quanti-

zation using a neural network, mean shift, and principal component analysis-based seed

re-initialization. Signal Processing, 87(5), pp.799-810 (2007).

53. Wang, L., Minami, K., Yamamoto, K. and Nakagawa, S., Speaker recognition by combin-

ing MFCC and phase information in noisy conditions. IEICE transactions on information

and systems, 93(9), pp.2397-2406 (2010).

54. Tirumala, S.S., Shahamiri, S.R., Garhwal, A.S. and Wang, R., Speaker identiﬁcation

features extraction methods: A systematic review. Expert Systems with Applications, 90,

pp.250-271 (2017).

55. Li, Q. and Huang, Y., Robust speaker identiﬁcation using an auditory-based feature.

In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (pp.

4514-4517) IEEE(2010, March).

56. Madikeri, S.R. and Murthy, H.A., Mel ﬁlter bank energy-based slope feature and its ap-

plication to speaker recognition. In 2011 National Conference on Communications (NCC)

(pp. 1-4) IEEE(2011, January).

57. Novoselov, S., Pekhovsky, T., Kudashev, O., Mendelev, V.S. and Prudnikov, A., Non-

linear PLDA for i-vector speaker veriﬁcation. In Sixteenth Annual Conference of the In-

ternational Speech Communication Association (2015).

58. Campbell, W.M., Sturim, D.E., Reynolds, D.A. and Solomonoﬀ, A., SVM based speaker

veriﬁcation using a GMM supervector kernel and NAP variability compensation. In 2006

IEEE International Conference on Acoustics Speech and Signal Processing Proceedings

(Vol. 1, pp. I-I) IEEE (2006, May).

59. Yaman, S., Pelecanos, J. and Sarikaya, R., Bottleneck features for speaker recognition.

In Odyssey 2012-The Speaker and Language Recognition Workshop (2012).

60. Lozano-Diez, A., Silnova, A., Matejka, P., Glembek, O., Plchot, O., Pesan, J., Burget, L.

and Gonzalez-Rodriguez, J., Analysis and Optimization of Bottleneck Features for Speaker

Recognition. In Odyssey (Vol. 2016, pp. 21-24) (2016).

61. Zeinali, H., Sameti, H. and Burget, L., HMM-based phrase-independent i-vector extrac-

tor for text-dependent speaker veriﬁcation. IEEE/ACM Transactions on Audio, Speech,

and Language Processing, 25(7), pp.1421-1435 (2017).

62. Khosravani, A. and Homayounpour, M.M., Nonparametrically trained PLDA for short

duration i-vector speaker veriﬁcation. Computer Speech & Language, 52, pp.105-122

(2018).

63. Li, M. and Narayanan, S., Simpliﬁed supervised i-vector modeling with application to

robust and eﬃcient language identiﬁcation and speaker veriﬁcation. Computer speech &

language, 28(4), pp.940-958 (2014).

64. Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P. and Ouellet, P., Front-end factor

analysis for speaker veriﬁcation. IEEE Transactions on Audio, Speech, and Language

Processing, 19(4), pp.788-798 (2010).

65. Dehak, N., Plchot, O., Bahari, M.H., Burget, L. and Dehak, R., GMM weights adap-

tation based on subspace approaches for speaker veriﬁcation. Proceedings Odyssey 2014,

pp.48-53 (2014).

66. Ghahabi, O. and Hernando, J., Restricted Boltzmann machines for vector representation

of speech in speaker recognition. Computer Speech & Language, 47, pp.16-29 (2018).

40 Bidhan Barai et al.

67. Avci, E., A new optimum feature extraction and classiﬁcation method for speaker recog-

nition: GWPNN. Expert Systems with Applications, 32(2), pp.485-498 (2007).

68. Mary, L. and Yegnanarayana, B., Extraction and representation of prosodic features for

language and speaker recognition. Speech communication, 50(10), pp.782-796 (2008).

69. Djellali, H. and Laskri, M.T., Random vector quantisation modelling in automatic

speaker veriﬁcation. International Journal of Biometrics, 5(3-4), pp.248-265 (2013).

70. Campbell, W.M., Sturim, D.E. and Reynolds, D.A., Support vector machines using

GMM supervectors for speaker veriﬁcation. IEEE signal processing letters, 13(5), pp.308-

311 (2006).

71. Reynolds, D.A., Speaker identiﬁcation and veriﬁcation using Gaussian mixture speaker

models. Speech communication, 17(1-2), pp.91-108 (1995).

72. Markov, K. and Nakagawa, S., Frame level likelihood normalization for text-independent

speaker identiﬁcation using Gaussian mixture models. In Proceeding of Fourth Inter-

national Conference on Spoken Language Processing. ICSLP’96 (Vol. 3, pp. 1764-1767)

IEEE(1996, October).

73. Susan, S. and Sharma, S., A fuzzy nearest neighbor classiﬁer for speaker identiﬁcation. In

2012 Fourth International Conference on Computational Intelligence and Communication

Networks (pp. 842-845) IEEE(2012, November).

74. Zeinali, H., Sameti, H. and Burget, L., Text-dependent speaker veriﬁcation based on

i-vectors, Neural Networks and Hidden Markov Models. Computer Speech & Language,

46, pp.53-71 (2017).

75. McLaren, M., Castan, D., Ferrer, L. and Lawson, A., On the Issue of Calibration in

DNN-Based Speaker Recognition Systems. In INTERSPEECH (pp. 1825-1829) (2016,

September).

76. Matˇejka, P., Glembek, O., Novotn´y, O., Plchot, O., Gr´ezl, F., Burget, L. and Cer-

nock´y, J.H., Analysis of DNN approaches to speaker identiﬁcation. In 2016 IEEE inter-

national conference on acoustics, speech and signal processing (ICASSP) (pp. 5100-5104)

IEEE(2016, March).

77. Richardson, F., Reynolds, D. and Dehak, N., Deep neural network approaches to speaker

and language recognition. IEEE signal processing letters, 22(10), pp.1671-1675 (2015).

78. Richardson, F., Reynolds, D. and Dehak, N., A uniﬁed deep neural network for speaker

and language recognition. arXiv preprint arXiv:1504.00923 (2015).

79. Matˇejka, P., Glembek, O., Castaldo, F., Alam, M.J., Plchot, O., Kenny, P., Burget, L.

and ˇ

Cernocky, J., Full-covariance UBM and heavy-tailed PLDA in i-vector speaker veriﬁ-

cation. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing

(ICASSP) (pp. 4828-4831) IEEE(2011, May).

80. You, C.H., Lee, K.A. and Li, H., GMM-SVM kernel with a Bhattacharyya-based dis-

tance for speaker recognition. IEEE Transactions on Audio, Speech, and Language Pro-

cessing, 18(6), pp.1300-1312 (2009).

81. Nguyen, V.X., Nguyen, V.P. and Pham, T.V., Robust speaker identiﬁcation based on

hybrid model of VQ and GMM-UBM. In 2015 International Conference on Advanced

Technologies for Communications (ATC) (pp. 490-495) IEEE(2015, October).

82. Ling, Z. and Hong, Z., The improved VQ-MAP and its combination with LS-SVM for

speaker recognition. In IEEE Conference Anthology (pp. 1-4) IEEE(2013, January).

83. Ming, J., Stewart, D. and Vaseghi, S., 2005, March. Speaker identiﬁcation in unknown

noisy conditions-a universal compensation approach. In Proceedings.(ICASSP’05). IEEE

International Conference on Acoustics, Speech, and Signal Processing, 2005. (Vol. 1, pp.

I-617). IEEE.

84. Shao, Y., Srinivasan, S. and Wang, D., 2007, April. Incorporating auditory feature

uncertainties in robust speaker identiﬁcation. In 2007 IEEE International Conference on

Acoustics, Speech and Signal Processing-ICASSP’07 (Vol. 4, pp. IV-277). IEEE.

85. Shao, Y. and Wang, D., 2008, March. Robust speaker identiﬁcation using auditory fea-

tures and computational auditory scene analysis. In 2008 IEEE International Conference

on Acoustics, Speech and Signal Processing (pp. 1589-1592). IEEE.

86. Togneri, R. and Pullella, D., An overview of speaker identiﬁcation: Accuracy and ro-

bustness issues. IEEE circuits and systems magazine, 11(2), pp.23-61 (2011).

87. Garcia-Romero, D., Zhou, X. and Espy-Wilson, C.Y., Multicondition training of Gaus-

sian PLDA models in i-vector space for noise and reverberation robust speaker recogni-

tion. In 2012 IEEE international conference on acoustics, speech and signal processing

(ICASSP) (pp. 4257-4260) IEEE(2012, March).

Closed-set Speaker Identiﬁcation Using VQ and GMM Based Models 41

88. Zhao, X. and Wang, D., Analyzing noise robustness of MFCC and GFCC features in

speaker identiﬁcation. In 2013 IEEE international conference on acoustics, speech and

signal processing (pp. 7204-7208) IEEE(2013, May).

89. Cooke, M., Green, P., Josifovski, L. and Vizinho, A., Robust automatic speech recogni-

tion with missing and unreliable acoustic data. Speech communication, 34(3), pp.267-285

(2001).

90. Kuhn, R., Nguyen, P., Junqua, J.C. and Boman, R., Panasonic Corp, Speaker veriﬁca-

tion and speaker identiﬁcation based on eigenvoices. U.S. Patent 6,141,644 (2000).

91. Vogt, R.J., Baker, B.J. and Sridharan, S., Modelling session variability in text indepen-

dent speaker veriﬁcation (2005).

92. Kenny, P., Joint factor analysis of speaker and session variability: Theory and algo-

rithms. CRIM, Montreal,(Report) CRIM-06/08-13, 14, pp.28-29 (2005).

93. Kenny, P., Boulianne, G. and Dumouchel, P., 2005. Eigenvoice modeling with sparse

training data. IEEE transactions on speech and audio processing, 13(3), pp.345-354.

94. Kenny, P., Stafylakis, T., Ouellet, P. and Alam, M.J., JFA-based front ends for speaker

recognition. In 2014 IEEE International Conference on Acoustics, Speech and Signal Pro-

cessing (ICASSP) (pp. 1705-1709) IEEE(2014, May).

95. Novoselov, S., Pekhovsky, T., Shulipa, A. and Sholokhov, A., Text-dependent GMM-JFA

system for password based speaker veriﬁcation. In 2014 IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP) (pp. 729-737) IEEE(2014, May).

96. Cumani, S. and Laface, P., Speaker recognition using e–vectors. IEEE/ACM Transac-

tions on Audio, Speech, and Language Processing, 26(4), pp.736-748 (2018).

97. Martin, A.F., Greenberg, C.S., Stanford, V.M., Howard, J.M., Doddington, G.R. and

Godfrey, J.J., Performance factor analysis for the 2012 NIST speaker recognition evalua-

tion. In Fifteenth Annual Conference of the International Speech Communication Associ-

ation (2014).

98. Kanagasundaram, A., Dean, D. and Sridharan, S., JFA based speaker recognition using

delta-phase and MFCC features. In SST 2012 14th Australasian International Conference

on Speech Science and Technology (2012, December).

99. Rajan, P., Afanasyev, A., Hautam¨aki, V. and Kinnunen, T., From single to multiple en-

rollment i-vectors: Practical PLDA scoring variants for speaker veriﬁcation. Digital Signal

Processing, 31, pp.93-101 (2014).

100. Garcia, A.A. and Mammone, R.J., Channel-robust speaker identiﬁcation using

modiﬁed-mean cepstral mean normalization with frequency warping. In 1999 IEEE Inter-

national Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99

(Cat. No. 99CH36258) (Vol. 1, pp. 325-328) IEEE(1999, March).

101. Juang, B.H., Rabiner, L. and Wilpon, J.G., On the use of bandpass liftering in speech

recognition. IEEE Transactions on acoustics, speech, and signal processing, 35(7), pp.947-

954 (1987).

102. Paliwal, K.K., Decorrelated and liftered ﬁlter-bank energies for robust speech recogni-

tion. In Sixth European Conference on Speech Communication and Technology (1999).

103. Chapaneri, S.V., Spoken digits recognition using weighted MFCC and improved fea-

tures for dynamic time warping. International Journal of Computer Applications, 40(3),

pp.6-12 (2012).

104. Colibro, D., Vair, C., Castaldo, F., Dalmasso, E. and Laface, P., Speaker recognition

using channel factors feature compensation. In 2006 14th European Signal Processing

Conference (pp. 1-5) IEEE(2006, September).

105. Aronowitz, H. and Aronowitz, V., Eﬃcient score normalization for speaker recognition.

In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (pp.

4402-4405) IEEE(2010, March).

106. B ¨

UY ¨

UK, O. and Arslan, M.L., Model selection and score normalization for text-

dependent single utterance speaker veriﬁcation. Turkish Journal of Electrical Engineering

and Computer Science, 20(Sup. 2), pp.1277-1295 (2012).

107. Zheng, R., Zhang, S. and Xu, B., A comparative study of feature and score normal-

ization for speaker veriﬁcation. In International Conference on Biometrics (pp. 531-538).

Springer, Berlin, Heidelberg (2006, January).

108. Bolt, R.H., Cooper, F.S., David, E.E., Denes, P.B., Pickett, J.M. and Stevens, K.N.,

Identiﬁcation of a speaker by speech spectrograms. Science, 166(3903), pp.338-343 (1969).

42 Bidhan Barai et al.

109. Kenny, P., Ouellet, P., Dehak, N., Gupta, V. and Dumouchel, P., 2008. A study of

interspeaker variability in speaker veriﬁcation. IEEE Transactions on Audio, Speech, and

Language Processing, 16(5), pp.980-988.

110. Brookes, M., 1997. Voicebox: Speech processing toolbox for matlab. Software, available

[Mar. 2011] from www. ee. ic. ac. uk/hp/staﬀ/dmb/voicebox/voicebox. html, 47.

111. Nabney, I., 2002. NETLAB: algorithms for pattern recognition. Springer Science &

Business Media.

112. Sawada, H., Mukai, R., Araki, S. and Makino, S., 2004. A robust and precise method

for solving the permutation problem of frequency-domain blind source separation. IEEE

transactions on speech and audio processing, 12(5), pp.530-538.

A preview of this full-text is provided by Springer Nature.

Learn more

Content available from International Journal of Speech Technology

This content is subject to copyright. Terms and conditions apply.

An empirical study on analysis window functions for text-independent speaker recognition

Article

Full-text available

Feb 2023
Int J Speech Tech

This paper describes the effect of analysis window functions on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate analysis window function is required to extract frames from the complete speech signal of a speaker prior to STFA. The number of frames are consider as the number of MFCC feature vectors of a speaker which uniquely represents the speaker in feature space (domain). For the recognition purpose Vector Quantization (VQ) and/or Gaussian Mixture Model (GMM) and/or Universal Background Model GMM (UBM-GMM) based classifiers are used and a comparative study is made. Generally in state-of-the-art MFCC feature vector extraction, Hamming (in some places abbreviated as Ham in this paper) window function is used, but here we also examine the effect of other window functions like rectangular window, Hann window, B-spline windows, polynomial windows, adjustable windows, hybrid windows and Lanczos window in SR. In the present paper, we briefly describe the analysis window functions and try to evaluate text-independent speaker identification (SI). We also use voice activity detector (VAD) to discard the silence frames before STFA. Indeed, silence frames removal leads to the better performance of SR because MFCC of silent frames make the MFCC feature space intrinsic (MFCC with impurity). Here IITG MV SR database contains speech signal of speakers recorded by different devices, namely, D01, H01, T01, M01 and M02, in different environment, different language, different session. This is the reason for calling the database multi variability. It is observed that VQ classifier performs better than other GMM based classifiers for this database and the classifiers VQ-GMM, VQ-UGM-GMM and the combination of them suffers from singularity problem of covariance matrix. So we evaluate the performance of device D01 for all the classifiers and the three classifiers namely, GMM, UBM-GMM and VQ are used for the remaining four recording devices, H01, T01, M01, M02 because except these three classifiers, all other classifiers suffer from singularity problem of covariance matrix in SI. It is observed that VQ provide the highest accuracy for all the devices.

Design of intelligent behavior analysis software based on speaker identity classification algorithm in microgrid mode

Article

Full-text available

Apr 2024

Weijie Guo

Digital technology still has a low level of intelligence in the microgrid mode of teaching behavior analysis, resulting in the traditional manual observation and recording stage still being used for speaker identity classification, and the efficiency of teaching behavior analysis is also low. In response to the above issues, the research is based on the teacher‐student analysis method and proposes a dual clustering algorithm based on the general background model Gaussian mixture model for speaker identity classification, thereby realizing the development and design of intelligent behavior analysis software. The research results indicate that the average recall rate of behavior transition points in the classroom teaching discourse corpus of the intelligent behavior analysis software is 89.03%, which is better than traditional analysis methods. Therefore, the intelligent behavior analysis software constructed by the dual clustering algorithm has high effectiveness and practicality. The research proposes a method model and implements intelligent visualization for classroom teaching behavior analysis, improving the efficiency of analyzing current microgrid teaching behavior.

MAuD: a multivariate audio database of samples collected from benchmark conferencing platforms

Article

Full-text available

Oct 2023
MULTIMED TOOLS APPL

This paper presents an unique audio database, we named it Multivariate Audio Database (MAuD), where audio data has been collected in real life scenarios. MAuD contains 229 audio files, each of duration approx 5 minutes, collected across different conferencing apps, spoken languages, background noises and discussion topics. Various audio conferencing applications have been used for collecting these data e.g. Mobile conference calls, Zoom, Google Meet, Skype and Hangout. During this collection, speakers of different age, sex spoke in several languages and on various topics. Audio was recorded using devices of one of the speakers. Background noises were then introduced synthetically. Researchers may find this database useful as it can be used for several signal processing experiments e.g. conference app identification, background noise identification, speaker identification, identification of who speaks when. We have explored classification of some of the above mentioned mismatch cases (conference app and background noise). Pre-trained deep learning models (ResNet18 and DenseNet201) has been used for these purposes. We have achieved more than 98% accuracy in both the experiments that confirms MAuD contains high quality audio specific properties.

Evaluating the Efficacy of Traditional Machine Learning Models in Speaker Recognition: A Comparative Study Using the LibriSpeech Dataset

Article

Full-text available

Jan 2024

Gregorius Airlangga

The efficacy of machine learning models in speaker recognition tasks is critical for advancements in security systems, biometric authentication, and personalized user interfaces. This study provides a comparative analysis of three prominent machine learning models: Naive Bayes, Logistic Regression, and Gradient Boosting, using the LibriSpeech test-clean dataset—a corpus of read English speech from audiobooks designed for training and evaluating speech recognition systems. Mel-Frequency Cepstral Coefficients (MFCCs) were extracted as features from the audio samples to represent the power spectrum of the speakers’ voices. The models were evaluated based on precision, recall, F1-score, and accuracy to determine their performance in correctly identifying speakers. Results indicate that Logistic Regression outperformed the other models, achieving nearly perfect scores across all metrics, suggesting its superior capability for linear classification in high-dimensional spaces. Naive Bayes also demonstrated high efficiency and robustness, despite the inherent assumption of feature independence, while Gradient Boosting showed slightly lower performance, potentially due to model complexity and overfitting. The study underscores the potential of simpler machine learning models to achieve high accuracy in speaker recognition tasks, particularly where computational resources are limited. However, limitations such as the controlled nature of the dataset and the focus on a single feature type were noted, with recommendations for future research to include more diverse environmental conditions and feature sets.

Comparing Machine Learning Models to Determine the Effect of Speech Duration on Speaker Identification within Kazakh Speech Corpus

Article

Jan 2024

Text and Language Independent Classification of Voice Calling Platforms Using Deep Learning

Chapter

Nov 2023

Audio and video conferencing apps like Google meet, Zoom, Mobile call conference are becoming more and more popular. Conferencing apps are used not only by professionals for remote work, but also for keeping social relations. Present situation demands understanding of these platforms in details and extract useful features to recognize them. Identification of conference call platforms will add value in forensic analysis. Our research focuses on collecting audio data using various conferencing apps. Audio data are collected in real world situation, i.e., in noisy environments, where speakers spoke in conversational style using multiple languages. After data collection, we have examined whether platform specific properties are present in the audio files or not. Pre-trained deep learning models (DenseNet, ResNet) are used to extract features automatically from the audio files. High recognition accuracy (99%) clearly indicates that these audio files contain significant amount of platform specific information.

A Hybrid Mel Frequency Cepstral Coefficients and Bayesian Gaussian Mixure Model for Voice based Authentication Websites

Conference Paper

Mar 2023

An Efficient Text-Independent Speaker Identification Using Feature Fusion and Transformer Model

Article

Full-text available

Mar 2023
CMC-COMPUT MATER CON

Automatic Speaker Identification (ASI) involves the process of distinguishing an audio stream associated with numerous speakers' utterances. Some common aspects, such as the framework difference, overlapping of different sound events, and the presence of various sound sources during recording, make the ASI task much more complicated and complex. This research proposes a deep learning model to improve the accuracy of the ASI system and reduce the model training time under limited computation resources. In this research, the performance of the transformer model is investigated. Seven audio features, chromagram, Mel-spectrogram, tonnetz, Mel-Frequency Cepstral Coefficients (MFCCs), delta MFCCs, delta-delta MFCCs and spectral contrast, are extracted from the ELSDSR, CSTR-VCTK, and Ar-DAD, datasets. The evaluation of various experiments demonstrates that the best performance was achieved by the proposed transformer model using seven audio features on all datasets. For ELSDSR, CSTR-VCTK, and Ar-DAD, the highest attained accuracies are 0.99, 0.97, and 0.99, respectively. The experimental results reveal that the proposed technique can achieve the best performance for ASI problems.

A speaker recognition method based on GMM using non-negative matrix factorization

Conference Paper

Dec 2022

Text-dependent and text-independent speaker recognition of reverberant speech based on CNN

Article

Full-text available

Dec 2021
Int J Speech Tech

Speaker recognition is one of several biometric recognition systems owing to its high importance in numerous applications of security and telecommunications. The key aspiration of speaker recognition systems is to know who is speaking depending on voice characteristics. This paper presents an extensive study of speaker recognition in both text-dependent and text-independent cases. Convolutional Neural Network (CNN) based feature extraction is extended to the text-dependent and text-independent speaker recognition tasks. In addition, the effect of reverberation on the speaker recognition system is addressed. All speech signals are converted into images by obtaining their spectrograms. Two proposed CNN models are presented for efficient speaker recognition from clean and reverberant speech signals. They depend on image processing concepts applied on spectrograms of speech signals. One of the proposed models is compared with a conventional Benchmark model in the text-independent scenario. The performance of the recognition system is measured by the recognition rate in the cases of clean and reverberant speech.

Singer identification for Indian singers using convolutional neural networks

Article

Full-text available

Sep 2021
Int J Speech Tech

Singer identification is one of the important aspects of music information retrieval (MIR). In this work, traditional feature-based and trending convolutional neural network (CNN) based approaches are considered and compared for identifying singers. Two different datasets, namely artist20 and the Indian popular singers’ database with 20 singers are used in this work to evaluate proposed approaches. Cepstral features such as Mel-frequency cepstral coefficients (MFCCs) and linear prediction cepstral coefficients (LPCCs) are considered to represent timbre information. Shifted delta cepstral (SDC) features are also computed beside the cepstral coefficients to capture temporal information. In addition, chroma features are computed from 12 semitones of a musical octave, overall forming a 46-dimensional feature vector. Experiments are conducted with different feature combinations, and suitable features are selected using the genetic algorithm-based feature selection (GAFS) approach. Two different classification techniques, namely artificial neural networks (ANNs) and random forest (RF), are considered on the features mentioned above. Further, spectrograms and chromagrams of audio clips are directly fed to CNN for classification. The singer identification results obtained using CNNs seem to be better than the traditional isolated and ensemble classifiers. Average accuracy of around 75% is observed with CNN in the case of Indian popular singers database. Whereas, on artist20 dataset, the proposed configuration of feature-based approach and CNN could not give better than 60% accuracy.

DNN and i-vector combined method for speaker recognition on multi-variability environments

Article

Full-text available

Jun 2021
Int J Speech Tech

The article deals with the compensation of variability in Automatic Speaker Verification systems in scenarios where the variability conditions due to utterance duration, reverberation and environmental noise are simultaneously present. We introduce a new representation of the speaker’s discriminative information, based on the use of a deep neural network trained discriminatively for speaker classification and i-vector representation. The proposed representation allows us to increase the verification performance by reducing the error between 2.5 and 7.9 % for all variability conditions compared to baseline systems. We also analyze the speaker verification system robustness based on interquartile range, obtaining a 1.19 times improvement compared to baselines evaluated.

Convolutional neural network vectors for speaker recognition

Article

Full-text available

Jun 2021
Int J Speech Tech

Deep learning models are now considered state-of-the-art in many areas of pattern recognition. In speaker recognition, several architectures have been studied, such as deep neural networks (DNNs), deep belief networks (DBNs), restricted Boltzmann machines (RBMs), and so on, while convolutional neural networks (CNNs) are the most widely used models in computer vision. The problem is that CNN is limited to the computer vision field due to its structure which is designed for two-dimensional data. To overcome this limitation, we aim at developing a customized CNN for speaker recognition. The goal of this paper is to propose a new approach to extract speaker characteristics by constructing CNN filters linked to the speaker. Besides, we propose new vectors to identify speakers, which we call in this work convVectors. Experiments have been performed with a gender-dependent corpus (THUYG-20 SRE) under three noise conditions : clean, 9db, and 0db. We compared the proposed method with our baseline system and the state-of-the-art methods. Results showed that the convVectors method was the most robust, improving the baseline system by an average of 43%, and recording an equal error rate of 1.05% EER. This is an important finding to understand how deep learning models can be adapted to the problem of speaker recognition.

Frame level likelihood normalization for text-independent speaker identification using Gaussian mixture models

Conference Paper

Oct 1996

GMM Weights Adaptation Based on Subspace Approaches for Speaker Verification

Conference Paper

Jun 2014

Feature extraction from analytic phase of speech signals for speaker verification

Conference Paper

Sep 2014

Performance factor analysis for the 2012 NIST speaker recognition evaluation

Conference Paper

Sep 2014

A unified deep neural network for speaker and language recognition

Conference Paper

Sep 2015

Digital Speech Processing, Synthesis, and Recognition

Book

May 2018

Sadaoki Furui

Closed-set speaker identification using VQ and GMM based models

Abstract and Figures

Recommended publications

Deep neural network based i-vector mapping for speaker verification using short utterances

Deep neural network based i-vector mapping for speaker verification using short utterances

A Three Tiered Approach for Articulated Object Action Modeling and Recognition.

An empirical study on analysis window functions for text-independent speaker recognition