Figure 5 - uploaded by Homayoon Beigi
Content may be subject to copyright.
Speaker Verification Results 

Speaker Verification Results 

Source publication
Conference Paper
Full-text available
This paper presents a hierarchical approach to the Large-Scale Speaker Recognition problem. In here the authors present a binary tree data-base approach for arranging the trained speaker models based on a distance measure designed for comparing two sets of distributions. The combination of this hierarchical structure and the distance measure[1] pro...

Contexts in source publication

Context 1
... case of Speaker Identification, the claimant’s test model may be compared, using the same distance measure, to all the models in the database including that of the background model. This may be expedited using a top-down sweep of the tree to arrive at the correct leaf with only log 2 N comparisons, each time going in the direction of the child with the smallest distance. Please note This may constitute a rejection if the background model is the closest model to the test model. Speaker Classification as a direct product of the tree building is very useful in many different occasions including the narrow-down of the search space for doing Speaker Recognition. The systems presented in [3, 4, 5] use speaker classification for performing speaker segmentation as well as improving speech recognition accuracies through adaptation. Please also note that if the claimant is an imposter and just happens to be closest to the claimed identity in the cohort which is picked, with the probability of 1 / ( CohortSize ) a false-acceptance is reached. The first row of results presented in the table of figure 3 present the false-rejection and false-acceptance results conducted on 60 speakers out of a population of 184 speakers in the database. This data is collected using nine different microphones including Tie-Clip , Hand- Held and Far-Field microphones. The training data lasts an average of 40 seconds. The test was performed using an average of 6 seconds of independent data. 60 of the 184 speakers were randomly used for the testing. The next section presents two novel techniques for solving the false-acceptance problem of verification. Two Complementary Model Techniques are proposed to solve the false-acceptance problem. The first technique will create a single model, used as a representation of all the models in the tree and outside the tree (given some background data). This model is called the Cumulative Complementary Model (CCM) by the authors. CCM is basically a merged model based on the complement of the cohort. Figure 5 shows a speaker-tree with a graphic representation of the models used to create the CCM for an example cohort. Note that this is a very quick computation since the tree structure is used to minimize the computation. The following sections list the model production and pros and cons of the two techniques: • The complementary model for each node is computed by merging the siblings with the complementary model of the parent as we travel down the tree. • There no confidence information available by the rejection mechanism. Also, the similar and dis-similar data are merged giving a non-robust merged model. Too many merges are done and since the merging is suboptimal, this will degrade accuracy. • Decoding is faster in GCM since the modified cohort consisting of the original cohort and the CCM is smaller. Training is slower due to many merges. • The complementary model for each node is the model merged from all its siblings. See figure 5. • When building the modified cohort, the complementary model of the node and of its parents are added to the cohort list and if the verification finds one of these complementary models to be the closest to the test speaker, it is rejected. • There is an inherent confidence level associated with this method. The higher the level (closer to root), the more confident the rejection decision. • No merges are necessary, hence the training is faster than CCM, but the testing is slower. The background model denoted in figure 5 may be computed by obtaining a lot of data not present in the tree and pooling the data together to create a single model. This will allow further rejection capability for imposters who were not enrolled in the database. The table of figure 3 shows a drastic reduction in the false-acceptance of the verification system when using the two proposed complementary models. As we had expected, the GCM produces much better results. In fact it reduces the false-acceptance of the system to 0 by not much of a degradation in the false-rejection. In order to perform a quick speaker identification of log 2 N distance computations versus N , the tree should be optimized for better top-down performance. This allows an Identify and Verify scheme for better performance of the verification as well, when compared to using the claimed ID as the cohort identifier. The authors are currently working on this optimiza- tion problem. Using the Likelihood-Based scheme, we have obtained the following preliminary results which take into ac- count mis-match conditions. All training data for a given speaker was collected from only one of 8 microphones. The testing data for the speaker was collected on the training microphone (the matched case) as well as on one of the other 8 microphones (the mismatched case). The imposter trials can be from any of the 8 microphones. In the experiments 28, (male and female) speakers were used, however for any given piece of training or testing data, the gender was unknown. In addition, we tried to get an even distribution of microphones for training and testing. We limited the amount of training and testing data to approximately 10 seconds. There were a total of 125 speakers in the tree. There were 199 matched verification tests, 214 mismatched tests, and 382 imposter tests. The imposters were taken from a population that excluded any of the enrolled speakers. The equal error rate was 13 . ...
Context 2
... case of Speaker Identification, the claimant’s test model may be compared, using the same distance measure, to all the models in the database including that of the background model. This may be expedited using a top-down sweep of the tree to arrive at the correct leaf with only log 2 N comparisons, each time going in the direction of the child with the smallest distance. Please note This may constitute a rejection if the background model is the closest model to the test model. Speaker Classification as a direct product of the tree building is very useful in many different occasions including the narrow-down of the search space for doing Speaker Recognition. The systems presented in [3, 4, 5] use speaker classification for performing speaker segmentation as well as improving speech recognition accuracies through adaptation. Please also note that if the claimant is an imposter and just happens to be closest to the claimed identity in the cohort which is picked, with the probability of 1 / ( CohortSize ) a false-acceptance is reached. The first row of results presented in the table of figure 3 present the false-rejection and false-acceptance results conducted on 60 speakers out of a population of 184 speakers in the database. This data is collected using nine different microphones including Tie-Clip , Hand- Held and Far-Field microphones. The training data lasts an average of 40 seconds. The test was performed using an average of 6 seconds of independent data. 60 of the 184 speakers were randomly used for the testing. The next section presents two novel techniques for solving the false-acceptance problem of verification. Two Complementary Model Techniques are proposed to solve the false-acceptance problem. The first technique will create a single model, used as a representation of all the models in the tree and outside the tree (given some background data). This model is called the Cumulative Complementary Model (CCM) by the authors. CCM is basically a merged model based on the complement of the cohort. Figure 5 shows a speaker-tree with a graphic representation of the models used to create the CCM for an example cohort. Note that this is a very quick computation since the tree structure is used to minimize the computation. The following sections list the model production and pros and cons of the two techniques: • The complementary model for each node is computed by merging the siblings with the complementary model of the parent as we travel down the tree. • There no confidence information available by the rejection mechanism. Also, the similar and dis-similar data are merged giving a non-robust merged model. Too many merges are done and since the merging is suboptimal, this will degrade accuracy. • Decoding is faster in GCM since the modified cohort consisting of the original cohort and the CCM is smaller. Training is slower due to many merges. • The complementary model for each node is the model merged from all its siblings. See figure 5. • When building the modified cohort, the complementary model of the node and of its parents are added to the cohort list and if the verification finds one of these complementary models to be the closest to the test speaker, it is rejected. • There is an inherent confidence level associated with this method. The higher the level (closer to root), the more confident the rejection decision. • No merges are necessary, hence the training is faster than CCM, but the testing is slower. The background model denoted in figure 5 may be computed by obtaining a lot of data not present in the tree and pooling the data together to create a single model. This will allow further rejection capability for imposters who were not enrolled in the database. The table of figure 3 shows a drastic reduction in the false-acceptance of the verification system when using the two proposed complementary models. As we had expected, the GCM produces much better results. In fact it reduces the false-acceptance of the system to 0 by not much of a degradation in the false-rejection. In order to perform a quick speaker identification of log 2 N distance computations versus N , the tree should be optimized for better top-down performance. This allows an Identify and Verify scheme for better performance of the verification as well, when compared to using the claimed ID as the cohort identifier. The authors are currently working on this optimiza- tion problem. Using the Likelihood-Based scheme, we have obtained the following preliminary results which take into ac- count mis-match conditions. All training data for a given speaker was collected from only one of 8 microphones. The testing data for the speaker was collected on the training microphone (the matched case) as well as on one of the other 8 microphones (the mismatched case). The imposter trials can be from any of the 8 microphones. In the experiments 28, (male and female) speakers were used, however for any given piece of training or testing data, the gender was unknown. In addition, we tried to get an even distribution of microphones for training and testing. We limited the amount of training and testing data to approximately 10 seconds. There were a total of 125 speakers in the tree. There were 199 matched verification tests, 214 mismatched tests, and 382 imposter tests. The imposters were taken from a population that excluded any of the enrolled speakers. The equal error rate was 13 . ...
Context 3
... case of Speaker Identification, the claimant’s test model may be compared, using the same distance measure, to all the models in the database including that of the background model. This may be expedited using a top-down sweep of the tree to arrive at the correct leaf with only log 2 N comparisons, each time going in the direction of the child with the smallest distance. Please note This may constitute a rejection if the background model is the closest model to the test model. Speaker Classification as a direct product of the tree building is very useful in many different occasions including the narrow-down of the search space for doing Speaker Recognition. The systems presented in [3, 4, 5] use speaker classification for performing speaker segmentation as well as improving speech recognition accuracies through adaptation. Please also note that if the claimant is an imposter and just happens to be closest to the claimed identity in the cohort which is picked, with the probability of 1 / ( CohortSize ) a false-acceptance is reached. The first row of results presented in the table of figure 3 present the false-rejection and false-acceptance results conducted on 60 speakers out of a population of 184 speakers in the database. This data is collected using nine different microphones including Tie-Clip , Hand- Held and Far-Field microphones. The training data lasts an average of 40 seconds. The test was performed using an average of 6 seconds of independent data. 60 of the 184 speakers were randomly used for the testing. The next section presents two novel techniques for solving the false-acceptance problem of verification. Two Complementary Model Techniques are proposed to solve the false-acceptance problem. The first technique will create a single model, used as a representation of all the models in the tree and outside the tree (given some background data). This model is called the Cumulative Complementary Model (CCM) by the authors. CCM is basically a merged model based on the complement of the cohort. Figure 5 shows a speaker-tree with a graphic representation of the models used to create the CCM for an example cohort. Note that this is a very quick computation since the tree structure is used to minimize the computation. The following sections list the model production and pros and cons of the two techniques: • The complementary model for each node is computed by merging the siblings with the complementary model of the parent as we travel down the tree. • There no confidence information available by the rejection mechanism. Also, the similar and dis-similar data are merged giving a non-robust merged model. Too many merges are done and since the merging is suboptimal, this will degrade accuracy. • Decoding is faster in GCM since the modified cohort consisting of the original cohort and the CCM is smaller. Training is slower due to many merges. • The complementary model for each node is the model merged from all its siblings. See figure 5. • When building the modified cohort, the complementary model of the node and of its parents are added to the cohort list and if the verification finds one of these complementary models to be the closest to the test speaker, it is rejected. • There is an inherent confidence level associated with this method. The higher the level (closer to root), the more confident the rejection decision. • No merges are necessary, hence the training is faster than CCM, but the testing is slower. The background model denoted in figure 5 may be computed by obtaining a lot of data not present in the tree and pooling the data together to create a single model. This will allow further rejection capability for imposters who were not enrolled in the database. The table of figure 3 shows a drastic reduction in the false-acceptance of the verification system when using the two proposed complementary models. As we had expected, the GCM produces much better results. In fact it reduces the false-acceptance of the system to 0 by not much of a degradation in the false-rejection. In order to perform a quick speaker identification of log 2 N distance computations versus N , the tree should be optimized for better top-down performance. This allows an Identify and Verify scheme for better performance of the verification as well, when compared to using the claimed ID as the cohort identifier. The authors are currently working on this optimiza- tion problem. Using the Likelihood-Based scheme, we have obtained the following preliminary results which take into ac- count mis-match conditions. All training data for a given speaker was collected from only one of 8 microphones. The testing data for the speaker was collected on the training microphone (the matched case) as well as on one of the other 8 microphones (the mismatched case). The imposter trials can be from any of the 8 microphones. In the experiments 28, (male and female) speakers were used, however for any given piece of training or testing data, the gender was unknown. In addition, we tried to get an even distribution of microphones for training and testing. We limited the amount of training and testing data to approximately 10 seconds. There were a total of 125 speakers in the tree. There were 199 matched verification tests, 214 mismatched tests, and 382 imposter tests. The imposters were taken from a population that excluded any of the enrolled speakers. The equal error rate was 13 . ...

Similar publications

Conference Paper
Full-text available
Wideband communications permit the transmission of an extended frequency range compared to the traditional narrowband. While benefits for automatic speaker recognition can be expected, the extent of the contribution of the additional bandwidth in wideband is still unclear. This work compares the i-vector speaker verification performances employing...

Citations

... Audio signal changes randomly and continuously through time. As an example, music and audio signals have strong energy content in the low frequencies and weaker energy content in the high frequencies [31,32]. Figure 2 depicts a generalized time and frequency spectra of audio signals [33]. ...
Chapter
Full-text available
This chapter addresses the topic of classification and separation of audio and music signals. It is a very important and a challenging research area. The importance of classification process of a stream of sounds come up for the sake of building two different libraries: speech library and music library. However, the separation process is needed sometimes in a cocktail-party problem to separate speech from music and remove the undesired one. In this chapter, some existed algorithms for the classification process and the separation process are presented and discussed thoroughly. The classification algorithms will be divided into three categories. The first category includes most of the real time approaches. The second category includes most of the frequency domain approaches. However, the third category introduces some of the approaches in the time-frequency distribution. The approaches of time domain discussed in this chapter are the short-time energy (STE), the zero-crossing rate (ZCR), modified version of the ZCR and the STE with positive derivative, the neural networks, and the roll-off variance. The approaches of the frequency spectrum are specifically the roll-off of the spectrum, the spectral centroid and the variance of the spectral centroid, the spectral flux and the variance of the spectral flux, the cepstral residual, and the delta pitch. The time-frequency domain approaches have not been yet tested thoroughly in the process of classification and separation of audio and music signals. Therefore, the spectrogram and the evolutionary spectrum will be introduced and discussed. In addition, some algorithms for separation and segregation of music and audio signals, like the independent Component Analysis, the pitch cancelation and the artificial neural networks will be introduced.
... In terms of feature extraction, the very common time-domain features are short-time energy (STE) [26][27][28] and the zero-crossing rate (ZCR) [29,30]. Signal energy [31][32][33], fundamental frequency [34], Mel frequency cepstral coefficients (MFCC) [35][36][37] are the most used frequency-domain features. Recently, a few studies focused on speech and song/music discrimination [38][39][40]. ...
Article
Full-text available
A robust approach for the application of audio content classification (ACC) is proposed in this paper, especially in variable noise-level conditions. We know that speech, music, and background noise (also called silence) are usually mixed in the noisy audio signal. Based on the findings, we propose a hierarchical ACC approach consisting of three parts: voice activity detection (VAD), speech/music discrimination (SMD), and post-processing. First, entropy-based VAD is successfully used to segment input signal into noisy audio and noise even if variable-noise level is happening. The determinations of one-dimensional (1D)-subband energy information (1D-SEI) and 2D-textural image information (2D-TII) are then formed as a hybrid feature set. The hybrid-based SMD is achieved because the hybrid feature set is input into the classification of the support vector machine (SVM). Finally, a rule-based post-processing of segments is utilized to smoothly determine the output of the ACC system. The noisy audio is successfully classified into noise, speech, and music. Experimental results show that the hierarchical ACC system using hybrid feature-based SMD and entropy-based VAD is successfully evaluated against three available datasets and is comparable with existing methods even in a variable noise-level environment. In addition, our test results with the VAD scheme and hybrid features also shows that the proposed architecture increases the performance of audio content discrimination.
... Another method is hierarchically clustering the UBM mixtures (Xiang, Berger, 2003;Saeidi et al., 2010). Some of the other methods are speaker clustering at feature level (Xiong et al., 2006), and speaker clustering at model level (Beigi et al., 1999;De Leon, Apsingekar, 2007;Apsingekar, De Leon, 2009). However, there is a tradeoff between the identification rate and identification time, since not all the mixtures are scored, or not all the speakers' models are considered. ...
Article
Full-text available
Conventional speaker recognition systems use the Universal Background Model (UBM) as an imposter for all speakers. In this paper, speaker models are clustered to obtain better imposter model represen- tations for speaker verification purpose. First, a UBM is trained, and speaker models are adapted from the UBM. Then, the k-means algorithm with the Euclidean distance measure is applied to the speaker models. The speakers are divided into two, three, four, and five clusters. The resulting cluster centers are used as background models of their respective speakers. Experiments showed that the proposed method consistently produced lower Equal Error Rates (EER) than the conventional UBM approach for 3, 10, and 30 seconds long test utterances, and also for channel mismatch conditions. The proposed method is also compared with the i-vector approach. The three-cluster model achieved the best performance with a 12.4% relative EER reduction in average, compared to the i-vector method. Statistical significance of the results are also given.
... Au niveau temporel, les descripteurs tels que le taux de passage du signal sonore à zéro [4, 34] et l'énergie du signal à court terme [5, 29] ont été introduits. Les trois descripteurs principalement utilisés au niveau fréquentiel ont été la fréquence fondamentale [37], l'énergie spectrale du signal [1, 3] et les Mel Frequency Cepstral Coefficients (MFCC) [6, 8, 23]. D'autres descripteurs plus proches de la perception humaine ont aussi été utilisés, à savoir la mesure de sonie [7, 31] et l'énergie contenue dans les sous-bandes [12]. ...
Conference Paper
Full-text available
Le chant est un élément remarquable d’une chanson et sa détection automatique au sein d’un morceau est un défi largement étudié. Cet article propose une approche permettant de discriminer les titres musicaux comportant du chant dans une base de données musicales conséquente. L’approche précédemment proposée par Ghosal et al.[9] fonde la prise de décision sur l’analyse des descripteurs à l’échelle de la chanson. Nous générons ici une probabilité de présence de chant à l’échelle de la trame afin de prendre une décision globale. Une première méthode proposée pour cette classification utilise la densité de probabilité des prédictions et une seconde des n-grammes sur les trames supposées contenir du chant. Les résultats de ces nouvelles méthodes améliorent ceux obtenus par [9] et montrent une meilleure robustesse lorsque la taille de la base musicale augmente. La précision de la classification chute ainsi de 3.6% seulement contre 13.1% pour [9] lorsque la base de test est multipliée par 16.
... ZCR (zero crossing rate) [2], [3], [4] and STE (short time energy) [5], [6], [4] are the most widely used time domain features. Features like signal bandwidth, spectral centroid, signal energy [7], [8], [9], fundamental frequency [1], melfrequency cepstral co-efficients (MFCC) [10], [11] belong to the category of frequency domain features. Roughness and loudness measures [12] have been tried to capture the perceptual aspect. ...
... ZCR (zero crossing rate) (West and Cox 2004;Downie 2004) and STE (short time energy) (Saunders 1996;El-Maleh et al. 2000) are the commonly used time domain features. Features like signal bandwidth, spectral centroid, signal energy (Beigi et al. 1999;McKay and Fujinaga 2004;West and Cox 2005), fundamental frequency (Zhang and Kuo 1998), mel-frequency cepstral co-efficients (MFCC) (Eronen and Klapuri 2000;Foote 1997) belong to the category of frequency domain features. Roughness and loudness measures (Fastl and Zwicker 2007) have been presented to capture the perceptual aspect. ...
... ZCR (zero crossing rate) (West and Cox 2004;Downie 2004) and STE (short time energy) (Saunders 1996;El-Maleh et al. 2000) are the commonly used time domain features. Features like signal bandwidth, spectral centroid, signal energy (Beigi et al. 1999;McKay and Fujinaga 2004;West and Cox 2005), fundamental frequency (Zhang and Kuo 1998), mel-frequency cepstral co-efficients (MFCC) (Eronen and Klapuri 2000;Foote 1997) belong to the category of frequency domain features. Roughness and loudness measures (Fastl and Zwicker 2007) have been presented to capture the perceptual aspect. ...
Article
Full-text available
Audio classification acts as the fundamental step for lots of applications like content based audio retrieval and audio indexing. In this work, we have presented a novel scheme for classifying audio signal into three categories namely, speech, music without voice (instrumental) and music with voice (song). A hierarchical approach has been adopted to classify the signals. At the first stage, signals are categorized as speech and music using audio texture derived from simple features like ZCR and STE. Proposed audio texture captures contextual information and summarizes the frame level features. At the second stage, music is further classified as instrumental/song based on Mel frequency cepstral co-efficient (MFCC). A classifier based on Random Sample and Consensus (RANSAC), capable of handling wide variety of data has been utilized. Experimental result indicates the effectiveness of the proposed scheme.
... ZCR (zero crossing rate) [7], [8] and STE (short time energy) [9], [10] are the most widely used time domain features. Frequency domain approaches include features like signal bandwidth, spectral centroid, signal energy [11], [12], [13], [14], fundamental frequency [1], mel-frequency cepstral co-efficients (MFCC) [15], [16] etc. Perceptual/psychoacoustic features include measures for roughness [17], loudness [17], etc. In [18], a model representing the temporal envelope processing by the human auditory system has been proposed which yields 62 features describing the auditory filterbank temporal envelope (AFTE). ...
... Short time energy (STE) [18,6,11] and zero crossing rate (ZCR) [23,5,11] are very common time domain features. Mostly used frequency domain features are signal energy [1,15,24], fundamental frequency [27], mel-frequency cepstral co-efficients (MFCC) [7,10,12]. Perceptual features such as loudness, sharpness and spread incorporate the human hearing process [16,31] to describe the sounds. ...
Conference Paper
Full-text available
Music classification is a fundamental step in any music retrieval system. As the first step for this, we have proposed a scheme for discriminating music signal with voice (song) and without voice (instrumental). The task is important as song-instrument discrimination is of immense importance in the context of a multi-lingual country like India. Moreover, it enables the subsequent classification of instrumentals based on the type of instrument. Spectrogram image of an audio signal shows the significance of different frequency components over the time scale. It has been observed that spectrogram image of an instrumental signal shows more stable peaks persisting over time and it is not so for a song. It has motivated us to look for spectrogram image based features. Contextual features have been computed based on the occurrence pattern of the most significant frequency over the time scale and overall texture pattern revealed by the time-frequency distribution of signal intensity. RANSAC has been used to classify the signals. Experimental result indicates the effectiveness of the proposed scheme.
... ZCR (zero crossing rate) [2], [3], [4] and STE (short time energy) [5], [6], [4] are the most widely used time domain features. Features like signal bandwidth, spectral centroid, signal energy [7], [8], [9], fundamental frequency [1], melfrequency cepstral co-efficients (MFCC) [10], [11] belong to the category of frequency domain features. Roughness and loudness measures [12] have been tried to capture the perceptual aspect. ...
Article
Full-text available
In a music retrieval system, classification of music data serves as the fundamental step for organizing the database to support faster access of desired data. In this context, it is very important to classify the music signal into the two sub-categories namely instrumental and song. A robust system for such classification will enable to go for further classification based on instrument type or music genre. In this work, we have presented a simple but novel scheme using Mel Frequency Cepstral Co-efficients (MFCC) as the signal descriptors. A classifier based on Random Sample and Consensus (RANSAC), capable of handling wide variety of data has been used. Experimental result indicates that proposed scheme works fine for a wide variety of music signals.