Figure 1 - uploaded by Homayoon Beigi
Content may be subject to copyright.
Pairing Speaker Models 

Pairing Speaker Models 

Source publication
Conference Paper
Full-text available
This paper presents a hierarchical approach to the Large-Scale Speaker Recognition problem. In here the authors present a binary tree data-base approach for arranging the trained speaker models based on a distance measure designed for comparing two sets of distributions. The combination of this hierarchical structure and the distance measure[1] pro...

Context in source publication

Context 1
... paper presents a hierarchical approach to the Large-Scale Speaker Recognition problem. In here the authors present a binary tree data-base approach for arranging the trained speaker models based on a distance measure designed for comparing two sets of distributions. The combination of this hierarchical structure and the distance measure [ 1] provide the means for conducting a large-scale verification task. In addition, two techniques are presented for creating a model of the complement-space to the cohort which is used for rejection purposes. Results are presented for the drastic improvements achieved mainly in reducing the false-acceptance of the speaker verification system without any significant false-rejection degradation. Let us consider a possible model for speech as being a collection of distributions (e.g., Gaussian distributions). To be able to rank speakers within a database based on their similarity of speech characteristics, one needs a distance measure which would be appropriate for comparing sets of distributions. Once this distance measure is established, a ranking process may be ap- plied to order the speakers in a database, in a hierarchical fashion for future reference. Last year, the authors presented a method for computing a mean- ingful distance between two collections of statistical distributions which is very useful for ranking models consisting of collections of distributions. [1] Once such hierarchical structure is established for the speakers in the database, the job of cohort computation becomes much easier. This paper presents a classification technique for the speakers in a database based on a binary tree structure and provides the means for a quick computation of the cohort for any speaker in the tree. Then, false-rejection and false-acceptance results are given on a database of 184 speakers. The way the speaker verification is implemented, the claimed speaker ID is used to find the cohort of the speaker from the binary tree by considering the speakers whose models are children of the same parent some number of generation up the tree. The above speaker verification will have limited rejection capabilities. Two techniques for reducing the false-acceptance of this verification system are presented in the form of models created from a space, complementary to the cohort space. The two techniques have pros and cons associated with them and are named by the authors as Graduated Complementary Model (GCM) and Cumulative Complementary Model (CCM). The details of the derivation of these two complementary models as well as implementation issues are presented along with improvement results for the verification task. The next section will briefly describe the procedure for building the speaker models [2]. Then, the details of building the binary speaker-tree using the distance measure of [1] are presented after which a brief description of speaker recognition using these models and the speaker-tree is given. Then, two very effective methods are established for creating a rejection mechanism used for speaker verification as well as open-set speaker identification. These methods are shown to reduce the false-acceptance rate of the speaker recognition by presenting results on a speaker verification task conducted over a 184-member database of speakers. Finally, some concluding remarks are given for the improvement of the hierarchical structure to in- crease performance and accuracy. Please note that the speaker recognition techniques presented here are text and language-independent. As we mentioned in [2], a speaker model is created as a collection of parameters (Means and Covariances) for a set of Multi-Dimensional Gaussian distributions. These distributions model the features produced by the signal-processing front-end of the engine. Speaker model M i is computed for the i th speaker based on a sequence of M frames of speech, with the d -dimensional feature vector, { f m } m =1 ,...,M . These models are stored in terms of their statistical parameters, such as, { μ i,j , Σ i,j , C i,j } j =1 ,...,n i , consisting of the Mean vector, the Covariance matrix, and the Counts, for the case when a Gaussian distribution is selected. Each speaker, i , may end up with a model consisting of n i distributions. The distance measure of [1] enables us to devise a speaker recognition system with capabilities for Speaker Identification, Verification and even- tually Clustering by creating a hierarchical structure. A binary tree is constructed using the distance measure of [1]. See figure 2. Each speaker model is computed as described above and in detail in [2]. Once the models are created, they are ranked using a a bottom up technique in which each individual model (a collection of multi-dimensional Gaussian distributions) is associated with a distinct speaker and constitutes a leaf of the tree. To perform the primary building operation of the tree, these models are compared with each-other using the distance of [1]. Figure 1 shows a set of sorted distances δ km which associated with speakers i and j . Please note that k and m are the indices of the sorted list and gener- ally differ from i and j . The sorting is done in a way that δ 1 m = 0. Then, going down the table and left to right, the pair ij with the smallest distance δ km are paired based on the next available non-paired speakers with the smallest distance between their models. Due to the nature of the distance measure, these distance computations between models are orders of mag- nitude faster than a traditional Maximum Likelihood approach. Once all speaker pairs are determined, each pair of models is merged using the following technique for producing a new model with the characteristics of both contributing models. Figure 4 shows a small example with two models being merged, each hav- ing a different number of Gaussian distributions associated with them. The superscript in the notation denotes the model number and the subscript denotes the distribution number. Please note that the pairing of the distributions follow the techniques given in [1]. The merged Gaussian distribution with the left and right subscripts i and j denotes the distribution created from the i th distribution of model 1 and the j th distribution of model 2. The counts for the merged distributions are simply the sum of counts of the two building distributions. The new model will have the same number of distributions as the maximum of the two models used in its conception. S x and S x 2 denote the first and second order sums of the feature data. These parameters are used as an alternative set of parameters defining the Gaussian distributions of interest. Each merged pair of models creates a new parent model for the two, in the next level of the binary tree. If the number of models in a level are not divisible by two, then the remaining model in that level may be merged with the members of the next generation (level) in the tree. As each level of the tree is created, the new models in that generation are treated as new speaker models containing their two children and the process is continued layer by layer until one root model is reached at the top of the tree. In this structure, finding the cohort of a speaker is as simple as matching the label of the claimed ID with one of the leaf members; going up the tree by as many layers as desired (based on the required size of the cohort); finally, going back down from the resulting parent to all the leaves leading to that parent. The models in these leaves will be the closest speakers to the claimed speaker. The training data is stored using a hierarchical structure so that accessing the models would be optimized at the time of recognition. The Speaker Verification is implemented by extracting a set of speakers (with their models) from the training database considering only those speakers with close proximity, as given by the distance measure of [1], to the speaker with the claimed ID. The claimant’s sample speech is then used to generate a test model which is compared to the models in the Cohort set. The models are sorted by the distance and the training model with the smallest distance from the test model is used to obtain the verification result. If the background model or any speaker other than the claimant comes up at the top of the sorted list, the claim is rejected. Otherwise, it is accepted. Alternatively, a thresholding method may be used to compare the likelihood of the input speech given the claimant model versus the average likelihood given the rest of the cohort members. M i = { μ i,j , Σ i,j , p i,j } j =1 ,..., 32 = { Θ i,j } j =1 ,..., 32 , denotes the set of speaker models, consisting of the mean vector, diagonal covariance matrix, and mixture weight for each of the 32 components of the i th 12- dimensional Gaussian Mixture Model (GMM) used to model the training data. The test data is denoted as O = { f n } n =1 ,...,N , and we assume that it is i.i.d. Let Σ i,j ( k ) denote the vari- ance of the k th dimension. Given the observed testing data and an identity claim i , verification proceeds by first ...

Similar publications

Conference Paper
Full-text available
Wideband communications permit the transmission of an extended frequency range compared to the traditional narrowband. While benefits for automatic speaker recognition can be expected, the extent of the contribution of the additional bandwidth in wideband is still unclear. This work compares the i-vector speaker verification performances employing...

Citations

... Audio signal changes randomly and continuously through time. As an example, music and audio signals have strong energy content in the low frequencies and weaker energy content in the high frequencies [31,32]. Figure 2 depicts a generalized time and frequency spectra of audio signals [33]. ...
Chapter
Full-text available
This chapter addresses the topic of classification and separation of audio and music signals. It is a very important and a challenging research area. The importance of classification process of a stream of sounds come up for the sake of building two different libraries: speech library and music library. However, the separation process is needed sometimes in a cocktail-party problem to separate speech from music and remove the undesired one. In this chapter, some existed algorithms for the classification process and the separation process are presented and discussed thoroughly. The classification algorithms will be divided into three categories. The first category includes most of the real time approaches. The second category includes most of the frequency domain approaches. However, the third category introduces some of the approaches in the time-frequency distribution. The approaches of time domain discussed in this chapter are the short-time energy (STE), the zero-crossing rate (ZCR), modified version of the ZCR and the STE with positive derivative, the neural networks, and the roll-off variance. The approaches of the frequency spectrum are specifically the roll-off of the spectrum, the spectral centroid and the variance of the spectral centroid, the spectral flux and the variance of the spectral flux, the cepstral residual, and the delta pitch. The time-frequency domain approaches have not been yet tested thoroughly in the process of classification and separation of audio and music signals. Therefore, the spectrogram and the evolutionary spectrum will be introduced and discussed. In addition, some algorithms for separation and segregation of music and audio signals, like the independent Component Analysis, the pitch cancelation and the artificial neural networks will be introduced.
... In terms of feature extraction, the very common time-domain features are short-time energy (STE) [26][27][28] and the zero-crossing rate (ZCR) [29,30]. Signal energy [31][32][33], fundamental frequency [34], Mel frequency cepstral coefficients (MFCC) [35][36][37] are the most used frequency-domain features. Recently, a few studies focused on speech and song/music discrimination [38][39][40]. ...
Article
Full-text available
A robust approach for the application of audio content classification (ACC) is proposed in this paper, especially in variable noise-level conditions. We know that speech, music, and background noise (also called silence) are usually mixed in the noisy audio signal. Based on the findings, we propose a hierarchical ACC approach consisting of three parts: voice activity detection (VAD), speech/music discrimination (SMD), and post-processing. First, entropy-based VAD is successfully used to segment input signal into noisy audio and noise even if variable-noise level is happening. The determinations of one-dimensional (1D)-subband energy information (1D-SEI) and 2D-textural image information (2D-TII) are then formed as a hybrid feature set. The hybrid-based SMD is achieved because the hybrid feature set is input into the classification of the support vector machine (SVM). Finally, a rule-based post-processing of segments is utilized to smoothly determine the output of the ACC system. The noisy audio is successfully classified into noise, speech, and music. Experimental results show that the hierarchical ACC system using hybrid feature-based SMD and entropy-based VAD is successfully evaluated against three available datasets and is comparable with existing methods even in a variable noise-level environment. In addition, our test results with the VAD scheme and hybrid features also shows that the proposed architecture increases the performance of audio content discrimination.
... Another method is hierarchically clustering the UBM mixtures (Xiang, Berger, 2003;Saeidi et al., 2010). Some of the other methods are speaker clustering at feature level (Xiong et al., 2006), and speaker clustering at model level (Beigi et al., 1999;De Leon, Apsingekar, 2007;Apsingekar, De Leon, 2009). However, there is a tradeoff between the identification rate and identification time, since not all the mixtures are scored, or not all the speakers' models are considered. ...
Article
Full-text available
Conventional speaker recognition systems use the Universal Background Model (UBM) as an imposter for all speakers. In this paper, speaker models are clustered to obtain better imposter model represen- tations for speaker verification purpose. First, a UBM is trained, and speaker models are adapted from the UBM. Then, the k-means algorithm with the Euclidean distance measure is applied to the speaker models. The speakers are divided into two, three, four, and five clusters. The resulting cluster centers are used as background models of their respective speakers. Experiments showed that the proposed method consistently produced lower Equal Error Rates (EER) than the conventional UBM approach for 3, 10, and 30 seconds long test utterances, and also for channel mismatch conditions. The proposed method is also compared with the i-vector approach. The three-cluster model achieved the best performance with a 12.4% relative EER reduction in average, compared to the i-vector method. Statistical significance of the results are also given.
... Au niveau temporel, les descripteurs tels que le taux de passage du signal sonore à zéro [4, 34] et l'énergie du signal à court terme [5, 29] ont été introduits. Les trois descripteurs principalement utilisés au niveau fréquentiel ont été la fréquence fondamentale [37], l'énergie spectrale du signal [1, 3] et les Mel Frequency Cepstral Coefficients (MFCC) [6, 8, 23]. D'autres descripteurs plus proches de la perception humaine ont aussi été utilisés, à savoir la mesure de sonie [7, 31] et l'énergie contenue dans les sous-bandes [12]. ...
Conference Paper
Full-text available
Le chant est un élément remarquable d’une chanson et sa détection automatique au sein d’un morceau est un défi largement étudié. Cet article propose une approche permettant de discriminer les titres musicaux comportant du chant dans une base de données musicales conséquente. L’approche précédemment proposée par Ghosal et al.[9] fonde la prise de décision sur l’analyse des descripteurs à l’échelle de la chanson. Nous générons ici une probabilité de présence de chant à l’échelle de la trame afin de prendre une décision globale. Une première méthode proposée pour cette classification utilise la densité de probabilité des prédictions et une seconde des n-grammes sur les trames supposées contenir du chant. Les résultats de ces nouvelles méthodes améliorent ceux obtenus par [9] et montrent une meilleure robustesse lorsque la taille de la base musicale augmente. La précision de la classification chute ainsi de 3.6% seulement contre 13.1% pour [9] lorsque la base de test est multipliée par 16.
... ZCR (zero crossing rate) [2], [3], [4] and STE (short time energy) [5], [6], [4] are the most widely used time domain features. Features like signal bandwidth, spectral centroid, signal energy [7], [8], [9], fundamental frequency [1], melfrequency cepstral co-efficients (MFCC) [10], [11] belong to the category of frequency domain features. Roughness and loudness measures [12] have been tried to capture the perceptual aspect. ...
... ZCR (zero crossing rate) (West and Cox 2004;Downie 2004) and STE (short time energy) (Saunders 1996;El-Maleh et al. 2000) are the commonly used time domain features. Features like signal bandwidth, spectral centroid, signal energy (Beigi et al. 1999;McKay and Fujinaga 2004;West and Cox 2005), fundamental frequency (Zhang and Kuo 1998), mel-frequency cepstral co-efficients (MFCC) (Eronen and Klapuri 2000;Foote 1997) belong to the category of frequency domain features. Roughness and loudness measures (Fastl and Zwicker 2007) have been presented to capture the perceptual aspect. ...
... ZCR (zero crossing rate) (West and Cox 2004;Downie 2004) and STE (short time energy) (Saunders 1996;El-Maleh et al. 2000) are the commonly used time domain features. Features like signal bandwidth, spectral centroid, signal energy (Beigi et al. 1999;McKay and Fujinaga 2004;West and Cox 2005), fundamental frequency (Zhang and Kuo 1998), mel-frequency cepstral co-efficients (MFCC) (Eronen and Klapuri 2000;Foote 1997) belong to the category of frequency domain features. Roughness and loudness measures (Fastl and Zwicker 2007) have been presented to capture the perceptual aspect. ...
Article
Full-text available
Audio classification acts as the fundamental step for lots of applications like content based audio retrieval and audio indexing. In this work, we have presented a novel scheme for classifying audio signal into three categories namely, speech, music without voice (instrumental) and music with voice (song). A hierarchical approach has been adopted to classify the signals. At the first stage, signals are categorized as speech and music using audio texture derived from simple features like ZCR and STE. Proposed audio texture captures contextual information and summarizes the frame level features. At the second stage, music is further classified as instrumental/song based on Mel frequency cepstral co-efficient (MFCC). A classifier based on Random Sample and Consensus (RANSAC), capable of handling wide variety of data has been utilized. Experimental result indicates the effectiveness of the proposed scheme.
... ZCR (zero crossing rate) [7], [8] and STE (short time energy) [9], [10] are the most widely used time domain features. Frequency domain approaches include features like signal bandwidth, spectral centroid, signal energy [11], [12], [13], [14], fundamental frequency [1], mel-frequency cepstral co-efficients (MFCC) [15], [16] etc. Perceptual/psychoacoustic features include measures for roughness [17], loudness [17], etc. In [18], a model representing the temporal envelope processing by the human auditory system has been proposed which yields 62 features describing the auditory filterbank temporal envelope (AFTE). ...
... Short time energy (STE) [18,6,11] and zero crossing rate (ZCR) [23,5,11] are very common time domain features. Mostly used frequency domain features are signal energy [1,15,24], fundamental frequency [27], mel-frequency cepstral co-efficients (MFCC) [7,10,12]. Perceptual features such as loudness, sharpness and spread incorporate the human hearing process [16,31] to describe the sounds. ...
Conference Paper
Full-text available
Music classification is a fundamental step in any music retrieval system. As the first step for this, we have proposed a scheme for discriminating music signal with voice (song) and without voice (instrumental). The task is important as song-instrument discrimination is of immense importance in the context of a multi-lingual country like India. Moreover, it enables the subsequent classification of instrumentals based on the type of instrument. Spectrogram image of an audio signal shows the significance of different frequency components over the time scale. It has been observed that spectrogram image of an instrumental signal shows more stable peaks persisting over time and it is not so for a song. It has motivated us to look for spectrogram image based features. Contextual features have been computed based on the occurrence pattern of the most significant frequency over the time scale and overall texture pattern revealed by the time-frequency distribution of signal intensity. RANSAC has been used to classify the signals. Experimental result indicates the effectiveness of the proposed scheme.
... ZCR (zero crossing rate) [2], [3], [4] and STE (short time energy) [5], [6], [4] are the most widely used time domain features. Features like signal bandwidth, spectral centroid, signal energy [7], [8], [9], fundamental frequency [1], melfrequency cepstral co-efficients (MFCC) [10], [11] belong to the category of frequency domain features. Roughness and loudness measures [12] have been tried to capture the perceptual aspect. ...
Article
Full-text available
In a music retrieval system, classification of music data serves as the fundamental step for organizing the database to support faster access of desired data. In this context, it is very important to classify the music signal into the two sub-categories namely instrumental and song. A robust system for such classification will enable to go for further classification based on instrument type or music genre. In this work, we have presented a simple but novel scheme using Mel Frequency Cepstral Co-efficients (MFCC) as the signal descriptors. A classifier based on Random Sample and Consensus (RANSAC), capable of handling wide variety of data has been used. Experimental result indicates that proposed scheme works fine for a wide variety of music signals.