Speaker Verification Results

Source publication

A hierarchical approach to large-scale speaker recognition

Conference Paper

Full-text available

Sep 1999

This paper presents a hierarchical approach to the Large-Scale Speaker Recognition problem. In here the authors present a binary tree data-base approach for arranging the trained speaker models based on a distance measure designed for comparing two sets of distributions. The combination of this hierarchical structure and the distance measure[1] pro...

Context 1

... case of Speaker Identification, the claimant’s test model may be compared, using the same distance measure, to all the models in the database including that of the background model. This may be expedited using a top-down sweep of the tree to arrive at the correct leaf with only log 2 N comparisons, each time going in the direction of the child with the smallest distance. Please note This may constitute a rejection if the background model is the closest model to the test model. Speaker Classification as a direct product of the tree building is very useful in many different occasions including the narrow-down of the search space for doing Speaker Recognition. The systems presented in [3, 4, 5] use speaker classification for performing speaker segmentation as well as improving speech recognition accuracies through adaptation. Please also note that if the claimant is an imposter and just happens to be closest to the claimed identity in the cohort which is picked, with the probability of 1 / ( CohortSize ) a false-acceptance is reached. The first row of results presented in the table of figure 3 present the false-rejection and false-acceptance results conducted on 60 speakers out of a population of 184 speakers in the database. This data is collected using nine different microphones including Tie-Clip , Hand- Held and Far-Field microphones. The training data lasts an average of 40 seconds. The test was performed using an average of 6 seconds of independent data. 60 of the 184 speakers were randomly used for the testing. The next section presents two novel techniques for solving the false-acceptance problem of verification. Two Complementary Model Techniques are proposed to solve the false-acceptance problem. The first technique will create a single model, used as a representation of all the models in the tree and outside the tree (given some background data). This model is called the Cumulative Complementary Model (CCM) by the authors. CCM is basically a merged model based on the complement of the cohort. Figure 5 shows a speaker-tree with a graphic representation of the models used to create the CCM for an example cohort. Note that this is a very quick computation since the tree structure is used to minimize the computation. The following sections list the model production and pros and cons of the two techniques: • The complementary model for each node is computed by merging the siblings with the complementary model of the parent as we travel down the tree. • There no confidence information available by the rejection mechanism. Also, the similar and dis-similar data are merged giving a non-robust merged model. Too many merges are done and since the merging is suboptimal, this will degrade accuracy. • Decoding is faster in GCM since the modified cohort consisting of the original cohort and the CCM is smaller. Training is slower due to many merges. • The complementary model for each node is the model merged from all its siblings. See figure 5. • When building the modified cohort, the complementary model of the node and of its parents are added to the cohort list and if the verification finds one of these complementary models to be the closest to the test speaker, it is rejected. • There is an inherent confidence level associated with this method. The higher the level (closer to root), the more confident the rejection decision. • No merges are necessary, hence the training is faster than CCM, but the testing is slower. The background model denoted in figure 5 may be computed by obtaining a lot of data not present in the tree and pooling the data together to create a single model. This will allow further rejection capability for imposters who were not enrolled in the database. The table of figure 3 shows a drastic reduction in the false-acceptance of the verification system when using the two proposed complementary models. As we had expected, the GCM produces much better results. In fact it reduces the false-acceptance of the system to 0 by not much of a degradation in the false-rejection. In order to perform a quick speaker identification of log 2 N distance computations versus N , the tree should be optimized for better top-down performance. This allows an Identify and Verify scheme for better performance of the verification as well, when compared to using the claimed ID as the cohort identifier. The authors are currently working on this optimiza- tion problem. Using the Likelihood-Based scheme, we have obtained the following preliminary results which take into ac- count mis-match conditions. All training data for a given speaker was collected from only one of 8 microphones. The testing data for the speaker was collected on the training microphone (the matched case) as well as on one of the other 8 microphones (the mismatched case). The imposter trials can be from any of the 8 microphones. In the experiments 28, (male and female) speakers were used, however for any given piece of training or testing data, the gender was unknown. In addition, we tried to get an even distribution of microphones for training and testing. We limited the amount of training and testing data to approximately 10 seconds. There were a total of 125 speakers in the tree. There were 199 matched verification tests, 214 mismatched tests, and 382 imposter tests. The imposters were taken from a population that excluded any of the enrolled speakers. The equal error rate was 13 . ...

View in full-text

Context 2

View in full-text

Context 3

View in full-text

Utterance Verification for Text-Dependent Speaker Recognition: A Comparative Assessment Using the RedDots Corpus

Conference Paper

Full-text available

Sep 2016

Advantages of Wideband over Narrowband Channels for Speaker Verification Employing MFCCs and LFCCs

Conference Paper

Full-text available

Sep 2014

Wideband communications permit the transmission of an extended frequency range compared to the traditional narrowband. While benefits for automatic speaker recognition can be expected, the extent of the contribution of the additional bandwidth in wideband is still unclear. This work compares the i-vector speaker verification performances employing...

Classification and Separation of Audio and Music Signals

Chapter

Full-text available

Dec 2020

Abdullah I Al-Shoshan

This chapter addresses the topic of classification and separation of audio and music signals. It is a very important and a challenging research area. The importance of classification process of a stream of sounds come up for the sake of building two different libraries: speech library and music library. However, the separation process is needed sometimes in a cocktail-party problem to separate speech from music and remove the undesired one. In this chapter, some existed algorithms for the classification process and the separation process are presented and discussed thoroughly. The classification algorithms will be divided into three categories. The first category includes most of the real time approaches. The second category includes most of the frequency domain approaches. However, the third category introduces some of the approaches in the time-frequency distribution. The approaches of time domain discussed in this chapter are the short-time energy (STE), the zero-crossing rate (ZCR), modified version of the ZCR and the STE with positive derivative, the neural networks, and the roll-off variance. The approaches of the frequency spectrum are specifically the roll-off of the spectrum, the spectral centroid and the variance of the spectral centroid, the spectral flux and the variance of the spectral flux, the cepstral residual, and the delta pitch. The time-frequency domain approaches have not been yet tested thoroughly in the process of classification and separation of audio and music signals. Therefore, the spectrogram and the evolutionary spectrum will be introduced and discussed. In addition, some algorithms for separation and segregation of music and audio signals, like the independent Component Analysis, the pitch cancelation and the artificial neural networks will be introduced.

Robust Audio Content Classification Using Hybrid-Based SMD and Entropy-Based VAD

Article

Full-text available

Feb 2020
Entropy

Wang

A robust approach for the application of audio content classification (ACC) is proposed in this paper, especially in variable noise-level conditions. We know that speech, music, and background noise (also called silence) are usually mixed in the noisy audio signal. Based on the findings, we propose a hierarchical ACC approach consisting of three parts: voice activity detection (VAD), speech/music discrimination (SMD), and post-processing. First, entropy-based VAD is successfully used to segment input signal into noisy audio and noise even if variable-noise level is happening. The determinations of one-dimensional (1D)-subband energy information (1D-SEI) and 2D-textural image information (2D-TII) are then formed as a hybrid feature set. The hybrid-based SMD is achieved because the hybrid feature set is input into the classification of the support vector machine (SVM). Finally, a rule-based post-processing of segments is utilized to smoothly determine the output of the ACC system. The noisy audio is successfully classified into noise, speech, and music. Experimental results show that the hierarchical ACC system using hybrid feature-based SMD and entropy-based VAD is successfully evaluated against three available datasets and is comparable with existing methods even in a variable noise-level environment. In addition, our test results with the VAD scheme and hybrid features also shows that the proposed architecture increases the performance of audio content discrimination.

Speaker Model Clustering to Construct Background Models for Speaker Verification

Article

Full-text available

Mar 2017
ARCH ACOUST

Conventional speaker recognition systems use the Universal Background Model (UBM) as an imposter for all speakers. In this paper, speaker models are clustered to obtain better imposter model represen- tations for speaker verification purpose. First, a UBM is trained, and speaker models are adapted from the UBM. Then, the k-means algorithm with the Euclidean distance measure is applied to the speaker models. The speakers are divided into two, three, four, and five clusters. The resulting cluster centers are used as background models of their respective speakers. Experiments showed that the proposed method consistently produced lower Equal Error Rates (EER) than the conventional UBM approach for 3, 10, and 30 seconds long test utterances, and also for channel mismatch conditions. The proposed method is also compared with the i-vector approach. The three-cluster model achieved the best performance with a 12.4% relative EER reduction in average, compared to the i-vector method. Statistical significance of the results are also given.

Classification à grande échelle de morceaux de musique en fonction de la présence de chant

Conference Paper

Full-text available

Mar 2016

Le chant est un élément remarquable d’une chanson et sa détection automatique au sein d’un morceau est un défi largement étudié. Cet article propose une approche permettant de discriminer les titres musicaux comportant du chant dans une base de données musicales conséquente. L’approche précédemment proposée par Ghosal et al.[9] fonde la prise de décision sur l’analyse des descripteurs à l’échelle de la chanson. Nous générons ici une probabilité de présence de chant à l’échelle de la trame afin de prendre une décision globale. Une première méthode proposée pour cette classification utilise la densité de probabilité des prédictions et une seconde des n-grammes sur les trames supposées contenir du chant. Les résultats de ces nouvelles méthodes améliorent ceux obtenus par [9] et montrent une meilleure robustesse lorsque la taille de la base musicale augmente. La précision de la classification chute ainsi de 3.6% seulement contre 13.1% pour [9] lorsque la base de test est multipliée par 16.

rp058 vol.1-T161

Data

Full-text available

Jun 2015

SpringerPlus(2013)2193-1801-2-526

Data

Full-text available

Jun 2015

A hierarchical approach for speech-instrumental-song classification

Article

Full-text available

Oct 2013

Audio classification acts as the fundamental step for lots of applications like content based audio retrieval and audio indexing. In this work, we have presented a novel scheme for classifying audio signal into three categories namely, speech, music without voice (instrumental) and music with voice (song). A hierarchical approach has been adopted to classify the signals. At the first stage, signals are categorized as speech and music using audio texture derived from simple features like ZCR and STE. Proposed audio texture captures contextual information and summarizes the frame level features. At the second stage, music is further classified as instrumental/song based on Mel frequency cepstral co-efficient (MFCC). A classifier based on Random Sample and Consensus (RANSAC), capable of handling wide variety of data has been utilized. Experimental result indicates the effectiveness of the proposed scheme.

Speech-Music Classiﬁcation Using Delta-Energy and RANSAC

Conference Paper

Full-text available

Dec 2012

Song/instrumental classification using spectrogram based contextual features

Conference Paper

Full-text available

Sep 2012

Music classification is a fundamental step in any music retrieval system. As the first step for this, we have proposed a scheme for discriminating music signal with voice (song) and without voice (instrumental). The task is important as song-instrument discrimination is of immense importance in the context of a multi-lingual country like India. Moreover, it enables the subsequent classification of instrumentals based on the type of instrument. Spectrogram image of an audio signal shows the significance of different frequency components over the time scale. It has been observed that spectrogram image of an instrumental signal shows more stable peaks persisting over time and it is not so for a song. It has motivated us to look for spectrogram image based features. Contextual features have been computed based on the occurrence pattern of the most significant frequency over the time scale and overall texture pattern revealed by the time-frequency distribution of signal intensity. RANSAC has been used to classify the signals. Experimental result indicates the effectiveness of the proposed scheme.

Instrumental/song classification of music signal using RANSAC

Article

Full-text available

Apr 2011

In a music retrieval system, classification of music data serves as the fundamental step for organizing the database to support faster access of desired data. In this context, it is very important to classify the music signal into the two sub-categories namely instrumental and song. A robust system for such classification will enable to go for further classification based on instrument type or music genre. In this work, we have presented a simple but novel scheme using Mel Frequency Cepstral Co-efficients (MFCC) as the signal descriptors. A classifier based on Random Sample and Consensus (RANSAC), capable of handling wide variety of data has been used. Experimental result indicates that proposed scheme works fine for a wide variety of music signals.

Speaker Verification Results

Contexts in source publication

Similar publications

Citations