Figure 4 - uploaded by Medikonda Jeevan
Content may be subject to copyright.
Block diagram of I-vector extraction

Block diagram of I-vector extraction

Source publication
Chapter
Full-text available
This paper presents to ameliorate the performance of text-independent speaker recognition system in a noisy environment and cross-channel recordings of the utterances. In this paper presents the combination of Gammatone Frequency Cepstral Coefficients (GFCC) to handle noisy environment with i-vectors to handle the session variability. Experiments a...

Context in source publication

Context 1
... calcu- lating universal Background Model (UBM) [8], first-order Baum-Welch statistics are used. Figure 4 gives a high level description of i-vector feature extraction. ...

Similar publications

Article
Full-text available
Transformer models are now widely used for speech processing tasks due to their powerful sequence modeling capabilities. Previous work determined an efficient way to model speaker embeddings using the Transformer model by combining transformers with convolutional networks. However, traditional global self-attention mechanisms lack the ability to ca...

Citations

... Where a is peak value, n the order of the filter, b the bandwidth, f c the characteristic frequency and is initial phase. f c and b can be derived from Equivalent Rectangular Bandwidth (ERB) scale, using the following equation [24]: = For GFCC, FFT treated speech signal is multiplied by the Gammatone filter bank, reverted by IFFT, noise is suppressed by decimating it to 100 Hz and rectified using a non-linear process. The rectification is carried out by applying a cubic root operation to the absolute valued input [24]. ...
... f c and b can be derived from Equivalent Rectangular Bandwidth (ERB) scale, using the following equation [24]: = For GFCC, FFT treated speech signal is multiplied by the Gammatone filter bank, reverted by IFFT, noise is suppressed by decimating it to 100 Hz and rectified using a non-linear process. The rectification is carried out by applying a cubic root operation to the absolute valued input [24]. Approximately, the first 22 features are called GFCC and these may be very useful in speaker identification. ...
Article
Full-text available
Audio is one of the most used ways of human communication, but at the same time it can be easily misused to trick people. With the revolution of AI, the related technologies are now accessible to almost everyone, thus making it simple for the criminals to commit crimes and forgeries. In this work, we introduce a neural network method to develop a classifier that will blindly classify an input audio as real or mimicked; the word ‘blindly’ refers to the ability to detect mimicked audio without references or real sources. We propose a deep neural network following a sequential model that comprises three hidden layers, with alternating dense and drop out layers. The proposed model was trained on a set of 26 important features extracted from a large dataset of audios to get a classifier that was tested on the same set of features from different audios. The data was extracted from two raw datasets, especially composed for this work; an all English dataset and a mixed dataset (Arabic plus English). For the purpose of comparison, the audios were also classified through human inspection with the subjects being the native speakers. The ensued results were interesting and exhibited formidable accuracy, as we were able to get at least 94%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$94\%$$\end{document} correct classification of the test cases, as against the 85%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$85\%$$\end{document} accuracy in the case of human observers.
... However, early works on speaker verification demonstrated that the MFCC performance decreases in adverse noisy conditions. In this context, numerous studies have been carried out to find alternative features based on the short-term power spectrum, such as GFCC (Shao & Wang, 2008;Jeevan et al., 2017;Krobba et al., 2023). In this study, GFCC are assessed as inputs of speaker embedding encoders in noisy, unconstrained and far-field conditions. ...
Article
Full-text available
Current speaker verification systems achieve impressive results in quiet and controlled environments. However, using these systems in real-life conditions significantly impacts their ability to continue delivering satisfactory performance. In this paper, we present a novel approach that addresses this challenge by optimizing the text-independent speaker verification task in noisy and far-field conditions and when it is subject to spoofing attacks. To perform this optimization, gammatone frequency cepstral coefficients (GFCC) are used as input features of a new factorized time delay neural network (FTDNN) speaker embedding encoder using a time-restricted self-attention mechanism (Att-FTDNN), at the end of the frame level. The Att-FTDNN-based speaker verification system is then integrated into a spoofing-aware configuration to measure the ability of this encoder to prevent false accepts due to spoofing attacks. The in-depth evaluation carried out in noisy and far-field conditions, as well as in the context of spoofing-aware speaker verification, demonstrated the effectiveness of the proposed Att-FTDNN encoder. The results showed that compared to the FDNN- and TDNN-based baseline systems, the proposed Att-FTDNN encoder using GFCC achieves 6.85% relative improvement in terms of minDCF for the VOiCES test set. A noticeable decrease of the equal error rate is also observed when the proposed encoder is integrated within a spoofing-aware speaker verification system tested with the ASVSpoof19 dataset.
... GFCC is particularly beneficial for low-frequency sensitive applications like music and animal sounds. However, it can be more vulnerable to noise in specific scenarios [10,11]. In general, MFCC is more commonly used in speech-related applications due to its accessibility and good performance with traditional machine learning methods. ...
... This simplification allows for the retention of the complex speech signal representation necessary for CNNs, while simultaneously reducing the overall computational burden. Although it is possible to combine multiple feature extraction techniques to leverage their respective strengths, Mel-spectrograms alone are often enough to train accurate models [11]. ...
Article
Full-text available
This paper investigates the implementation of a lightweight Siamese neural network for enhancing speaker identification accuracy and inference speed in embedded systems. Integrating speaker identification into embedded systems can improve portability and versatility. Siamese neural networks achieve speaker identification by comparing input voice samples to reference voices in a database, effectively extracting features and classifying speakers accurately. Considering the trade-off between accuracy and complexity, as well as hardware constraints in embedded systems, various neural networks could be applied to speaker identification. This paper compares the incorporation of CNN architectures targeted for embedded systems, MCUNet, SqueezeNet and MobileNetv2, to implement Siamese neural networks on a Raspberry Pi. Our experiments demonstrate that MCUNet achieves 85% accuracy with a 0.23-second inference time. In comparison, the larger MobileNetv2 attains 84.5% accuracy with a 0.32-second inference time. Additionally, contrastive loss was superior to binary cross-entropy loss in the Siamese neural network. The system using contrastive loss had almost 68% lower loss scores, resulting in a more stable performance and more accurate predictions. In conclusion, this paper establishes that an appropriate lightweight Siamese neural network, combined with contrastive loss, can significantly improve speaker identification accuracy, and enable efficient deployment on resource-constrained platforms.
... The research work on [12] used a combination of GFCC and MFCC features to develop speaker identification using Deep Neural Network (DNN) classifiers in noisy conditions, the result shows that fusion of both features perform superior to individual features. Robustness of GFCC feature for additive noise and white Gaussian noise is also evaluated in speaker recognition using i-vector in the study [13] and the result shows that Appl. Sci. ...
Article
Full-text available
The performance of speaker recognition systems is very well on the datasets without noise and mismatch. However, the performance gets degraded with the environmental noises, channel variation, physical and behavioral changes in speaker. The types of Speaker related feature play crucial role in improving the performance of speaker recognition systems. Gammatone Frequency Cepstral Coefficient (GFCC) features has been widely used to develop robust speaker recognition systems with the conventional machine learning, it achieved better performance compared to Mel Frequency Cepstral Coefficient (MFCC) features in the noisy condition. Recently, deep learning models showed better performance in the speaker recognition compared to conventional machine learning. Most of the previous deep learning-based speaker recognition models has used Mel Spectrogram and similar inputs rather than a handcrafted features like MFCC and GFCC features. However, the performance of the Mel Spectrogram features gets degraded in the high noise ratio and mismatch in the utterances. Similar to Mel Spectrogram, Cochleogram is another important feature for deep learning speaker recognition models. Like GFCC features, Cochleogram represents utterances in Equal Rectangular Band (ERB) scale which is important in noisy condition. However, none of the studies have conducted analysis for noise robustness of Cochleogram and Mel Spectrogram in speaker recognition. In addition, only limited studies have used Cochleogram to develop speech-based models in noisy and mismatch condition using deep learning. In this study, analysis of noise robustness of Cochleogram and Mel Spectrogram features in speaker recognition using deep learning model is conducted at the Signal to Noise Ratio (SNR) level from 5 dB to 20 dB. Experiments are conducted on the VoxCeleb1 and Noise added VoxCeleb1 dataset by using basic 2DCNN, ResNet-50, VGG-16, ECAPA-TDNN and TitaNet Models architectures. The Speaker identification and verification performance of both Cochleogram and Mel Spectrogram is evaluated. The results show that Cochleogram have better performance than Mel Spectrogram in both speaker identification and verification at the noisy and mismatch condition.
... A twenty-five-millisecond hamming windowing is used to split the input signal into short enough temporal segments, which does not allow enough time for the properties of the signal to change in each segment [21]. The signal is passed through the Fast Fourier transform to obtain a short-term power spectrum [34]. The resultant signal is passed through a second-order filter and a normalized gammachirp auditory filter bank. ...
... Therefore, this method is widely used in several studies, including speaker recognition [6], speaker verification [7], dialect recognition [8], emotion recognition [9], as well as age and gender recognition [10]. In addition to MFCC, several other similar feature extraction methods have begun to be studied, including the IMFCC [11], LFCC [12], GFCC [13], and BFCC [14]. The main difference in the extraction method is in the use of a filter bank, including the mel, inverse-mel, gammatone, bark, and linear filter bank. ...
... A cochlear filtering model known as the gammatone filter is used to observe the psychophysics of sound. The non-linear rectification process in the GFCC can reduce the noise in the MFCC method so that the features improve [13]. The gammatone function is shown by equation (6). ...
... GFCC and BFCC feature extraction process[13][14] ...
Conference Paper
Speech recognition is a technology application that communicates with machines by identifying the speaker's words. The main processes in speech recognition are pre-processing, feature extraction and classification. The feature extraction method is one of the most studied sections in speech recognition research because it plays an important role in obtaining optimal features. MFCC has become a popular feature extraction method widely applied to speech recognition systems. In addition, the IMFCC, GFCC, BFCC, and LFCC methods have also begun to be widely applied. Each method can model some information and, by doing a combination, is considered able to represent more information. The feature extraction combination system generates more feature data, and optimization is done to reduce the feature coefficient. This paper discusses testing machine learning models on each feature extraction method and a combination of feature extraction methods to determine the best level of accuracy. Furthermore, this paper also discusses the effect of using the PCA method, which is applied to a combination of feature extraction methods in reducing the number of features to find out the best number of features. Support vector machine (SVM) is proposed as a classification process in this research. The voice sample consisted of 800 in the form of words spoken in Indonesian, namely the numbers zero to nine (0–9). The results showed that combining feature extraction methods resulted in better accuracy than individual methods. The combined GFCC+LFCC method produces the best accuracy of 99.38% while using independent methods produces the best accuracy of 99.38% using the GFCC method. Applying the PCA method to GFCC+LFCC can reduce feature dimensions to 16 coefficients while maintaining the highest level of accuracy. This paper is expected to be a reference in further speech recognition research.
... The time-domain features include Short-Time Energy (STEN), Pitch Frequency (PFCY), Formant Frequency (FFCY) and Average Speech Speed (AVSS) [7]. The cepstrum features include MFCC, Gamma Frequency Cepstrum Coefficient (GFCC) [8], Barker Frequency Cepstrum Coefficient (BFCC) [9], Normalized Gamma Chirped Cepstrum Coefficient (NGCC) [10], Amplitude-based Spectrum Root Cepstral Coefficient (MSRCC), Phase-based Spectrum Root Cepstral Coefficient (PSRCC) [11] and Linear Frequency Cepstrum Coefficient (LFCC) [12]. ...
Article
Full-text available
In the field of Human-Computer Interaction (HCI), speech emotion recognition technology plays an important role. Facing a small number of speech emotion data, a novel speech emotion recognition method based on feature construction and ensemble learning is proposed in this paper. Firstly, the acoustic features are extracted from the speech signal and combined to form different original feature sets. Secondly, based on Light Gradient Boosting Machine (LightGBM) and Sequential Forward Selection (SFS) method, a novel feature selection method named L-SFS is proposed. And then, the softmax regression model is used to learn automatically the weights of the four single weak learners including Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Extreme Gradient Boosting (XGBoost) and LightGBM. Lastly, based on the learned automatically weights and the weighted average probability voting strategy, an ensemble classification model named Sklex is constructed, which integrates the above four single weak learners. In conclusion, the method reflects the effectiveness of feature construction and the superiority and stability of ensemble learning, and gets good speech emotion recognition accuracy.
... pyAudioProcessing aims to provide an end-to-end processing solution for converting between audio file formats, visualizing time and frequency domain representations, cleaning with silence and low-activity segments removal from audio, building features from raw audio samples, and training a machine learning model that can then be used to classify unseen raw audio samples (e.g., into categories such as music, speech, etc.). This library allows the user to extract features such as Mel Frequency Cepstral Coefficients (MFCC) [CD14], Gammatone Frequency Cepstral Coefficients (GFCC) [JDHP17], spectral features, chroma features and other beat-based and cepstrum based features from audio to use with one's own classification backend or scikit-learn classifiers that have been built into pyAudioProcessing. The classifier implementation examples that are a part of this software aim to give the users a sample solution to audio classification problems and help build the foundation to tackle new and unseen problems. ...
... Furthermore, such countermeasures often rely on standard time-frequency techniques constrained by assumptions such as stationarity or linearity of the underlying speech signal. The speech community has proposed multiple variations of these classical methods to overcome the aforementioned issues (see for example [16], [17], [18], [19], [20], [21]) and so dealing with different aspects faced by ASV systems in discriminating spoofed and real voices. The traditional practice foresees the extraction or engineering of the raw speech data features and then conducts the classification task by stacking them within a vector. ...
Article
Full-text available
Statistical analysis of speech is an emerging area of machine learning. In this paper, we tackle the biometric challenge of Automatic Speaker Verification (ASV) of differentiating between samples generated by two distinct populations of utterances, those of an authentic human voice and those generated by a synthetic one. Solving such an issue through a statistical perspective foresees the definition of a decision rule function and a learning procedure to identify the optimal classifier. Classical state-of-the-art countermeasures rely on strong assumptions such as stationarity or local-stationarity of speech that may be atypical to encounter in practice. We explore in this regard a robust non-linear and non-stationary signal decomposition method known as the Empirical Mode Decomposition combined with the Mel-Frequency Cepstral Coefficients in a novel fashion with a refined classifier technique known as multi-kernel Support Vector machine. We undertake significant real data case studies covering multiple ASV systems using different datasets, including the ASVSpoof 2019 challenge database. The obtained results overwhelmingly demonstrate the significance of our feature extraction and classifier approach versus existing conventional methods in reducing the threat of cyber-attack perpetrated by synthetic voice replication seeking unauthorised access.
... The LP signal analysis of this work uses a Gaussian mixture autoregressive model to compress the spectrum parameters. Besides this, it is showed in [11] that, even at low SNRs of environmental noise, the Gammatone filter bank and cubic root rectification provide more robustness to the features than the Mel-filter bank and log nonlinear. ...
Article
Full-text available
In this paper, we present a Mixture Linear Prediction based approach for robust Gammatone Cepstral Coefficients extraction (MLPGCCs). The proposed method provides performance improvement of Automatic Speaker Verification (ASV) using i-vector and Gaussian Probabilistic Linear Discriminant Analysis GPLDA modeling under transmission channel noise. The performance of the extracted MLPGCCs was evaluated using the NIST 2008 database where a single channel microphone recorded conversational speech. The system is analyzed in the presence of different channel transmission noises such as Additive White Gaussian (AWGN) and Rayleigh fading at various Signals to Noise Ratio (SNR) levels. The evaluation results show that the MLPGCCs features are a promising way for the ASV task. Indeed, the speaker verification performance using the MLPGCCs proposed features is significantly improved compared to the conventional Gammatone Frequency Cepstral Coefficients (GFCCs) and Mel Frequency Cepstral Coefficients (MFCCs) features. For speech signals corrupted with AWGN noise at SNRs ranging from (-5 dB to 15 dB), we obtain a significant reduction of the Equal Error Rate (EER) ranging from 9.41% to 6.65% and 3.72% to 1.50%, compared with conventional MFCCs and GFCCs features respectively. In addition, when the test speech signals are corrupted with Rayleigh fading channel we achieve an EER reduction ranging from 23.63% to 7.8% and from 10.88% to 6.8% compared with conventional MFCCs and GFCCs, respectively. We also found that the combination of GFCCs and MLPGCCs gives the highest performance of speaker verification system. The best performance combination achieved is around EER from 0.43% to 0.59% and 1.92% to 3.88%.