Block diagram of I-vector extraction

Source publication

Robust Speaker Verification Using GFCC Based i-Vectors

Chapter

Full-text available

Oct 2017

This paper presents to ameliorate the performance of text-independent speaker recognition system in a noisy environment and cross-channel recordings of the utterances. In this paper presents the combination of Gammatone Frequency Cepstral Coefficients (GFCC) to handle noisy environment with i-vectors to handle the session variability. Experiments a...

Context 1

... calcu- lating universal Background Model (UBM) [8], first-order Baum-Welch statistics are used. Figure 4 gives a high level description of i-vector feature extraction. ...

View in full-text

The window size w changed from 15 to 30 frames.

Ablation study of our proposed architecture.

Global–Local Self-Attention Based Transformer for Speaker Verification

Article

Full-text available

Oct 2022

Transformer models are now widely used for speech processing tasks due to their powerful sequence modeling capabilities. Previous work determined an efficient way to model speaker embeddings using the Transformer model by combining transformers with convolutional networks. However, traditional global self-attention mechanisms lack the ability to ca...

Faked speech detection with zero prior knowledge

Article

Full-text available

May 2024

Audio is one of the most used ways of human communication, but at the same time it can be easily misused to trick people. With the revolution of AI, the related technologies are now accessible to almost everyone, thus making it simple for the criminals to commit crimes and forgeries. In this work, we introduce a neural network method to develop a classifier that will blindly classify an input audio as real or mimicked; the word ‘blindly’ refers to the ability to detect mimicked audio without references or real sources. We propose a deep neural network following a sequential model that comprises three hidden layers, with alternating dense and drop out layers. The proposed model was trained on a set of 26 important features extracted from a large dataset of audios to get a classifier that was tested on the same set of features from different audios. The data was extracted from two raw datasets, especially composed for this work; an all English dataset and a mixed dataset (Arabic plus English). For the purpose of comparison, the audios were also classified through human inspection with the subjects being the native speakers. The ensued results were interesting and exhibited formidable accuracy, as we were able to get at least 94%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$94\%$$\end{document} correct classification of the test cases, as against the 85%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$85\%$$\end{document} accuracy in the case of human observers.

Attention-based factorized TDNN for a noise-robust and spoof-aware speaker verification system

Article

Full-text available

Nov 2023
Int J Speech Tech

Current speaker verification systems achieve impressive results in quiet and controlled environments. However, using these systems in real-life conditions significantly impacts their ability to continue delivering satisfactory performance. In this paper, we present a novel approach that addresses this challenge by optimizing the text-independent speaker verification task in noisy and far-field conditions and when it is subject to spoofing attacks. To perform this optimization, gammatone frequency cepstral coefficients (GFCC) are used as input features of a new factorized time delay neural network (FTDNN) speaker embedding encoder using a time-restricted self-attention mechanism (Att-FTDNN), at the end of the frame level. The Att-FTDNN-based speaker verification system is then integrated into a spoofing-aware configuration to measure the ability of this encoder to prevent false accepts due to spoofing attacks. The in-depth evaluation carried out in noisy and far-field conditions, as well as in the context of spoofing-aware speaker verification, demonstrated the effectiveness of the proposed Att-FTDNN encoder. The results showed that compared to the FDNN- and TDNN-based baseline systems, the proposed Att-FTDNN encoder using GFCC achieves 6.85% relative improvement in terms of minDCF for the VOiCES test set. A noticeable decrease of the equal error rate is also observed when the proposed encoder is integrated within a spoofing-aware speaker verification system tested with the ASVSpoof19 dataset.

Siamese Networks for Speaker Identification on Resource-Constrained Platforms

Article

Full-text available

Oct 2023
J Phys Conf

This paper investigates the implementation of a lightweight Siamese neural network for enhancing speaker identification accuracy and inference speed in embedded systems. Integrating speaker identification into embedded systems can improve portability and versatility. Siamese neural networks achieve speaker identification by comparing input voice samples to reference voices in a database, effectively extracting features and classifying speakers accurately. Considering the trade-off between accuracy and complexity, as well as hardware constraints in embedded systems, various neural networks could be applied to speaker identification. This paper compares the incorporation of CNN architectures targeted for embedded systems, MCUNet, SqueezeNet and MobileNetv2, to implement Siamese neural networks on a Raspberry Pi. Our experiments demonstrate that MCUNet achieves 85% accuracy with a 0.23-second inference time. In comparison, the larger MobileNetv2 attains 84.5% accuracy with a 0.32-second inference time. Additionally, contrastive loss was superior to binary cross-entropy loss in the Siamese neural network. The system using contrastive loss had almost 68% lower loss scores, resulting in a more stable performance and more accurate predictions. In conclusion, this paper establishes that an appropriate lightweight Siamese neural network, combined with contrastive loss, can significantly improve speaker identification accuracy, and enable efficient deployment on resource-constrained platforms.

Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recognition

Article

Full-text available

Dec 2022

The performance of speaker recognition systems is very well on the datasets without noise and mismatch. However, the performance gets degraded with the environmental noises, channel variation, physical and behavioral changes in speaker. The types of Speaker related feature play crucial role in improving the performance of speaker recognition systems. Gammatone Frequency Cepstral Coefficient (GFCC) features has been widely used to develop robust speaker recognition systems with the conventional machine learning, it achieved better performance compared to Mel Frequency Cepstral Coefficient (MFCC) features in the noisy condition. Recently, deep learning models showed better performance in the speaker recognition compared to conventional machine learning. Most of the previous deep learning-based speaker recognition models has used Mel Spectrogram and similar inputs rather than a handcrafted features like MFCC and GFCC features. However, the performance of the Mel Spectrogram features gets degraded in the high noise ratio and mismatch in the utterances. Similar to Mel Spectrogram, Cochleogram is another important feature for deep learning speaker recognition models. Like GFCC features, Cochleogram represents utterances in Equal Rectangular Band (ERB) scale which is important in noisy condition. However, none of the studies have conducted analysis for noise robustness of Cochleogram and Mel Spectrogram in speaker recognition. In addition, only limited studies have used Cochleogram to develop speech-based models in noisy and mismatch condition using deep learning. In this study, analysis of noise robustness of Cochleogram and Mel Spectrogram features in speaker recognition using deep learning model is conducted at the Signal to Noise Ratio (SNR) level from 5 dB to 20 dB. Experiments are conducted on the VoxCeleb1 and Noise added VoxCeleb1 dataset by using basic 2DCNN, ResNet-50, VGG-16, ECAPA-TDNN and TitaNet Models architectures. The Speaker identification and verification performance of both Cochleogram and Mel Spectrogram is evaluated. The results show that Cochleogram have better performance than Mel Spectrogram in both speaker identification and verification at the noisy and mismatch condition.

ASSESSMENT OF THE DETECTION CAPACITY OF NORMALIZED GAMMACHIRP CEPSTRAL COEFFICIENTS IN ANDROID MALWARE

Article

Dec 2022

Optimization of Feature Extraction in Indonesian Speech Recognition Using PCA and SVM Classification

Conference Paper

Dec 2022

Speech recognition is a technology application that communicates with machines by identifying the speaker's words. The main processes in speech recognition are pre-processing, feature extraction and classification. The feature extraction method is one of the most studied sections in speech recognition research because it plays an important role in obtaining optimal features. MFCC has become a popular feature extraction method widely applied to speech recognition systems. In addition, the IMFCC, GFCC, BFCC, and LFCC methods have also begun to be widely applied. Each method can model some information and, by doing a combination, is considered able to represent more information. The feature extraction combination system generates more feature data, and optimization is done to reduce the feature coefficient. This paper discusses testing machine learning models on each feature extraction method and a combination of feature extraction methods to determine the best level of accuracy. Furthermore, this paper also discusses the effect of using the PCA method, which is applied to a combination of feature extraction methods in reducing the number of features to find out the best number of features. Support vector machine (SVM) is proposed as a classification process in this research. The voice sample consisted of 800 in the form of words spoken in Indonesian, namely the numbers zero to nine (0–9). The results showed that combining feature extraction methods resulted in better accuracy than individual methods. The combined GFCC+LFCC method produces the best accuracy of 99.38% while using independent methods produces the best accuracy of 99.38% using the GFCC method. Applying the PCA method to GFCC+LFCC can reduce feature dimensions to 16 coefficients while maintaining the highest level of accuracy. This paper is expected to be a reference in further speech recognition research.

A novel speech emotion recognition method based on feature construction and ensemble learning

Article

Full-text available

Aug 2022
PLOS ONE

In the field of Human-Computer Interaction (HCI), speech emotion recognition technology plays an important role. Facing a small number of speech emotion data, a novel speech emotion recognition method based on feature construction and ensemble learning is proposed in this paper. Firstly, the acoustic features are extracted from the speech signal and combined to form different original feature sets. Secondly, based on Light Gradient Boosting Machine (LightGBM) and Sequential Forward Selection (SFS) method, a novel feature selection method named L-SFS is proposed. And then, the softmax regression model is used to learn automatically the weights of the four single weak learners including Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Extreme Gradient Boosting (XGBoost) and LightGBM. Lastly, based on the learned automatically weights and the weighted average probability voting strategy, an ensemble classification model named Sklex is constructed, which integrates the above four single weak learners. In conclusion, the method reflects the effectiveness of feature construction and the superiority and stability of ensemble learning, and gets good speech emotion recognition accuracy.

pyAudioProcessing: Audio Processing, Feature Extraction, and Machine Learning Modeling

Conference Paper

Jan 2022

Jyotika Singh

Machine Learning Mitigants for Speech Based Cyber Risk

Article

Full-text available

Oct 2021

Statistical analysis of speech is an emerging area of machine learning. In this paper, we tackle the biometric challenge of Automatic Speaker Verification (ASV) of differentiating between samples generated by two distinct populations of utterances, those of an authentic human voice and those generated by a synthetic one. Solving such an issue through a statistical perspective foresees the definition of a decision rule function and a learning procedure to identify the optimal classifier. Classical state-of-the-art countermeasures rely on strong assumptions such as stationarity or local-stationarity of speech that may be atypical to encounter in practice. We explore in this regard a robust non-linear and non-stationary signal decomposition method known as the Empirical Mode Decomposition combined with the Mel-Frequency Cepstral Coefficients in a novel fashion with a refined classifier technique known as multi-kernel Support Vector machine. We undertake significant real data case studies covering multiple ASV systems using different datasets, including the ASVSpoof 2019 challenge database. The obtained results overwhelmingly demonstrate the significance of our feature extraction and classifier approach versus existing conventional methods in reducing the threat of cyber-attack perpetrated by synthetic voice replication seeking unauthorised access.

Mixture linear prediction Gammatone Cepstral features for robust speaker verification under transmission channel noise

Article

Full-text available

Jul 2020
MULTIMED TOOLS APPL

In this paper, we present a Mixture Linear Prediction based approach for robust Gammatone Cepstral Coefficients extraction (MLPGCCs). The proposed method provides performance improvement of Automatic Speaker Verification (ASV) using i-vector and Gaussian Probabilistic Linear Discriminant Analysis GPLDA modeling under transmission channel noise. The performance of the extracted MLPGCCs was evaluated using the NIST 2008 database where a single channel microphone recorded conversational speech. The system is analyzed in the presence of different channel transmission noises such as Additive White Gaussian (AWGN) and Rayleigh fading at various Signals to Noise Ratio (SNR) levels. The evaluation results show that the MLPGCCs features are a promising way for the ASV task. Indeed, the speaker verification performance using the MLPGCCs proposed features is significantly improved compared to the conventional Gammatone Frequency Cepstral Coefficients (GFCCs) and Mel Frequency Cepstral Coefficients (MFCCs) features. For speech signals corrupted with AWGN noise at SNRs ranging from (-5 dB to 15 dB), we obtain a significant reduction of the Equal Error Rate (EER) ranging from 9.41% to 6.65% and 3.72% to 1.50%, compared with conventional MFCCs and GFCCs features respectively. In addition, when the test speech signals are corrupted with Rayleigh fading channel we achieve an EER reduction ranging from 23.63% to 7.8% and from 10.88% to 6.8% compared with conventional MFCCs and GFCCs, respectively. We also found that the combination of GFCCs and MLPGCCs gives the highest performance of speaker verification system. The best performance combination achieved is around EER from 0.43% to 0.59% and 1.92% to 3.88%.

Block diagram of I-vector extraction

Context in source publication

Similar publications

Citations