Conference PaperPDF Available

Robust speech recognition in noise: an evaluation using the SPINE corpus

Authors:
  • InvenSense, Inc

Figures

Content may be subject to copyright.
A preview of the PDF is not available
... This includes features extracted from low dimensional short term representation of the speech signals such as MEL Frequency Cepstrum Coefficients [16]. The performance of these systems degrade in real world conditions [17], [18]. These systems are dependent on Human ability to extract useful features which is a limitation for the system. ...
Preprint
Full-text available
In recent years, an association is established between faces and voices of celebrities leveraging large scale audio-visual information from YouTube. The availability of large scale audio-visual datasets is instrumental in developing speaker recognition methods based on standard Convolutional Neural Networks. Thus, the aim of this paper is to leverage large scale audio-visual information to improve speaker recognition task. To achieve this task, we proposed a two-branch network to learn joint representations of faces and voices in a multimodal system. Afterwards, features are extracted from the two-branch network to train a classifier for speaker recognition. We evaluated our proposed framework on a large scale audio-visual dataset named VoxCeleb$1$. Our results show that addition of facial information improved the performance of speaker recognition. Moreover, our results indicate that there is an overlap between face and voice.
... We analyze both SPINE-1 and SPINE-2 with ClipDaT to produce the characteristic profile in Fig. 14. For SPINE-2, both the unprocessed (original microphone recordings) and processed (using various voice coders Hansen et al., 2001) are both included in the analyzed audio. The audio consists of conversational recordings of 64 two-person collaborative tasks. ...
Article
Full-text available
Speech, speaker, and language systems have traditionally relied on carefully collected speech material for training acoustic models. There is an enormous amount of freely accessible audio content. A major challenge, however, is that such data is not professionally recorded, and therefore may contain a wide diversity of background noise, nonlinear distortions, or other unknown environmental or technology-based contamination or mismatch. There is a crucial need for automatic analysis to screen such unknown data sets before acoustic model development training, or to perform input audio purity screening prior to classification. In this study, we propose a waveform based clipping detection algorithm for naturalistic audio streams and examine the impact of clipping at different severities on speech quality measurements and automatic speaker recognition systems. We use the TIMIT and NIST SRE08 corpora as case studies. The results show, as expected, that clipping introduces a nonlinear distortion into clean speech data, which reduces speech quality and performance for speaker recognition. We also investigate what degree of clipping can be present to sustain effective speech system performance. The proposed detection system, which will be released, could contribute to massive new audio collections for speech and language technology development (e.g. Google Audioset (Gemmeke et al., 2017), CRSS-UTDallas Apollo Fearless-Steps (Yu et al., 2014) (19,000 h naturalistic audio from NASA Apollo missions)).
... Moreover, one of the reasons behind the extensive usage of the MFCCs features by the researchers is the high performance because mel-filter bands are positioned logarithmically and these features are more in line with human ear auditory characteristics (Sun, et al., 2019). However, the performance of MFCC features degrade under noisy conditions (Hansen, Sarikaya, Yapanel, & Pellom, 2001). Furthermore, MFCC features concentrate on the whole spectral envelop of shorter frames and lack in speaker-discriminative features such as pitch information (Almaadeed, et al., 2016;Nagrani, et al., 2017). ...
Article
Speech is a powerful medium of communication that always convey rich and useful information, such as gender, accent, and other unique characteristics of a speaker. These unique characteristics enable researchers to recognize human voice using artificial intelligence techniques that are important in the areas of forensic voice verification, security and surveillance, electronic voice eavesdropping, mobile banking and mobile shopping. Recent advancements in deep learning and other hardware techniques have gained attention of researchers working in the field of automatic speaker identification (SI). However, to the best of our knowledge, there is no in-depth survey is available that critically appraises and summarizes the existing techniques with their strengths and weaknesses for SI. Hence, this study identified and discussed various areas of SI, presented a comprehensive survey of existing studies, and also presented the future research challenges that require significant research efforts in the field of SI systems.
... Other researchers suggest advanced feature processing, like cepstral normalization techniques (e.g., cepstral mean normalization-CMN, variable cepstral mean normalization-VCMN), or other techniques which try to assessment cepstral parameters of undeformed speech, given cepstral parameters of the noisy speech, this is integrated occasionally with multi-condition training, i.e., training acoustic models with speech distorted with several noisy kinds and signal-to-noise (SNR) ratios (Hansen et al. 2001;Deng et al. 2001). Using sparse representation based classification permits for improving robustness, though it requires a lot of processing power. ...
Article
Full-text available
Automatic Speech Recognition (ASR) for Amazigh speech, particularly Moroccan Tarifit accented speech, is a less researched area. This paper focuses on the analysis and evaluation of the first ten Amazigh digits in the noisy conditions from an ASR perspective based on Signal to Noise Ratio (SNR). Our testing experiments were performed under two types of noise and repeated with added environmental noise with various SNR ratios for each kind ranging from 5 to 45 dB. Different formalisms are used to develop a speaker independent Amazigh speech recognition, like Hidden Markov Model (HMMs), Gaussian Mixture Models (GMMs). The experimental results under noisy conditions show that degradation of performance was observed for all digits with different degrees and the rates under car noisy environment are decreased less than grinder conditions with the difference of 2.84% and 8.42% at SNR 5 dB and 25 dB, respectively. Also, we observed that the most affected digits are those which contain the "S" alphabet.
... The traditional SV models, such as GMM-UBM [5] and ivector [7], have been the state-of-the-art approaches for a long time. All the above mentioned-methods rely on low dimensional input features extracted by using mel-frequency cepstrum coefficients (MFCC), however, MFCC is known to suffer from performance degradation under real-world noise conditions as demonstrated by [22,23]. Deep Convolutional Neural Networks (DCNN) have proven to be effective to extract intrinsic features from noisy data, thus various speech applications [24,25,26] have been proposed based on DCNN. ...
Preprint
Full-text available
Text-independent speaker verification is an important artificial intelligence problem that has a wide spectrum of applications, such as criminal investigation, payment certification, and interest-based customer services. The purpose of text-independent speaker verification is to determine whether two given uncontrolled utterances originate from the same speaker or not. Extracting speech features for each speaker using deep neural networks is a promising direction to explore and a straightforward solution is to train the discriminative feature extraction network by using a metric learning loss function. However, a single loss function often has certain limitations. Thus, we use deep multi-metric learning to address the problem and introduce three different losses for this problem, i.e., triplet loss, n-pair loss and angular loss. The three loss functions work in a cooperative way to train a feature extraction network equipped with Residual connections and squeeze-and-excitation attention. We conduct experiments on the large-scale \texttt{VoxCeleb2} dataset, which contains over a million utterances from over $6,000$ speakers, and the proposed deep neural network obtains an equal error rate of $3.48\%$, which is a very competitive result. Codes for both training and testing and pretrained models are available at \url{https://github.com/GreatJiweix/DmmlTiSV}, which is the first publicly available code repository for large-scale text-independent speaker verification with performance on par with the state-of-the-art systems.
... Audio Recognition, namely the problem of classifying sounds, has traditionally been addressed by means of models such as Gaussian Mixture Models (GMM) [22] and Support Vector Machines (SVM) [23] trained by using hand-crafted low-dimension features such as the Mel Frequency Cepstrum Coefficients (MFCCs) or i-vectors [24]. However, the performance of MFCCs in audio recognition degrades rapidly in "unconstrained" environments that include real-world noise [25,26]. More recently, the success of deep learning has motivated approaches based on CNNs [5,27] or RNNs [28,29,30]. ...
Preprint
Full-text available
Our goal is to collect a large-scale audio-visual dataset with low label noise from videos in the wild using computer vision techniques. The resulting dataset can be used for training and evaluating audio recognition models. We make three contributions. First, we propose a scalable pipeline based on computer vision techniques to create an audio dataset from open-source media. Our pipeline involves obtaining videos from YouTube; using image classification algorithms to localize audio-visual correspondence; and filtering out ambient noise using audio verification. Second, we use this pipeline to curate the VGGSound dataset consisting of more than 210k videos for 310 audio classes. Third, we investigate various Convolutional Neural Network~(CNN) architectures and aggregation approaches to establish audio recognition baselines for our new dataset. Compared to existing audio datasets, VGGSound ensures audio-visual correspondence and is collected under unconstrained conditions. Code and the dataset are available at http://www.robots.ox.ac.uk/~vgg/data/vggsound/
Chapter
Speaker verification is the process used to verify a speaker from his/her voice characteristics. Given a speech segment as input and the target speaker data, the system automatically determines whether the target speaker spoke the test segment. There are many methods of bio-metric verification like a fingerprint, iris scanning signatures, etc. In this list, speech specific authentication is not as much reliable as other methods. Hence, we would like to develop a reliable speaker verification model. Recent advances in deep learning have facilitated the design of speaker verification systems that directly input raw waveforms. Though developing a model with raw waveforms is complex in speech processing, it would yield an end-to-end system, which reduces the time and power of feature extraction. To achieve the motive of end-to-end speaker verification, we propose to use Raw Waveforms as input. The development of such a system is possible without much domain knowledge of feature extraction. Moreover, the availability of a large dataset eases the development of the end-to-end system. The later part of the proposed system also includes analyzing the model's performance using a short utterance dataset to make the model more user-friendly and reduce computation power. Hence, we plan to analyze and improvise RawNet (Jung et al. in Proceedings of Interspeech, pp. 3583–3587, 2020 [1]) for short utterances.KeywordsSpeaker verificationEnd-to-end systemRaw waveformsRawNetVoxCeleb datasetShort utterance
Article
Text-independent speaker verification is an important artificial intelligence problem that has a wide spectrum of applications, such as criminal investigation, payment certification, and interest-based customer services. The purpose of text-independent speaker verification is to determine whether two given uncontrolled utterances originate from the same speaker or not. Extracting speech features for each speaker using deep neural networks is a promising direction to explore and a straightforward solution is to train the discriminative feature extraction network by using a metric learning loss function. However, a single loss function often has certain limitations. Thus, we use deep multi-metric learning to address the problem and introduce three different losses for this problem, i.e., triplet loss, n-pair loss and angular loss. The three loss functions work in a cooperative way to train a feature extraction network equipped with Residual connections and squeeze-and-excitation attention. We conduct experiments on the large-scale VoxCeleb2 dataset, which contains over a million utterances from over 6,000 speakers, and the proposed deep neural network obtains an equal error rate of 3.48%, which is a very competitive result. Codes for both training and testing and pretrained models are available at https://github.com/GreatJiweix/DmmlTiSV, which is the first publicly available code repository for large-scale text-independent speaker verification with performance on par with the state-of-the-art systems.
Conference Paper
Full-text available
Initial efforts to make Sphinx, a continuous-speech speaker-independent recognition system, robust to changes in the environment are reported. To deal with differences in noise level and spectral tilt between close-talking and desk-top microphones, two novel methods based on additive corrections in the cepstral domain are proposed. In the first algorithm, the additive correction depends on the instantaneous SNR of the signal. In the second technique, expectation-maximization techniques are used to best match the cepstral vectors of the input utterances to the ensemble of codebook entries representing a standard acoustical ambience. Use of the algorithms dramatically improves recognition accuracy when the system is tested on a microphone other than the one on which it was trained
Conference Paper
Full-text available
For a speech recognition system based on a continuous density hidden Markov model (CDHMM), it is shown that speaker adaptation of the parameters of the CDHMM can be formulated as a Bayesian learning procedure and it can be integrated into the segmental k -means training algorithm. Some results are reported for adapting both the mean and the diagonal covariance matrix of the Gaussian state observation densities of a CDHMM. When the speaker adaptation procedure is tested on a 39-word English alpha-digit vocabulary in isolated word mode, the results indicate that the procedure achieves better performance than a speaker-independent system, when only one training token from each word is used to perform speaker adaptation. It is also shown that much better performance can be achieved when two or more training tokens are used for speaker adaptation
Article
Full-text available
This paper addresses the problem of automatic speech recognition in the presence of interfering noise. It focuses on the parallel model combination (PMC) scheme, which has been shown to be a powerful technique for achieving noise robustness. Most experiments reported on PMC to date have been on small, 10-50 word vocabulary systems. Experiments on the Resource Management (RM) database, a 1000 word continuous speech recognition task, reveal compensation requirements not highlighted by the smaller vocabulary tasks. In particular, that it is necessary to compensate the dynamic parameters as well as the static parameters to achieve good recognition performance. The database used for these experiments was the RM speaker independent task with either Lynx Helicopter noise or Operation Room noise from the NOISEX-92 database added. The experiments reported here used the HTK RM recognizer developed at CUED modified to include PMC based compensation for the static, delta and delta-delta parameters. After training on clean speech data, the performance of the recognizer was found to be severely degraded when noise was added to the speech signal at between 10 and 18 dB. However, using PMC the performance was restored to a level comparable with that obtained when training directly in the noise corrupted environment
Article
It is well known that the introduction of acoustic background distortion and the variability resulting from environmentally induced stress causes speech recognition algorithms to fail. In this paper, several causes for recognition performance degradation are explored. It is suggested that recent studies based on a Source Generator Framework can provide a viable foundation in which to establish robust speech recognition techniques. This research encompasses three inter-related issues: (i) analysis and modeling of speech characteristics brought on by workload task stress, speaker emotion/stress or speech produced in noise (Lombard effect), (ii) adaptive signal processing methods tailored to speech enhancement and stress equalization, and (iii) formulation of new recognition algorithms which are robust in adverse environments. An overview of a statistical analysis of a Speech Under Simulated and Actual Stress (SUSAS) database is presented. This study was conducted on over 200 parameters in the domains of pitch, duration, intensity, glottal source and vocal tract spectral variations. These studies motivate the development of a speech modeling approach entitled Source Generator Framework in which to represent the dynamics of speech under stress. This framework provides an attractive means for performing feature equalization of speech under stress. In the second half of this paper, three novel approaches for signal enhancement and stress equalization are considered to address the issue of recognition under noisy stressful conditions. The first method employs (Auto:I,LSP:T) constrained iterative speech enhancement to address background noise and maximum likelihood stress equalization across formant location and bandwidth. The second method uses a feature enhancing artificial neural network which transforms the input stressed speech feature set during parameterization for keyword recognition. The final method employs morphological constrained feature enhancement to address noise and an adaptive Mel-cepstral compensation algorithm to equalize the impact of stress. Recognition performance is demonstrated for speech under a range of stress conditions, signal-to-noise ratios and background noise types.ZusammenfassungEs ist wohlbekannt, dass die Einführung von Hintergrundgeräuschen und von Variabilität der Umgebung dazu führen, dass Spracherkennungsalgorithmen versagen. In diesem Paper werden verschiedene Fälle untersucht, die zu einer Minderung des Erkennungsgrades führen. Es wird vorgeschlagen, dass gegenwärtige Untersuchungen, basierend auf Source Generator Framework, eine variable Grundlage bilden, in der robuste Spracherkennungstechniken aufgebaut werden können. Diese Untersuchung schliesst drei Punkte mit ein, die damit in Beziehung stehen: (i) Analyse und Modellierung von Sprachcharakteristika, die durch Stress, Emotionen oder Sprache in einer lauten Umgebung (Lombard Effekt), herinführen, (ii) adaptive Signalverarbeitungsmethoden, angepasst an den Ausgleich von Betonungen und (iii) Formulierung neuer und robuster Spracherkennungsalgorithmen. Ein Überblick über eine statistische Analyse von Sprache unter simulierten und aktuellen Stressdatenbanken (SUSAS) wird gegeben. Diese Untersuchung wurde an mehr als 200 Parametern ausgeführt in den Bereichen Länge, Intensität und vokal spektrale Variationen. Diese Untersuchungen motivieren die Entwicklung eines Sprachmodellierungsansatzes, genannt Source Generator Framework, bei dem die Dynamik der Sprache unter Stress dargestellt wird. In der zweiten Hälfte des Papers werden drei Ansätze zum Stressausgleich vorgestellt, um auch den Punkt der Spracherkennung in einer verrauschten Umgebung anzusprechen. Die erste Methode beinhalten (Auto:I,LSP:T) beschränkte iterative Sprachzusätze, um Hintergrundgeräusche zu erfassen sowie mit höchster Wahrscheinlichkeit einen Stressausgleich über Bandbreiten und Ort hinweg zu erreichen. Die zweite Methode benutzt die Eigenschaft, künstliche neuronale Netze durch Eigenschaften zu erweitern, welche verrauschte Eingaben (die während der Parametrierung für Schlüsselworterkennungen entstehen) transformiert. Die letzte Methode beinhaltet morphologisch beschränkten Zusatz von Eigenschaften, um Rauschen zu betrachten sowie einen adaptiven Mel-cepstral Kompensationsalgorithmus, um den Einfluss von Stress auszugleichen. Der Grad der Erkennung wird demonstriert für Sprache unter einem grossen Bereich von Stressbedingungen, Signal-Rauschen Verhältnis sowie Hintergrundgeräuschen.RésuméIl est connu que la distorsion acoustique introduite par l'environnement ambiant ainsi que la variabilité résultant du stress induit détériorent énormément les performances des algorithmes de reconnaissance. Dans cet article, on explore les diverses causes de dégradation de ces performances. On suggère que les études récentes effectuées sur l'approche appelée Source Generator Framework produisent un fondement viable pour développer des techniques robustes de reconnaissance de la parole. L'étude décrite s'articule autour de trois axes corrélés: (i) l'analyse et la modélisation de la parole produite soit sous l'effet de stress du à la charge de travail et/ou à l'émotion, soit dans le bruit, (ii) les méthodes de traitement adaptatif du signal pour le débruitage de la parole et la réduction de l'effet du stress, et (iii) la formulation de nouveaux algorithmes robustes de reconnaissance. Une analyse statistique d'une base de données (SUSAS) de parole sous stress simulé et réel est présentée. Cette analyse a été menée sur plus de 200 paramètres relatifs au pitch, à la durée, à l'intensité, à la source glottique et aux variations des spectres du conduit vocal. Ces études ont motivé le développement de l'approche appelée Source Generator Framework qui permet de modéliser la dynamique de la parole sous stress. Ce cadre offre des moyens intéressants pour effectuer l'égalisation des paramètres de la parole sous stress. Dans la seconde moitié de l'article, trois nouvelles approches pour le débruitage de la parole et la réduction de l'effet du stress sont considérées. La première méthode utilise la technique itérative contrainte (Auto:I,LSP:T) de débruitage et une égalisation par maximum de vraisemblance de la parole à travers la localisation des formants et leurs bandes passantes. Pour la reconnaissance de mots clés, la seconde méthode utilise un réseau de neurones qui transforme les vecteurs de paramètres de la parole sous stress pendant la phase de paramétrisation. La dernière méthode applique une technique de rehaussement des paramètres basée sur des contraintes morphologiques pour effectuer le débruitage et utilise un algorithme adaptatif sur les cepstres-Mel pour égaliser les effets du stress. Les performances de reconnaissance sont données pour la parole produite dans plusieurs conditions de stress, avec plusieurs rapports signal/bruit, et pour différents types de bruit ambiant.
Article
Achieving reliable performance for a speech recogniser is an important challenge, especially in the context of mobile telephony applications where the user can access telephone functions through voice. The breakthrough of such a technology is appealing, since the driver can concentrate completely and safely on his task while composing and conversing in a "full" hands-free mode. This paper addresses the problem of speaker-dependent discrete utterance recognition in noise. Special reference is made to the mismatch effects due to the fact that training and testing are made in different environments. A novel technique for noise compensation is proposed: nonlinear spectral subtraction (NSS). Robust variance estimates and robust pdf evaluations (projection) are also introduced and combined with NSS into the HMM framework. We show that the lower limit of applicability of the projection (low SNR values) can be loosened after combination with NSS. Experimental results are reported. The performance of an HMM-based recogniser rises from 56% (no compensation) to 98% after speech enhancement. More than 3300 utterances have been used to evaluate the systems (three databases, two European languages). This result is achieved by the use of robust training/recognition schemes and by preprocessing the noisy speech by NSS.
Article
The performance levels of most current speech recognizers degrade significantly when environmental noise occurs during use. Such performance degradation is mainly caused by mismatches in training and operating environments. During recent years much effort has been directed to reducing this mismatch. This paper surveys research results in the area of digital techniques for single microphone noisy speech recognition classified in three categories: noise resistant features and similarity measurement, speech enhancement, and speech model compensation for noise. The survey indicates that the essential points in noisy speech recognition consist of incorporating time and frequency correlations, giving more importance to high SNR portions of speech in decision making, exploiting task-specific a priori knowledge both of speech and of noise, using class-dependent processing, and including auditory models in speech processing.
Conference Paper
The authors aim at the formulation of similarity measures for robust speech recognition. Their consideration focuses on the speech cepstrum derived from linear prediction coefficients (the LPC cepstrum). By using common models for noisy speech, they analytically and empirically show how the ambient noise can affect some important attributes of the LPC cepstrum such as the vector norm, coefficient order, and the direction perturbation. The new findings led them to propose a family of distortion measures based on the projection between two cepstral vectors. Performance evaluation of these measures has been conducted in both speaker-dependent and speaker-independent isolated word recognition tasks. Experimental results show that the new measures cause no degradation in recognition accuracy at high SNR, but perform significantly better when tested under noisy conditions using only clean reference templates. At an SNR of 5 dB, the new measures are shown to be able to achieve a recognition rate equivalent to that obtained by the filtered cepstral measure at 20 dB SNR, demonstrating a gain of 15 dB
Conference Paper
A set of iterative speech enhancement techniques using spectral constraints is extended and evaluated. The approaches apply inter- and intraframe spectral constraints to ensure optimum speech quality across all classes of speech. Constraints are applied on the basis of the presence of perceptually important speech characteristics found during the enhancement procedure. Results show improvement over past techniques for additive white noise distortions. Three points are addressed in the present study. First, a convenient and consistent terminating point for the iterative technique is presented which was previously unavailable. Second, the techniques have been generalized to allow for slowly varying, colored noise. Finally, a comparative evaluation has been performed to determine their usefulness as preprocessors for recognition in extremely noisy environments in the vicinity of 0 dB SNR
Article
It is well known that the performance of speech recognition algorithms degrade in the presence of adverse environments where a speaker is under stress, emotion, or Lombard (1911) effect. This study evaluates the effectiveness of traditional features in recognition of speech under stress and formulates new features which are shown to improve stressed speech recognition. The focus is on formulating robust features which are less dependent on the speaking conditions rather than applying compensation or adaptation techniques. The stressed speaking styles considered are simulated angry and loud. Lombard effect speech, and noisy actual stressed speech from the SUSAS database which is available on a CD-ROM through the NATO IST/TG-01 research group and LDC. In addition, this study investigates the immunity of the linear prediction power spectrum and fast Fourier transform power spectrum to the presence of stress. Our results show that unlike fast Fourier transform's (FFT) immunity to noise, the linear prediction power spectrum is more immune than FFT to stress as well as to a combination of a noisy and stressful environment. Finally, the effect of various parameter processing such as fixed versus variable preemphasis, liftering, and fixed versus cepstral mean normalization are studied. Two alternative frequency partitioning methods are proposed and compared with traditional mel-frequency cepstral coefficients (MFCC) features for stressed speech recognition. It is shown that the alternate filterbank frequency partitions are more effective for recognition of speech under both simulated and actual stressed conditions