Conference PaperPDF Available

RASTA-PLP speech analysis technique

Authors:

Abstract

Most speech parameter estimation techniques are easily influenced by the frequency response of the communication channel. The authors have developed a technique that is more robust to such steady-state spectral factors in speech. The approach is conceptually simple and computationally efficient. The new method is described, and experimental results are proposed that show significant advantages for the proposed method
SPEECH
LOGARITHM
FILTERING
EQUAL−LOUDNESS CURVE
POWER−LAW OF HEARING
INVERSE LOGARITHM
SOLVING OF SET OF LINEAR EQUATIONS (DURBIN)
CEPSTRAL RECURSION
CEPSTRAL COEFFICIENTS OF RASTA−PLP MODEL
INVERSE DISCRETE FOURIER TRANSFORM
DISCRETE FOURIER TRANSFORM
... There are several approaches for speech descriptions such as Mel Frequency Cepstral Coefficients (MFCC) [9], Perceptual Linear Predictive (PLP) [13], Relative Spectral -Perceptual Linear Predictive (RASTA-PLP) [14] and Power Normalized Cepstral Coefficients (PNCC) [20]. They can allow high performance for a recognizer depending on how the samples have been recorded, i.e., factors such as the presence of noise, the signal-to-noise ratio, the influence of the distance between a microphone and a speaker, the speed of pronunciation, and the silent period may impair the efficacy of a descriptor. ...
... Find the winner node: W in(t) ← min(d j (t)); 11 Apply the weight update equation: w ji ← ρ(t) * T j,W in(t) (t) * (P i − w ji ); 12 end for 13 end for 14 ...
Article
Full-text available
In automatic speech recognition (ASR) systems, the minimization of noxious effects caused by different background noises between training and operating situations has been a challenging task for many years. An ASR robust to noise that can deal with different types of speeches and various speakers still is an open research point. Typically, conventional ASR models for missing-feature reconstructions and robust speech descriptors employ acoustic features and statistical methods. In spite of improved performance in dealing with noise, such methods still degrade the performance when different background noises co-exist with the main signal. More recent approaches use neural networks, particularly deep learning models, for ASR purposes. Such models increase performance at the high training cost. In order to mitigate such limitations, we proposed an ASR model called Self-Organizing Speech Recognizer (SOSR). Unlike most conventional ASRs, SOSR is characterized by using acoustic and articulatory features, employing unsupervised and incremental learning, and is suitable for real-time applications due to its quick training stage. SOSR simultaneously processes an audio signal in a two-branch. In the first path, the acoustic features are extracted from the original signal whereas in the second path an acoustic-to-articulatory inversion is performed by several Self-organizing Maps. The signal from both paths is delivered to a Self-organizing Map with a time-varying structure, which is responsible for recognizing the input speech signal. Four datasets (TIMIT, Aurora 2, Aurora 4, and CHIME 2) were used for SOSR assessment. The Word Error Rate (WER) was the chosen metric to compare the experimental results of the tests with different noise levels and signal variations. Hence, the experimental results suggest that SOSR can learn quickly, and it can handle noisy signals, various speakers, different types of speeches, and assorted lengths of utterances.
... The significant aim of performing this step is to derive the appropriate/relevant information. In this section, we discuss about five important feature extraction techniques, namely mel spectrograms [13], MFCC [23], PLP [24], RASTA-PLP [25], SDC [17]. These five features are selected after reviewing several works [26,27,28,29,30] on speech related applications. ...
Preprint
Identifying user-defined keywords is crucial for personalizing interactions with smart devices. Previous approaches of user-defined keyword spotting (UDKWS) have relied on short-term spectral features such as mel frequency cepstral coefficients (MFCC) to detect the spoken keyword. However, these features may face challenges in accurately identifying closely related pronunciation of audio-text pairs, due to their limited capability in capturing the temporal dynamics of the speech signal. To address this challenge, we propose to use shifted delta coefficients (SDC) which help in capturing pronunciation variability (transition between connecting phonemes) by incorporating long-term temporal information. The performance of the SDC feature is compared with various baseline features across four different datasets using a cross-attention based end-to-end system. Additionally, various configurations of SDC are explored to find the suitable temporal context for the UDKWS task. The experimental results reveal that the SDC feature outperforms the MFCC baseline feature, exhibiting an improvement of 8.32% in area under the curve (AUC) and 8.69% in terms of equal error rate (EER) on the challenging Libriphrase-hard dataset. Moreover, the proposed approach demonstrated superior performance when compared to state-of-the-art UDKWS techniques.
... Tis is the frst Telugu database created by IIT Kharagpur [28]. Te recordings were done by radio artists with 15 sentences spoken by 10 different speakers. ...
Article
Full-text available
The objective of speech emotion recognition (SER) is to enhance man–machine interface. It can also be used to cover the physiological state of a person in critical situations. In recent time, speech emotion recognition also finds its operations in medicine and forensics. A new feature extraction technique using Teager energy operator (TEO) is proposed for the detection of stressed emotions as Teager energy-autocorrelation envelope (TEO-Auto-Env). TEO is basically designed for increasing the energies of the stressed speech signals whose energies are reduced during the speech production process and hence used in this analysis. A stressed speech emotion recognition (SSER) system is developed using TEO-Auto-Env and spectral feature combination for detecting the emotions. The spectral features considered are Mel-frequency cepstral coefficients (MFCC), linear prediction cepstral coefficients (LPCC), and relative spectra–perceptual linear prediction (RASTA-PLP). EMO-DB (German), EMOVO (Italian), IITKGP (Telugu), and EMA (English) databases are used in this analysis. The classification of the emotions is carried out using the k-nearest neighborhood (k-NN) classifier for gender-dependent (GD) and speaker-independent (SI) cases. The proposed SSER system provides improved accuracy compared to the existing ones. Average recall is used for performance evaluation. The highest classification accuracy is achieved using the feature combination of TEO-Auto-Env, MFCC, and LPCC features with 91.4% (SI), 91.4% (GD-male), and 93.1%(GD-female) for EMO-DB; 68.5% (SI), 68.5% (GD-male), and 74.6% (GD-female) for EMOVO; 90.6%(SI), 91% (GD-male), and 92.3% (GD-female) for EMA; and 95.1% (GD-female) for IITKGP female database.
... • Video: BRISQUE [12], CORNIA [13], V-BLIINDS [14], TLVQM [15], VIDEVAL [16], and RAPIQUE [17]. • Audio: Mel frequency cepstral coefficient (MFCC), RASTA-PLP [18] and NRMusic [19]. The AQA models mentioned above extract features from each audio segment and calculate the means and stds over all audio segments to produce audio quality-aware features. ...
Preprint
With the explosive increase of User Generated Content (UGC), UGC video quality assessment (VQA) becomes more and more important for improving users' Quality of Experience (QoE). However, most existing UGC VQA studies only focus on the visual distortions of videos, ignoring that the user's QoE also depends on the accompanying audio signals. In this paper, we conduct the first study to address the problem of UGC audio and video quality assessment (AVQA). Specifically, we construct the first UGC AVQA database named the SJTU-UAV database, which includes 520 in-the-wild UGC audio and video (A/V) sequences, and conduct a user study to obtain the mean opinion scores of the A/V sequences. The content of the SJTU-UAV database is then analyzed from both the audio and video aspects to show the database characteristics. We also design a family of AVQA models, which fuse the popular VQA methods and audio features via support vector regressor (SVR). We validate the effectiveness of the proposed models on the three databases. The experimental results show that with the help of audio signals, the VQA models can evaluate the perceptual quality more accurately. The database will be released to facilitate further research.
Article
Full-text available
Speech is one of the most fundamental and essential forms of human communication. Humans and computers interact through what is known as a human–computer interface. Speech can be used to communicate with the computer. Speech recognition is utilised not only in mobile devices, but also in embedded systems, modern desktop and laptop computers, operating systems, and browsers. This is beneficial to children, the senior citizens, and people who are blind or have impaired eyesight. This is especially important for physically handicapped individuals who rely only on this medium to interact with computer systems. The field of voice recognition research is becoming more inventive. Researchers are trying to expand the ways in which computers can use human speech. This review article aims to classify methods for translating human speech into a format that computers can understand. Challenges of the current most popular speech recognition systems are analysed and solutions are presented. This review paper is intended to provide a summary for the researchers who are working in speech recognition. Both feature extraction and classification are critical components of a speech recognition system. The focus of this study is to present a review of the literature on feature extraction and classification strategies for speech recognition systems.
Article
We introduce Shennong, a Python toolbox and command-line utility for audio speech features extraction. It implements a wide range of well-established state-of-the-art algorithms: spectro-temporal filters such as Mel-Frequency Cepstral Filterbank or Predictive Linear Filters, pre-trained neural networks, pitch estimators, speaker normalization methods, and post-processing algorithms. Shennong is an open source, reliable and extensible framework built on top of the popular Kaldi speech processing library. The Python implementation makes it easy to use by non-technical users and integrates with third-party speech modeling and machine learning tools from the Python ecosystem. This paper describes the Shennong software architecture, its core components, and implemented algorithms. Then, three applications illustrate its use. We first present a benchmark of speech features extraction algorithms available in Shennong on a phone discrimination task. We then analyze the performances of a speaker normalization model as a function of the speech duration used for training. We finally compare pitch estimation algorithms on speech under various noise conditions.
Article
Full-text available
Deep learning has been widely adopted in automatic emotion recognition and has lead to significant progress in the field. However, due to insufficient training data, pre-trained models are limited in their generalisation ability, leading to poor performance on novel test sets. To mitigate this challenge, transfer learning performed by fine-tuning pr-etrained models on novel domains has been applied. However, the fine-tuned knowledge may overwrite and/or discard important knowledge learnt in pre-trained models. In this paper, we address this issue by proposing a PathNet-based meta-transfer learning method that is able to (i) transfer emotional knowledge learnt from one visual/audio emotion domain to another domain and (ii) transfer emotional knowledge learnt from multiple audio emotion domains to one another to improve overall emotion recognition accuracy. To show the robustness of our proposed method, extensive experiments on facial expression-based emotion recognition and speech emotion recognition are carried out on three bench-marking data sets: SAVEE, EMODB, and eNTERFACE. Experimental results show that our proposed method achieves superior performance compared with existing transfer learning methods.
Article
Full-text available
A model-based spectral estimation algorithm is derived that improves the robustness of speech recognition systems to additive noise. The algorithm is tailored for filter-bank-based systems, where the estimation should seek to minimize the distortion as measured by the recognizer's distance metric. This estimation criterion is approximated by minimizing the Euclidean distance between spectral log-energy vectors, which is equivalent to minimizing the nonweighted, nontruncated cepstral distance. Correlations between frequency channels are incorporated in the estimation by modeling the spectral distribution of speech as a mixture of components, each representing a different speech class, and assuming that spectral energies at different frequency channels are uncorrelated within each class. The algorithm was tested with SRI's continuous-speech, speaker-independent, hidden Markov model recognition system using the large- vocabulary NIST "Resource Management Task." When trained on a clean-speech database and tested with additive white Gaussian noise, the new algorithm has an error rate half of that with MMSE estimation of log spectral energies at individual frequency channels, and it achieves a level similar to that with the ideal condition of training and testing at constant SNR. The algorithm is also very efficient with additive environmental noise, recorded with a desktop microphone.
Article
Full-text available
In this paper we discuss recent results from our efforts to make SPHINX, the CMU continuous-speech speaker- independent recognition system, robust to changes in the environment. To deal with differences in noise level and spectral tilt between close-talking and desk-top microphones, we describe two novel methods based on ad- ditive corrections in the cepstral domain. In the first algo- rithm, an additive correction is imposed that depends on the instantaneous SNR of the signal. In the second tech- nique, EM techniques are used to best match the cepstral vectors of the input utterances to the ensemble of codebook entries representing a standard acoustical ambience. Use of these algorithms dramatically improves recognition ac- curacy when the system is tested on a microphone other than the one on which it was trained. In this paper we present two algorithms for speech nor- malization based on additive corrections in the cepstral domain and compare them to techniques that operate in the frequency domain. We have chosen the cepstral domain rather than the frequency domain so that can we work directly with the parameters that SPHINX uses, and because speech can be characterized with a smaller number of parameters in the cepstral domain than in the frequency domain. The first algorithm, SNR-dependent cepstral normalization (SDCN) is simple and effective, but it can- not be applied to new microphones without microphone- specific training. The second algorithm, codeword-dependent cepstral normalization (CDCN) uses the speech knowledge represented in a codebook to es- timate the noise and spectral equalization necessary for the environmental normalization. We also describe an inter- polated SDCN algorithm (iSDCN) which combines the simplicity of SDCN and the normalization capabilities of CDCN. These algorithms are evaluated with a number of microphones using an alphanumeric database in which ut- terances were recorded simultaneously with two different microphones.
Article
Listeners identified both constituents ofdouble vowels created by summing the waveforms of pairs of synthetic vowels with the same duration and fundamental frequency, Accuracy of identification was significantly above chance. Effects of introducing such double vowels by visual or acoustical precursor stimuli were examined. Precursors specified the identity of one of the two constituent vowels. Performance was scored as the accuracy with which the other vowel was identified. Visual precursors were standard English spellings of one member of the vowel pair; acoustical precursors were 1-sec segments of one member of the vowel pair. Neither visual precursors nor contralateral acoustical precursors improved performance over the condition with no precursor. Thus, knowledge of the identity of one of the constituents of a double vowel does not help listeners to identify the other constituent. A significant improvement in performance did occur with ipsilateral acoustical precursors, consistent with earlier demonstrations that frequency components which undergo changes in spectral amplitude achieve enhanced auditory prominence relative to unchanging components. This outcome demonstrates the joint but independent operation of auditory and perceptual processes underlying the ability of listeners to understand speech despite adversely peaked frequency responses in communication channels.
Article
A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, is presented and examined. This technique uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum: (1) the critical-band spectral resolution, (2) the equal-loudness curve, and (3) the intensity-loudness power law. The auditory spectrum is then approximated by an autoregressive all-pole model. A 5th-order all-pole model is effective in suppressing speaker-dependent details of the auditory spectrum. In comparison with conventional linear predictive (LP) analysis, PLP analysis is more consistent with human hearing. The effective second formant F2' and the 3.5-Bark spectral-peak integration theories of vowel perception are well accounted for. PLP analysis is computationally efficient and yields a low-dimensional representation of speech. These properties are found to be useful in speaker-independent automatic-speech recognition.
Article
Listeners identified both constituents of double vowels created by summing the waveforms of pairs of synthetic vowels with the same duration and fundamental frequency. Accuracy of identification was significantly above chance. Effects of introducing such double vowels by visual or acoustical precursor stimuli were examined. Precursors specified the identity of one of the two constituent vowels. Performance was scored as the accuracy with which the other vowel was identified. Visual precursors were standard English spellings of one member of the vowel pair; acoustical precursors were 1-sec segments of one member of the vowel pair. Neither visual precursors nor contralateral acoustical precursors improved performance over the condition with no precursor. Thus, knowledge of the identity of one of the constituents of a double vowel does not help listeners to identify the other constituent. A significant improvement in performance did occur with ipsilateral acoustical precursors, consistent with earlier demonstrations that frequency components which undergo changes in spectral amplitude achieve enhanced auditory prominence relative to unchanging components. This outcome demonstrates the joint but independent operation of auditory and perceptual processes underlying the ability of listeners to understand speech despite adversely peaked frequency responses in communication channels.
Conference Paper
A new speech analysis technique applicable to speech recognition is proposed considering the auditory mechanism of speech perception which emphasizes spectral dynamics and which compensates for the spectral undershoot associated with coarticulation. A speech wave is represented by the LPC cepstrum and logarithmic energy sequences, and the time sequences over short periods are expanded by the first- and second-order polynomial functions at every frame period. The dynamics of the cepstrum sequences are then emphasized by the linear combination of their polynomial expansion coefficients, that is, derivatives, and their instantaneous values. Speaker-independent word recognition experiments using time functions of the dynamics-emphasized cepstrum and the polynomial coefficient for energy indicate that the error rate can be largely reduced by this method.