Fig 7 - uploaded by Osamu Ichikawa
Content may be subject to copyright.
Word Error Rate using the proposed Method 1 for various values of the over-subtraction factor γ and the length of the adaptive weights L, for the Command Recognition Task.

Word Error Rate using the proposed Method 1 for various values of the over-subtraction factor γ and the length of the adaptive weights L, for the Command Recognition Task.

Source publication
Conference Paper
Full-text available
Recently, automatic speech recognition in a car has practical uses for applications like car-navigation and hands-free telephone dialers. For noise robustness, the current successes are based on the assumption that there is only a stationary cruising noise. Therefore, the recognition rate is greatly reduced when there is music or news coming from a...

Similar publications

Preprint
Full-text available
In this paper we address the problem of multichannel speech enhancement in the short-time Fourier transform (STFT) domain and in the framework of sequence-to-sequence deep learning. A long short-time memory (LSTM) network takes as input a sequence of STFT coefficients associated with a frequency bin of multichannel noisy-speech signals. The network...
Article
Full-text available
Endpoint detection of speech has been shown prosperous for speech recognition and speech enhancement. But the traditional endpoint detection methods lose efficiency in either low signal-to-noise ratio (SNR) environments or nonstationary noise environments. To improve the accuracy of speech endpoint detection in low SNR environments, an endpoint det...
Conference Paper
Full-text available
In this paper, we investigated the enhancement of speech by applying kalman filter. Noise removal is very important in many applications like telephone conversation, speech recognition, etc. The corruption of speech due to presence of additive background noise causes severe difficulties in various communication environments. If the background noise...
Article
Full-text available
In this paper, we propose a statistical model-based speech enhancement technique using the spectral difference scheme for the speech recognition in virtual reality. In the analyzing step, two principal parameters, the weighting parameter in the decision-directed (DD) method and the long-term smoothing parameter in noise estimation, are uniquely det...

Citations

... We next generalize and express the criterion introduced in Sect. 3. ...
... We compare Proposed with conventional methods that combine a LAF and nonlinear echo suppressor using STSA estimation. These are referred to as follows: Conventional 1, consisting of both [1] without noise reduction and [2]; Conventional 2 consisting of [3] without a noise canceller; and Conventional 3 [5]. ...
Article
Full-text available
Hands-free communications between cellular phones must be robust enough to withstand echo-path variation, and highly nonlinear echoes must be suppressed at low cost, when acoustic echo cancellation or suppression is applied to them. This paper proposes a spectrum-selective nonlinear echo suppression (SS-ES) approach as a solution to these issues. SS-ES is characterized by the selection of either a spectrum of the residual signal from an adaptive filter or a spectrum of the sending input signal depending on the amount of linear echo cancellation in an adaptive filter. Compared to conventional methods, the objective evaluation results of the SS-ES approach show an improvement of approximately 0.8-2.2dB, 0.23-2.39dB, and 0.26-0.50 in average echo return loss enhancement (ERLE), average root-mean-square log-spectral distortion (RMS-LSD), and the perceptual evaluation of speech quality (PESQ) value, respectively, under echo-path variation and double-talk conditions.
... We compare the proposed SS-ES mothod with other con ventional methods -a combination of linear adaptive filter and nonlinear echo suppressor using STSA estimation. These conventional methods are referred to as follows: Conventional 1 consisting of both [I] without noise reduction and [2]; Con ventional 2 containing both [3] without noise canceller; and Conventional 3 [5]. In all methods, the same linear adaptive filter with the NLMS algorithm and a double-talk detector that a linear adaptive filter is not able to cancel. ...
... We compare the proposed SS-ES mothod with other conventional methods -a combination of linear adaptive filter and nonlinear echo suppressor using STSA estimation. These conventional methods are referred to as follows: Conventional 1 consisting of both [1] without noise reduction and [2]; Conventional 2 containing both [3] without noise canceller; and Conventional 3 [5]. In all methods, the same linear adaptive filter with the NLMS algorithm and a double-talk detector [10], [11] is used. ...
Conference Paper
Full-text available
When acoustic echo cancellation or suppression is applied to hands-free communications in cellular phones, it is important for it to be robust against echo path variation and to suppress highly nonlinear echo at low cost. This paper proposes the spectrum selective nonlinear echo suppression (SS-ES) approach as a solution to these issues. SS-ES is characterized by selecting either a spectrum of the residual signal from an adaptive filter or a spectrum of the sending input signal depending on the contribution from linear echo cancellation in an adaptive filter. In the SS-ES approach, the results of objective evaluation show that the average ERLE is approximately 1.2–2.7 dB, the average LSD is approximately 0.19–0.62 dB, and the PESQ values is approximately 0.36–0.53 better than the conventional methods under conditions of echo path variation and double-talk.
... By far, the most important issue constitutes the lack of ASR robustness to the noisy automobile acoustic environment, for example due to road, wind, and engine noise, background conversations, radio music, and other audio sources. The problem is exacerbated by the fact that the overall car noise environment is nonstationary and dif cult to model a-priori, which in turn reduces the effectiveness of traditional noise-robust ASR techniques, for example Wiener ltering [3], echo cancellation [4], beamforming [5], spectral subtraction [6], the Algonquin framework [7], or adaptation techniques [8], to name a few. As a result, in practice, ASR systems in automotive environments are typically used in the so-called "push-to-talk" or "push-toactivate" mode. ...
Conference Paper
Full-text available
We present a system for automatically detecting driver's speech in the automobile domain using visual-only information extracted from the driver's mouth region. The work is motivated by the desire to eliminate manual push-to-talk activation of the speech recognition engine in newly designed voice interfaces in the typically noisy car environment, aiming at reducing driver cognitive load and increasing naturalness of the interaction. The proposed system uses a camera mounted on the rearview mirror to monitor the driver, detect face boundaries and facial features, and finally employ lip motion clues to recognize visual speech activity. In particular, the designed algorithm has very low computational cost, which allows real-time implementation on currently available inexpensive embedded platforms, as described in the paper. Experiments are also reported on a small multi-speaker database collected in moving automobiles, that demonstrate promising accuracy.
Article
The accuracy of automatic speech recognition in a car is significantly degraded in a very low SNR (Signal to Noise Ratio) situation such as ``Fan high'' or ``Window open''. In such cases, speech signals are often buried in broadband noise. Although several existing noise reduction algorithms are known to improve the accuracy, other approaches that can work with them are still required for further improvement. One of the candidates is enhancement of the harmonic structures in human voices. However, most conventional approaches are based on comb filtering, and it is difficult to use them in practical situations, because their assumptions for F0 detection and for voiced/unvoiced detection are not accurate enough in realistic noisy environments. In this paper, we propose a new approach that does not rely on such detection. An observed power spectrum is directly converted into a filter for speech enhancement, by retaining only the local peaks considered to be harmonic structures in the human voice. In our experiments, this approach reduced the word error rate by 17% in realistic automobile environments. Also, it showed further improvement when used with existing noise reduction methods.