Fig 1 - uploaded by Satoru Hayamizu
Content may be subject to copyright.
Classification of noise types and the suitable noise reduction method. SI: spatial inverse; AF: acoustic focus; W/SS: Wiener filter and spectral subtraction; and AdvS: advanced single-microphone method.

Classification of noise types and the suitable noise reduction method. SI: spatial inverse; AF: acoustic focus; W/SS: Wiener filter and spectral subtraction; and AdvS: advanced single-microphone method.

Source publication
Article
Full-text available
A method of speech enhancement using microphone-array signal processing based on the subspace method is proposed and evaluated. The method consists of the following two stages corresponding to the different types of noise. In the first stage, less-directional ambient noise is reduced by eliminating the noise-dominant subspace. It is realized by wei...

Contexts in source publication

Context 1
... to improve the rate of recognition. Various kinds of speech enhancement/noise reduction techniques have been studied for improving the signal-to-noise ratio (S/N) at the input of ASR. However, since the types of noise varies greatly ac- cording to the environment, no one speech enhancement tech- nique is able to cover the whole range of noise. Fig. 1 shows a rough classification of noise and the cor- responding suitable speech enhancement methods. Speech enhancement techniques can be roughly divided into the multi-microphone approach and the single-microphone ap- proach. The multi-microphone approach can further be divided into the spatial inverse type and the acoustic focus type. ...
Context 2
... in the experiment in this paper), conventional single-mi- crophone methods such as the Wiener filter and spectral subtrac- tion show comparable performance as long as the noise is sta- tionary. In the single-microphone method, significant improve- ment has been made for nonstationary noise [4] (denoted as an advanced single-microphone method in Fig. 1). However, it is still difficult to cover all kinds of nonstationary noise. This is due to the fact that the single-microphone methods utilize a priori knowledge of the noise. The acoustic-focus-type method utilizes only the difference of the spatial characteristics of the signal and noise, and is effective for both stationary and non- ...
Context 3
... room used in the experiment is a meeting room, the size of which is 8.3 m 7.2 m 3.2 m. The reverberation time was 0.42 s. Target source A, directional noise source B1/B2, and am- bient noise source C were located as depicted in Fig. 10. To sim- ulate ambient noise, source C was placed facing a corner of the room. Then, the impulse responses from these sources to the microphones were measured. The microphone input was gen- erated by convolving these impulse responses with the source signals. In the impulse responses from the ambient noise source C, direct sound was ...
Context 4
... input was gen- erated by convolving these impulse responses with the source signals. In the impulse responses from the ambient noise source C, direct sound was eliminated to generate diffused noise. The noise sources used were white noise and the noise of an elevator (low-frequency dominant). The spectrum of the elevator noise is shown in Fig. 11. The parameters of NSR and ASR are the same as those of the previous section. The following two cases were investigated: 1) [directional speech A ambient noise C]; 2) [directional speech A directional noise ...
Context 5
... case 1) [A C] was tested. Fig. 12(a) shows the re- sults for the white-noise as an ambient noise source. S/N w/o corresponds to the case when ambient noise C does not exist. Even in this case, the recognition rate was around 70% for both DS-NSR and DS. This is due to the reverberation of the direc- tional speech. The reason for there being no improvement for DS-NSR as ...
Context 6
... the case where elevator noise was employed as the ambient noise source, the recognition rate was much reduced compared with the case in which white noise was used. This is due to the fact that a large portion of noise energy was concentrated in the low frequencies as shown in Fig. 11. Therefore, in this case, a Wiener filter was further applied to the low frequency range of the output of the array processing to reduce the low frequency component of the noise. The setup of the Wiener filter was the same as that of Section V-C. The range for application of the Wiener filter was the lower 10 mel-frequency bands with a ...
Context 7
... output of the array processing to reduce the low frequency component of the noise. The setup of the Wiener filter was the same as that of Section V-C. The range for application of the Wiener filter was the lower 10 mel-frequency bands with a center frequency of 0 to 1388 Hz. This range was determined so that the recognition score was the highest. Fig. 12(b) shows the results for the array signal processing the Wiener filter. As shown by this figure, an improvement was found for DS-NSR as compared with ...
Context 8
... case 2) [A B1] was investigated. Though ambient noise source C was not employed, the reverberation for A and B1 existed as natural ambient noise. Therefore, this is a case where directional noise and real ambient noise coexist. Fig. 13 shows the directivity pattern of the MV beamformer. Fig. 13(a) is the case when while Fig. 13(b) is the case when . In 13(a), a deep valley appeared in the direction of B1 while an increase in the gain was found in the lower frequencies in directions other than A and B1. On the other hand, in 13(b), the valley in the direction of B1 is ...
Context 9
... case 2) [A B1] was investigated. Though ambient noise source C was not employed, the reverberation for A and B1 existed as natural ambient noise. Therefore, this is a case where directional noise and real ambient noise coexist. Fig. 13 shows the directivity pattern of the MV beamformer. Fig. 13(a) is the case when while Fig. 13(b) is the case when . In 13(a), a deep valley appeared in the direction of B1 while an increase in the gain was found in the lower frequencies in directions other than A and B1. On the other hand, in 13(b), the valley in the direction of B1 is shallower while the increase in the gain in the low frequency ...
Context 10
... case 2) [A B1] was investigated. Though ambient noise source C was not employed, the reverberation for A and B1 existed as natural ambient noise. Therefore, this is a case where directional noise and real ambient noise coexist. Fig. 13 shows the directivity pattern of the MV beamformer. Fig. 13(a) is the case when while Fig. 13(b) is the case when . In 13(a), a deep valley appeared in the direction of B1 while an increase in the gain was found in the lower frequencies in directions other than A and B1. On the other hand, in 13(b), the valley in the direction of B1 is shallower while the increase in the gain in the low frequency is relatively small. , ...
Context 11
... averaging is required, especially when the target signal coexists with the other directional interferences. This is mainly due to the cross terms of the target and the other directional components such as not being zero in [15]. These cross terms are theoretically zero if the target and the other directional components are mutually independent. Fig. 15 shows the response of the MV beamformer in the direction of the directional interference derived using under the existence of the target (curve A). In comparison with the response using without the target (curve B), the reduction in gain of curve A is much smaller. Curve C shows the response with the virtual correlation matrix , which ...

Similar publications

Article
Full-text available
In this paper, we propose a convolutive transfer function generalized sidelobe canceler (CTF-GSC), which is an adaptive beamformer designed for multichannel speech enhancement in reverberant environments. Using a complete system representation in the short-time Fourier transform (STFT) domain, we formulate a constrained minimization problem of tota...
Book
Full-text available
This work addresses this problem in the short-time Fourier transform (STFT) domain. We divide the general problem into five basic categories depending on the number of microphones being used and whether the interframe or interband correlation is considered. The first category deals with the single-channel problem where STFT coefficients at differen...
Conference Paper
Full-text available
This paper deals with the problem of microphone array speech enhancement using a hybrid Generalized Sidelobe Canceller (GSC), Near-Field Super-Directive (NFSD) beamformer, and post-filter. In this research, we employ a near field compensation block before the blocking matrix (of the GSC) to prevent signal leakage in the reference noise and a Linear...
Conference Paper
Full-text available
In speech communication systems the received microphone signals are often degraded by competing speakers, noise signals and room reverberation. Microphone arrays are commonly utilized to enhance the desired speech signal. In this paper two important design criteria, namely the minimum variance distortionless response (MVDR) and the linearly constra...

Citations

... Classic SE techniques such as spectral subtraction [1], wiener filtering [2], minimum mean square error estimation [3], and subspace methods [4] have been widely used and have shown to be effective in certain situations. However, they have limitations when dealing with non-stationary noise and low Signal-to-Noise Ratio (SNR) conditions. ...
Preprint
Full-text available
Speech enhancement (SE) is crucial for reliable communication devices or robust speech recognition systems. Although conventional artificial neural networks (ANN) have demonstrated remarkable performance in SE, they require significant computational power, along with high energy costs. In this paper, we propose a novel approach to SE using a spiking neural network (SNN) based on a U-Net architecture. SNNs are suitable for processing data with a temporal dimension, such as speech, and are known for their energy-efficient implementation on neuromorphic hardware. As such, SNNs are thus interesting candidates for real-time applications on devices with limited resources. The primary objective of the current work is to develop an SNN-based model with comparable performance to a state-of-the-art ANN model for SE. We train a deep SNN using surrogate-gradient-based optimization and evaluate its performance using perceptual objective tests under different signal-to-noise ratios and real-world noise conditions. Our results demonstrate that the proposed energy-efficient SNN model outperforms the Intel Neuromorphic Deep Noise Suppression Challenge (Intel N-DNS Challenge) baseline solution and achieves acceptable performance compared to an equivalent ANN model.
... 6.8] and parameter estimation [30], [31]. It was also applied in speech enhancement, first in single-channel [35], [36] and later in array-based methods [37], [38]. In speech enhancement, noise components in the signal subspace are typically further reduced using a signal-dependent post-filter and several estimators have been proposed for that purpose [39]. ...
Article
Psychoacoustic experiments have shown that directional properties of the direct sound, salient reflections, and the late reverberation of an acoustic room response can have a distinct influence on the auditory perception of a given room. Spatial room impulse responses (SRIRs) capture those properties and thus are used for direction-dependent room acoustic analysis and virtual acoustic rendering. This work proposes a subspace method that decomposes SRIRs into a direct part, which comprises the direct sound and the salient reflections, and a residual, to facilitate enhanced analysis and rendering methods by providing individual access to these components. The proposed method is based on the generalized singular value decomposition and interprets the residual as noise that is to be separated from the other components of the reverberation. Large generalized singular values are attributed to the direct part, which is then obtained as a low-rank approximation of the SRIR. By advancing from the end of the SRIR toward the beginning while iteratively updating the residual estimate, the method adapts to spatio-temporal variations of the residual. The method is evaluated using a spatio-spectral error measure and simulated SRIRs of different rooms, microphone arrays, and ratios of direct sound to residual energy. The proposed method creates lower errors than existing approaches in all tested scenarios, including a scenario with two simultaneous reflections. A case study with measured SRIRs shows the applicability of the method under real-world acoustic conditions. A reference implementation is provided.
... The frequency smoothing is applied to decorrelate the sources and to increase the rank of the sources' cross-correlation matrix. Many speaker localization methods employ the CSS method to overcome problems due to reverberation [6]- [12]. These methods assume that frequency smoothing fully decorrelates the coherent reflections. ...
... Experimental studies demonstrated the improved decorrelation obtained by the proposed weights and the enhanced localization accuracy when it was combined in a DPD test based method for speaker localization. These results suggest that the proposed weights may be useful for other frequency-smoothing-based methods, including localization methods of coherent sources other than speech [17], [19], [23], spatial filtering [22], [35], [36], speech enhancement [12] and focusing frequency selection [21]. ...
Article
Full-text available
The coherent signal subspace method may be used in order to apply subspace localization methods (e.g. MUSIC) to coherent sources. This method involves a focusing process followed by frequency smoothing, which is intended to decorrelate source signals from coherent sources. In practice, however, only moderate decorrelation is obtained, which may lead to performance degradation. Although decorrelation can be improved by widening the smoothing bandwidth, a wider bandwidth may increase focusing error and the smoothing bandwidth is limited bythe bandwidth of the actual signal. In this paper, a weighted frequency smoothing that improves decorrelation for a given bandwidth is proposed. It is shown that better decorrelation is obtained by selecting the weights to be inversely proportional to the source signal power at the given frequency. However, since the power of the source is not known, it is estimated by the trace of the array spatial covariance matrix. An experimental study is presented that investigates the effect of the proposed weighting on DOA estimation of speech sources in a reverberant environment.
... Further, the multi-band algorithm works on the fact that the noise does not affect the speech signal uniformly. In Asano et al. (2000), a speech enhancement algorithm based on the subspace approach has been proposed. The proposed algorithm in Asano et al. (2000) reduces the ambient noise by eliminating the noise-dominant eigenvalues. ...
... In Asano et al. (2000), a speech enhancement algorithm based on the subspace approach has been proposed. The proposed algorithm in Asano et al. (2000) reduces the ambient noise by eliminating the noise-dominant eigenvalues. The basic principle of the subspace approach is that the clean signal might be confined to a subspace of the noisy Euclidean space. ...
Article
Full-text available
Speech enables easy human-to-human communication as well as human-to-machine interaction. However, the quality of speech degrades due to background noise in the environment, such as drone noise embedded in speech during search and rescue operations. Similarly, helicopter noise, airplane noise, and station noise reduce the quality of speech. Speech enhancement algorithms reduce background noise, resulting in a crystal clear and noise-free conversation. For many applications, it is also necessary to process these noisy speech signals at the edge node level. Thus, we propose implicit Wiener filter-based algorithm for speech enhancement using edge computing system. In the proposed algorithm, a first order recursive equation is used to estimate the noise. The performance of the proposed algorithm is evaluated for two speech utterances, one uttered by a male speaker and the other by a female speaker. Both utterances are degraded by different types of non-stationary noises such as exhibition, station, drone, helicopter, airplane, and white Gaussian stationary noise with different signal-to-noise ratios. Further, we compare the performance of the proposed speech enhancement algorithm with the conventional spectral subtraction algorithm. Performance evaluations using objective speech quality measures demonstrate that the proposed speech enhancement algorithm outperforms the spectral subtraction algorithm in estimating the clean speech from the noisy speech. Finally, we implement the proposed speech enhancement algorithm, in addition to the spectral subtraction algorithm, on the Raspberry Pi 4 Model B, which is a low power edge computing device.
... Source number estimation is typically performed through the analysis of the eigenvalues of the spatial covariance matrix of the array or ambisonic signals [10]. The identification of dominant eigenvalues and the respective eigenvectors then permits the segregation of the signal and noise subspaces, which are a common input for DoA estimation [11], [12] and spatial filtering methods [13], [14]. ...
... This subspace processing paradigm also gives rise to an alternative primary-ambience decomposition approach, which does not require DoA estimation. In the simplest case, the noise subspace eigenvalues and eigenvectors can be made to re-assemble a spatial covariance matrix corresponding to the scene ambience [14]. The ambient signals may then be estimated using a MWF. ...
Conference Paper
Full-text available
Spatial audio coding and reproduction methods are often based on the estimation of primary directional and secondary ambience components. This paper details a study into the estimation and subsequent reproduction of the ambient components found in ambisonic sound scenes. More specifically, two different ambience estimation approaches are investigated. The first estimates the ambient Ambisonic signals through a source-separation and spatial subtraction approach, and therefore requires an estimate of both the number of sources and their directions. The second instead requires only the number of sources to be known, and employs a multi-channel Wiener filter (MWF) to obtain the estimated ambient signals. One approach for reproducing estimated ambient signals is through a signal processing chain of: a plane-wave decomposition, signal decor-relation, and subsequent spatialisation for the target playback setup. However, this reproduction approach may be sensitive to spatial and signal fidelity degradations incurred during the beamforming and decorrelation operations. Therefore, an optimal mixing alternative is proposed for this reproduction task, which achieves spatially incoherent rendering of ambience directly for the target playback setup; bypassing intermediate plane-wave decomposition and excessive decorrelation. Listening tests indicate improved perceived quality when using the proposed reproduction method in conjunction with both tested ambience estimation approaches.
... 6.8] and parameter estimation [30], [31]. It was also applied in speech enhancement, first in single-channel [35], [36] and later in array-based methods [37], [38]. In speech enhancement, noise components in the signal subspace are typically further reduced using a signal-dependent post-filter and several estimators have been proposed for that purpose [39]. ...
Preprint
Full-text available
Psychoacoustic experiments have shown that directional properties of the direct sound, salient reflections, and the late reverberation of an acoustic room response can have a distinct influence on the auditory perception of a given room. Spatial room impulse responses (SRIRs) capture those properties and thus are used for direction-dependent room acoustic analysis and virtual acoustic rendering. This work proposes a subspace method that decomposes SRIRs into a direct part, which comprises the direct sound and the salient reflections, and a residual, to facilitate enhanced analysis and rendering methods by providing individual access to these components. The proposed method is based on the generalized singular value decomposition and interprets the residual as noise that is to be separated from the other components of the reverberation. Large generalized singular values are attributed to the direct part, which is then obtained as a low-rank approximation of the SRIR. By advancing from the end of the SRIR toward the beginning while iteratively updating the residual estimate, the method adapts to spatio-temporal variations of the residual. The method is evaluated using a spatio-spectral error measure and simulated SRIRs of different rooms, microphone arrays, and ratios of direct sound to residual energy. The proposed method creates lower errors than existing approaches in all tested scenarios, including a scenario with two simultaneous reflections. A case study with measured SRIRs shows the applicability of the method under real-world acoustic conditions. A reference implementation is provided.
... There are several preprocessing techniques performed before DOA estimation, such as speech enhancement based on the subspace method [34], blind source separation [35,36], sub-band-based clustering [37], and the adaptive directional time-frequency distributions (ADTFD) method [38]. ...
... The technique introduced by Asano et al. [34] constituted two stages corresponding to the different types of noise. In the first stage, ambient noise, which was less directional, was reduced by eliminating the noise-dominant subspaces. ...
Article
Full-text available
Direction of arrival (DOA) is one of the essential topics in array signal processing that has many applications in communications, smart antennas, seismology, acoustics, radars, and many more. As the applications of DOA estimation are broadened, the challenges in implementing a DOA algorithm arise. Different environments require different modifications to the existing methods. This paper reviews the DOA algorithms in the literature. It evaluates and compares the performance of the three well known algorithms, including MUSIC, ESPRIT, and Eigenvalue Decomposition (EVD), with and without using adaptive directional time–frequency distributions (ADTFD) at the preprocessing stage. We simulated a case with four sources and three receivers. The sources were well separated. Signals were received at each sensor with an SNR value of −5 dB, 0 dB, 5 dB, and 10 dB. The angles of the sources were 15, 30, 45, and 60 degrees. The simulation results show that the ADTFD algorithm significantly improved the performance of MUSIC, while it did not provide similar results for the ESPRIT and EVD methods. As expected, the computation time of the algorithms was increased by implementing the ADTFD algorithm as a preprocessing step.
... This work was extended in [20] to cope with coloured noise by using a generalized eigenvalue decomposition (GEVD) to jointly diagonalize the speech and noise covariance matrices. Subspace-based methods have also been extended to multi-channel systems [21], [22] but they do not fully exploit spatial information to minimize speech distortion [7], as will be seen in Section VI. ...
... A large number of contributions are based on the assumption of narrowband signal models, e.g., [19]- [22]. Single-channel approaches exploit temporal correlations while multi-channel approaches capture spatio-temporal correlations. ...
Article
Full-text available
Speech enhancement is important for applications such as telecommunications, hearing aids, automatic speech recognition and voice-controlled systems. Enhancement algorithms aim to reduce interfering noise and reverberation while minimizing any speech distortion. In this work for speech enhancement, we propose to use polynomial matrices to model the spatial, spectral and temporal correlations between the speech signals received by a microphone array and polynomial matrix eigenvalue decomposition (PEVD) to decorrelate in space, time and frequency simultaneously. We then propose a blind and unsupervised PEVD-based speech enhancement algorithm. Simulations and informal listening examples involving diverse reverberant and noisy environments have shown that our method can jointly suppress noise and reverberation, thereby achieving speech enhancement without introducing processing artefacts into the enhanced signal. Listening examples and code are available: https://vwn09.github.io/pevd-enhance/
... This method was chosen for comparison because it is also based on the DPD approach and can be applied to arbitrary arrays. However, [14] is strictly used for DoA estimation, while the proposed method can be used to compute smoothed cross-spectrum matrices for other applications, for example, speech enhancement [16], blind source separation [17], and beamforming [18]. The proposed method demonstrated performance comparable with the DPD test proposed in [14]. ...
... This result verifies that the focusing and smoothing technique works well. However, the LSDD-DPD test [14] is strictly used for DoA estimation while the proposed method can be used to compute smoothed cross-spectrum matrices for other applications, such as speech enhancement [16], blind source separation [17], and beamforming [18]. Moreover, the measure proposed in [14] is based on a steered beam response; thus, it may have resolution limitations, especially when applied to arrays with a small number of microphones, such as a binaural array. ...
Article
Full-text available
The coherent signal subspace method (CSSM) enables the direction-of-arrival (DoA) estimation of coherent sources with subspace localization methods. The focusing process that aligns the signal subspaces within a frequency band to its central frequency is central to the CSSM. Within current focusing approaches, a direction-independent focusing approach may be more suitable for reverberant environments since no initial estimation of the sources' DoAs is required. However, these methods use integrals over the steering function, and cannot be directly applied to arrays around complex scattering structures, such as robot heads. In this paper, current direction-independent focusing methods are extended to arrays for which the steering function is available only for selected directions, typically in a numerical form. Spherical harmonics decomposition of the steering function is then employed to formulate several aspects of the focusing error. A case of two coherent sources is studied and guidelines for the selection of the frequency smoothing bandwidth are suggested. The performance of the proposed methods is then investigated for an array that is mounted on a robot head. The focusing process is integrated within the direct-path dominance (DPD) test method for speaker localization, originally designed for spherical arrays, extending its application to arrays with arbitrary configurations. Finally, experiments with real data verify the feasibility of the proposed method to successfully estimate the DoAs of multiple speakers under real-world conditions.
... Like the multi-channel Wiener filter (MWF), this approach does not fully exploit spatial information to minimize speech distortion [11]. A different approach was adopted in [12,13], in which KLT is applied to the spatial covariance matrix between different microphones for different frequency bins. This approach, however, processes frequency subbands independently and, therefore, neglects correlations between bands and phase continuities across band boundaries. ...
Conference Paper
Full-text available
The enhancement of noisy speech is important for applications involving human-to-human interactions, such as telecommunications and hearing aids, as well as human-to-machine interactions, such as voice-controlled systems and robot audition. In this work, we focus on reverberant environments. It is shown that, by exploiting the lack of correlation between speech and the late reflections, further noise reduction can be achieved. This is verified using simulations involving actual acoustic impulse responses and noise from the ACE corpus. The simulations show that even without using a noise estimator, our proposed method simultaneously achieves noise reduction, and enhancement of speech quality and intelligibility, in reverberant environments over a wide range of SNRs. Furthermore, informal listening examples highlight that our approach does not introduce any significant processing artefacts such as musical noise.