Figure 3 - uploaded by Ivan Tashev
Content may be subject to copyright.
Directivity Index vs. frequency.

Directivity Index vs. frequency.

Source publication
Conference Paper
Full-text available
In this paper we describe a novel algorithm for postprocessing a microphone array’s beamformer output to achieve better spatial filtering under noise and reverberation. For each audio frame and frequency bin the algorithm estimates the spatial probability for sound source presence and applies a spatio-temporal filter towards the look-up direction....

Context in source publication

Context 1
... results were used to compute the directivity pattern and directivity index in eight logarithmically spaced fre- quency subbands. Figure 2 shows the measured direc- tivity patterns for the band with center frequency of 1000Hz, figure 3 -the directivity index as function of the frequency. The improvement of the directivity index in the band of 500-3000Hz is 3-8dB. ...

Similar publications

Article
Full-text available
In many daily life communication situations, several sound sources are simultaneously active. While normal-hearing listeners can easily distinguish the target sound source from interfering sound sources-as long as target and interferers are spatially or spectrally separated-and concentrate on the target, hearing-impaired listeners and cochlear impl...
Conference Paper
Full-text available
A randomly positioned microphone array is considered in this work. In many applications, the locations of the array elements are known up to a certain degree of random mismatch. We derive a novel statistical model for performance analysis of the multi-channel Wiener filter (MWF) beamformer under random mismatch in sensors location. We consider the...

Citations

... Another effective method is based on a coherence algorithm that calculates the correlation of two input signals to estimate a filter to weaken the interference components [36,50]. Besides, post-filtering [42] is commonly used for further noise reduction. Usually, when speech covariance matrices and direction of the arrival can be accurately estimated, the above methods are capable of achieving good speech enhancement performance. ...
Article
Full-text available
Although deep learning-based methods have greatly advanced the speech enhancement, their performance is intensively degraded under the non-Gaussian noises. To combat the problem, a correntropy-based multi-objective multi-channel speech enhancement method is proposed. First, the log-power spectra (LPS) of multi-channel noisy speech are fed to the bidirectional long short-term memory network with the aim of predicting the intermediate log ideal ratio mask (LIRM) and LPS of clean speech in each channel. Then, the intermediate LPS and LIRM features obtained from each channel are separately integrated into a single-channel LPS and a single-channel LIRM by fusion layers. Next, the two single-channel features are further fused into a single-channel LPS and finally fed to the deep neural network to predict the LPS of clean speech. During training, a multi-loss function is constructed based on correntropy with the aim of reducing the impact of outliers and improving the performance of overall network. Experimental results show that the proposed method achieves significant improvements in suppressing non-Gaussian noises and reverberations and has good robustness to different noises, signal–noise ratios and source–array distances.
... In [TSA05;TA06] the difference between the estimated IDOA and the expected DOA of the target signal is used to compute a gain function. In [STA07], a machine learning approach is employed to track the distribution of signal and noise IDOA features using a Gaussian model. ...
Thesis
Full-text available
One of the challenges for far-field speech communication and recognition applications is that the acquired speech signal is impacted by reverberation and noise. It is therefore often required to apply signal processing techniques for dereverberation and noise reduction. Particularly effective are techniques which exploit spatial information about the sound field from multichannel microphone signals. One approach for modeling the spatial characteristics of reverberation and noise are spatial coherence functions. These are dependent only on acoustic properties which are relatively similar between different rooms, and require a minimum of assumptions about the acoustic scenario, which provides the motivation for focusing this thesis on signal enhancement approaches exploiting spatial coherence models. As a foundation, the applicability of different spatial coherence models to reverberation, and their dependency on acoustic properties of the room, are investigated. Existing methods for signal enhancement are reviewed, with a focus on spectral enhancement methods which use a short-time coherence estimate to estimate the power ratio between desired coherent and undesired diffuse sound field components. Known spectral enhancement methods are expressed in this framework, and novel estimators are proposed which have both theoretical and practical advantages over existing methods. Based on these estimators, an effective dereverberation system is proposed which can operate without knowledge of the position of the desired source, solely by exploiting the characteristic spatial coherence of reverberation. Furthermore, a more experimental dereverberation system is proposed which additionally accounts for the effect of early signal reflections in the room, showing that this approach can provide promising directions for future research. Finally, the problem of how to effectively use spatial information in an automatic speech recognizer based on a deep neural network acoustic model is investigated. A novel way of exploiting spatial information for reverberation-robust automatic speech recognition is proposed, where a spatial feature vector is extracted from short-time coherence estimates and then supplied as input to the neural network. It is shown that this approach can exceed the improvements that are obtained by the application of signal enhancement methods for dereverberation.
... In addition, it cannot be applied when the sources are spatially close to each other. Conventional postfiltering techniques, which are mainly based on signal statistics and conventional single-channel speech enhancement [5], [1] or spatial filters computed using phase information [6], [7], [8], [1], usually cannot achieve high-quality noise reduction in reverberant multi-source environments. ...
... Narrowband DOAs have been previously used for desired speech detection in the literature. For instance, in [22], the authors use narrowband DOAs to control the a priori desired speech presence probability (DSPP) in a Gaussian signal model, while in [23] a Gaussian DOA model is used to compute a DSPP and apply it as a single-channel gain to the output of a spatial filter. We propose a different statistical model for the narrowband DOA estimates which is used for desired signal detection and estimation of the propagation vectors and the PSD matrices in an ISF framework. ...
... Therefore, we first estimate the desired signal RTF vectorĝ 1 (t, k), and compute an MVDR filter using (31). The complete source extraction framework is summarized in Fig. 2. In addition, the DSPP p H s |θ tk can be applied as a multiplicative factor to the output of the MVDR filter, i.e., which has a similar role as the single-channel DOA-based gain in [23] and the DOA-based TF mask common for source separation [36]. Applying the DSPP as a multiplicative factor provides additional undesired signal reduction, however, when inaccurately estimated, it causes audible distortion to the desired signal. ...
... In this experiment, the proposed system D dm and the output of the DSB are multiplied by the a posteriori DSPP. A system where DOA-based DSPP is applied at the output of a fixed spatial filter is proposed in [23], and the goal of the current experiment is to confirm that the benefit of the DSPP is even larger when it is used in combination with a data-dependent, informed spatial filter, rather than a fixed spatial filter. The experiment is repeated with the two DOA estimators discussed in Section 3. ...
Article
Full-text available
A desired speech signal in hands-free communication systems is often degraded by noise and interfering speech. Even though the number and locations of the interferers are often unknown in practice, it is justified to assume in certain applications that the direction-of-arrival (DOA) of the desired source is approximately known. Using the known DOA, fixed spatial filters such as the delay-and-sum beamformer can be steered to extract the desired source. However, it is well-known that fixed data-independent spatial filters do not provide sufficient reduction of directional interferers. Instead, the DOA information can be used to estimate the statistics of the desired and the undesired signals and to compute optimal data-dependent spatial filters. One way the DOA is exploited for optimal spatial filtering in the literature, is by designing DOA-based narrowband detectors to determine whether a desired or an undesired signal is dominant at each time-frequency (TF) bin. Subsequently, the statistics of the desired and the undesired signals can be estimated during the TF bins where the respective signal is dominant. In a similar manner, a Gaussian signal model-based detector which does not incorporate DOA information has been used in scenarios where the undesired signal consists of stationary background noise. However, when the undesired signal is non-stationary, resulting for example from interfering speakers, such a Gaussian signal model-based detector is unable to robustly distinguish desired from undesired speech. To this end, we propose a DOA model-based detector to determine the dominant source at each TF bin and estimate the desired and undesired signal statistics. We demonstrate that data-dependent spatial filters that use the statistics estimated by the proposed framework achieve very good undesired signal reduction, even when using only three microphones.
... The output of the localizer is also used by post-filters to additionally suppress unwanted sound sources [1], [2]. The idea of spatial noise suppression was proposed in [3] and further developed in [4], where a de facto sound source localizer per frequency bin is used to suppress sound sources coming from undesired directions for each frequency bin separately. The probability of the sound source, coming from a given direction, is estimated based on the phase differences only. ...
Conference Paper
Full-text available
The sound source localizer is an important part of any microphone array processing block. Its major purpose is to determine the direction of arrival of the sound source and let a beamformer aim its beam towards this direction. In addition, the direction of arrival can be used for meetings diarization, pointing a camera, sound source separation. Multiple algorithms and approaches exist, targeting different settings and microphone arrays. In this paper we treat the sound source localizer as a classifier and use as features the phase differences and magnitude proportions in the microphone channels. To determine the proper mix, we propose a novel cost function to measure the localization capability. The resulting algorithm is fast and suitable for real– time implementations. It works well with different microphone array geometries with both omnidirectional and unidirectional microphones.
... is the captured signal at the mth microphone, which takes into account the source originated signal, the decay due to the distance and the microphone directivity. To achieve a higher degree of directivity, a real-time post-processor is used, which acts as a spatio-temporal filter that evaluates, for each audio frame, the presence of the source in terms of spatial probability [38]. Manufacturing tolerances of the elements inside the microphones and preamplifiers can cause slight variations among the receiving channels; in addition to this, it needs to be considered the dependency on external factors (atmospheric pressure, temperature, etc.). ...
Research
Full-text available
Microsoft Kinect motion sensor can capture information of different nature and it originated several applications in the field of human-computer interaction. To achieve natural interaction, an application to detect and follow the user’s movements and simultaneously track his voice is proposed when more than one person is in front of the sensor. Using gesture recognition to identify the active speaker, the microphone array is virtually pointed to him through a beamforming technique to follow his movements. Consequently, the user can speak without the necessity of being close to the sensor. Excellent results are obtained regarding accuracy and execution speed.
... The beamformer output is typically enhanced with a (single channel) post-filter, and post-filters for different types of noise have been designed by Zelinski (1988), Simmer et al. (2001), McCowan and Bourlard (2003), Lefkimmiatis and Maragos (2007). The spatial post-filter can suppress also point-wise noise sources (Tashev and Acero, 2006). Seltzer et al. (2007) presented post-filtering using phase differences and spectral observations for a linear array. ...
Article
Speech separation algorithms are faced with a difficult task of producing high degree of separation without containing unwanted artifacts. The time–frequency (T–F) masking technique applies a real-valued (or binary) mask on top of the signal’s spectrum to filter out unwanted components. The practical difficulty lies in the mask estimation. Often, using efficient masks engineered for separation performance leads to presence of unwanted musical noise artifacts in the separated signal. This lowers the perceptual quality and intelligibility of the output. Microphone arrays have been long studied for processing of distant speech. This work uses a feed-forward neural network for mapping microphone array’s spatial features into a T–F mask. Wiener filter is used as a desired mask for training the neural network using speech examples in simulated setting. The T–F masks predicted by the neural network are combined to obtain an enhanced separation mask that exploits the information regarding interference between all sources. The final mask is applied to the delay-and-sum beamformer (DSB) output. The algorithm’s objective separation capability in conjunction with the separated speech intelligibility is tested with recorded speech from distant talkers in two rooms from two distances. The results show improvement in instrumental measure for intelligibility and frequency-weighted SNR over complex-valued non-negative matrix factorization (CNMF) source separation approach, spatial sound source separation, and conventional beamforming methods such as the DSB and minimum variance distortionless response (MVDR).
... A spatial post-filter can suppress also point-wise noise sources. Tashev et al. derived the instantaneous DOA (IDOA) filter in [15], in which phase-difference measurements form a likelihood function for post-filter estimation. Selzer et al. [16] proposed a statistical generative model to estimate speech and noise parameters as Gaussian random variables with application to post-filtering using phase-difference and spectral observations for a four microphone linear array. ...
... As in [10] the phase based features are dependent on the angle of the source. While spatial filtering has impressive suppression of noise as evident in [15] it can also produce unwanted artifacts that lead to lower perceptual quality than that of the simple DSB. Therefore, it is important to investigate noise suppression capability of spatial filtering in conjunction with perceptual quality. ...
... Following the azimuth angle IDOA filter definition of [15] and omitting the time index for brevity the expression of IDOA for a DOA vector k is ...
Conference Paper
High level of noise reduces the perceptual quality and intelligibility of speech. Therefore, enhancing the captured speech signal is important in everyday applications such as telephony and teleconferencing. Microphone arrays are typically placed at a distance from a speaker and require processing to enhance the captured signal. Beamforming provides directional gain towards the source of interest and attenuation of interference. It is often followed by a single channel post-filter to further enhance the signal. Non-linear spatial post-filters are capable of providing high noise suppression but can produce unwanted musical noise that lowers the perceptual quality of the output. This work proposes an artificial neural network (ANN) to learn the structure of naturally occurring post-filters to enhance speech from interfering noise. The ANN uses phase-based features obtained from a multichannel array as an input. Simulations are used to train the ANN in a supervised manner. The performance is measured with objective scores from speech recorded in an office environment. The post-filters predicted by the ANN are found to improve the perceptual quality over delay-and-sum beamforming while maintaining high suppression of noise characteristic to spatial post-filters.
... We examine how we can encode small amounts of information in these chirps to distinguish different phones. To calculate the angle of arrival, we build microphone arrays that can be used in conjunction with SSL (sound source localization) algorithms [17] that we have customized for our use. ...
... Instead, Steered Response Power (SRP) algorithms are often used. Those are based on evaluating multiple angles and picking the one that maximizes certain criteria (power of the signal from the sound source [18], spatial probability [17], eigenvalues [14], etc.). In Daredevil, we use a modified and improved version of the algorithm described in [17]. ...
... Those are based on evaluating multiple angles and picking the one that maximizes certain criteria (power of the signal from the sound source [18], spatial probability [17], eigenvalues [14], etc.). In Daredevil, we use a modified and improved version of the algorithm described in [17]. On every audio processing frame we run a Voice Activity Detector (VAD), like the one described in [15] but modified for the type of audio we generate, and engage the sound source localizer only if there is a signal (real signal or interfering signal) present. ...
Article
Full-text available
A variety of techniques have been used by prior work on the problem of smartphone location. In this paper, we propose a novel approach using sound source localization (SSL) with microphone arrays to determine where in a room a smartphone is located. In our system called Daredevil, smartphones emit sound at particular times and frequencies, which are received by microphone arrays. Using SSL that we modified for our purposes, we can calculate the angle between the center of each microphone array and the phone, and thereby triangulate the phone's position. In this early work, we demonstrate the feasibility of our approach and present initial results. Daredevil can locate smartphones in a room with an average precision of 3.19 feet. We identify a number of challenges in realizing the system in large deployments, and we hope this work will benefit researchers who pursue such techniques.
... We examine how we can encode small amounts of information in these chirps to distinguish different phones. To calculate the angle of arrival, we build microphone arrays that can be used in conjunction with SSL (sound source localization) algorithms [17] that we have customized for our use. ...
... Instead, Steered Response Power (SRP) algorithms are often used. Those are based on evaluating multiple angles and picking the one that maximizes certain criteria (power of the signal from the sound source [18], spatial probability [17], eigenvalues [14], etc.). In Daredevil, we use a modified and improved version of the algorithm described in [17]. ...
... Those are based on evaluating multiple angles and picking the one that maximizes certain criteria (power of the signal from the sound source [18], spatial probability [17], eigenvalues [14], etc.). In Daredevil, we use a modified and improved version of the algorithm described in [17]. On every audio processing frame we run a Voice Activity Detector (VAD), like the one described in [15] but modified for the type of audio we generate, and engage the sound source localizer only if there is a signal (real signal or interfering signal) present. ...
Conference Paper
Full-text available
A variety of techniques have been used by prior work on the problem of smartphone location. In this paper, we propose a novel approach using sound source localization (SSL) with microphone arrays to determine where in a room a smartphone is located. In our system called Daredevil, smartphones emit sound at particular times and frequencies, which are received by microphone arrays. Using SSL that we modified for our purposes, we can calculate the angle between the center of each microphone array and the phone, and thereby triangulate the phone’s position. In this early work, we demonstrate the feasibility of our approach and present initial results. Daredevil can locate smartphones in a room with an average precision of 3.19 feet. We identify a number of challenges in realizing the system in large deployments, and we hope this work will benefit researchers who pursue such techniques.