Fig 4 - uploaded by Franz Pernkopf
Content may be subject to copyright.
Measured directivity pattern for 2 Audix-TM1 microphones placed d = 5 cm apart. 

Measured directivity pattern for 2 Audix-TM1 microphones placed d = 5 cm apart. 

Source publication
Conference Paper
Full-text available
In this paper, we present a multi-channel Directional-to-Diffuse Postfilter (DD-PF), relying on the assumption of a directional speech signal embedded in diffuse noise. Our postfilter uses the output of a superdirective beamformer like the Generalized Sidelobe Canceller (GSC), which is projected back to the microphone inputs to separate the sound f...

Context in source publication

Context 1
... ratio can be determined given the ATF estimate A l and the beamforming filter W l at each time frame l . The PSD matrix Γ N N is a constant. The directional and diffuse component of the sound field, Z ˆ ′ and Z ˆ ′′ , are estimated online using Eqn. ( 3). Their respective PSDs are found by recursive averaging, e.g. Φ Z ˆ ′ Z ˆ ′ ,l = Φ Z ˆ ′ Z ˆ ′ ,l − 1 α + (1 − α ) Z ˆ ′ Z ˆ ′ H . The DD-SNR ξ ( j Ω) is obtained by using Eqn. (6). We used the well-known Optimally-Modified Log-Spectral Amplitude Es- timator (OM-LSA) algorithm [13] to calculate the real-valued gain G ( j Ω) . Finally, an estimate of the clean speech signal is obtained by X ( j Ω) = Y ( j Ω) G ( j Ω) . Our first experiment aims at presenting the enhanced directivity of a two microphone beamforming array with an aperture of d = 5 cm. Since the postfilter depends on the beamformer’s state by using the ATF estimate A ˆ ( j Ω) and the beamforming filter W ( j Ω) , we can easily incorporate the postfilter into the overall Directivity Pattern . We used the procedure described in [14] to simulate the directivity pattern of the GSC with our postfilter. The beamformer is fixed to look towards 0 ◦ . To measure the real directivity pattern for comparison, we used a physical array consisting of two microphones. For this experiment we used the room from Figure 1, a turntable and an exponential chirp for measuring the frequency response. Figure 4 shows that the measured beampattern is a bit sharper than the theoretical result. A cause for this effect could be minor gain differences in the microphones being used. However, the most relevant property is the high spatial selectivity for low frequencies. It can be seen that signals impinging from outside ± 20 ◦ are completely suppressed, even with two microphones. Increasing the number of microphones up to four does not change the directivity pattern significantly. In order to test the speech quality of our MCSE system against a significant amount of speech data, the TIMIT [15], KCORS [16], and (KCOSS) [17] speech corpora have been used. All three databases use f s = 16 kHz and data from both male and female speakers. The speech signals have been replayed with the loudspeaker located at the 0.5m position (see Figure 1). For the noise data, recordings from various sources, e.g. traf- fic noise, industry parks, subway stations and the NOIZEUS database have been replayed with the loudspeaker located at the 5m position. In total, about 60 minutes of test material has been generated. For comparison, we also implemented two other postfilter approaches – the MC-SPP approach and the TBRR. We use the PEASS Toolkit [9, 10] to evaluate the performance of the algorithms in terms of speech quality. It explicitly aims at the psycho-acoustically motivated quality assessment of audio source separation algorithms. In a wider sense, a beamformer can also be thought of as a source separation task, i.e. the enhanced speech signal is an estimate of the speech source at the reference microphone. PEASS delivers four scores: The Target Perceptual Score (TPS) measures the perceptual quality of the desired speech signal con- tained in the postfilter output. The Interference Perceptual Score (IPS) measures the influence of the residual noise components in the beamformer output. The Artifact Perceptual Score (APS) measures the influence of artifacts like musical noise generated by the algorithm. And the Overall Perceptual Score (OPS) provides a global measure of the perceptual quality of the enhanced output. Each score ranges from 0 to 100 and large values indicate better performance. Each algorithm is tested with a signal-to-interference ratio (SIR) ranging from -20dB to +20dB in 5dB steps. Figure 5 shows the comparison of the postfilters in terms of the PEASS measures. The OPS score of the TBRR and the MC-SPP postfilters are more or less equal. However, the TBRR performs better than the MC-SPP for the IPS and TPS score, and the APS score indicates that the TBRR introduces the most artifacts. The MC-SPP algorithm has the lowest IPS score, as it relies on the inversion of the spatial noise PSD matrix which is numerically unstable at low frequencies due to high signal correlation. The TBRR algorithm has the lowest APS score, as it relies on classical noise floor estimation techniques such as IMCRA or MS, which introduce musical artifacts depends on the instationarity of the noise. The speech quality of our DD-PF does not depend on spatial noise PSD estimation or a speech presence probability, but solely on the estimate of the directional and the diffuse sound components Φ Z ˆ ′ Z ˆ ′ and Φ Z ˆ ′′ Z ˆ ′′ . Their accuracy is determined by the shape of the noise sound field and the target leakage in the blocking matrix (e.g. speech leaking through the blocking matrix due to reflections). We found the target leakage to be quite low in our experiments, and the diffuse noise sound field was almost ideal diffuse. Therefore, we achieved both a good speech quality and a good noise sup- pression at the same time, even for low frequencies. This can be seen by the OPS and TPS score. In this paper, we introduced the Directional-to-Diffuse Postfilter (DD-PF), which splits the sound field at the microphones into its directional and diffuse components, and calculates a Directional-to-Diffuse SNR from the PSDs of these components, from which a noise reduction Wiener filter is derived. Its performance depends only on target leakage in the beamformer and the diffuse sound field assumption. In our experiments, we have shown that these conditions are met in a typical setup using four microphones and a variety of speech and noise tracks. The achieved directivity pattern is selective even at low frequencies and the speech quality is significantly higher compared to the TBRR and MC-SPP ...

Similar publications

Article
Full-text available
Multichannel adaptive signal detection jointly uses the test and training data to form an adap-tive detector, and then make a decision on whether a target exists or not. Remarkably, the resulting adaptive detectors usually possess the constant false alarm rate (CFAR) properties, and hence no additional CFAR processing is needed. Filtering is not ne...

Citations

... Most of the contributions relate to physical performance measures. While there exist some perceptual studies in beamforming, they either relied on objective models trained on perceptual features [7,21,22,23] such as PEASS [24], or performed a listening test only using speech and without stating an attribute to be rated [25]. ...
Article
Full-text available
Microphone array beamforming can be used to enhance and separate sound sources, with applications in the capture of object-based audio. Many beamforming methods have been proposed and assessed against each other. However, the effects of compact microphone array design on beamforming performance have not been studied for this kind of application. This study investigates how to maximize the quality of audio objects extracted from a horizontal sound field by filter-and-sum beamforming, through appropriate choice of microphone array design. Eight uniform geometries with practical constraints of a limited number of microphones and maximum array size are evaluated over a range of physical metrics. Results show that baffled circular arrays outperform the other geometries in terms of perceptually relevant frequency range, spatial resolution, directivity and robustness. Moreover, a subjective evaluation of microphone arrays and beamformers is conducted with regards to the quality of the target sound, interference suppression and overall quality of simulated music performance recordings. Baffled circular arrays achieve higher target quality and interference suppression than alternative geometries with wideband signals. Furthermore, subjective scores of beamformers regarding target quality and interference suppression agree well with beamformer on-axis and off-axis responses; with wideband signals the superdirective beamformer achieves the highest overall quality.
... Nevertheless, the results from TPS and APS did not fully agree with informal listening test in terms of the quality of the target signal and the complete absence of artefacts. Neither of the previous beamforming evaluations with PEASS Pfeifenberger & Pernkopf, 2014a,b) reported such a discrepancy between subjective (whether informal or not) and modelled results, perhaps due to the inability to uniquely assess the change in subjective quality in presence of noise and reverberation from an actual capture;Pfeifenberger & Pernkopf (2014a) only concluded that PEASS represented the speech quality better than PESQ. However, for source separation evaluationsCano et al. (2016) showed that with new stimuli and separation methods, PEASS scores do not correlate well with subjective results, suggesting that PEASS does not generalise well to source separation algorithms and/or test material outside those used in training. ...
Thesis
Full-text available
Microphone arrays can capture a sound scene and can be combined with signal processing to spatially filter or beamform the scene to extract the source of interest by suppressing unwanted sounds. Microphone array beamforming has been widely used for speech enhancement, giving rise to a vast number of beamforming methods to optimally suppress interfering sounds. However, the opportunities of these systems in broadcast and consumer audio recording have not been investigated, where wideband capture is a requirement. In this case, the microphone array design plays a significant role, yet despite the various designs from the literature, it is not clear which geometry provides the best performance under a range of criteria relevant for these applications. Moreover, the interactions between the array geometry, the beamformer and other design parameters and their impact on both physical and perceptual quality of extracted audio sources have not been established. The main contribution of this thesis is to determine the uniform microphone array design that maximises the quality of extracted audio sources (or objects) from horizontal sound scenes, since most sound scenes have much larger variation in azimuth than elevation. Both physical and perceptual performance evaluations are conducted with a range of microphone geometries and beamforming methods showing that baffled circular arrays outperform alternative geometries both objectively (in terms of frequency range, spatial resolution, directivity and robustness) and perceptually (based on interference suppression and quality of target and overall sounds). New insights of the interactions between array geometries and beamformers are provided. Moreover, a subjective evaluation of beamforming methods is undertaken showing the benefits of the on-axis distortionless response in combination with very high directivity from the superdirective beamformer, particularly for wideband signals. In addition to the array geometry, the effects of directivity order and regularisation are further investigated to synthesise frequency-invariant directional responses with the least-squares beamformer. The results exhibit the trade-offs between directivity and robustness with regularisation and between directivity and frequency range with directivity order. Baffled circular arrays perform best consistently for different orders and regularisation parameters. Furthermore, an optimal regularisation parameter is derived that minimises the error between the target and synthesised responses in presence of manifold errors, outperforming constant robustness constraints particularly for gain and positioning errors whose optimal regularised responses are frequency dependent. The combination of simulation and perceptual results presented in this thesis represents a significant addition to the beamforming literature, potentially influencing the design of future compact microphone arrays.
... To that end, PEASS [16] was used as an objective metric derived from subjective listening data. PEASS was also used in the context of microphone array beamforming in [1,25,26], whereas only the signal-to-interferer ratio, rather than the perceptual scores, was used in [27]. ...
Conference Paper
Full-text available
Frequency-invariant beamformers are useful for spatial audio capture since their attenuation of sources outside the look direction is consistent across frequency. In particular, the least-squares beamformer (LSB) approximates arbitrary frequency-invariant beampatterns with generic microphone configurations. This paper investigates the effects of array geometry, directivity order and regularization for robust hypercardioid synthesis up to 15th order with the LSB, using three 2D 32-microphone array designs (rectangular grid, open circular, and circular with cylindrical baffle). While the directivity increases with order, the frequency range is inversely proportional to the order and is widest for the cylindrical array. Regularization results in broadening of the mainlobe and reduced on-axis response at low frequencies. The PEASS toolkit was used to evaluate perceptually beamformed speech signals.
... with the projection matrix P [22]. The expression B = I − P can be identified as blocking matrix [21]. ...
... using the fixed beamformer (FBF) F , the adaptive interference canceler (AIC) H, and the blocking matrix (BM) B. In particular, we implemented the following three GSC variants detailed in the following sub-sections. Details can be found in [2,15]. ...
... Our first postfilter is based on the GSC with MVDR and ABM. Similar to [15], the beamformer output Y (k, l) is back-projected to the microphones using the ATFs A(k, l). This way, the microphone inputs X can be split into their speech and noise componentsŜ andN : ...