Fig 2 - uploaded by Petr Tichavsky
Content may be subject to copyright.
Three artificial convolutive mixtures of the sources from Fig. 1 simulating signals obtained by three microphones.

Three artificial convolutive mixtures of the sources from Fig. 1 simulating signals obtained by three microphones.

Source publication
Article
Full-text available
Time-domain algorithms for blind separation of audio sources can be classified as being based either on a partial or complete decomposition of an observation space. The decomposition, especially the complete one, is mostly done under a constraint to reduce the computational burden. However, this constraint potentially restricts the performance. The...

Citations

... BSS finds application in smart voice assistants, handsfree teleconferencing, automatic meeting transcription, etc., where only mixed signals from single or multiple microphones are available. Several BSS algorithms have been developed based on different assumptions about the characteristics of the speech sources and the mixing systems [2][3][4][5][6][7][8][9]. Learning-based BSS approaches have recently received increased research attention due to advances in deep learning hardware and software. ...
Article
Full-text available
A three-stage approach is proposed for speaker counting and speech separation in noisy and reverberant environments. In the spatial feature extraction, a spatial coherence matrix (SCM) is computed using whitened relative transfer functions (wRTFs) across time frames. The global activity functions of each speaker are estimated from a simplex constructed using the eigenvectors of the SCM, while the local coherence functions are computed from the coherence between the wRTFs of a time-frequency bin and the global activity function-weighted RTF of the target speaker. In speaker counting, we use the eigenvalues of the SCM and the maximum similarity of the interframe global activity distributions between two speakers as the input features to the speaker counting network (SCnet). In speaker separation, a global and local activity-driven network (GLADnet) is used to extract each independent speaker signal, which is particularly useful for highly overlapping speech signals. Experimental results obtained from the real meeting recordings show that the proposed system achieves superior speaker counting and speaker separation performance compared to previous publications without the prior knowledge of the array configurations.
... Commonly, the signals are analyzed in the short time Fourier transform (STFT) domain, in which the convolutive mixtures are approximated by multiplicative mixtures. Various approaches for BASS exist, such as the independent component analysis (ICA) and independent vector analysis (IVA) separation methods [6][7][8][9][10], non-negative matrix factorization (NMF) [11][12][13][14][15], and, more recently, deep neural network (DNN)-based separation methods [16][17][18][19][20][21][22][23]. A related problem to acoustic source separation was recently investigated in the field of structural health monitoring based on acoustic emission, dealing with onset detection of overlapped acoustic emission waves [24] for accurate time of arrival estimation [25]. ...
... → 1. In the following, we show how we can exploit the relation in (10) between the feature-wise correlations and the probability products to estimate the activity probabilities, by exploiting either linear programming theory or convex geometry tools. ...
... Here, we utilize the equivalence between the feature-wise correlations and the probability products implied by (10) to obtain samples from the function t l , where the sample t l (n) is given by the correlation between the features associated with the lth and the nth frames. Note that we only have the function value t l (n), but not the function parameters p(l) and the point q(l), in which the function was evaluated. ...
Article
Full-text available
Two novel methods for speaker separation of multi-microphone recordings that can also detect speakers with infrequent activity are presented. The proposed methods are based on a statistical model of the probability of activity of the speakers across time. Each method takes a different approach for estimating the activity probabilities. The first method is derived using a linear programming (LP) problem for maximizing the correlation function between different time frames. It is shown that the obtained maxima correspond to frames which contain a single active speaker. Accordingly, we propose an algorithm for successive identification of frames dominated by each speaker. The second method aggregates the correlation values associated with each frame in a correlation vector. We show that these correlation vectors lie in a simplex with vertices that correspond to frames dominated by one of the speakers. In this method, we utilize convex geometry tools to sequentially detect the simplex vertices. The correlation functions associated with single-speaker frames, which are detected by either of the two proposed methods, are used for recovering the activity probabilities. A spatial mask is estimated based on the recovered probabilities and is utilized for separation and enhancement by means of both spatial and spectral processing. Experimental results demonstrate the performance of the proposed methods in various conditions on real-life recordings with different reverberation and noise levels, outperforming a state-of-the-art separation method.
... The second group uses components analysis techniques as the main approach. Independent component analysis (ICA) [14], [15], [16] and its variants [17], [18], [19], [20] are the representatives for this approach. Different from the digital filter, ICA focuses on separating all the components in the signal so it is usually used when the interfering signals are the background sounds such as music sound or radio sounds. ...
... From table 6, in comparison with the other methods including wavelet [12], time-frequency filter bank [13], ICA [17], and VAE [25], our model gets a high result. We achieve 12.99dB in SDR measure, which is the highest score, and 15.02dB in SIR. ...
Article
Full-text available
Speech source separation is essential for speech-related applications because this process enhances the input speech signal for the main processing model. Most of the current approaches for this task focus on separating the speech of commonly high-frequency noises or a particular background sound. They cannot clear the signals which intersect with the human speech in its frequency range. To deal with this problem, we propose a hybrid approach combining a variational autoencoder (VAE) and a bandpass filter (BPF). This method can extract and enhance the speech signal in the mixture of many elements such as speech signal, the high-frequency noises, and many kinds of different background sounds which interfere with the speech sound. Experimental results showed that our model can extract effectively the speech signal with 15.02 dB in Signal to Interference Ratio (SIR) and 12.99 dB in Signal to Distortion Ratio (SDR). On the other hand, we can adjust the passband to identify the range of frequency at the output signal to apply for a particular application like gender classification.
... These methods are able to obtain good text-independent separations with no or limited prior information [2]. This is a big improvement compared to previous methods like hidden Markov models [3,4,5], independent component analysis [6], computational auditory scene analysis [7,8] and non-negative matrix factorisation [9,10], which have limited separation performance or impose restrictions on the speakers and vocabulary. This improved performance comes at the cost of needing a lot of labelled training data (mixtures for which the desired separation is known) and demanding computations. ...
Preprint
Full-text available
This paper examines the applicability in realistic scenarios of two deep learning based solutions to the overlapping speaker separation problem. Firstly, we present experiments that show that these methods are applicable for a broad range of languages. Further experimentation indicates limited performance loss for untrained languages, when these have common features with the trained language(s). Secondly, it investigates how the methods deal with realistic background noise and proposes some modifications to better cope with these disturbances. The deep learning methods that will be examined are deep clustering and deep attractor networks.
... These methods are able to obtain good text-independent separations with no or limited prior information [2]. This is a big improvement compared to previous methods like hidden Markov models [3,4,5], independent component analysis [6], computational auditory scene analysis [7,8] and non-negative matrix factorisation [9,10], which have limited separation performance or impose restrictions on the speakers and vocabulary. This improved performance comes at the cost of needing a lot of labelled training data (mixtures for which the desired separation is known) and demanding computations. ...
... When h mn is described by an impulse response of L samples that represents the delay and reverberations in a real-room situation, the problem is called convolutive BSS and the mixtures are modeled as To cope with a real-room situation, we need to solve the convolutive BSS problem. Although there have been proposed time-domain approaches [69][70][71][72][73][74][75] to the convolutive BSS problem, a more suitable approach for combining ICA and NMF is a frequency-domain approach [76][77][78][79][80][81][82][83][84][85], where we apply ...
Article
Full-text available
This paper describes several important methods for the blind source separation of audio signals in an integrated manner. Two historically developed routes are featured. One started from independent component analysis and evolved to independent vector analysis (IVA) by extending the notion of independence from a scalar to a vector. In the other route, nonnegative matrix factorization (NMF) has been extended to multichannel NMF (MNMF). As a convergence point of these two routes, independent low-rank matrix analysis has been proposed, which integrates IVA and MNMF in a clever way. All the objective functions in these methods are efficiently optimized by majorization-minimization algorithms with appropriately designed auxiliary functions. Experimental results for a simple two-source two-microphone case are given to illustrate the characteristics of these five methods.
... Blind signal separation (BSS) or blind source separation is the structure looked for the separation of speech signal in riotous condition and furthermore to isolate the speech of two separate public [5]. With the considerate that both the mixing method and source signals are unknown the process is regularly called as ''blind'' [6]. ...
Article
Full-text available
In this research a new way introduced for solving the underdetermined blind speech signal separation problem when the number of observation is less than the sources for which the ICA is no longer applicable, which enhance the time complexity for separation of signal. To resolve that, Improved Sparse Component Analysis (ISCA) is introduced to exploit the sparse nature of TF domain, which adopt a two-step processing that contains mixing matrix estimation followed by separation of source. This ISCA is based on fuzzy c-means with Particle swarm optimization (PSO) algorithm for mixed matrix Estimation. In our work PSO is used to separate the accurate voice signal from the random mixed signal by finding the best optima solution in the cluster part. Then the source signal separation is carried out based on the shortest path. These initial processing is done and verified by Mat lab and hardware description language is generated using HDL coder and it is synthesized using Xilinx ISE. The final result illustrates that the proposed system has an improved performance in terms of SNR, Efficiency and Accuracy.
... The accuracy of ICA-based approaches depends heavily on the fidelity of the estimated basis signals and the statistical separability of the target and interfering sources. This aspect, however, limits their applicability when all the sources are speech signals and therefore have similar statistical properties [35], [36], [37], [38]. ...
Preprint
Full-text available
Robust speech processing in multitalker acoustic environments requires automatic speech separation. While single-channel, speaker-independent speech separation methods have recently seen great progress, the accuracy, latency, and computational cost of speech separation remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of spectrogram representations for speech separation, and the long latency in calculating the spectrogram. To address these shortcomings, we propose the time-domain audio separation network (TasNet), which is a deep learning autoencoder framework for time-domain speech separation. TasNet uses a convolutional encoder to create a representation of the signal that is optimized for extracting individual speakers. Speaker extraction is achieved by applying a weighting function (mask) to the encoder output. The modified encoder representation is then inverted to the sound waveform using a linear decoder. The masks are found using a temporal convolutional network consisting of dilated convolutions, which allow the network to model the long-term dependencies of the speech signal. This end-to-end speech separation algorithm significantly outperforms previous time-frequency methods in terms of separating speakers in mixed audio, even when compared to the separation accuracy achieved with the ideal time-frequency mask of the speakers. In addition, TasNet has a smaller model size and a shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications. This study therefore represents a major step toward actualizing speech separation for real-world speech processing technologies.
... Le problème de séparation de sources modélise de nombreux systèmes multi-capteurs comme des réseaux d'antennes, de microphones ou de capteurs chimiques. Il peut donc être appliqué aux domaines de l'astronomie [1,2], de l'audio [3,4], du biomédical [5][6][7][8][9], de la sismique [10,11], des télécommunications numériques [12][13][14][15][16][17] ou encore de la spectroscopie de fluorescence [18][19][20][21][22]. ...
Thesis
Cette thèse présente de nouveaux algorithmes de diagonalisation conjointe par similitude. Cesalgorithmes permettent, entre autres, de résoudre le problème de décomposition canonique polyadiquede tenseurs. Cette décomposition est particulièrement utilisée dans les problèmes deséparation de sources. L’utilisation de la diagonalisation conjointe par similitude permet de paliercertains problèmes dont les autres types de méthode de décomposition canonique polyadiquesouffrent, tels que le taux de convergence, la sensibilité à la surestimation du nombre de facteurset la sensibilité aux facteurs corrélés. Les algorithmes de diagonalisation conjointe par similitudetraitant des données complexes donnent soit de bons résultats lorsque le niveau de bruit est faible,soit sont plus robustes au bruit mais ont un coût calcul élevé. Nous proposons donc en premierlieu des algorithmes de diagonalisation conjointe par similitude traitant les données réelles etcomplexes de la même manière. Par ailleurs, dans plusieurs applications, les matrices facteursde la décomposition canonique polyadique contiennent des éléments exclusivement non-négatifs.Prendre en compte cette contrainte de non-négativité permet de rendre les algorithmes de décompositioncanonique polyadique plus robustes à la surestimation du nombre de facteurs ou lorsqueces derniers ont un haut degré de corrélation. Nous proposons donc aussi des algorithmes dediagonalisation conjointe par similitude exploitant cette contrainte. Les simulations numériquesproposées montrent que le premier type d’algorithmes développés améliore l’estimation des paramètresinconnus et diminue le coût de calcul. Les simulations numériques montrent aussi queles algorithmes avec contrainte de non-négativité améliorent l’estimation des matrices facteurslorsque leurs colonnes ont un haut degré de corrélation. Enfin, nos résultats sont validés à traversdeux applications de séparation de sources en télécommunications numériques et en spectroscopiede fluorescence.
... In a general sense, BSS involves Principal Component Analysis (PCA), Independent Component Analysis (ICA) [2], [11], Independent Vector Analysis (IVA) [12], [13], Nonnegative Matrix Factorization [14], etc. Some methods separate all of the one-dimensional components of s [15], [16], extract selected components only [17], or separate multidimensional components; see, e.g., [18], [19], [20], [21], [22]. The separation can also proceed in two steps where a steering vector/matrix (e.g., H 1 ) is identified first, while the signals are separated in the second step using an array processor such as the minimum variance distortion-less (MVDR) beamformer [23]. ...
Article
Full-text available
Blind methods often separate or identify signals or signal subspaces up to an unknown scaling factor. Sometimes it is necessary to cope with the scaling ambiguity, which can be done through reconstructing signals as they are received by sensors, because scales of the sensor responses (images) have known physical interpretations. In this paper, we analyze two approaches that are widely used for computing the sensor responses, especially, in Frequency-Domain Independent Component Analysis. One approach is the least-squares projection, while the other one assumes a regular mixing matrix and computes its inverse. Both estimators are invariant to the unknown scaling. Although frequently used, their differences were not studied yet. A goal of this work is to fill this gap. The estimators are compared through a theoretical study, perturbation analysis and simulations. We point to the fact that the estimators are equivalent when the separated signal subspaces are orthogonal, and vice versa. Two applications are shown, one of which demonstrates a case where the estimators yield substantially different results.