ArticlePDF Available

End-to-End Post-Filter for Speech Separation With Deep Attention Fusion Features

Authors:

Abstract

In this paper, we propose an end-to-end post-filter method with deep attention fusion features for monaural speaker-independent speech separation. At first, a time-frequency domain speech separation method is applied as the pre-separation stage. The aim of pre-separation stage is to separate the mixture preliminarily. Although this stage can separate the mixture, it still contains the residual interference. In order to enhance the pre-separated speech and improve the separation performance further, the end-to-end post-filter (E2EPF) with deep attention fusion features is proposed. The E2EPF can make full use of the prior knowledge of the pre-separated speech, which contributes to speech separation. It is a fully convolutional speech separation network and uses the waveform as the input features. Firstly, the 1-D convolutional layer is utilized to extract the deep representation features for the mixture and pre-separated signals in the time domain. Secondly, to pay more attention to the outputs of the pre-separation stage, an attention module is applied to acquire deep attention fusion features, which are extracted by computing the similarity between the mixture and the pre-separated speech. These deep attention fusion features are conducive to reduce the interference and enhance the pre-separated speech. Finally, these features are sent to the post-filter to estimate each target signals. Experimental results on the WSJ0-2mix dataset show that the proposed method outperforms the state-of-the-art speech separation method. Compared with the pre-separation method, our proposed method can acquire 64.1%, 60.2%, 25.6% and 7.5% relative improvements in scale-invariant source-to-noise ratio (SI-SNR), the signal-to-distortion ratio (SDR), the perceptual evaluation of speech quality (PESQ) and the short-time objective intelligibility (STOI) measures, respectively.
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 28, 2020 1303
End-to-End Post-Filter for Speech Separation
With Deep Attention Fusion Features
Cunhang Fan , Student Member, IEEE, Jianhua Tao , Senior Member, IEEE, Bin Liu , Member, IEEE,
Jiangyan Yi , Member, IEEE, Zhengqi Wen, Member, IEEE, and Xuefei Liu, Member, IEEE
Abstract—In this article, we propose an end-to-end post-filter
method with deep attention fusion features for monaural speaker-
independent speech separation. At first, a time-frequency domain
speech separation method is applied as the pre-separation stage.
The aim of pre-separation stage is to separate the mixture prelimi-
narily. Although this stage can separate the mixture, it still contains
the residual interference. In order to enhance the pre-separated
speech and improve the separation performance further, the end-
to-end post-filter (E2EPF) with deep attention fusion features is
proposed. The E2EPF can make full use of the prior knowledge of
the pre-separated speech, which contributes to speech separation.
It is a fully convolutional speech separation network and uses the
waveform as the input features. Firstly, the 1-D convolutional layer
is utilized to extract the deep representation features forthe mixture
and pre-separated signals in the time domain. Secondly,to pay more
attention to the outputs of the pre-separation stage, an attention
module is applied to acquire deep attention fusion features, which
are extracted by computing the similarity between the mixture and
the pre-separated speech. These deep attention fusion features are
conducive to reduce the interference and enhance the pre-separated
speech. Finally, these features are sent to the post-filter to estimate
each target signals. Experimental results on the WSJ0-2mix dataset
show that the proposed method outperforms the state-of-the-art
speech separation method. Compared with the pre-separation
method, our proposed method can acquire 64.1%, 60.2%, 25.6%
and 7.5% relative improvements in scale-invariant source-to-noise
ratio (SI-SNR), the signal-to-distortion ratio (SDR), the perceptual
evaluation of speech quality (PESQ) and the short-time objective
intelligibility (STOI) measures, respectively.
Manuscript received September 3, 2019; revised December 20, 2019 and
March 14, 2020; accepted March 16, 2020. Date of publication March 20,
2020; date of current version May 7, 2020. This work was supported in part
by the National Key Research and Development Plan of China under Grant
2017YFC0820602, in part by the National Natural Science Foundation of China
(NSFC) under Grants 61831022, 61771472, 61901473, and 61773379 and in
part by Inria-CAS Joint Research Project under Grants 173211KYSB20170061
and 173211KYSB20190049. The associate editor coordinating the review of
this manuscript and approving it for publication was Prof. Sven Erik Nordholm.
(Corresponding authors: Jianhua Tao; Bin Liu.)
Cunhang Fan is with the National Laboratory of Pattern Recognition, Institute
of Automation, Chinese Academy of Sciences, Beijing 100190, China, and also
with the School of Artificial Intelligence, University of Chinese Academy of
Sciences, Beijing 100190, China (e-mail: cunhang.fan@nlpr.ia.ac.cn).
Jianhua Tao is with the National Laboratory of Pattern Recognition, Institute
of Automation, Chinese Academy of Sciences, Beijing 100190, China, with
the School of Artificial Intelligence, University of Chinese Academy of Sci-
ences, Beijing 100190, China, and also with the CAS Center for Excellence
in Brain Science and Intelligence Technology, Beijing 100190, China (e-mail:
jhtao@nlpr.ia.ac.cn).
Bin Liu, Jiangyan Yi, Zhengqi Wen, and Xuefei Liu are with the Na-
tional Laboratory of Pattern Recognition, Institute of Automation, Chinese
Academy of Sciences, Beijing 100190, China (e-mail: liubin@nlpr.ia.ac.cn;
jiangyan.yi@nlpr.ia.ac.cn; zqwen@nlpr.ia.ac.cn; xuefei.liu@nlpr.ia.ac.cn).
Digital Object Identifier 10.1109/TASLP.2020.2982029
Index Terms—Speech separation, end-to-end post-filter, deep
attention fusion features, deep clustering, permutation invariant
training.
I. INTRODUCTION
SPEECH separation aims to estimate the target sources from
a noisy mixture, which is known as the cocktail party
problem [1]–[3]. As for monaural speech separation, it is a
very challenging task because only single channel can be used.
This study focuses on monaural speaker-independent speech
separation.
Recently, deep learning has been applied to address speaker-
independent speech separation, which has obtained impressive
results [4]–[11]. The difficulty of speaker-independent speech
separation is label ambiguity or permutation problem [12], [13].
In order to deal with this problem, deep clustering (DC) [13]
is proposed, which is a state-of-the-art method for speaker-
independent speech separation. DC is usually formulated as two-
step processes: embedding learning and embedding clustering.
Firstly, as for embedding learning, a bidirectional long-short
term memory (BLSTM) network is trained to project each time-
frequency (T-F) bin of mixture spectrogram into an embedding
vector. The training objective is the Frobenius norm between the
affinity matrices of the embedding vector and the ideal binary
mask. In this way, if the T-F bins belong to the same speaker,
these embedding vectors are grouped closer together. Otherwise,
they become farther apart. Finally, in order to acquire the binary
mask of each source, K-means algorithm is applied to cluster
these embedding vectors, which is the embedding clustering.
Although DC gets good performance, it still has two limitations.
Firstly, the training objective is defined in the embedding vectors,
instead of the real separated sources. These embedding vectors
do not necessarily imply perfect separation of the sources in the
signal space. Secondly, DC applies the unsupervised K-means
clustering algorithm to estimate the binary masks of target
sources. Therefore, the performance of speech separation is
limited by the K-means clustering algorithm. To overcome the
training objective limitation of DC, the deep attractor network
(DANet) [14] method is proposed. Same as DC, the DANet
also maps the mixture spectrogram into a high-dimensional
embedding space. Different from DC, DANet firstly creates
attractor points at the embedding space. Then the similarities
between the embedded points and each attractor are applied
to estimate each source’s mask. However, at the test stage, it
2329-9290 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.
1304 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 28, 2020
still requires the unsupervised K-means clustering algorithm to
acquire the binary mask.
Frame-level permutation invariant training (PIT) [15] deals
with the permutation problem in a different way. During training,
the frame-level PIT (denoted by tPIT) computes all possible la-
bel permutations for each frame. Then tPIT uses the permutation
with the lowest mean square error (MSE) as the loss to train the
separation model. It can get a good performance for frame-level
separation. However, in the real-world conditions, the frame-
level permutation of separated signals is unknown. It means
that tPIT needs the speaker tracing step during inference. To
address this issue, utterance-level PIT (uPIT) [12] is proposed.
With uPIT, instead of choosing the permutation at frame-level,
the permutation corresponding to the minimum utterance-level
separation error is used for all frames in one utterance. In this
way, uPIT can effectively eliminate the speaker tracing problem.
However, tPIT and uPIT only reduce the distance between the
same speakers, they don’t increase the distance between the
different speakers. This may lead to increasing the possibility
of remixing the separated sources.
In order to use both of DC and PIT, Chimera++ network [16] is
applied for speech separation, which is followed by the Chimera
network [17]. The Chimera++ network uses a multi-task learning
architecture to combine the DC and PIT. However, it simply
employs the DC and PIT as two outputs of the separation model
rather than fuses them deeply. Therefore, it does not solve
the limitations of DC and PIT. Computational auditory scene
analysis (CASA) [18] is a traditional speech separation method,
which is inspired by human auditory scene analysis. Deep CASA
[6] is another method to combine the DC and PIT. It adopts
the same divide-and-conquer strategy of CASA. Deep CASA is
a two-stage speech separation method. Firstly, tPIT is used to
estimate each source from the mixture spectrogram. Then, DC
is used as the speaker tracing step. In other words, DC is applied
to estimate the optimized permutation at frame-level. Although
deep CASA acquires good separation performance, it is also
limited by the K-means algorithm.
Motivated by PIT, DC and discriminative learning [2], [10],
[19]–[21], we proposed a discriminative learning method for
speaker-independent speech separation with deep embedding
features (denoted by uPIT+DEF+DL) in our previous work
[22]. uPIT+DEF+DL combines DC and PIT in a deep fusion
method and addresses the limitations of DC and PIT very well.
It utilizes the DC network as the extractor of deep embedding
features. Then instead of using K-means clustering algorithm to
estimate the target sources, uPIT+DEF+DL applies the uPIT
to separate the speech from these deep embedding features.
Although uPIT+DEF+DL can separate the mixture well, it still
has two drawbacks limiting its performance. Firstly, it uses the
separated magnitude and mixture phase to reconstruct target
signals by inverse short-time Fourier transformation (ISTFT),
which is mismatched for magnitude and phase. Secondly, the
separated signals by the uPIT+DEF+DL may still contain the
residual interference signals, which damages the performance
of speech separation.
In this study, in order to address the above issues, we propose
an end-to-end post-filter (E2EPF) method with deep attention
fusion features for monaural speaker-independent speech sepa-
ration. The proposed E2EPF utilizes the time-domain waveform
as the input features. The waveform contains all of the infor-
mation of the raw wave, including the magnitude and phase.
Therefore, separating the speech from waveform can solve the
mismatch problem of magnitude and phase. At the first, the
uPIT+DEF+DL is used as the pre-separation stage to preliminar-
ily estimate target sources from the mixture spectrogram through
T-F domain. The separated speech by this stage may still contain
the residual interference. To further enhance the pre-separated
speech, the E2EPF with deep attention fusion features is applied.
The E2EPF can make full use of the prior knowledge of pre-
separated speech to help reduce the residual interference. Firstly,
the mixture and pre-separated signals are processed by the
1-D convolutional layer to extract deep representation features.
Secondly, instead of simply stacking these deep representation
features, an attention module is applied to compute the similarity
between the mixture and the pre-separated speech, which is
used as the extractor of deep attention fusion features. These
features can make the proposed model pay more attention to the
pre-separated signals so that the proposed E2EPF can reduce the
interference more easily and enhance the pre-separated speech.
The main contribution of this paper is two-fold. Firstly, we
propose the E2EPF to further enhance the pre-separated speech
and reduce the residual interference. Secondly, deep attention
fusion features are applied to compute the similarity between
the mixture and the pre-separated speech. Experiments are con-
ducted on WSJ0-2mix and WSJ0-3mix datasets [13]. Experi-
mental results show that our proposed method outperforms the
state-of-the-art speech separation method.
The rest of this paper is organized as follows. Section II
presents discriminative learning for monaural speech separa-
tion using deep embedding features. Section III introduces
the proposed end-to-end post-filter speech separation method.
The experimental setup is stated in Section IV. Section V
shows experimental results. Section VI shows the discussions.
Section VII draws conclusions.
II. DISCRIMINATIVE LEARNING FOR MONAURAL SPEECH
SEPARATION USING DEEP EMBEDDING FEATURES
The object of monaural speech separation is to estimate target
sources from the mixture speech recorded by single channel.
y(t)=
S
s=1
xs(t)(1)
where y(t)is the mixture speech, tis the time index, S
is the number of sources and xs(t),s=1,...,S are target
sources. And the corresponding short-time Fourier transforma-
tion (STFT) of y(t)and xs(t)are Y(t, f )and Xs(t, f).
The speech separation aims to estimate each source signals
xs(t)from y(t)or Y(t, f ). In this section, we introduce the
discriminative learning method for speech separation with deep
embedding features [22], which is based on uPIT. This method
is denoted as uPIT+DEF+DL. We use this method as our pre-
separation stage and our baseline.
Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: END-TO-END POST-FILTER FOR SPEECH SEPARATION WITH DEEP ATTENTION FUSION FEATURES 1305
Fig. 1. Schematic diagram of uPIT+DEF+DL speech separation system. DC
loss is the loss of deep clustering.
A. Deep Embedding Features
Fig. 1 shows the schematic diagram of uPIT+DEF+DL speech
separation system. Firstly, a BLSTM network is trained as the
extractor of deep embedding features (DEF). The aim of the
extractor is to project the mixed amplitude spectrum |Y(t, f)|
of each T-F bin into the D-dimensional deep embedding features
V.
V=γθ(|Y(t, f )|)RTF×D(2)
where TF is the number of T-F bins and γθ()is the BLSTM
mapping function. Here we consider a unit-norm embedding, so
|vi|2=1,vi=vi,d (3)
where vi,d is the value of the d-th dimension of the embedding
for element i. We let the embeddings Vto implicitly represent
an TF ×TF estimated affinity matrix VVT.
As for the deep embedding features extractor, the loss function
JDC is defined as follow:
JDC =||VVTBBT||2
F
=||VVT||2
F2||VTB||2
F+||BBT||2
F(4)
where BRTF×Sis a binary matrix, which means the source
membership function for each T-F bin. Specifically, if the energy
of source sis the highest compared with other sources, Btf,s =1.
Otherwise, Btf,s =0. S denotes the source number. || ||2
Fis
the squared Frobenius norm.
B. uPIT Based Speech Separation Model With Deep
Embedding Features
As for DC [13], the training objective is not the real sep-
arated sources. Besides, the unsupervised K-means clustering
algorithm is applied to acquire binary masks. Therefore, the
performance is limited by the K-means algorithm. In order
to address these issues, we use the deep embedding vectors
extracted by DC as the input of uPIT to directly learn each
source’s soft masks. In this way, on one hand, we directly use
the real separated sources as the training objective. In other
words, the DC and uPIT can be trained end-to-end. On the other
hand, the performance of speech separation is not limited by the
K-means algorithm.
Phase sensitive mask (PSM) [23], [24] is proved to be effective
for speech separation because it makes full use of the phase
information [12]. In this paper, we utilize the PSM for speech
separation in the T-F domain. The ideal PSM is defined as:
Ms(t, f )=|Xs(t, f )|cos(θy(t, f )θs(t, f))
|Y(t, f )|(5)
where θy(t, f )and θs(t, f)are the phase of mixture speech and
target source s.
uPIT computes the MSE for all possible speaker permutations
at utterance-level. Then the minimum cost among all permuta-
tions (P) is chosen as the optimal assignment.
JuP IT =arg min
θsP
S
s=1
|||Y|
Ms−|Xθs|cos(θyθs)||2
F
(6)
where the number of all permutations (P) is N=S!(!denotes
the factorial symbol). The (t, f )is omitted in
Ms,Y,X,θyand
θs.
C. Discriminative Learning
For uPIT, the target of minimizing Eq. 6 is to reduce the
distance between the outputs and their corresponding target
sources. To decrease the possibility of remixing separated
sources, the discriminative learning (DL) is applied to our pro-
posed model. DL not only reduces the distance between the
prediction and the corresponding target, but also increases the
distance between the prediction and the interference sources.
We assume that φis the chosen permutation (the same as the
JuP IT in Eq. 6), which has the lowest MSE among all permu-
tations. Therefore, the discriminative learning loss function can
be defined as:
JDL =φ
φ=φP
αφ (7)
where φis a permutation from Pbut does not contain φ,
α0is the regularization parameter of φ. When α=0,the
loss function is the same as the JuP IT in Eq. 6. It means with
no discriminative learning.
D. Joint Training
To extract embedding features effectively, we apply the joint
training framework to the proposed system. The loss function
of joint training is defined as follow:
J=λJDC +(1λ)JDL
=λJDC +(1λ)
φ
φ=φP
αφ
(8)
where λ[0,1] controls the weight of JDC and JDL.
Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.
1306 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 28, 2020
Fig. 2. (a): the diagram of the end-to-end post-filter. It contains three parts: feature extraction, deep attention fusion and post-filter. Features are extracted by
the 1-D convolution operation. Then attention mechanism is leveraged to the deep attention fusion. Finally, these features are inputted to the post-filter for speech
separation. (b): the detail block diagram of post-filter. The post-filter is composed of 1-D convolution and temporal convolutional network (TCN). (c): the design
of 1-D convolution block.
III. THE PROPOSED SPEECH SEPARATION METHOD
In this paper, we propose an end-to-end post-filter
(E2EPF) with deep attention fusion features for monaural
speaker-independent speech separation. Firstly, we use the
uPIT+DEF+DL to separate the mixture preliminarily in the T-F
domain, which is used as the pre-separation stage. The separated
speech by this method may still contain the residual interference.
In order to further enhance the separated speech and improve
the performance of speech separation, we utilize the E2EPF
with deep attention fusion features as another stage. The E2EPF
can make full use of the prior knowledge of the pre-separated
speech. The E2EPF is a fully convolutional network and applies
the waveform as the input feature. Besides, in order to make
the separation model pay more attention to the pre-separated
signals, an attention module is utilized to extract deep attention
fusion features, which are computed the similarity between the
mixture and pre-separated signals.
The E2EPF mainly solves two problems. Firstly, in the pre-
separation stage, it only enhances the magnitude and leaves the
phase spectrum unchanged. The mismatched magnitude and
phase are used to reconstruct estimated signals, which dam-
ages the performance of speech separation. The E2EPF does
the speech separation in the time domain so that it can enhance
the magnitude and phase spectrum simultaneously. Secondly, the
separated signals by the pre-separation stage may still contain
the residual interference. The E2EPF makes full use of the
prior knowledge of the pre-separated speech and applies the
deep attention fusion features to further remove the residual
interference and improve the performance of speech separation.
The E2EPF utilizes the waveform as the input features. It
consists three parts: feature extraction, deep attention fusion and
post-filter, as shown in Fig. 2(a). This section we will introduce
these three parts detailedly.
A. Feature Extraction
The input mixture speech (y(t)) and the output sources
(os(t),s=1,2, ..., S) of the pre-separation stage can be divided
into overlapping segments of length L. We denote them as
ykR1×Land osk R1×L, where k=1, ..., ˆ
Tis the index
of segment and ˆ
Tdenotes the total number of segments in y(t)
and os(t).
The 1-D convolution operation is used to extract deep features
from the yand os(we drop the index kand time tfrom now on).
wy=ReLU(yUy)(9)
ws=ReLU(osUs),s=1,2,...,S (10)
Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: END-TO-END POST-FILTER FOR SPEECH SEPARATION WITH DEEP ATTENTION FUSION FEATURES 1307
where wy,wsR1×Nare the deep features extracted from the
yand os, respectively. UyRN×Land UsRN×Lare the
basis functions of 1-D convolution operation, which contains
Nvectors with length Leach. ReLU()denotes the rectified
linear unit, which is an optional nonlinear function.
B. Deep Attention Fusion
Recently, attention models have been successfully applied to
the sequence-to-sequence learning tasks [25]–[29]. In this study,
attention mechanism is leveraged to acquire the deep attention
fusion features.
The aim of the attention mechanism is to make the sepa-
ration model pay more attention to the output signals of the
pre-separation stage. It is used to compute the similarity between
the mixture and pre-separated signals. Therefore, the E2EPF can
further reduce the interference signals and improve the perfor-
mance of speech separation. In order to compute the similarity
between mixture and the pre-separated signals, the wyand ws
are sent to another 1-D convolutional layer.
w
y=ReLU (wyU
y)(11)
w
s=ReLU (wsU
s),s=1,2, ..., S (12)
where U
yRN×Land U
sRN×Lare the basis functions of
1-D convolution operation.
According to the global attention mechanism [28], the atten-
tion weight αt,tcan be learned:
αt,t=exp(dt,t)
texp(dt,t)(13)
where dt,tis the correlation between w
yand w
s, which mea-
sures their similarity. The attention weight αt,tis the softmax
of dt,tover t[1,N]. We follow the dot-based function in
[28] as the dt,t.dt,tis defined as follow:
dt,t=w
T
yw
s(14)
The context vector ctsR1×Ncan be calculated by the
weighted average of w
s:
cts=
t
αt,tw
s(15)
As shown in Fig. 2(a), gray area is the deep attention fusion
part. Finally, these two context vectors ctsand the mixture deep
feature w
yare as the deep attention fusion features to the next
post-filter part.
C. Post-Filter
The detail block diagram of post-filter is shown in Fig. 2(b),
which adopts the temporal convolutional network (TCN) similar
to TasNet [30]. TCN is leveraged to the end-to-end post-filter,
which has shown comparable even better performance than
RNNs in various sequence modeling tasks [30]–[36]. The post-
filter is a fully-convolutional module including stacked dilated
1-D convolutional blocks as shown in Fig. 2(c). Compared with
the TasNet [30], there are two main differences. Firstly, our
proposed post-filter makes full use of the prior knowledge of
the pre-separated speech and the post-filter is used as the second
stage to improve the separation performance. Secondly, to pay
more attention to the pre-separated speech, these deep attention
fusion features are applied.
TCNs are used to replace for recurrent neural networks
(RNNs), which have shown comparable even better perfor-
mance than RNNs in various sequence modeling tasks [30]–[35].
For each TCN, 1-D convolutional blocks have increasing dila-
tion factors (1,2, ..., 2M1,Mis the number of convolutional
blocks), as shown in the light brown of Fig. 2(b). These in-
creasing dilation factors can capture a large temporal context.
To further increase the receptive field, the Mstacked dilated
convolutional blocks are repeated R=4times.
Fig. 2(c) shows the stacked dilated 1-D convolutional block,
which follows [37]. To avoid losing input information, the skip
connection is utilized between the input and the next block. The
depthwise separable convolution has been proven to be effective
for image processing tasks [38], [39]. Then, the depthwise
separable convolution is applied to further decrease the param-
eters numbers. A nonlinear activation function and a normaliza-
tion operation are added after both the first 1×1conv and
Dconv blocks respectively. The parametric rectified linear
unit (PReLU) [40] is applied. The reason is that PReLU can
improve model fitting with nearly zero extra computational cost
and little overfitting risk [40]. The type of the normalization
is the global layer normalization (gLN) because that the gLN
outperforms all other normalization methods [35].
The output of the stacked dilated 1-D convolutional block
is inputted to a 1-D convolutional layer with ReLU nonlinear
function and we denote these neural networks as γ().The
reason of using ReLU is that we want the network to learn target
masks like the T-F domain. The output of γ()is the estimated
mask msR1×Nof each source similar to the pre-separation
stage.
ms=γ([ws,cts;w
y]),s=1,2, ..., S (16)
Then the separated representation esof source scan be
estimated as following:
es=wyms(17)
where denotes the element-wise multiplication.
Finally, the estimated waveform of source s
xsis recon-
structed by the transposed 1-D convolution operator:
xs=esUe(18)
where UeRN×Ldenotes the basis function of transposed 1-D
convolution operator.
D. Training Objective
In order to improve the separation performance, the training
objective of the end-to-end post-filter is to maximize the scale-
invariant source-to-noise ratio (SI-SNR) [41]. The SI-SNR is
Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.
1308 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 28, 2020
defined as:
xtarget =
x,xx
x2(19)
enoise =
xxtarget (20)
SI SNR = 10log10
xtarget 2
enoise2(21)
where
xand xdenote the estimated and target sources, respec-
tively. x2=x,xis the signal power. In order to solve the
permutation problem, the uPIT is utilized during training.
IV. EXPERIMENTAL SETUP
A. Dataset
The WSJ0-2mix and WSJ0-3mix datasets [13] are used to
conduct our experiments, which is derived from WSJ0 corpus
[42]. It has training, validation and test set. The training set has
20,000 utterances about 30 hours. It is 5,000 utterances about 10
hours for validation set. As for test set, it has 3,000 utterances
about 5 hours. All of the data is generated by randomly selecting
utterances from WSJ0 set with signal-to-noise ratios (SNRs)
between 5 dB and 5 dB. The training and validation set are
generated from the WSJ0 training set (si_tr_s). The test set
is generated from the WSJ0 development set (si_dt_05) and
evaluation set (si_et_05). All the waveforms are sampled at
8000 Hz.
In order to evaluate the separation performance, the validation
set is used as the closed conditions (CC) and the test set is used
as the open condition (OC).
B. Baseline Model
In this paper, we use the uPIT+DEF+DL as our baseline
model. To compute the short-time Fourier transform (STFT), the
hamming window is 32ms and window shift is 16ms. Therefore,
the dimension of the spectral magnitude is 129. We use the
normalized amplitude spectrum of the mixture speech as the
input features.
There are two BLSTM layers with 896 units as for the
extractor of deep embedding features. We set the dimension
of embedding D to 40. Following the embedding layer, a tanh
activation function is utilized. For uPIT separation network,
there is only one layer with 896 units. Therefore, the network of
pre-separation has 3 BLSTM layers in total, which is the same as
the baseline in [12]. As for the mask estimation layer, a Rectified
Liner Uint (ReLU) activation function is used to estimate the
mask of each source, which is followed by the uPIT separation
network. The discriminative learning parameter αis set to 0.1.
For each BLSTM layer, a random dropout is applied and
dropout rate is set to 0.5. The batch-size is 16 utterances which is
generated by randomly selecting. The minimum epoch is set to
30. The learning rate is initialized as 0.0005. When the training
loss increases on the validation set, the learning rate is scaled
down by 0.7. When the relative loss improvement is lower than
0.01, the model is early stopped. The models of this stage are
optimized with the Adam algorithm [43].
In this paper, we re-implement uPIT [12] with our experi-
mental setup, which has three BLSTM layers with 896 units.
The others are the same as the experimental setup of our pre-
separation stage.
C. The Proposed End-to-End Post-Filter Method
The length of input waveform is 4-second long segments. The
learning rate is initialized as 0.0001. If the training loss increases
in 3 consecutive epochs on the validation set, the learning rate
is halved. Same as the pre-separation stage, the optimizer of
this stage is the Adam algorithm [43]. The maximum number of
epoch is 100. As for the feature extraction, the number of the first
1-D convolutional filters is 256 with length 20 (in samples) (N=
256,L=20in Section III-B). As for the other 1-D convolution,
the number of channels all is 256. For convolutional blocks, The
numbers of channels and kernel size are 512 and 3, respectively.
The number of repeats is 4 and in each repeat the number of
convolutional blocks Mis 8.
D. Evaluation Metrics
In this work, in order to evaluate the performance of speech
separation results, the models are evaluated on the scale-
invariant source-to-noise ratio (SI-SNR), the signal-to-distortion
ratio (SDR), signal-to-interference ratio (SIR) and signal-to-
artifact ratio (SAR) which are the BBS-eval [41] score, the
perceptual evaluation of speech quality (PESQ) [44] measure
and the short-time objective intelligibility (STOI) measure [45].
E. Comparison With Ideal T-F Masks
In order to compare with the ideal T-F masks, we use the ideal
PSM (IPSM), ideal binary mask (IBM) and ideal ratio mask
(IRM). These masks are calculated by STFT with 32 ms length
hamming window and 16 ms window shift, which is the same
as the pre-separation stage. The IPSM is defined in Eq. 5. The
IBM and IRM of source s=1,2, ..., S are defined as following:
IBMs(t, f )=1,|Xs(t, f )|>|Xj=s(t, f)|
0,otherwise
(22)
IRMs(t, f )= |Xs(t, f)|
S
j=1 |Xj(t, f )|(23)
V. R ESULTS
A. Pre-Separation Stage
We firstly evaluate the performance of the pre-separation stage
in the T-F domain. Table I shows the results of SDR, SIR, SAR
and PESQ between the uPIT based different speech separation
methods on closed (CC) and open (OC) condition. The deep
embedding features is denoted by DEF. In Table I, the “Optimal
(Opt.) Assign.” means that outputs are optimal assignment. In
other words, outputs are with optimal permutation for all of
the frames in a utterance. Otherwise, it is the “Default (Def.)
Assign.”
Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: END-TO-END POST-FILTER FOR SPEECH SEPARATION WITH DEEP ATTENTION FUSION FEATURES 1309
TAB L E I
THE RESULTS OF SDR, SIR, SAR AND PESQ FOR DIFFERENT SEPARATION METHODS WITH CLOSED (CC) AND OPEN (OC) CONDITION ON WSJ0-2MIX DATAS E T.
λISTHEWEIGHT OF JOINT TRAINING IN EQ.8.DEFDENOTES THE DEEP EMBEDDING FEATURES.UPIT ISTHEBASELINE METHOD,UPIT+DEF AND
UPIT+DEF+DL ARE OUR PROPOSED METHODS.UPIT+DEF MEANS WITH NO DISCRIMINATIVE LEARNING
1) Evaluation of Deep Embedding Features: From Table I,
we can find that in all objective measures, uPIT+DEF methods
all outperform the uPIT method no matter what λis. These
results indicate that the uPIT based separation method with deep
embedding features can improve the performance of speaker-
independent speech separation. This is because that these deep
embedding features are deep representations for the mixture
amplitude spectrum, which contain the potential information of
each target source so that they can effectively estimate the masks
of target sources. Therefore, these deep embedding features are
discriminative features for speech separation.
2) Evaluation of Discriminative Learning: The aim of dis-
criminative learning is to maximize the distance between differ-
ent sources and minimize the distance between same sources,
simultaneously.
Compared with the uPIT+DEF, uPIT+DEF+DL (discrimi-
native learning is utilized) achieves better performance in the
majority of cases, except for the PESQ measure. These re-
sults indicate that the discriminative learning can improve the
performance of speech separation. Meanwhile, especially for
the BSS-eval evaluation metrics (SDR, SIR and SAR), using
discriminative learning can gets a better result. The reason
is that the discriminative learning increases the dissimilarity
between different speakers so that the possibility of remixing
the interferences can be reduced. Although the performance
of uPIT+DEF+DL is slightly worse than uPIT+DEF for PESQ
measure, it is also comparable to the uPIT+DEF and significantly
better than the uPIT.
Fig. 3 shows the MSE over epochs on the WSJ0-2mix with
and without DL training method based on uPIT+DEF. From
Fig. 3 we can find that the DL based separation can be faster
convergent than the without DL method. This result indicates
the effectiveness of DL.
B. Comparison of the Proposed End-to-End Post-Filter
Method With the uPIT Based Methods
Table II shows the results of SI-SNR, SDR, PESQ and
STOI for the proposed method in time domain and the
T-F domain uPIT based methods. They are all in the de-
fault assignment and open condition. In this study, we extend
uPIT+DEF+DL and propose the end-to-end post-filter method
for monaural speech separation with deep attention fusion fea-
tures (uPIT+DEF+DL+E2EPF+attention).
From the Table II we can know that when the speech sig-
nals separated by the pre-stage (uPIT+DEF+DL method) are
Fig. 3. MSE over epochs on the WSJ0-2mix with and without DL training
method based on uPIT+DEF.
TAB L E I I
THE RESULTS OF SI-SNR, SDR, PESQ AND STOI FOR THE PROPOSED
METHOD IN TIME DOMAIN AND THE T-F D OMAIN BASED METHODS ON
WSJ0-2MIX DATAS E T.THEY ARE ALL IN THE DEFAULT ASSIGNMENT
AND OPEN CONDITION
processed by the end-to-end post-filter, the performance of
speech separation can be improved significantly. More specifi-
cally, compared with the uPIT+DEF+DL, our proposed speech
separation method uPIT+DEF+DL+E2EPF+attention obtains
6.6 dB increment in SI-SNR, 6.5 dB increment in SDR, 0.7
increment in PESQ and 6.7% increment in STOI. The reason of
the large improvement is that the uPIT+DEF+DL method does
the speech separation in the T-F domain and it only enhances the
amplitude spectrum, while the phase spectrum is left unchanged.
In other words, the uPIT+DEF+DL method utilizes the separated
Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.
1310 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 28, 2020
TABLE III
THE SDR, PESQ AND STOI RESULTS OF DIFFERENT SEPARATION METHODS FOR DIFFERENT GENDER COMBINATIONS ON WSJ0-2MIX DATA SE T .THEY ARE ALL
IN THE DEFAULT ASSIGNMENT AND OPEN CONDITION
TAB L E I V
COMPARISON WITH OTHER STATE -OF-THE-ART SYSTEMS ON
WSJ0-2MIX DATAS E T
magnitude spectrum and the mixture phase spectrum to recon-
struct the each source signals by ISTFT. However, the separated
magnitude spectrum and the mixture phase spectrum are mis-
matched, which damages the performance of speech separation.
As for our proposed uPIT+DEF+DL+E2EPF+attention method,
the pre-separation stage does the speech separation in the T-F
domain to separate the mixture preliminarily. At the end-to-end
post-filter stage, in order to improve the performance of speech
separation, it applies the waveform as the input features. The
waveform contains all of the information of the mixture signals,
including magnitude spectrum and phase spectrum. Therefore,
this stage enhances the magnitude spectrum and phase spectrum,
simultaneously. In addition, to reduce the complexity and size
of the end-to-end post-filter model, at the end-to-end post-filter
stage, all structures are CNN.
C. Evaluation of the Deep Attention Fusion Features
In Table II, Table III and Table IV, the uPIT+DEF+
DL+E2EPF means without the module of deep attention
fusion, the uPIT+DEF+DL+E2EPF+attention means with
the module of deep attention fusion.
From Table II we can find that when the deep atten-
tion fusion features are applied, the performance of speech
separation can be improved. More specifically, compare to
the uPIT+DEF+DL+E2EPF method, the uPIT+DEF+DL+
E2EPF+attention can acquire 0.3 dB increment for both SI-SNR
and SDR evaluation metrics. The reason is that these deep
attention fusion features are extracted by the attention module,
which computes the similarity between the mixture and the
pre-separated signals. Therefore, these deep attention fusion
features can make the separation model can pay more attention
to the pre-separated signals. So they are conducive to help
reduce the residual interference and enhance the pre-separated
speech so that the performance of speech separation can be
improved. These results prove that deep attention fusion features
are effective for speech separation.
Examples of separated speech for the baseline and our pro-
posed method are available online.1
D. Comparison of the Proposed Method With the Ideal Masks
In order to make a comparison of our proposed method with
the ideal masks, Fig. 4 shows the results of SI-SNR, SDR, PESQ
and STOI for our proposed method and the the ideal masks.
From Fig. 4, several observations can be found. Firstly, IPSM
has the best performance compared with the other ideal masks
(IBM and IRM) in all evaluation metrics. This is because that
the IPSM is a phase sensitive mask, which makes full use
of the phase information. Therefore, the phase is very im-
portant for speech separation. Secondly, as for SI-SNR and
SDR evaluation metrics as shown in Fig. 4(a) and (b), our
proposed method uPIT+DEF+DL+E2EPF+attention acquires
the best performance compared with the ideal masks. If only
enhances the magnitude spectrum and leaves the phase spectrum
unchanged, these ideal masks are the limitation performance of
speech separation. However, the performance of our proposed
method is better than these ideal masks, which reveals that our
1[Online]. Available: https://github.com/fchest/wave-samples
Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: END-TO-END POST-FILTER FOR SPEECH SEPARATION WITH DEEP ATTENTION FUSION FEATURES 1311
Fig. 4. The results of SI-SNR, SDR, PESQ and STOI for the proposed method uPIT+DEF+DL+E2EPF+attention and ideal masks on WSJ0-2mix dataset.
(a) The SI-SNR result. (b) The SDR results. (c) The PESQ results. (d) The STOI results.
proposed method can separate the mixture very well. Finally, as
for PESQ evaluation metric Fig. 4(c), although the performance
of the proposed method is slightly worse than IPSM, it is still
better than IBM and comparable to IRM. And as for the STOI
evaluation metric Fig. 4(d), our proposed method is comparable
to the IPSM and outperforms the IBM and IRM. Therefore,
these results indicate the effectiveness of our proposed method
for speech separation.
E. Comparison With Different Gender Combinations
Table III compares the results of uPIT based speech separation
methods for different gender combinations. Male-female combi-
nations can acquire a better performance than female-female and
male-male combinations for all of speech separation methods
in Table III. This is because that compared with the same
gender combinations, different gender combinations have larger
differences for speech features, for example pitch. Therefore, the
same gender combinations speech is more difficult to separate.
However, our proposed method can achieve better results than
other methods for all of the gender combinations, especially for
the same gender combinations. These results indicate that our
proposed method is effective for speech separation.
F. Comparison With Other State-of-the-Art Methods
In order to compare the separation results of our proposed
method with previous methods, Table IV shows the performance
of our proposed method uPIT+DEF+DL+E2EPF+attention and
other state-of-the-art methods on the same WSJ0-2mix dataset.
For all methods, the best reported results are listed and they are
all in the default assignment and open condition. Note that, for
[5], [6], [12], [13], [16], [30], [35], [47] methods are use SDR
improvements results. To compare equally, their final results are
add 0.2 dB although the SDR result of the mixture is only about
0.15 dB. In this table, the missing values are because they are
unreported in their corresponding study.
Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.
1312 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 28, 2020
TAB L E V
COMPARISON WITH OTHER STATE -OF-THE-ART SYSTEMS ON
WSJ0-3MIX DATAS E T
As for DC [13], DC++ [46], uPIT [12], SDC–MLT-Grid [47]
and CASA-E2E [6], they all do the speech separation in the
T-F domain with no phase enhancement. Their performance are
slightly worse than the other speech separation methods. TasNet
[30] and Conv-TasNet [35] extend uPIT to the time domain and
use the TCN for separation, which acquire quite good results.
Note that the TasNet [30] does not use the prior knowledge of
the pre-separated speech. From Table IV we can find that our
proposed method acquires the best performance, which indicate
the effectiveness of our proposed method. The reason is that our
proposed method can make full use of the prior knowledge of the
pre-separated speech to help reduce the residual interference. In
order to address the mismatch problem of magnitude and phase,
our proposed E2EPF utilizes the waveform as the input feature,
which can enhance the magnitude and phase simultaneously. In
addition, the deep attention fusion features are applied to E2EPF
so that the E2EPF can pay more attention to the pre-separated
speech. Therefore, the E2EPF can enhance the separated speech
very well and the performance of speech separation can be
improved.
Table V shows the results of our proposed method
uPIT+DEF+DL+E2EPF+attention and other state-of-the-art
methods on the same WSJ0-3mix dataset. As the SI-SNR
and SDR of the mixture in WSJ0-3mix dataset are nega-
tives, we use the SI-SNR and SDR as the evaluation met-
rics. From Table V we can find that our proposed method
uPIT+DEF+DL+E2EPF+attention outperforms other separa-
tion systems on the WSJ0-3mix dataset. These results indicate
that our proposed method is effective for speech separation.
VI. DISCUSSIONS
The above experimental results show that our proposed end-
to-end post-filter method with deep attention fusion features is
effective for speaker independent speech separation. We can
make some interesting observations as follows.
Our proposed end-to-end post-filter method can further re-
duce the residual interference and improve the performance of
speech separation. The performance of the pre-separation stage
uPIT+DEF+DL method outperforms the uPIT method but it
still needs to be improved. This is because that the separated
speech by this stage may still contain the residual interference. In
addition, it uses the mismatched mixture phase and the enhanced
magnitude to reconstruct the separated speech, which damages
the separation performance. When the proposed end-to-end
post-filter method is utilized, the separation performance can
be improved. The reason is that the end-to-end post-filter makes
full use of the prior knowledge of pre-separated speech so that it
can reduce the residual interference and improve the separation
performance. Besides, it utilizes the waveform as the input
features, which includes the magnitude and phase. Therefore,
when it enhances the waveform, the amplitude and phase can be
enhanced simultaneously. So our proposed method can address
the mismatch problem of the magnitude and phase.
The deep attention fusion features are conducive to
speech separation. Compared to the uPIT+DEF+DL+E2EPF
method (without deep attention fusion features), the proposed
uPIT+DEF+DL+E2EPF+attention can acquire a better speech
separation result. The reason is that these deep attention fusion
features are extracted by an attention module that computes
the similarity between the mixture and pre-separated signals.
Therefore, the end-to-end post-filter can pay more attention to
the pre-separated signals so that the residual interference can be
reduced and the pre-separated speech can be enhanced further.
In summary, our proposed end-to-end post-filter method can
further reduce the residual interference. Furthermore, the deep
attention fusion features are applied to improve the performance
of speech separation.
VII. CONCLUSION
In this paper, we presented an end-to-end post-filter method
for monaural speech separation, which utilized the deep attention
fusion features. The uPIT+DEF+DL method was applied to sep-
arate the mixture speech preliminarily. In order to further reduce
the interference, the end-to-end post-filter with the deep attention
fusion features was proposed. Our experiments were conducted
on WSJ0-2mix and WSJ0-3mix dataset. Results showed that the
proposed method was effective for speaker-independent speech
separation. In the future, we will extend the proposed method
to multi-channel speech separation, which could use the spatial
information to improve the performance of speech separation.
REFERENCES
[1] J.A.OSullivanet al., “Attentional selection in a cocktail party environ-
ment can be decoded from single-trial EEG,” Cerebral Cortex, vol. 25,
no. 7, pp. 1697–1706, 2015.
[2] C. Fan, B. Liu, J. Tao, J. Yi, and Z. Wen, “Spatial and spectral deep
attention fusion for multi-channel speech separation using deep embedding
features,” 2020, arXiv:2002.01626.
[3] C. Fan, B. Liu, J. Tao, J. Yi, Z. Wen, and Y. Bai, “Noise prior knowledge
learning for speech enhancement via gated convolutional generative ad-
versarial network, in Proc. IEEE Asia-Pacific Signal Inf. Process. Assoc.
Annu. Summit Conf., 2019, pp. 662–666.
[4] D. Wang and J. Chen, “Supervised speech separation based on deep
learning: An overview, IEEE/ACM Trans. Audio, Speech, Lang. Process.,
vol. 26, no. 10, pp. 1702–1726, Oct. 2018.
[5] Z.-Q. Wang, K. Tan, and D. Wang, “Deep learning based phase reconstruc-
tion for speaker separation: A trigonometric perspective, in Proc. IEEE
Int. Conf. Acoust., Speech Signal Process., 2019, pp. 71–75.
Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: END-TO-END POST-FILTER FOR SPEECH SEPARATION WITH DEEP ATTENTION FUSION FEATURES 1313
[6] Y. Liu and D. Wang, “A CASA approach to deep learning based speaker-
independent co-channel speech separation,” in Proc. IEEE Int. Conf.
Acoust., Speech Signal Process., 2018, pp. 5399–5403.
[7] J. Wang et al., “Deep extractor network for target speaker recovery from
single channel speech mixtures,” in Proc. Interspeech, 2018, pp. 307–311.
[8] Y. Luo and N. Mesgarani, “TasNet:Time-domain audio separation network
for real-time, single-channel speech separation,” in Proc. IEEE Int. Conf.
Acoust., Speech, Signal Process., 2018, pp. 696–700.
[9] C. Xu, W. Rao, X. Xiao, E. S. Chng, and H. Li, “Single channel speech
separation with constrained utterance level permutation invariant training
using grid LSTM,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
Process., 2018, pp. 6–10.
[10] C. Fan, B. Liu, J. Tao, Z. Wen, J. Yi, and Y. Bai, “Utterance-level per-
mutation invariant training with discriminative learning for single channel
speech separation,” in Proc. IEEE Int. Symp. Chin. Spoken Lang. Process.,
2018, pp. 26–30.
[11] K. Wang, F. Song, and X. Lei, “A Pitch-aware approach to single-channel
speech separation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Pro-
cess., 2019, pp. 296–300.
[12] M. Kolbæk, D. Yu, Z. Tan, and J. Jensen, “Multitalker speech separation
with utterance-level permutation invariant training of deep recurrent neural
networks,” IEEE/ACM Trans. Audio, Speech Lang. Process., vol. 25,
no. 10, pp. 1901–1913, Oct. 2017.
[13] J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, “Deep clustering:
Discriminative embeddings for segmentation and separation, in IEEE Int.
Conf. Acoust., Speech Signal Process., 2016, pp. 31–35.
[14] Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network for single-
microphone speaker separation,” in Proc. IEEE Int. Conf. Acoust., Speech
Signal Process., 2017, pp. 246–250.
[15] D. Yu, M. Kolbk, Z. H. Tan, and J. Jensen, “Permutation invariant training
of deep models for speaker-independent multi-talker speech separation,
in IEEE Int. Conf. Acoust., Speech Signal Process., 2017, pp. 241–245.
[16] Z.-Q.Wang, J. Le Roux, and J. R. Hershey,“Alternative objectivefunctions
for deep clustering,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
Process., 2018, pp. 686–690.
[17] Y. Luo, Z. Chen, J. R. Hershey, J. Le Roux, and N. Mesgarani, “Deep
clustering and conventional networks for music separation: Stronger to-
gether, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017,
pp. 61–65.
[18] J. Rouat, “Computational auditory scene analysis: Principles, algorithms,
and applications (Wang, D. and Brown, GJ, eds.; 2006) [book review],
IEEE Trans. Neural Netw., vol. 19, no. 1, Jan. 2008.
[19] E. M. Grais, G. Roma, A. J. R. Simpson, and M. D. Plumbley, “Combining
mask estimates for single channel audio source separation using deep
neural networks,” in Proc. Interspeech, 2016, pp. 3339–3343.
[20] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Singing-
voice separation from monaural recordings using deep recurrent neural
networks,”in Proc. Int. Soc. Music Inf. Retrieval Conf., 2014, pp. 477–482.
[21] P. S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Deep
learning for monaural speech separation,” in Proc. IEEE Int. Conf. Acoust.,
Speech Signal Process., 2014, pp. 1562–1566.
[22] C. Fan, B. Liu, J. Tao, J. Yi, and Z. Wen, “Discriminative learning for
monaural speech separation using deep embedding features,” in Proc.
Interspeech, 2019, pp. 4599–4603.
[23] Y. Wang, A. Narayanan, and D. L. Wang, “On training targets for super-
vised speech separation,” IEEE/ACM Trans. Audio Speech Lang. Process.,
vol. 22, no. 12, pp. 1849–1858, Dec. 2014.
[24] H. Erdogan, J. R. Hershey, S. Watanabe, and J. L. Roux, “Phase-sensitive
and recognition-boosted speech separation using deep recurrent neural
networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2015,
pp. 708–712.
[25] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
jointly learning to align and translate,” in Proc. 3rd Int. Conf. Learn.
Representations, 2015.
[26] D.Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel,and Y. Bengio, “End-to-
end attention-based large vocabulary speech recognition, in Proc. IEEE
Int. Conf. Acoust., Speech Signal Process., 2016, pp. 4945–4949.
[27] X. Hao, C. Shan, Y. Xu, S. Sun, and L. Xie, “An attention-based neural
network approach for single channel speech enhancement,” in Proc. IEEE
Int. Conf. Acoust., Speech Signal Process., 2019, pp. 6895–6899.
[28] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-
based neural machine translation,” in Proc. Conf. Empirical Methods
Natural Lang. Process., 2015, pp. 1412–1421.
[29] X. Xiao et al., “Single-channel speech extraction using speaker inventory
and attention network,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
Process., 2019, pp. 86–90.
[30] Y. Luo and N. Mesgarani, “TasNet: Surpassing ideal time-frequency
masking for speech separation,” 2018, arXiv:1809.07454.
[31] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic
convolutional and recurrent networks for sequence modeling, 2018,
arXiv:1803.01271.
[32] C. Lea, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional
networks: A unified approach to action segmentation, in Proc. Eur. Conf.
Comput. Vision, 2016, pp. 47–54.
[33] A. Pandey and D. Wang, “TCNN: Temporal convolutional neural network
for real-time speech enhancement in the time domain,” in Proc. IEEE Int.
Conf. Acoust., Speech Signal Process., 2019, pp. 6875–6879.
[34] Y. Liu and D. Wang, “A CASA approach to deep learning based speaker-
independent co-channel speech separation,” 2018 IEEE Int. Conf. Acous-
tics, Speech Signal Process. (ICASSP), 2019, pp. 5399–5403.
[35] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency
magnitude masking for speech separation,” IEEE/ACM Trans. Audio,
Speech, Lang. Process., vol. 27, no. 8, pp. 1256–1266, Aug. 2019.
[36] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal
convolutional networks for action segmentation and detection, in Proc.
IEEE Conf. Comput. Vision Pattern Recognit., 2017, pp. 156–165.
[37] A. van den Oord et al., “WaveNet: A generative model for raw audio, in
Proc. 9th ISCA Speech Synthesis Workshop.
[38] F. Chollet, “Xception: Deep learning with depthwise separable convo-
lutions,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., 2017,
pp. 1251–1258.
[39] A. G. Howard et al., “Mobilenets: Efficient convolutional neural networks
for mobile vision applications,” 2017, arXiv:1704.04861.
[40] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
surpassing human-level performance on imagenet classification, in Proc.
IEEE Int. Conf. Comput. vision, 2015, pp. 1026–1034.
[41] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in
blind audio source separation,”IEEE Trans. Audio, Speech, Lang. Process.,
vol. 14, no. 4, pp. 1462–1469, Jul. 2006.
[42] J. Garofalo, D. Graff, D. Paul, and D. Pallett, “Csr-i (WSJ0) Complete,
Linguistic Data Consortium, Philadelphia, 2007.
[43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
Comput. Sci., 2014, arXiv:1412.6980.
[44] A. W. Rix, M. P. Hollier, A. P. Hekstra, and J. G. Beerends, “Perceptual
evaluation of speech quality (PESQ) the new ITU standard for end-to-end
speech quality assessment part I–time-delay compensation,” J. Audio Eng.
Soc., vol. 50, no. 10, pp. 755–764, 2002.
[45] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-
time objective intelligibility measure for time-frequency weighted noisy
speech,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2010,
pp. 4214–4217.
[46] Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single-
channel multi-speaker separation using deep clustering,” Interspeech,
pp. 545–549, 2016.
[47] C. Xu, W. Rao, E. S. Chng, and H. Li, “A shifted delta coefficient
objective for monaural speech separation using multi-task learning, in
Proc. Interspeech, 2018, pp. 3479–3483.
[48] K. Wang, F. Soong, and L. Xie, “A pitch-aware approach to single-channel
speech separation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Pro-
cess., 2019, pp. 296–300.
[49] Y. Luo, Z. Chen, and N. Mesgarani, “Speaker-independent speech separa-
tion with deep attractor network,” IEEE/ACM Trans. Audio, Speech, Lang.
Process., vol. 26, no. 4, pp. 787–796, Apr. 2018.
Cunhang Fan (Student Member, IEEE) received the
B.S. degree from the Beijing University of Chemical
Technology, Beijing, China, in 2016. He is currently
working toward the Ph.D degree with the National
Laboratory of Pattern Recognition, Institute of Au-
tomation, Chinese Academy of Sciences, Beijing,
China. His current research interests include speech
separation, speech enhancement, speech recognition
and speech signal processing.
Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.
1314 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 28, 2020
Jianhua Tao (Senior Member, IEEE) received the
M.S. degree from Nanjing University, Nanjing,
China, in 1996, and the Ph.D. degree from Tsinghua
University, Beijing, China, in 2001. He is currently
a Professor with NLPR, Institute of Automation,
Chinese Academy of Sciences, Beijing, China. He
has authored or coauthored more than 200 papers on
major journals and proceedings including the IEEE
TRANSACTIONS ON AUDIO,SPEECH,AND LANGUAGE
PROCESSING. His current research interests include
speech recognition, speech synthesis, human com-
puter interaction, affective computing, and pattern recognition. He is the Board
Member of ISCA, the Chair or Program Committee Member for several major
conferences, including Interspeech, ICPR, ACII, ICMI, ISCSLP, etc. He was
the Steering Committee Member for the IEEE TRANSACTIONS ON AFFECTIVE
COMPUTING, and is an Associate Editor for Journal on Multimodal User Inter-
face and International Journal on Synthetic Emotions. He was the recipient of
several awards from the important conferences, such as Interspeech, NCMMSC,
etc.
Bin Liu (Member, IEEE) received the B.S. degree
and the M.S. degree from the Beijing institute of
technology, Beijing, China, in 2007 and 2009 respec-
tively. He received the Ph.D. degree from the Na-
tional Laboratory of Pattern Recognition, Institute of
Automation, Chinese Academy of Sciences, Beijing,
China, in 2015. He is currently an Associate Professor
with the National Laboratory of Pattern Recogni-
tion, Institute of Automation, Chinese Academy of
Sciences, Beijing, China. His current research in-
terests include affective computing and audio signal
processing.
Jiangyan Yi (Member, IEEE) received the M.A. de-
gree from the Graduate School of Chinese Academy
of Social Sciences, Beijing, China, in 2010 and
the Ph.D. degree from the University of Chinese
Academy of Sciences, Beijing, China, in 2018. She
was a Senior R&D Engineer with Alibaba Group
during 2011 to 2014. She is currently an Assis-
tant Professor with the National Laboratory of Pat-
tern Recognition, Institute of Automation, Chinese
Academy of Sciences, Beijing, China. Her current
research interests include speech processing, speech
recognition, distributed computing, deep learning, and transfer learning.
Zhengqi Wen (Member, IEEE) received the B.S. de-
gree from the University of Science and Technology
of China, Hefei, China, in 2008, and the Ph.D. de-
gree from the Chinese Academy of Sciences, Beijing,
China, in 2013. He is currently an Associate Professor
with the National Laboratory of Pattern Recognition,
Institute of Automation, Chinese Academy of Sci-
ences, Beijing, China. His current research interests
include speech processing, speech recognition, and
speech synthesis.
Xuefei Liu (Member, IEEE) received the M.A. de-
gree from Beijing Normal University, Beijing, China,
in 2013 and the Ph.D. degree from the Graduate
School of Chinese Academy of Social Sciences, Bei-
jing, China, in 2016. She is currently an Assistant Pro-
fessor with the National Laboratory of Pattern Recog-
nition, Institute of Automation, Chinese Academy of
Sciences, Beijing, China. Her current research inter-
ests include Corpus construction and experimental
phonetics.
Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.
... To address this challenge, researchers proposed the long short-term memory (LSTM) network (Hochreiter & Schmidhuber, 1997), which allows neurons to retain contextual memory in their pipelines. Subsequently, the temporal convolutional module (TCM) introduced in Bai, Kolter, and Koltun (2018) was found to be more effective in time series modeling than LSTM, and is currently widely applied in speech separation (Fan et al., 2020;Luo & Mesgarani, 2019). The structure of TCM is shown in Fig. 1(a), which includes a 1 × 1 input convolution to increase channel dimensionality, a 1 × 1 output convolution to restore the number of channels, and a depth-dilated convolution (DD-Conv) to capture long-term temporal dependencies. ...
... In addition, amplitude and phase spectra are obtained during the extraction of spectrograms, and most of all research works use only the amplitude information as input of the model. However, the phase information is important for speech quality and intelligibility [21,22]. Hence, it may also imply depression-related information, which needs to be verified. ...
... The time domain approach based on deep learning has achieved good results on speech separation problems recently. Wan [18] proposed a time domain speech separation algorithm based on fully convolutional network, which makes up for the shortcomings of the traditional T-F domain method. Fan et al. [19 ] proposed an end-to-end approach for speech separation based on 1-demensional convolutional network, which exploits speech waveform as the input of network for preliminary separation, followed by the fusion depth feature for further separation. ...
Article
Monaural speech separation is a significant research field in speech signal processing. To achieve a better separation performance, we propose three novel joint-constraint loss functions and a multiple joint-constraint loss function for monaural speech separation based on dual-output deep neural network (DNN). The multiple joint-constraint loss function for DNN separation model not only restricts the ideal ratio mask (IRM) errors of the two outputs, but also constrains the relationship of the estimated IRMs and the magnitude spectrograms of the clean speech signals, the relationship of the estimated IRMs of the two outputs, and the relationship of the estimated IRMs and the magnitude spectrogram of the mixed signal. The constraint strength is adjusted through three parameters to improve the accuracy of the speech separation model. Furthermore, we solve the optimal weighting coefficients of the multiple joint-constraint loss function based on the optimization idea, which further improves the performance of the separation system. We conduct a series of speech separation experiments on the GRID corpus to validate the superiority performance of the proposed method. The results show that using perceptual evaluation of speech quality, the short-time objective intelligibility, source to distortion ratio, signal to interference ratio and source to artifact ratio as the evaluation metrics, the proposed method out-performs the conventional DNN separation model. Taking the gender into consideration, we carry out experiments among Female-Female, Male-Male and Male-Female cases, which show that our method improves the robustness and performance of the separation system compared with some previous approaches.
... In recent years, important advances achieved in supervised speech separation, especially with the development of deep learning, have greatly promoted the single-channel speech separation technology [5][6][7][8]. Some research in deep-learning based speech separation techniques focus on time-frequency (T-F) domain methods [9][10][11][12][13][14][15]. ...
Article
Full-text available
Recent years have witnessed a great progress in single-channel speech separation by applying self-attention based networks. Despite the excellent performance in mining relevant long-sequence contextual information, self-attention networks cannot perfectly focus on subtle details in speech signals, such as temporal or spectral continuity, spectral structure, and timbre. To tackle this problem, we proposed a time-domain adaptive attention network (TAANet) with local and global attention network. Channel and spatial attention are introduced in local attention networks to focus on subtle details of the speech signals (frame-level features). In the global attention networks, a self-attention mechanism is used to explore the global associations of the speech contexts (utterance-level features). Moreover, we model the speech signal serially using multiple local and global attention blocks. This cascade structure enables our model to focus on local and global features adaptively, compared with other speech separation feature extraction methods, further boosting the separation performance. Versus other end-to-end speech separation methods, extensive experiments on benchmark datasets demonstrate that our approach obtains a superior result. (20.7 dB of SI-SNRi and 20.9 dB of SDRi on WSJ0-2mix).
... The reason is that the separation network cannot identify the characteristics of each speaker well in this case. Therefore, we use the idea of discriminative learning [16,17,18] to increase the similarity between the estimated and clean sources of the same speaker while decreasing the similarity between the estimated sources of different speakers. Speakers with similar voices can be efficiently distinguished and separated by optimizing the separation network using the proposed loss function. ...
Preprint
Full-text available
Transformer has shown advanced performance in speech separation, benefiting from its ability to capture global features. However, capturing local features and channel information of audio sequences in speech separation is equally important. In this paper, we present a novel approach named Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation. Specifically, we design a new network SE-Conformer that can model audio sequences in multiple dimensions and scales, and apply it to the dual-path speech separation framework. Furthermore, we propose Multi-Block Feature Aggregation to improve the separation effect by selectively utilizing information from the intermediate blocks of the separation network. Meanwhile, we propose a speaker similarity discriminative loss to optimize the speech separation model to address the problem of poor performance when speakers have similar voices. Experimental results on the benchmark datasets WSJ0-2mix and WHAM! show that ISCIT can achieve state-of-the-art results.
Article
Full-text available
Deep neural networks (DNN) techniques have become pervasive in domains such as natural language processing and computer vision. They have achieved great success in tasks such as machine translation and image generation. Due to their success, these data driven techniques have been applied in audio domain. More specifically, DNN models have been applied in speech enhancement and separation to perform speech denoising, dereverberation, speaker extraction and speaker separation. In this paper, we review the current DNN techniques being employed to achieve speech enhancement and separation. The review looks at the whole pipeline of speech enhancement and separation techniques from feature extraction, how DNN-based tools models both global and local features of speech, model training (supervised and unsupervised) to how they address label ambiguity problem. The review also covers the use of domain adaptation techniques and pre-trained models to boost speech enhancement process. By this, we hope to provide an all inclusive reference of all the state of art DNN based techniques being applied in the domain of speech separation and enhancement. We further discuss future research directions. This survey can be used by both academic researchers and industry practitioners working in speech separation and enhancement domain.
Article
Full-text available
Speech enhancement is an important method for improving speech quality and intelligibility in noisy environments. An effective speech enhancement model depends on precise modelling of the long-range dependencies of noisy speech. Several recent studies have examined ways to enhance speech by capturing the long-term contextual information. For speech enhancement, the time–frequency (T–F) distribution of speech spectral components is also important, but is usually ignored in these studies. The multi-stage learning method is an effective way to integrate various deep learning modules at the same time. The benefit of multi-stage training is that the optimization target can be iteratively updated stage by stage. In this paper, speech enhancement is investigated by multi-stage learning using a multi-stage structure in which time–frequency attention (TFA) blocks are followed by stacks of squeezed temporal convolutional networks (S-TCN) with exponentially increasing dilation rates. To reinject original information into later stages, a feature fusion block (FB) is inserted at the input of later stages to reduce the possibility of speech information being lost in the early stages. The S-TCN blocks are responsible for temporal sequence modelling tasks. The time–frequency attention (TFA) is a simple but effective network module that explicitly exploits position information to generate a 2D attention map to characterize the salient T–F distribution of speech by using two branches, time-frame attention and frequency attention in parallel. Extensive experiments have demonstrated that the proposed model consistently improves the performance over existing baselines across two widely used objective metrics such as PESQ and STOI. A significant improvement in system robustness to noise is also shown by our evaluation results using the TFA module.
Preprint
Full-text available
Deep neural networks (DNN) techniques have become pervasive in domains such as natural language processing and computer vision. They have achieved great success in tasks such as machine translation and image generation. Due to their success, these data driven techniques have been applied in audio domain. More specifically, DNN models have been applied in speech enhancement and separation to achieve denoising, dereverberation, speaker extraction and speaker separation in monaural speech intelligibility improvement. In this paper, we review some dominant DNN techniques being employed to achieve speech enhancement and separation. The review looks at the pipeline of speech enhancement and separation techniques from feature extraction, how DNN-based tools models both global and local features of speech and model training (supervised and unsupervised). The review also covers the use of domain adaptation techniques and pre-trained models to boost speech enhancement process.
Conference Paper
Full-text available
Despite significant advancements of deep learning on separating speech sources mixed in a single channel, same gender speaker mix, i.e., male-male or female-female, is still more difficult to separate than the case of opposite gender mix. In this study, we propose a pitch-aware speech separation approach to improve the speech separation performance. The proposed approach performs speech separation in the following steps: 1) training a pre-separation model to separate the mixed sources; 2) training a pitch-tracking network to perform polyphonic pitch tracking; 3) incorporating the estimated pitch for the final pitch-aware speech separation. Experimental results of the new approach, tested on the WSJ0-2mix public dataset, show that the new approach improves speech separation performance for both same and opposite gender mixture. The improved performance in signal-to-distortion (SDR) of 12.0 dB is the best reported result without using any phase enhancement.
Conference Paper
Full-text available
This study investigates phase reconstruction for deep learning based monaural talker-independent speaker separation in the short-time Fourier transform (STFT) domain. The key observation is that, for a mixture of two sources, with their magnitudes accurately estimated and under a geometric constraint, the absolute phase difference between each source and the mixture can be uniquely determined; in addition, the source phases at each time-frequency (T-F) unit can be narrowed down to only two candidates. To pick the right candidate, we propose three algorithms based on iterative phase reconstruction, group delay estimation, and phase-difference sign prediction. State-of-the-art results are obtained on the publicly available wsj0-2mix and 3mix corpus.
Article
Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency of the entire system. To address these shortcomings, we propose a fully-convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time-domain speech separation. Conv-TasNet uses a linear encoder to generate a representation of the speech waveform optimized for separating individual speakers. Speaker separation is achieved by applying a set of weighting functions (masks) to the encoder output. The modified encoder representations are then inverted back to the waveforms using a linear decoder. The masks are found using a temporal convolutional network (TCN) consisting of stacked 1-D dilated convolutional blocks, which allows the network to model the long-term dependencies of the speech signal while maintaining a small model size. The proposed Conv-TasNet system significantly outperforms previous time-frequency masking methods in separating two- and three-speaker mixtures. Additionally, Conv-TasNet surpasses several ideal time-frequency magnitude masks in two-speaker speech separation as evaluated by both objective distortion measures and subjective quality assessment by human listeners. Finally, Conv-TasNet has a significantly smaller model size and a much shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications. This study therefore represents a major step toward the realization of speech separation systems for real-world speech processing technologies.