ArticlePDF Available

End-to-End Post-Filter for Speech Separation With Deep Attention Fusion Features

March 2020
IEEE/ACM Transactions on Audio Speech and Language Processing PP(99):1-1

March 2020
PP(99):1-1

DOI:10.1109/TASLP.2020.2982029

Authors:

Cunhang Fan

Anhui University

Jianhua Tao

Tsinghua University

Show all 6 authorsHide

In this paper, we propose an end-to-end post-filter method with deep attention fusion features for monaural speaker-independent speech separation. At first, a time-frequency domain speech separation method is applied as the pre-separation stage. The aim of pre-separation stage is to separate the mixture preliminarily. Although this stage can separate the mixture, it still contains the residual interference. In order to enhance the pre-separated speech and improve the separation performance further, the end-to-end post-filter (E2EPF) with deep attention fusion features is proposed. The E2EPF can make full use of the prior knowledge of the pre-separated speech, which contributes to speech separation. It is a fully convolutional speech separation network and uses the waveform as the input features. Firstly, the 1-D convolutional layer is utilized to extract the deep representation features for the mixture and pre-separated signals in the time domain. Secondly, to pay more attention to the outputs of the pre-separation stage, an attention module is applied to acquire deep attention fusion features, which are extracted by computing the similarity between the mixture and the pre-separated speech. These deep attention fusion features are conducive to reduce the interference and enhance the pre-separated speech. Finally, these features are sent to the post-filter to estimate each target signals. Experimental results on the WSJ0-2mix dataset show that the proposed method outperforms the state-of-the-art speech separation method. Compared with the pre-separation method, our proposed method can acquire 64.1%, 60.2%, 25.6% and 7.5% relative improvements in scale-invariant source-to-noise ratio (SI-SNR), the signal-to-distortion ratio (SDR), the perceptual evaluation of speech quality (PESQ) and the short-time objective intelligibility (STOI) measures, respectively.

Content uploaded by Jianhua Tao

Content may be subject to copyright.

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 28, 2020 1303

End-to-End Post-Filter for Speech Separation

With Deep Attention Fusion Features

Cunhang Fan , Student Member, IEEE, Jianhua Tao , Senior Member, IEEE, Bin Liu , Member, IEEE,

Jiangyan Yi , Member, IEEE, Zhengqi Wen, Member, IEEE, and Xuefei Liu, Member, IEEE

Abstract—In this article, we propose an end-to-end post-ﬁlter

method with deep attention fusion features for monaural speaker-

independent speech separation. At ﬁrst, a time-frequency domain

speech separation method is applied as the pre-separation stage.

The aim of pre-separation stage is to separate the mixture prelimi-

narily. Although this stage can separate the mixture, it still contains

the residual interference. In order to enhance the pre-separated

speech and improve the separation performance further, the end-

to-end post-ﬁlter (E2EPF) with deep attention fusion features is

proposed. The E2EPF can make full use of the prior knowledge of

the pre-separated speech, which contributes to speech separation.

It is a fully convolutional speech separation network and uses the

waveform as the input features. Firstly, the 1-D convolutional layer

is utilized to extract the deep representation features forthe mixture

and pre-separated signals in the time domain. Secondly,to pay more

attention to the outputs of the pre-separation stage, an attention

module is applied to acquire deep attention fusion features, which

are extracted by computing the similarity between the mixture and

the pre-separated speech. These deep attention fusion features are

conducive to reduce the interference and enhance the pre-separated

speech. Finally, these features are sent to the post-ﬁlter to estimate

each target signals. Experimental results on the WSJ0-2mix dataset

show that the proposed method outperforms the state-of-the-art

speech separation method. Compared with the pre-separation

method, our proposed method can acquire 64.1%, 60.2%, 25.6%

and 7.5% relative improvements in scale-invariant source-to-noise

ratio (SI-SNR), the signal-to-distortion ratio (SDR), the perceptual

evaluation of speech quality (PESQ) and the short-time objective

intelligibility (STOI) measures, respectively.

Manuscript received September 3, 2019; revised December 20, 2019 and

March 14, 2020; accepted March 16, 2020. Date of publication March 20,

2020; date of current version May 7, 2020. This work was supported in part

by the National Key Research and Development Plan of China under Grant

2017YFC0820602, in part by the National Natural Science Foundation of China

(NSFC) under Grants 61831022, 61771472, 61901473, and 61773379 and in

part by Inria-CAS Joint Research Project under Grants 173211KYSB20170061

and 173211KYSB20190049. The associate editor coordinating the review of

this manuscript and approving it for publication was Prof. Sven Erik Nordholm.

(Corresponding authors: Jianhua Tao; Bin Liu.)

Cunhang Fan is with the National Laboratory of Pattern Recognition, Institute

of Automation, Chinese Academy of Sciences, Beijing 100190, China, and also

with the School of Artiﬁcial Intelligence, University of Chinese Academy of

Sciences, Beijing 100190, China (e-mail: cunhang.fan@nlpr.ia.ac.cn).

Jianhua Tao is with the National Laboratory of Pattern Recognition, Institute

of Automation, Chinese Academy of Sciences, Beijing 100190, China, with

the School of Artiﬁcial Intelligence, University of Chinese Academy of Sci-

ences, Beijing 100190, China, and also with the CAS Center for Excellence

in Brain Science and Intelligence Technology, Beijing 100190, China (e-mail:

jhtao@nlpr.ia.ac.cn).

Bin Liu, Jiangyan Yi, Zhengqi Wen, and Xuefei Liu are with the Na-

tional Laboratory of Pattern Recognition, Institute of Automation, Chinese

Academy of Sciences, Beijing 100190, China (e-mail: liubin@nlpr.ia.ac.cn;

jiangyan.yi@nlpr.ia.ac.cn; zqwen@nlpr.ia.ac.cn; xuefei.liu@nlpr.ia.ac.cn).

Digital Object Identiﬁer 10.1109/TASLP.2020.2982029

Index Terms—Speech separation, end-to-end post-ﬁlter, deep

attention fusion features, deep clustering, permutation invariant

training.

I. INTRODUCTION

SPEECH separation aims to estimate the target sources from

a noisy mixture, which is known as the cocktail party

problem [1]–[3]. As for monaural speech separation, it is a

very challenging task because only single channel can be used.

This study focuses on monaural speaker-independent speech

separation.

Recently, deep learning has been applied to address speaker-

independent speech separation, which has obtained impressive

results [4]–[11]. The difﬁculty of speaker-independent speech

separation is label ambiguity or permutation problem [12], [13].

In order to deal with this problem, deep clustering (DC) [13]

is proposed, which is a state-of-the-art method for speaker-

independent speech separation. DC is usually formulated as two-

step processes: embedding learning and embedding clustering.

Firstly, as for embedding learning, a bidirectional long-short

term memory (BLSTM) network is trained to project each time-

frequency (T-F) bin of mixture spectrogram into an embedding

vector. The training objective is the Frobenius norm between the

afﬁnity matrices of the embedding vector and the ideal binary

mask. In this way, if the T-F bins belong to the same speaker,

these embedding vectors are grouped closer together. Otherwise,

they become farther apart. Finally, in order to acquire the binary

mask of each source, K-means algorithm is applied to cluster

these embedding vectors, which is the embedding clustering.

Although DC gets good performance, it still has two limitations.

Firstly, the training objective is deﬁned in the embedding vectors,

instead of the real separated sources. These embedding vectors

do not necessarily imply perfect separation of the sources in the

signal space. Secondly, DC applies the unsupervised K-means

clustering algorithm to estimate the binary masks of target

sources. Therefore, the performance of speech separation is

limited by the K-means clustering algorithm. To overcome the

training objective limitation of DC, the deep attractor network

(DANet) [14] method is proposed. Same as DC, the DANet

also maps the mixture spectrogram into a high-dimensional

embedding space. Different from DC, DANet ﬁrstly creates

attractor points at the embedding space. Then the similarities

between the embedded points and each attractor are applied

to estimate each source’s mask. However, at the test stage, it

See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.

1304 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 28, 2020

still requires the unsupervised K-means clustering algorithm to

acquire the binary mask.

Frame-level permutation invariant training (PIT) [15] deals

with the permutation problem in a different way. During training,

the frame-level PIT (denoted by tPIT) computes all possible la-

bel permutations for each frame. Then tPIT uses the permutation

with the lowest mean square error (MSE) as the loss to train the

separation model. It can get a good performance for frame-level

separation. However, in the real-world conditions, the frame-

level permutation of separated signals is unknown. It means

that tPIT needs the speaker tracing step during inference. To

address this issue, utterance-level PIT (uPIT) [12] is proposed.

With uPIT, instead of choosing the permutation at frame-level,

the permutation corresponding to the minimum utterance-level

separation error is used for all frames in one utterance. In this

way, uPIT can effectively eliminate the speaker tracing problem.

However, tPIT and uPIT only reduce the distance between the

same speakers, they don’t increase the distance between the

different speakers. This may lead to increasing the possibility

of remixing the separated sources.

In order to use both of DC and PIT, Chimera++ network [16] is

applied for speech separation, which is followed by the Chimera

network [17]. The Chimera++ network uses a multi-task learning

architecture to combine the DC and PIT. However, it simply

employs the DC and PIT as two outputs of the separation model

rather than fuses them deeply. Therefore, it does not solve

the limitations of DC and PIT. Computational auditory scene

analysis (CASA) [18] is a traditional speech separation method,

which is inspired by human auditory scene analysis. Deep CASA

[6] is another method to combine the DC and PIT. It adopts

the same divide-and-conquer strategy of CASA. Deep CASA is

a two-stage speech separation method. Firstly, tPIT is used to

estimate each source from the mixture spectrogram. Then, DC

is used as the speaker tracing step. In other words, DC is applied

to estimate the optimized permutation at frame-level. Although

deep CASA acquires good separation performance, it is also

limited by the K-means algorithm.

Motivated by PIT, DC and discriminative learning [2], [10],

[19]–[21], we proposed a discriminative learning method for

speaker-independent speech separation with deep embedding

features (denoted by uPIT+DEF+DL) in our previous work

[22]. uPIT+DEF+DL combines DC and PIT in a deep fusion

method and addresses the limitations of DC and PIT very well.

It utilizes the DC network as the extractor of deep embedding

features. Then instead of using K-means clustering algorithm to

estimate the target sources, uPIT+DEF+DL applies the uPIT

to separate the speech from these deep embedding features.

Although uPIT+DEF+DL can separate the mixture well, it still

has two drawbacks limiting its performance. Firstly, it uses the

separated magnitude and mixture phase to reconstruct target

signals by inverse short-time Fourier transformation (ISTFT),

which is mismatched for magnitude and phase. Secondly, the

separated signals by the uPIT+DEF+DL may still contain the

residual interference signals, which damages the performance

of speech separation.

In this study, in order to address the above issues, we propose

an end-to-end post-ﬁlter (E2EPF) method with deep attention

fusion features for monaural speaker-independent speech sepa-

ration. The proposed E2EPF utilizes the time-domain waveform

as the input features. The waveform contains all of the infor-

mation of the raw wave, including the magnitude and phase.

Therefore, separating the speech from waveform can solve the

mismatch problem of magnitude and phase. At the ﬁrst, the

uPIT+DEF+DL is used as the pre-separation stage to preliminar-

ily estimate target sources from the mixture spectrogram through

T-F domain. The separated speech by this stage may still contain

the residual interference. To further enhance the pre-separated

speech, the E2EPF with deep attention fusion features is applied.

The E2EPF can make full use of the prior knowledge of pre-

separated speech to help reduce the residual interference. Firstly,

the mixture and pre-separated signals are processed by the

1-D convolutional layer to extract deep representation features.

Secondly, instead of simply stacking these deep representation

features, an attention module is applied to compute the similarity

between the mixture and the pre-separated speech, which is

used as the extractor of deep attention fusion features. These

features can make the proposed model pay more attention to the

pre-separated signals so that the proposed E2EPF can reduce the

interference more easily and enhance the pre-separated speech.

The main contribution of this paper is two-fold. Firstly, we

propose the E2EPF to further enhance the pre-separated speech

and reduce the residual interference. Secondly, deep attention

fusion features are applied to compute the similarity between

the mixture and the pre-separated speech. Experiments are con-

ducted on WSJ0-2mix and WSJ0-3mix datasets [13]. Experi-

mental results show that our proposed method outperforms the

state-of-the-art speech separation method.

The rest of this paper is organized as follows. Section II

presents discriminative learning for monaural speech separa-

tion using deep embedding features. Section III introduces

the proposed end-to-end post-ﬁlter speech separation method.

The experimental setup is stated in Section IV. Section V

shows experimental results. Section VI shows the discussions.

Section VII draws conclusions.

II. DISCRIMINATIVE LEARNING FOR MONAURAL SPEECH

SEPARATION USING DEEP EMBEDDING FEATURES

The object of monaural speech separation is to estimate target

sources from the mixture speech recorded by single channel.

y(t)=



s=1

xs(t)(1)

where y(t)is the mixture speech, tis the time index, S

is the number of sources and xs(t),s=1,...,S are target

sources. And the corresponding short-time Fourier transforma-

tion (STFT) of y(t)and xs(t)are Y(t, f )and Xs(t, f).

The speech separation aims to estimate each source signals

xs(t)from y(t)or Y(t, f ). In this section, we introduce the

discriminative learning method for speech separation with deep

embedding features [22], which is based on uPIT. This method

is denoted as uPIT+DEF+DL. We use this method as our pre-

separation stage and our baseline.

Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.

FAN et al.: END-TO-END POST-FILTER FOR SPEECH SEPARATION WITH DEEP ATTENTION FUSION FEATURES 1305

Fig. 1. Schematic diagram of uPIT+DEF+DL speech separation system. DC

loss is the loss of deep clustering.

A. Deep Embedding Features

Fig. 1 shows the schematic diagram of uPIT+DEF+DL speech

separation system. Firstly, a BLSTM network is trained as the

extractor of deep embedding features (DEF). The aim of the

extractor is to project the mixed amplitude spectrum |Y(t, f)|

of each T-F bin into the D-dimensional deep embedding features

V=γθ(|Y(t, f )|)∈RTF×D(2)

where TF is the number of T-F bins and γθ(∗)is the BLSTM

mapping function. Here we consider a unit-norm embedding, so

|vi|2=1,vi=vi,d (3)

where vi,d is the value of the d-th dimension of the embedding

for element i. We let the embeddings Vto implicitly represent

an TF ×TF estimated afﬁnity matrix VVT.

As for the deep embedding features extractor, the loss function

JDC is deﬁned as follow:

JDC =||VVT−BBT||2

=||VVT||2

F−2||VTB||2

F+||BBT||2

F(4)

where B∈RTF×Sis a binary matrix, which means the source

membership function for each T-F bin. Speciﬁcally, if the energy

of source sis the highest compared with other sources, Btf,s =1.

Otherwise, Btf,s =0. S denotes the source number. || ∗ ||2

Fis

the squared Frobenius norm.

B. uPIT Based Speech Separation Model With Deep

Embedding Features

As for DC [13], the training objective is not the real sep-

arated sources. Besides, the unsupervised K-means clustering

algorithm is applied to acquire binary masks. Therefore, the

performance is limited by the K-means algorithm. In order

to address these issues, we use the deep embedding vectors

extracted by DC as the input of uPIT to directly learn each

source’s soft masks. In this way, on one hand, we directly use

the real separated sources as the training objective. In other

words, the DC and uPIT can be trained end-to-end. On the other

hand, the performance of speech separation is not limited by the

K-means algorithm.

Phase sensitive mask (PSM) [23], [24] is proved to be effective

for speech separation because it makes full use of the phase

information [12]. In this paper, we utilize the PSM for speech

separation in the T-F domain. The ideal PSM is deﬁned as:

Ms(t, f )=|Xs(t, f )|cos(θy(t, f )−θs(t, f))

|Y(t, f )|(5)

where θy(t, f )and θs(t, f)are the phase of mixture speech and

target source s.

uPIT computes the MSE for all possible speaker permutations

at utterance-level. Then the minimum cost among all permuta-

tions (P) is chosen as the optimal assignment.

JuP IT =arg min

θs∈P



s=1

|||Y|

Ms−|Xθs|cos(θy−θs)||2

(6)

where the number of all permutations (P) is N=S!(!denotes

the factorial symbol). The (t, f )is omitted in 

Ms,Y,X,θyand

θs.

C. Discriminative Learning

For uPIT, the target of minimizing Eq. 6 is to reduce the

distance between the outputs and their corresponding target

sources. To decrease the possibility of remixing separated

sources, the discriminative learning (DL) is applied to our pro-

posed model. DL not only reduces the distance between the

prediction and the corresponding target, but also increases the

distance between the prediction and the interference sources.

We assume that φ∗is the chosen permutation (the same as the

JuP IT in Eq. 6), which has the lowest MSE among all permu-

tations. Therefore, the discriminative learning loss function can

be deﬁned as:

JDL =φ∗−

φ=φ∗,φ∈P

αφ (7)

where φis a permutation from Pbut does not contain φ∗,

α≥0is the regularization parameter of φ. When α=0,the

loss function is the same as the JuP IT in Eq. 6. It means with

no discriminative learning.

D. Joint Training

To extract embedding features effectively, we apply the joint

training framework to the proposed system. The loss function

of joint training is deﬁned as follow:

J=λJDC +(1−λ)JDL

=λJDC +(1−λ)⎛

⎝φ∗−

φ=φ∗,φ∈P

αφ⎞

⎠(8)

where λ∈[0,1] controls the weight of JDC and JDL.

Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.

1306 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 28, 2020

Fig. 2. (a): the diagram of the end-to-end post-ﬁlter. It contains three parts: feature extraction, deep attention fusion and post-ﬁlter. Features are extracted by

the 1-D convolution operation. Then attention mechanism is leveraged to the deep attention fusion. Finally, these features are inputted to the post-ﬁlter for speech

separation. (b): the detail block diagram of post-ﬁlter. The post-ﬁlter is composed of 1-D convolution and temporal convolutional network (TCN). (c): the design

of 1-D convolution block.

III. THE PROPOSED SPEECH SEPARATION METHOD

In this paper, we propose an end-to-end post-ﬁlter

(E2EPF) with deep attention fusion features for monaural

speaker-independent speech separation. Firstly, we use the

uPIT+DEF+DL to separate the mixture preliminarily in the T-F

domain, which is used as the pre-separation stage. The separated

speech by this method may still contain the residual interference.

In order to further enhance the separated speech and improve

the performance of speech separation, we utilize the E2EPF

with deep attention fusion features as another stage. The E2EPF

can make full use of the prior knowledge of the pre-separated

speech. The E2EPF is a fully convolutional network and applies

the waveform as the input feature. Besides, in order to make

the separation model pay more attention to the pre-separated

signals, an attention module is utilized to extract deep attention

fusion features, which are computed the similarity between the

mixture and pre-separated signals.

The E2EPF mainly solves two problems. Firstly, in the pre-

separation stage, it only enhances the magnitude and leaves the

phase spectrum unchanged. The mismatched magnitude and

phase are used to reconstruct estimated signals, which dam-

ages the performance of speech separation. The E2EPF does

the speech separation in the time domain so that it can enhance

the magnitude and phase spectrum simultaneously. Secondly, the

separated signals by the pre-separation stage may still contain

the residual interference. The E2EPF makes full use of the

prior knowledge of the pre-separated speech and applies the

deep attention fusion features to further remove the residual

interference and improve the performance of speech separation.

The E2EPF utilizes the waveform as the input features. It

consists three parts: feature extraction, deep attention fusion and

post-ﬁlter, as shown in Fig. 2(a). This section we will introduce

these three parts detailedly.

A. Feature Extraction

The input mixture speech (y(t)) and the output sources

(os(t),s=1,2, ..., S) of the pre-separation stage can be divided

into overlapping segments of length L. We denote them as

yk∈R1×Land osk ∈R1×L, where k=1, ..., ˆ

Tis the index

of segment and ˆ

Tdenotes the total number of segments in y(t)

and os(t).

The 1-D convolution operation is used to extract deep features

from the yand os(we drop the index kand time tfrom now on).

wy=ReLU(yUy)(9)

ws=ReLU(osUs),s=1,2,...,S (10)

Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.

FAN et al.: END-TO-END POST-FILTER FOR SPEECH SEPARATION WITH DEEP ATTENTION FUSION FEATURES 1307

where wy,ws∈R1×Nare the deep features extracted from the

yand os, respectively. Uy∈RN×Land Us∈RN×Lare the

basis functions of 1-D convolution operation, which contains

Nvectors with length Leach. ReLU(∗)denotes the rectiﬁed

linear unit, which is an optional nonlinear function.

B. Deep Attention Fusion

Recently, attention models have been successfully applied to

the sequence-to-sequence learning tasks [25]–[29]. In this study,

attention mechanism is leveraged to acquire the deep attention

fusion features.

The aim of the attention mechanism is to make the sepa-

ration model pay more attention to the output signals of the

pre-separation stage. It is used to compute the similarity between

the mixture and pre-separated signals. Therefore, the E2EPF can

further reduce the interference signals and improve the perfor-

mance of speech separation. In order to compute the similarity

between mixture and the pre-separated signals, the wyand ws

are sent to another 1-D convolutional layer.



y=ReLU (wyU

y)(11)



s=ReLU (wsU

s),s=1,2, ..., S (12)

where U

y∈RN×Land U

s∈RN×Lare the basis functions of

1-D convolution operation.

According to the global attention mechanism [28], the atten-

tion weight αt,tcan be learned:

αt,t=exp(dt,t)



texp(dt,t)(13)

where dt,tis the correlation between w



yand w



s, which mea-

sures their similarity. The attention weight αt,tis the softmax

of dt,tover t∈[1,N]. We follow the dot-based function in

[28] as the dt,t.dt,tis deﬁned as follow:

dt,t=w

T



s(14)

The context vector cts∈R1×Ncan be calculated by the

weighted average of w



cts=

αt,tw



s(15)

As shown in Fig. 2(a), gray area is the deep attention fusion

part. Finally, these two context vectors ctsand the mixture deep

feature w



yare as the deep attention fusion features to the next

post-ﬁlter part.

C. Post-Filter

The detail block diagram of post-ﬁlter is shown in Fig. 2(b),

which adopts the temporal convolutional network (TCN) similar

to TasNet [30]. TCN is leveraged to the end-to-end post-ﬁlter,

which has shown comparable even better performance than

RNNs in various sequence modeling tasks [30]–[36]. The post-

ﬁlter is a fully-convolutional module including stacked dilated

1-D convolutional blocks as shown in Fig. 2(c). Compared with

the TasNet [30], there are two main differences. Firstly, our

proposed post-ﬁlter makes full use of the prior knowledge of

the pre-separated speech and the post-ﬁlter is used as the second

stage to improve the separation performance. Secondly, to pay

more attention to the pre-separated speech, these deep attention

fusion features are applied.

TCNs are used to replace for recurrent neural networks

(RNNs), which have shown comparable even better perfor-

mance than RNNs in various sequence modeling tasks [30]–[35].

For each TCN, 1-D convolutional blocks have increasing dila-

tion factors (1,2, ..., 2M−1,Mis the number of convolutional

blocks), as shown in the light brown of Fig. 2(b). These in-

creasing dilation factors can capture a large temporal context.

To further increase the receptive ﬁeld, the Mstacked dilated

convolutional blocks are repeated R=4times.

Fig. 2(c) shows the stacked dilated 1-D convolutional block,

which follows [37]. To avoid losing input information, the skip

connection is utilized between the input and the next block. The

depthwise separable convolution has been proven to be effective

for image processing tasks [38], [39]. Then, the depthwise

separable convolution is applied to further decrease the param-

eters numbers. A nonlinear activation function and a normaliza-

tion operation are added after both the ﬁrst 1×1−conv and

D−conv blocks respectively. The parametric rectiﬁed linear

unit (PReLU) [40] is applied. The reason is that PReLU can

improve model ﬁtting with nearly zero extra computational cost

and little overﬁtting risk [40]. The type of the normalization

is the global layer normalization (gLN) because that the gLN

outperforms all other normalization methods [35].

The output of the stacked dilated 1-D convolutional block

is inputted to a 1-D convolutional layer with ReLU nonlinear

function and we denote these neural networks as γ(∗).The

reason of using ReLU is that we want the network to learn target

masks like the T-F domain. The output of γ(∗)is the estimated

mask ms∈R1×Nof each source similar to the pre-separation

stage.

ms=γ([ws,cts;w



y]),s=1,2, ..., S (16)

Then the separated representation esof source scan be

estimated as following:

es=wyms(17)

where denotes the element-wise multiplication.

Finally, the estimated waveform of source s

xsis recon-

structed by the transposed 1-D convolution operator:



xs=esUe(18)

where Ue∈RN×Ldenotes the basis function of transposed 1-D

convolution operator.

D. Training Objective

In order to improve the separation performance, the training

objective of the end-to-end post-ﬁlter is to maximize the scale-

invariant source-to-noise ratio (SI-SNR) [41]. The SI-SNR is

Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.

1308 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 28, 2020

deﬁned as:

xtarget =

x,xx

x2(19)

enoise =

x−xtarget (20)

SI −SNR = 10log10

xtarget 2

enoise2(21)

where 

xand xdenote the estimated and target sources, respec-

tively. x2=x,xis the signal power. In order to solve the

permutation problem, the uPIT is utilized during training.

IV. EXPERIMENTAL SETUP

A. Dataset

The WSJ0-2mix and WSJ0-3mix datasets [13] are used to

conduct our experiments, which is derived from WSJ0 corpus

[42]. It has training, validation and test set. The training set has

20,000 utterances about 30 hours. It is 5,000 utterances about 10

hours for validation set. As for test set, it has 3,000 utterances

about 5 hours. All of the data is generated by randomly selecting

utterances from WSJ0 set with signal-to-noise ratios (SNRs)

between −5 dB and 5 dB. The training and validation set are

generated from the WSJ0 training set (si_tr_s). The test set

is generated from the WSJ0 development set (si_dt_05) and

evaluation set (si_et_05). All the waveforms are sampled at

8000 Hz.

In order to evaluate the separation performance, the validation

set is used as the closed conditions (CC) and the test set is used

as the open condition (OC).

B. Baseline Model

In this paper, we use the uPIT+DEF+DL as our baseline

model. To compute the short-time Fourier transform (STFT), the

hamming window is 32ms and window shift is 16ms. Therefore,

the dimension of the spectral magnitude is 129. We use the

normalized amplitude spectrum of the mixture speech as the

input features.

There are two BLSTM layers with 896 units as for the

extractor of deep embedding features. We set the dimension

of embedding D to 40. Following the embedding layer, a tanh

activation function is utilized. For uPIT separation network,

there is only one layer with 896 units. Therefore, the network of

pre-separation has 3 BLSTM layers in total, which is the same as

the baseline in [12]. As for the mask estimation layer, a Rectiﬁed

Liner Uint (ReLU) activation function is used to estimate the

mask of each source, which is followed by the uPIT separation

network. The discriminative learning parameter αis set to 0.1.

For each BLSTM layer, a random dropout is applied and

dropout rate is set to 0.5. The batch-size is 16 utterances which is

generated by randomly selecting. The minimum epoch is set to

30. The learning rate is initialized as 0.0005. When the training

loss increases on the validation set, the learning rate is scaled

down by 0.7. When the relative loss improvement is lower than

0.01, the model is early stopped. The models of this stage are

optimized with the Adam algorithm [43].

In this paper, we re-implement uPIT [12] with our experi-

mental setup, which has three BLSTM layers with 896 units.

The others are the same as the experimental setup of our pre-

separation stage.

C. The Proposed End-to-End Post-Filter Method

The length of input waveform is 4-second long segments. The

learning rate is initialized as 0.0001. If the training loss increases

in 3 consecutive epochs on the validation set, the learning rate

is halved. Same as the pre-separation stage, the optimizer of

this stage is the Adam algorithm [43]. The maximum number of

epoch is 100. As for the feature extraction, the number of the ﬁrst

1-D convolutional ﬁlters is 256 with length 20 (in samples) (N=

256,L=20in Section III-B). As for the other 1-D convolution,

the number of channels all is 256. For convolutional blocks, The

numbers of channels and kernel size are 512 and 3, respectively.

The number of repeats is 4 and in each repeat the number of

convolutional blocks Mis 8.

D. Evaluation Metrics

In this work, in order to evaluate the performance of speech

separation results, the models are evaluated on the scale-

invariant source-to-noise ratio (SI-SNR), the signal-to-distortion

ratio (SDR), signal-to-interference ratio (SIR) and signal-to-

artifact ratio (SAR) which are the BBS-eval [41] score, the

perceptual evaluation of speech quality (PESQ) [44] measure

and the short-time objective intelligibility (STOI) measure [45].

E. Comparison With Ideal T-F Masks

In order to compare with the ideal T-F masks, we use the ideal

PSM (IPSM), ideal binary mask (IBM) and ideal ratio mask

(IRM). These masks are calculated by STFT with 32 ms length

hamming window and 16 ms window shift, which is the same

as the pre-separation stage. The IPSM is deﬁned in Eq. 5. The

IBM and IRM of source s=1,2, ..., S are deﬁned as following:

IBMs(t, f )=1,|Xs(t, f )|>|Xj=s(t, f)|

0,otherwise

(22)

IRMs(t, f )= |Xs(t, f)|

S

j=1 |Xj(t, f )|(23)

V. R ESULTS

A. Pre-Separation Stage

We ﬁrstly evaluate the performance of the pre-separation stage

in the T-F domain. Table I shows the results of SDR, SIR, SAR

and PESQ between the uPIT based different speech separation

methods on closed (CC) and open (OC) condition. The deep

embedding features is denoted by DEF. In Table I, the “Optimal

(Opt.) Assign.” means that outputs are optimal assignment. In

other words, outputs are with optimal permutation for all of

the frames in a utterance. Otherwise, it is the “Default (Def.)

Assign.”

Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.

FAN et al.: END-TO-END POST-FILTER FOR SPEECH SEPARATION WITH DEEP ATTENTION FUSION FEATURES 1309

TAB L E I

THE RESULTS OF SDR, SIR, SAR AND PESQ FOR DIFFERENT SEPARATION METHODS WITH CLOSED (CC) AND OPEN (OC) CONDITION ON WSJ0-2MIX DATAS E T.

λISTHEWEIGHT OF JOINT TRAINING IN EQ.8.DEFDENOTES THE DEEP EMBEDDING FEATURES.UPIT ISTHEBASELINE METHOD,UPIT+DEF AND

UPIT+DEF+DL ARE OUR PROPOSED METHODS.UPIT+DEF MEANS WITH NO DISCRIMINATIVE LEARNING

1) Evaluation of Deep Embedding Features: From Table I,

we can ﬁnd that in all objective measures, uPIT+DEF methods

all outperform the uPIT method no matter what λis. These

results indicate that the uPIT based separation method with deep

embedding features can improve the performance of speaker-

independent speech separation. This is because that these deep

embedding features are deep representations for the mixture

amplitude spectrum, which contain the potential information of

each target source so that they can effectively estimate the masks

of target sources. Therefore, these deep embedding features are

discriminative features for speech separation.

2) Evaluation of Discriminative Learning: The aim of dis-

criminative learning is to maximize the distance between differ-

ent sources and minimize the distance between same sources,

simultaneously.

Compared with the uPIT+DEF, uPIT+DEF+DL (discrimi-

native learning is utilized) achieves better performance in the

majority of cases, except for the PESQ measure. These re-

sults indicate that the discriminative learning can improve the

performance of speech separation. Meanwhile, especially for

the BSS-eval evaluation metrics (SDR, SIR and SAR), using

discriminative learning can gets a better result. The reason

is that the discriminative learning increases the dissimilarity

between different speakers so that the possibility of remixing

the interferences can be reduced. Although the performance

of uPIT+DEF+DL is slightly worse than uPIT+DEF for PESQ

measure, it is also comparable to the uPIT+DEF and signiﬁcantly

better than the uPIT.

Fig. 3 shows the MSE over epochs on the WSJ0-2mix with

and without DL training method based on uPIT+DEF. From

Fig. 3 we can ﬁnd that the DL based separation can be faster

convergent than the without DL method. This result indicates

the effectiveness of DL.

B. Comparison of the Proposed End-to-End Post-Filter

Method With the uPIT Based Methods

Table II shows the results of SI-SNR, SDR, PESQ and

STOI for the proposed method in time domain and the

T-F domain uPIT based methods. They are all in the de-

fault assignment and open condition. In this study, we extend

uPIT+DEF+DL and propose the end-to-end post-ﬁlter method

for monaural speech separation with deep attention fusion fea-

tures (uPIT+DEF+DL+E2EPF+attention).

From the Table II we can know that when the speech sig-

nals separated by the pre-stage (uPIT+DEF+DL method) are

Fig. 3. MSE over epochs on the WSJ0-2mix with and without DL training

method based on uPIT+DEF.

TAB L E I I

THE RESULTS OF SI-SNR, SDR, PESQ AND STOI FOR THE PROPOSED

METHOD IN TIME DOMAIN AND THE T-F D OMAIN BASED METHODS ON

WSJ0-2MIX DATAS E T.THEY ARE ALL IN THE DEFAULT ASSIGNMENT

AND OPEN CONDITION

processed by the end-to-end post-ﬁlter, the performance of

speech separation can be improved signiﬁcantly. More speciﬁ-

cally, compared with the uPIT+DEF+DL, our proposed speech

separation method uPIT+DEF+DL+E2EPF+attention obtains

6.6 dB increment in SI-SNR, 6.5 dB increment in SDR, 0.7

increment in PESQ and 6.7% increment in STOI. The reason of

the large improvement is that the uPIT+DEF+DL method does

the speech separation in the T-F domain and it only enhances the

amplitude spectrum, while the phase spectrum is left unchanged.

In other words, the uPIT+DEF+DL method utilizes the separated

Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.

1310 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 28, 2020

TABLE III

THE SDR, PESQ AND STOI RESULTS OF DIFFERENT SEPARATION METHODS FOR DIFFERENT GENDER COMBINATIONS ON WSJ0-2MIX DATA SE T .THEY ARE ALL

IN THE DEFAULT ASSIGNMENT AND OPEN CONDITION

TAB L E I V

COMPARISON WITH OTHER STATE -OF-THE-ART SYSTEMS ON

WSJ0-2MIX DATAS E T

magnitude spectrum and the mixture phase spectrum to recon-

struct the each source signals by ISTFT. However, the separated

magnitude spectrum and the mixture phase spectrum are mis-

matched, which damages the performance of speech separation.

As for our proposed uPIT+DEF+DL+E2EPF+attention method,

the pre-separation stage does the speech separation in the T-F

domain to separate the mixture preliminarily. At the end-to-end

post-ﬁlter stage, in order to improve the performance of speech

separation, it applies the waveform as the input features. The

waveform contains all of the information of the mixture signals,

including magnitude spectrum and phase spectrum. Therefore,

this stage enhances the magnitude spectrum and phase spectrum,

simultaneously. In addition, to reduce the complexity and size

of the end-to-end post-ﬁlter model, at the end-to-end post-ﬁlter

stage, all structures are CNN.

C. Evaluation of the Deep Attention Fusion Features

In Table II, Table III and Table IV, the ‘uPIT+DEF+

DL+E2EPF’ means without the module of deep attention

fusion, the ‘uPIT+DEF+DL+E2EPF+attention’ means with

the module of deep attention fusion.

From Table II we can ﬁnd that when the deep atten-

tion fusion features are applied, the performance of speech

separation can be improved. More speciﬁcally, compare to

the uPIT+DEF+DL+E2EPF method, the uPIT+DEF+DL+

E2EPF+attention can acquire 0.3 dB increment for both SI-SNR

and SDR evaluation metrics. The reason is that these deep

attention fusion features are extracted by the attention module,

which computes the similarity between the mixture and the

pre-separated signals. Therefore, these deep attention fusion

features can make the separation model can pay more attention

to the pre-separated signals. So they are conducive to help

reduce the residual interference and enhance the pre-separated

speech so that the performance of speech separation can be

improved. These results prove that deep attention fusion features

are effective for speech separation.

Examples of separated speech for the baseline and our pro-

posed method are available online.1

D. Comparison of the Proposed Method With the Ideal Masks

In order to make a comparison of our proposed method with

the ideal masks, Fig. 4 shows the results of SI-SNR, SDR, PESQ

and STOI for our proposed method and the the ideal masks.

From Fig. 4, several observations can be found. Firstly, IPSM

has the best performance compared with the other ideal masks

(IBM and IRM) in all evaluation metrics. This is because that

the IPSM is a phase sensitive mask, which makes full use

of the phase information. Therefore, the phase is very im-

portant for speech separation. Secondly, as for SI-SNR and

SDR evaluation metrics as shown in Fig. 4(a) and (b), our

proposed method uPIT+DEF+DL+E2EPF+attention acquires

the best performance compared with the ideal masks. If only

enhances the magnitude spectrum and leaves the phase spectrum

unchanged, these ideal masks are the limitation performance of

speech separation. However, the performance of our proposed

method is better than these ideal masks, which reveals that our

1[Online]. Available: https://github.com/fchest/wave-samples

Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.

FAN et al.: END-TO-END POST-FILTER FOR SPEECH SEPARATION WITH DEEP ATTENTION FUSION FEATURES 1311

Fig. 4. The results of SI-SNR, SDR, PESQ and STOI for the proposed method uPIT+DEF+DL+E2EPF+attention and ideal masks on WSJ0-2mix dataset.

(a) The SI-SNR result. (b) The SDR results. (c) The PESQ results. (d) The STOI results.

proposed method can separate the mixture very well. Finally, as

for PESQ evaluation metric Fig. 4(c), although the performance

of the proposed method is slightly worse than IPSM, it is still

better than IBM and comparable to IRM. And as for the STOI

evaluation metric Fig. 4(d), our proposed method is comparable

to the IPSM and outperforms the IBM and IRM. Therefore,

these results indicate the effectiveness of our proposed method

for speech separation.

E. Comparison With Different Gender Combinations

Table III compares the results of uPIT based speech separation

methods for different gender combinations. Male-female combi-

nations can acquire a better performance than female-female and

male-male combinations for all of speech separation methods

in Table III. This is because that compared with the same

gender combinations, different gender combinations have larger

differences for speech features, for example pitch. Therefore, the

same gender combinations speech is more difﬁcult to separate.

However, our proposed method can achieve better results than

other methods for all of the gender combinations, especially for

the same gender combinations. These results indicate that our

proposed method is effective for speech separation.

F. Comparison With Other State-of-the-Art Methods

In order to compare the separation results of our proposed

method with previous methods, Table IV shows the performance

of our proposed method uPIT+DEF+DL+E2EPF+attention and

other state-of-the-art methods on the same WSJ0-2mix dataset.

For all methods, the best reported results are listed and they are

all in the default assignment and open condition. Note that, for

[5], [6], [12], [13], [16], [30], [35], [47] methods are use SDR

improvements results. To compare equally, their ﬁnal results are

add 0.2 dB although the SDR result of the mixture is only about

0.15 dB. In this table, the missing values are because they are

unreported in their corresponding study.

Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.

1312 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 28, 2020

TAB L E V

COMPARISON WITH OTHER STATE -OF-THE-ART SYSTEMS ON

WSJ0-3MIX DATAS E T

As for DC [13], DC++ [46], uPIT [12], SDC–MLT-Grid [47]

and CASA-E2E [6], they all do the speech separation in the

T-F domain with no phase enhancement. Their performance are

slightly worse than the other speech separation methods. TasNet

[30] and Conv-TasNet [35] extend uPIT to the time domain and

use the TCN for separation, which acquire quite good results.

Note that the TasNet [30] does not use the prior knowledge of

the pre-separated speech. From Table IV we can ﬁnd that our

proposed method acquires the best performance, which indicate

the effectiveness of our proposed method. The reason is that our

proposed method can make full use of the prior knowledge of the

pre-separated speech to help reduce the residual interference. In

order to address the mismatch problem of magnitude and phase,

our proposed E2EPF utilizes the waveform as the input feature,

which can enhance the magnitude and phase simultaneously. In

addition, the deep attention fusion features are applied to E2EPF

so that the E2EPF can pay more attention to the pre-separated

speech. Therefore, the E2EPF can enhance the separated speech

very well and the performance of speech separation can be

improved.

Table V shows the results of our proposed method

uPIT+DEF+DL+E2EPF+attention and other state-of-the-art

methods on the same WSJ0-3mix dataset. As the SI-SNR

and SDR of the mixture in WSJ0-3mix dataset are nega-

tives, we use the SI-SNR and SDR as the evaluation met-

rics. From Table V we can ﬁnd that our proposed method

uPIT+DEF+DL+E2EPF+attention outperforms other separa-

tion systems on the WSJ0-3mix dataset. These results indicate

that our proposed method is effective for speech separation.

VI. DISCUSSIONS

The above experimental results show that our proposed end-

to-end post-ﬁlter method with deep attention fusion features is

effective for speaker independent speech separation. We can

make some interesting observations as follows.

Our proposed end-to-end post-ﬁlter method can further re-

duce the residual interference and improve the performance of

speech separation. The performance of the pre-separation stage

uPIT+DEF+DL method outperforms the uPIT method but it

still needs to be improved. This is because that the separated

speech by this stage may still contain the residual interference. In

addition, it uses the mismatched mixture phase and the enhanced

magnitude to reconstruct the separated speech, which damages

the separation performance. When the proposed end-to-end

post-ﬁlter method is utilized, the separation performance can

be improved. The reason is that the end-to-end post-ﬁlter makes

full use of the prior knowledge of pre-separated speech so that it

can reduce the residual interference and improve the separation

performance. Besides, it utilizes the waveform as the input

features, which includes the magnitude and phase. Therefore,

when it enhances the waveform, the amplitude and phase can be

enhanced simultaneously. So our proposed method can address

the mismatch problem of the magnitude and phase.

The deep attention fusion features are conducive to

speech separation. Compared to the uPIT+DEF+DL+E2EPF

method (without deep attention fusion features), the proposed

uPIT+DEF+DL+E2EPF+attention can acquire a better speech

separation result. The reason is that these deep attention fusion

features are extracted by an attention module that computes

the similarity between the mixture and pre-separated signals.

Therefore, the end-to-end post-ﬁlter can pay more attention to

the pre-separated signals so that the residual interference can be

reduced and the pre-separated speech can be enhanced further.

In summary, our proposed end-to-end post-ﬁlter method can

further reduce the residual interference. Furthermore, the deep

attention fusion features are applied to improve the performance

of speech separation.

VII. CONCLUSION

In this paper, we presented an end-to-end post-ﬁlter method

for monaural speech separation, which utilized the deep attention

fusion features. The uPIT+DEF+DL method was applied to sep-

arate the mixture speech preliminarily. In order to further reduce

the interference, the end-to-end post-ﬁlter with the deep attention

fusion features was proposed. Our experiments were conducted

on WSJ0-2mix and WSJ0-3mix dataset. Results showed that the

proposed method was effective for speaker-independent speech

separation. In the future, we will extend the proposed method

to multi-channel speech separation, which could use the spatial

information to improve the performance of speech separation.

REFERENCES

[1] J.A.O’Sullivanet al., “Attentional selection in a cocktail party environ-

ment can be decoded from single-trial EEG,” Cerebral Cortex, vol. 25,

no. 7, pp. 1697–1706, 2015.

[2] C. Fan, B. Liu, J. Tao, J. Yi, and Z. Wen, “Spatial and spectral deep

attention fusion for multi-channel speech separation using deep embedding

features,” 2020, arXiv:2002.01626.

[3] C. Fan, B. Liu, J. Tao, J. Yi, Z. Wen, and Y. Bai, “Noise prior knowledge

learning for speech enhancement via gated convolutional generative ad-

versarial network,” in Proc. IEEE Asia-Paciﬁc Signal Inf. Process. Assoc.

Annu. Summit Conf., 2019, pp. 662–666.

[4] D. Wang and J. Chen, “Supervised speech separation based on deep

learning: An overview,” IEEE/ACM Trans. Audio, Speech, Lang. Process.,

vol. 26, no. 10, pp. 1702–1726, Oct. 2018.

[5] Z.-Q. Wang, K. Tan, and D. Wang, “Deep learning based phase reconstruc-

tion for speaker separation: A trigonometric perspective,” in Proc. IEEE

Int. Conf. Acoust., Speech Signal Process., 2019, pp. 71–75.

Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.

FAN et al.: END-TO-END POST-FILTER FOR SPEECH SEPARATION WITH DEEP ATTENTION FUSION FEATURES 1313

[6] Y. Liu and D. Wang, “A CASA approach to deep learning based speaker-

independent co-channel speech separation,” in Proc. IEEE Int. Conf.

Acoust., Speech Signal Process., 2018, pp. 5399–5403.

[7] J. Wang et al., “Deep extractor network for target speaker recovery from

single channel speech mixtures,” in Proc. Interspeech, 2018, pp. 307–311.

[8] Y. Luo and N. Mesgarani, “TasNet:Time-domain audio separation network

for real-time, single-channel speech separation,” in Proc. IEEE Int. Conf.

Acoust., Speech, Signal Process., 2018, pp. 696–700.

[9] C. Xu, W. Rao, X. Xiao, E. S. Chng, and H. Li, “Single channel speech

separation with constrained utterance level permutation invariant training

using grid LSTM,” in Proc. IEEE Int. Conf. Acoust., Speech Signal

Process., 2018, pp. 6–10.

[10] C. Fan, B. Liu, J. Tao, Z. Wen, J. Yi, and Y. Bai, “Utterance-level per-

mutation invariant training with discriminative learning for single channel

speech separation,” in Proc. IEEE Int. Symp. Chin. Spoken Lang. Process.,

2018, pp. 26–30.

[11] K. Wang, F. Song, and X. Lei, “A Pitch-aware approach to single-channel

speech separation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Pro-

cess., 2019, pp. 296–300.

[12] M. Kolbæk, D. Yu, Z. Tan, and J. Jensen, “Multitalker speech separation

with utterance-level permutation invariant training of deep recurrent neural

networks,” IEEE/ACM Trans. Audio, Speech Lang. Process., vol. 25,

no. 10, pp. 1901–1913, Oct. 2017.

[13] J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, “Deep clustering:

Discriminative embeddings for segmentation and separation,” in IEEE Int.

Conf. Acoust., Speech Signal Process., 2016, pp. 31–35.

[14] Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network for single-

microphone speaker separation,” in Proc. IEEE Int. Conf. Acoust., Speech

Signal Process., 2017, pp. 246–250.

[15] D. Yu, M. Kolbk, Z. H. Tan, and J. Jensen, “Permutation invariant training

of deep models for speaker-independent multi-talker speech separation,”

in IEEE Int. Conf. Acoust., Speech Signal Process., 2017, pp. 241–245.

[16] Z.-Q.Wang, J. Le Roux, and J. R. Hershey,“Alternative objectivefunctions

for deep clustering,” in Proc. IEEE Int. Conf. Acoust., Speech Signal

Process., 2018, pp. 686–690.

[17] Y. Luo, Z. Chen, J. R. Hershey, J. Le Roux, and N. Mesgarani, “Deep

clustering and conventional networks for music separation: Stronger to-

gether,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017,

pp. 61–65.

[18] J. Rouat, “Computational auditory scene analysis: Principles, algorithms,

and applications (Wang, D. and Brown, GJ, eds.; 2006) [book review],”

IEEE Trans. Neural Netw., vol. 19, no. 1, Jan. 2008.

[19] E. M. Grais, G. Roma, A. J. R. Simpson, and M. D. Plumbley, “Combining

mask estimates for single channel audio source separation using deep

neural networks,” in Proc. Interspeech, 2016, pp. 3339–3343.

[20] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Singing-

voice separation from monaural recordings using deep recurrent neural

networks,”in Proc. Int. Soc. Music Inf. Retrieval Conf., 2014, pp. 477–482.

[21] P. S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Deep

learning for monaural speech separation,” in Proc. IEEE Int. Conf. Acoust.,

Speech Signal Process., 2014, pp. 1562–1566.

[22] C. Fan, B. Liu, J. Tao, J. Yi, and Z. Wen, “Discriminative learning for

monaural speech separation using deep embedding features,” in Proc.

Interspeech, 2019, pp. 4599–4603.

[23] Y. Wang, A. Narayanan, and D. L. Wang, “On training targets for super-

vised speech separation,” IEEE/ACM Trans. Audio Speech Lang. Process.,

vol. 22, no. 12, pp. 1849–1858, Dec. 2014.

[24] H. Erdogan, J. R. Hershey, S. Watanabe, and J. L. Roux, “Phase-sensitive

and recognition-boosted speech separation using deep recurrent neural

networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2015,

pp. 708–712.

[25] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by

jointly learning to align and translate,” in Proc. 3rd Int. Conf. Learn.

Representations, 2015.

[26] D.Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel,and Y. Bengio, “End-to-

end attention-based large vocabulary speech recognition,” in Proc. IEEE

Int. Conf. Acoust., Speech Signal Process., 2016, pp. 4945–4949.

[27] X. Hao, C. Shan, Y. Xu, S. Sun, and L. Xie, “An attention-based neural

network approach for single channel speech enhancement,” in Proc. IEEE

Int. Conf. Acoust., Speech Signal Process., 2019, pp. 6895–6899.

[28] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-

based neural machine translation,” in Proc. Conf. Empirical Methods

Natural Lang. Process., 2015, pp. 1412–1421.

[29] X. Xiao et al., “Single-channel speech extraction using speaker inventory

and attention network,” in Proc. IEEE Int. Conf. Acoust., Speech Signal

Process., 2019, pp. 86–90.

[30] Y. Luo and N. Mesgarani, “TasNet: Surpassing ideal time-frequency

masking for speech separation,” 2018, arXiv:1809.07454.

[31] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic

convolutional and recurrent networks for sequence modeling,” 2018,

arXiv:1803.01271.

[32] C. Lea, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional

networks: A uniﬁed approach to action segmentation,” in Proc. Eur. Conf.

Comput. Vision, 2016, pp. 47–54.

[33] A. Pandey and D. Wang, “TCNN: Temporal convolutional neural network

for real-time speech enhancement in the time domain,” in Proc. IEEE Int.

Conf. Acoust., Speech Signal Process., 2019, pp. 6875–6879.

[34] Y. Liu and D. Wang, “A CASA approach to deep learning based speaker-

independent co-channel speech separation,” 2018 IEEE Int. Conf. Acous-

tics, Speech Signal Process. (ICASSP), 2019, pp. 5399–5403.

[35] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency

magnitude masking for speech separation,” IEEE/ACM Trans. Audio,

Speech, Lang. Process., vol. 27, no. 8, pp. 1256–1266, Aug. 2019.

[36] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal

convolutional networks for action segmentation and detection,” in Proc.

IEEE Conf. Comput. Vision Pattern Recognit., 2017, pp. 156–165.

[37] A. van den Oord et al., “WaveNet: A generative model for raw audio,” in

Proc. 9th ISCA Speech Synthesis Workshop.

[38] F. Chollet, “Xception: Deep learning with depthwise separable convo-

lutions,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., 2017,

pp. 1251–1258.

[39] A. G. Howard et al., “Mobilenets: Efﬁcient convolutional neural networks

for mobile vision applications,” 2017, arXiv:1704.04861.

[40] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers:

surpassing human-level performance on imagenet classiﬁcation,” in Proc.

IEEE Int. Conf. Comput. vision, 2015, pp. 1026–1034.

[41] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in

blind audio source separation,”IEEE Trans. Audio, Speech, Lang. Process.,

vol. 14, no. 4, pp. 1462–1469, Jul. 2006.

[42] J. Garofalo, D. Graff, D. Paul, and D. Pallett, “Csr-i (WSJ0) Complete,”

Linguistic Data Consortium, Philadelphia, 2007.

[43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”

Comput. Sci., 2014, arXiv:1412.6980.

[44] A. W. Rix, M. P. Hollier, A. P. Hekstra, and J. G. Beerends, “Perceptual

evaluation of speech quality (PESQ) the new ITU standard for end-to-end

speech quality assessment part I–time-delay compensation,” J. Audio Eng.

Soc., vol. 50, no. 10, pp. 755–764, 2002.

[45] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-

time objective intelligibility measure for time-frequency weighted noisy

speech,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2010,

pp. 4214–4217.

[46] Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single-

channel multi-speaker separation using deep clustering,” Interspeech,

pp. 545–549, 2016.

[47] C. Xu, W. Rao, E. S. Chng, and H. Li, “A shifted delta coefﬁcient

objective for monaural speech separation using multi-task learning,” in

Proc. Interspeech, 2018, pp. 3479–3483.

[48] K. Wang, F. Soong, and L. Xie, “A pitch-aware approach to single-channel

speech separation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Pro-

cess., 2019, pp. 296–300.

[49] Y. Luo, Z. Chen, and N. Mesgarani, “Speaker-independent speech separa-

tion with deep attractor network,” IEEE/ACM Trans. Audio, Speech, Lang.

Process., vol. 26, no. 4, pp. 787–796, Apr. 2018.

Cunhang Fan (Student Member, IEEE) received the

B.S. degree from the Beijing University of Chemical

Technology, Beijing, China, in 2016. He is currently

working toward the Ph.D degree with the National

Laboratory of Pattern Recognition, Institute of Au-

tomation, Chinese Academy of Sciences, Beijing,

China. His current research interests include speech

separation, speech enhancement, speech recognition

and speech signal processing.

Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.

1314 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 28, 2020

Jianhua Tao (Senior Member, IEEE) received the

M.S. degree from Nanjing University, Nanjing,

China, in 1996, and the Ph.D. degree from Tsinghua

University, Beijing, China, in 2001. He is currently

a Professor with NLPR, Institute of Automation,

Chinese Academy of Sciences, Beijing, China. He

has authored or coauthored more than 200 papers on

major journals and proceedings including the IEEE

TRANSACTIONS ON AUDIO,SPEECH,AND LANGUAGE

PROCESSING. His current research interests include

speech recognition, speech synthesis, human com-

puter interaction, affective computing, and pattern recognition. He is the Board

Member of ISCA, the Chair or Program Committee Member for several major

conferences, including Interspeech, ICPR, ACII, ICMI, ISCSLP, etc. He was

the Steering Committee Member for the IEEE TRANSACTIONS ON AFFECTIVE

COMPUTING, and is an Associate Editor for Journal on Multimodal User Inter-

face and International Journal on Synthetic Emotions. He was the recipient of

several awards from the important conferences, such as Interspeech, NCMMSC,

etc.

Bin Liu (Member, IEEE) received the B.S. degree

and the M.S. degree from the Beijing institute of

technology, Beijing, China, in 2007 and 2009 respec-

tively. He received the Ph.D. degree from the Na-

tional Laboratory of Pattern Recognition, Institute of

Automation, Chinese Academy of Sciences, Beijing,

China, in 2015. He is currently an Associate Professor

with the National Laboratory of Pattern Recogni-

tion, Institute of Automation, Chinese Academy of

Sciences, Beijing, China. His current research in-

terests include affective computing and audio signal

processing.

Jiangyan Yi (Member, IEEE) received the M.A. de-

gree from the Graduate School of Chinese Academy

of Social Sciences, Beijing, China, in 2010 and

the Ph.D. degree from the University of Chinese

Academy of Sciences, Beijing, China, in 2018. She

was a Senior R&D Engineer with Alibaba Group

during 2011 to 2014. She is currently an Assis-

tant Professor with the National Laboratory of Pat-

tern Recognition, Institute of Automation, Chinese

Academy of Sciences, Beijing, China. Her current

research interests include speech processing, speech

recognition, distributed computing, deep learning, and transfer learning.

Zhengqi Wen (Member, IEEE) received the B.S. de-

gree from the University of Science and Technology

of China, Hefei, China, in 2008, and the Ph.D. de-

gree from the Chinese Academy of Sciences, Beijing,

China, in 2013. He is currently an Associate Professor

with the National Laboratory of Pattern Recognition,

Institute of Automation, Chinese Academy of Sci-

ences, Beijing, China. His current research interests

include speech processing, speech recognition, and

speech synthesis.

Xuefei Liu (Member, IEEE) received the M.A. de-

gree from Beijing Normal University, Beijing, China,

in 2013 and the Ph.D. degree from the Graduate

School of Chinese Academy of Social Sciences, Bei-

jing, China, in 2016. She is currently an Assistant Pro-

fessor with the National Laboratory of Pattern Recog-

nition, Institute of Automation, Chinese Academy of

Sciences, Beijing, China. Her current research inter-

ests include Corpus construction and experimental

phonetics.

Authorized licensed use limited to: National Science Library CAS. Downloaded on March 08,2023 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.

CompNet: Complementary network for single-channel speech enhancement

Article

Sep 2023
NEURAL NETWORKS

FTA-net: A Frequency and Time Attention Network for Speech Depression Detection

Conference Paper

Aug 2023

Monaural Speech Separation Using Dual-Output Deep Neural Network with Multiple Joint Constraint

Article

May 2023
CHINESE J ELECTRON

Monaural speech separation is a significant research field in speech signal processing. To achieve a better separation performance, we propose three novel joint-constraint loss functions and a multiple joint-constraint loss function for monaural speech separation based on dual-output deep neural network (DNN). The multiple joint-constraint loss function for DNN separation model not only restricts the ideal ratio mask (IRM) errors of the two outputs, but also constrains the relationship of the estimated IRMs and the magnitude spectrograms of the clean speech signals, the relationship of the estimated IRMs of the two outputs, and the relationship of the estimated IRMs and the magnitude spectrogram of the mixed signal. The constraint strength is adjusted through three parameters to improve the accuracy of the speech separation model. Furthermore, we solve the optimal weighting coefficients of the multiple joint-constraint loss function based on the optimization idea, which further improves the performance of the separation system. We conduct a series of speech separation experiments on the GRID corpus to validate the superiority performance of the proposed method. The results show that using perceptual evaluation of speech quality, the short-time objective intelligibility, source to distortion ratio, signal to interference ratio and source to artifact ratio as the evaluation metrics, the proposed method out-performs the conventional DNN separation model. Taking the gender into consideration, we carry out experiments among Female-Female, Male-Male and Male-Female cases, which show that our method improves the robustness and performance of the separation system compared with some previous approaches.

Time-domain adaptive attention network for single-channel speech separation

Article

Full-text available

May 2023

Recent years have witnessed a great progress in single-channel speech separation by applying self-attention based networks. Despite the excellent performance in mining relevant long-sequence contextual information, self-attention networks cannot perfectly focus on subtle details in speech signals, such as temporal or spectral continuity, spectral structure, and timbre. To tackle this problem, we proposed a time-domain adaptive attention network (TAANet) with local and global attention network. Channel and spatial attention are introduced in local attention networks to focus on subtle details of the speech signals (frame-level features). In the global attention networks, a self-attention mechanism is used to explore the global associations of the speech contexts (utterance-level features). Moreover, we model the speech signal serially using multiple local and global attention blocks. This cascade structure enables our model to focus on local and global features adaptively, compared with other speech separation feature extraction methods, further boosting the separation performance. Versus other end-to-end speech separation methods, extensive experiments on benchmark datasets demonstrate that our approach obtains a superior result. (20.7 dB of SI-SNRi and 20.9 dB of SDRi on WSJ0-2mix).

Multi-Dimensional and Multi-Scale Modeling for Speech Separation Optimized by Discriminative Learning

Preprint

Full-text available

Mar 2023

Transformer has shown advanced performance in speech separation, benefiting from its ability to capture global features. However, capturing local features and channel information of audio sequences in speech separation is equally important. In this paper, we present a novel approach named Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation. Specifically, we design a new network SE-Conformer that can model audio sequences in multiple dimensions and scales, and apply it to the dual-path speech separation framework. Furthermore, we propose Multi-Block Feature Aggregation to improve the separation effect by selectively utilizing information from the intermediate blocks of the separation network. Meanwhile, we propose a speaker similarity discriminative loss to optimize the speech separation model to address the problem of poor performance when speakers have similar voices. Experimental results on the benchmark datasets WSJ0-2mix and WHAM! show that ISCIT can achieve state-of-the-art results.

Deep neural network techniques for monaural speech enhancement and separation: state of the art analysis

Article

Full-text available

Oct 2023
ARTIF INTELL REV

Peter Ochieng

Deep neural networks (DNN) techniques have become pervasive in domains such as natural language processing and computer vision. They have achieved great success in tasks such as machine translation and image generation. Due to their success, these data driven techniques have been applied in audio domain. More specifically, DNN models have been applied in speech enhancement and separation to perform speech denoising, dereverberation, speaker extraction and speaker separation. In this paper, we review the current DNN techniques being employed to achieve speech enhancement and separation. The review looks at the whole pipeline of speech enhancement and separation techniques from feature extraction, how DNN-based tools models both global and local features of speech, model training (supervised and unsupervised) to how they address label ambiguity problem. The review also covers the use of domain adaptation techniques and pre-trained models to boost speech enhancement process. By this, we hope to provide an all inclusive reference of all the state of art DNN based techniques being applied in the domain of speech separation and enhancement. We further discuss future research directions. This survey can be used by both academic researchers and industry practitioners working in speech separation and enhancement domain.

Subband fusion of complex spectrogram for fake speech detection

Article

Sep 2023
SPEECH COMMUN

Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks

Article

Full-text available

Jul 2023
CIRC SYST SIGNAL PR

Speech enhancement is an important method for improving speech quality and intelligibility in noisy environments. An effective speech enhancement model depends on precise modelling of the long-range dependencies of noisy speech. Several recent studies have examined ways to enhance speech by capturing the long-term contextual information. For speech enhancement, the time–frequency (T–F) distribution of speech spectral components is also important, but is usually ignored in these studies. The multi-stage learning method is an effective way to integrate various deep learning modules at the same time. The benefit of multi-stage training is that the optimization target can be iteratively updated stage by stage. In this paper, speech enhancement is investigated by multi-stage learning using a multi-stage structure in which time–frequency attention (TFA) blocks are followed by stacks of squeezed temporal convolutional networks (S-TCN) with exponentially increasing dilation rates. To reinject original information into later stages, a feature fusion block (FB) is inserted at the input of later stages to reduce the possibility of speech information being lost in the early stages. The S-TCN blocks are responsible for temporal sequence modelling tasks. The time–frequency attention (TFA) is a simple but effective network module that explicitly exploits position information to generate a 2D attention map to characterize the salient T–F distribution of speech by using two branches, time-frame attention and frequency attention in parallel. Extensive experiments have demonstrated that the proposed model consistently improves the performance over existing baselines across two widely used objective metrics such as PESQ and STOI. A significant improvement in system robustness to noise is also shown by our evaluation results using the TFA module.

Deep Neural Network Techniques for Monaural Speech Enhancement and Separation: State of the art Analysis

Preprint

Full-text available

Jun 2023

Peter Ochieng

Deep neural networks (DNN) techniques have become pervasive in domains such as natural language processing and computer vision. They have achieved great success in tasks such as machine translation and image generation. Due to their success, these data driven techniques have been applied in audio domain. More specifically, DNN models have been applied in speech enhancement and separation to achieve denoising, dereverberation, speaker extraction and speaker separation in monaural speech intelligibility improvement. In this paper, we review some dominant DNN techniques being employed to achieve speech enhancement and separation. The review looks at the pipeline of speech enhancement and separation techniques from feature extraction, how DNN-based tools models both global and local features of speech and model training (supervised and unsupervised). The review also covers the use of domain adaptation techniques and pre-trained models to boost speech enhancement process.

Multi-Dimensional and Multi-Scale Modeling for Speech Separation Optimized by Discriminative Learning

Conference Paper

Jun 2023

Noise Prior Knowledge Learning for Speech Enhancement via Gated Convolutional Generative Adversarial Network

Conference Paper

Full-text available

Nov 2019

Utterance-level Permutation Invariant Training with Discriminative Learning for Single Channel Speech Separation

Conference Paper

Full-text available

Nov 2018

A Pitch-Aware Approach to Single-Channel Speech Separation

Conference Paper

Full-text available

Feb 2019

Despite significant advancements of deep learning on separating speech sources mixed in a single channel, same gender speaker mix, i.e., male-male or female-female, is still more difficult to separate than the case of opposite gender mix. In this study, we propose a pitch-aware speech separation approach to improve the speech separation performance. The proposed approach performs speech separation in the following steps: 1) training a pre-separation model to separate the mixed sources; 2) training a pitch-tracking network to perform polyphonic pitch tracking; 3) incorporating the estimated pitch for the final pitch-aware speech separation. Experimental results of the new approach, tested on the WSJ0-2mix public dataset, show that the new approach improves speech separation performance for both same and opposite gender mixture. The improved performance in signal-to-distortion (SDR) of 12.0 dB is the best reported result without using any phase enhancement.

Deep Learning Based Phase Reconstruction for Speaker Separation: A Trigonometric Perspective

Conference Paper

Full-text available

May 2019

This study investigates phase reconstruction for deep learning based monaural talker-independent speaker separation in the short-time Fourier transform (STFT) domain. The key observation is that, for a mixture of two sources, with their magnitudes accurately estimated and under a geometric constraint, the absolute phase difference between each source and the mixture can be uniquely determined; in addition, the source phases at each time-frequency (T-F) unit can be narrowed down to only two candidates. To pick the right candidate, we propose three algorithms based on iterative phase reconstruction, group delay estimation, and phase-difference sign prediction. State-of-the-art results are obtained on the publicly available wsj0-2mix and 3mix corpus.

Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features

Conference Paper

Sep 2019

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

Article

May 2019

Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency of the entire system. To address these shortcomings, we propose a fully-convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time-domain speech separation. Conv-TasNet uses a linear encoder to generate a representation of the speech waveform optimized for separating individual speakers. Speaker separation is achieved by applying a set of weighting functions (masks) to the encoder output. The modified encoder representations are then inverted back to the waveforms using a linear decoder. The masks are found using a temporal convolutional network (TCN) consisting of stacked 1-D dilated convolutional blocks, which allows the network to model the long-term dependencies of the speech signal while maintaining a small model size. The proposed Conv-TasNet system significantly outperforms previous time-frequency masking methods in separating two- and three-speaker mixtures. Additionally, Conv-TasNet surpasses several ideal time-frequency magnitude masks in two-speaker speech separation as evaluated by both objective distortion measures and subjective quality assessment by human listeners. Finally, Conv-TasNet has a significantly smaller model size and a much shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications. This study therefore represents a major step toward the realization of speech separation systems for real-world speech processing technologies.

An Attention-based Neural Network Approach for Single Channel Speech Enhancement

Conference Paper

May 2019

Single-channel Speech Extraction Using Speaker Inventory and Attention Network

Conference Paper

May 2019

TCNN: Temporal Convolutional Neural Network for Real-time Speech Enhancement in the Time Domain

Conference Paper

May 2019

Alternative Objective Functions for Deep Clustering

Conference Paper

Apr 2018

End-to-End Post-Filter for Speech Separation With Deep Attention Fusion Features

Abstract

Recommended publications

Deep Attention Fusion Feature for Speech Separation with End-to-End Post-filter Method

Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features

Utterance-level Permutation Invariant Training with Discriminative Learning for Single Channel Speec...

Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features