ArticlePDF Available

Fault diagnosis for small samples based on attention mechanism

October 2021
Measurement 187(8):110242

October 2021
187(8):110242

DOI:10.1016/j.measurement.2021.110242

Authors:

Chao He

Beijing Jiaotong University

Show all 6 authorsHide

Aiming at the application of deep learning in fault diagnosis, mechanical rotating equipment components are prone to failure under complex working environment, and the industrial big data suffers from limited labeled samples, different working conditions and noises. In order to explore the problems above, a small sample fault diagnosis method is proposed based on dual path convolution with attention mechanism(DCA) and Bidirectional Gated Recurrent Unit(DCA-BiGRU), whose performance can be effectively mined by the latest regularization training strategies. BiGRU is utilized to realize spatiotemporal feature fusion, where vibration signal fused features with attention weight are extracted by DCA. Besides, global average pooling(GAP) is applied to dimension reduction and fault diagnosis. It is indicated that DCA-BiGRU has exceptional capacities of generalization and robustness by experiments, and can effectively carry out diagnosis under various complicated situations.

The core structures of GRU cell and BiGRU and training modes are conventional, whose potentiality has not been further explored. Besides, in fault diagnosis, at the current moment, Bi-GRU makes the output state determined by the state of the previous and next moments conjointly. Of course, the last hidden neuron output is generally taken as the final hidden feature for diagnosis, for the reason that it has the most abundant features. Nevertheless, the strategy ignores signal features learned by other GRU cells. Therefore, an intelligent fault diagnosis method called DCA-BiGRU has been proposed, which is composed of data enhancement, dual path convolution, attention mechanism, BiGRU, GAP and diagnosis layer, as shown in Fig.4. As shown in Fig.3, in practical application, the specific steps of fault diagnosis based on DCA-BiGRU are as follows: 1) Obtain the original signals and realize data segmentation and standardization. 2) Divide signals into training, verification and test samples. 3) Propose the model structures and diagnostic method. 4) Offline training: use the training set and regularization strategies to train and save the optimal parameters. 5) Online diagnosis: apply the test set to verify the model performance or load pre-training parameters and finetune the whole model to utilize parameter sharing transfer learning to realize timely training and fault diagnosis.

…

The architecture of the proposed Attention Block

…

Description of experimental parameters

…

Partition of CWRU data sets

…

Figures - uploaded by Chao He

Content may be subject to copyright.

Content uploaded by Chao He

Content may be subject to copyright.

Graphical Abstract

Fault Diagnosis for Small Samples Based on Attention Mechanism

Xin Zhang,Chao He,Yanping Lu,Biao Chen,Le Zhu,Li Zhang

Sliding window sampling Back Propagation

LSR Meta-

ACON

Training DCA-BiGRU

Early

Stopping

Optimized DCA-BiGRU Fault diagnosis

Data Processing

online testing

offline training

Fine-tune

Industrial samples Fault diagnosis

Sharing Parameters

AdaBN

Data Normalization

Data Partitioning

AdamP

Saving parameters

Training, Validation

Testing

Training, Validation

Testing

B→A B→B B→C B→D

95.0

95.5

96.0

96.5

97.0

97.5

98.0

98.5

99.0

99.5

100.0

acc(%)

Domain migration

WLSR WGHMC WFL WCE WLSRG BLSR 179.55s

413.83s

133.12s

170.51s

227.01s

300.43s

0.1 0.2 0.3 0.4 0.5

G-mean

DCNN-BiGRU

DCNN

DCA

DCA-BiGRU

online application

Highlights

Fault Diagnosis for Small Samples Based on Attention Mechanism

Xin Zhang,Chao He,Yanping Lu,Biao Chen,Le Zhu,Li Zhang

•A fault diagnosis model based on dual path convolution with attention mechanism and BiGRU is proposed.

•The impact of low training set ratio is discussed on fault diagnosis.

•The inﬂuence of BiGRU and attention mechanism are studied on small samples.

•The performance of the method has been veriﬁed in the bearing and gearbox data sets.

•Diﬀerent working conditions of the equipment can be dealt with eﬀectively.

Fault Diagnosis for Small Samples Based on Attention Mechanism

Xin Zhanga, Chao Heb, Yanping Lub, Biao Chenb, Le Zhucand Li Zhangb,∗

aSchool of Materials Science and Engineering, Northeastern University, Shenyang 110819, China

bSchool of Information, Liaoning University, Shenyang 110036, China

cSchool of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China

ARTICLE INFO

Keywords:

Convolutional neural network

Bidirectional gated recurrent unit

Attention mechanism

Rolling bearings

Small samples

Fault diagnosis

ABSTRACT

Aiming at the application of deep learning in fault diagnosis, mechanical rotating equipment com-

ponents are prone to failure under complex working environment, and the industrial big data suﬀers

from limited labeled samples, diﬀerent working conditions and noises. In order to explore the problems

above, a small sample fault diagnosis method is proposed based on dual path convolution with attention

mechanism(DCA) and Bidirectional Gated Recurrent Unit(DCA-BiGRU), whose performance can

be eﬀectively mined by the latest regularization training strategies. BiGRU is utilized to realize

spatiotemporal feature fusion, where vibration signal fused features with attention weight are extracted

by DCA. Besides, global average pooling(GAP) is applied to dimension reduction and fault diagnosis.

It is indicated that DCA-BiGRU has exceptional capacities of generalization and robustness by

experiments, and can eﬀectively carry out diagnosis under various complicated situations.

1. Introduction

With the development of industrial Internet of Things,

the manufacturability, integration and precision of rotating

machinery system are constantly improving, but complexity,

nonlinearity and uncertainty are also signiﬁcantly enhanced,

which has become a huge challenge[1]. During the long-

term running, rotating machinery will be aﬀected by material

degradation, loads, temperature and humidity, leading to the

breakdown of key components easily, which will depress

plant beneﬁts, or lead to casualties and ecological pollution.

Therefore, it is of great signiﬁcance to monitor the status of

rotating machinery.

In the past few years, fault diagnosis methods based on

signal analysis, swarm intelligence evolution and machine

learning have continued to emerge[2–4]. However, it is too

dependent on prior knowledge of experts and features are

extracted by manual, which makes it diﬃcult to process

big data and learn advanced features. Additionally, swarm

intelligence is a heuristic algorithm and the optimized result

is hard to be stable because of randomness. Furthermore,

related algorithms with a quite high time complexity cannot

guarantee to ﬁgure out the global optimum. Finally, in the

face of complex and changeable industrial data, it is diﬃcult

for vanilla shallow models to achieve ideal results.

In recent years, with the development of deep learning,

it has made remarkable achievements in image classiﬁca-

tion, semantic segmentation, target detection and natural

language processing[5–8]. Similarly, it also provides some

directions of settling the problems encountered above in fault

diagnosis[9]. Accordingly, a series of studies for fault diag-

nosis have set oﬀ a research upsurge, which include convo-

lutional neural network, autoencoder, generative adversarial

network, deep belief network, recurrent neural network and

capsule network etc[10–16]. Implementation of these meth-

ods usually requires to design novel and eﬃcient structures

or improve deep optimized algorithms. Alternatively, the

∗Corresponding author.

zhang_li@lnu.edu.cn (L. Zhang)

ORCID (s):

distribution features of signals are required to analyze from

multiple perspectives. For example, Zhou et al.[17] added

a data generation and ﬁltering strategy into autoencoder-

generative adversarial networks(AE-GAN) for unbalanced

data, where autoencoder was utilized to learn features of

unbalanced samples, and the discriminator aimed to ﬁlter out

unqualiﬁed generated samples. Kumar et al.[18] adopted a

Deep CNN model based on AdaGrad, which fused multiple

sensor data to generate images for fault diagnosis.

Furthermore, small sample fault diagnosis has become a

new research focus. Zhang et al.[19] put forward a method

for small samples based on siamese neural network, and the

same or diﬀerent sample pairs were input to calculate 𝐿1

distance of feature vectors, judging whether to belong to the

same class to train, and then support sets and query sets as

pairs were calculated similarity to realize fault diagnosis. On

this basis, Wang et al.[20] proposed a comparison diagnosis

model which applied the full connected layer as the similarity

measure of feature pairs to judge whether they belonged to

a certain type, and meanwhile regularization methods were

added to improve performance. Wu et al.[21] compared small

sample transfer learning among feature transfer, ﬁne-tuning

and meta relation network, and concluded that under small

samples or the similarity between source domain and target

domain was large, the meta relation transfer was dominant.

On the contrary, the advantage of feature transfer was gradu-

ally obvious. Sauﬁ et al.[22] came up with a small sample

fault diagnosis method based on spectral kurtosis ﬁltering

and particle swarm optimization stacked sparse autoencoder,

where a high diagnostic accuracy can be achieved when the

number of per fault training samples is 100. Han et al.[23]

applied bidirectional long short-term memory(BiLSTM) and

capsule network to design a small sample fault diagnosis

method, which proved that capsule network had a satisfying

performance after denoising and fusion signals by BiLSTM.

Li et al.[24] developed a conditional Wasserstein generative

adversarial network(CWGAN), where vast similar samples

were generated by training CWGAN with vast source domain

et al.: Preprint submitted to Elsevier Page 1 of 15

samples, and pre-trained CWGAN was ﬁne-tuned to achieve

transfer learning under target domain with limited samples.

For small samples, they either utilize regularization tech-

nologies and feature extraction advantages of models, or

generate substantial high-quality samples based on the distri-

bution of real samples, or apply emerging machine learning

technologies such as meta-learning and transfer learning.

The design of big convolution kernels is beneﬁcial to

enhance robustness[25], while that of deep small convolution

kernels eﬀectively extract abstract features. Also, time-step

information cannot be ignored in vibration signals. Com-

pared with CNN, RNN can just meet requirements.

To learn temporal and hidden features in diﬀerent loca-

tions, an eﬀective strategy is to employ a gated RNN struc-

ture, LSTM or GRU. LSTM has an excellent time modeling

capability while has many parameters, which easily leads to

overﬁtting under small samples. Similarly, it is inappropriate

to assume that signals only propagate information forward,

so BiGRU with similar performance to BiLSTM, fewer pa-

rameters and propagating forth and back is a terriﬁc choice.

Zhao et al.[26] put forward a method of combining Manifold

Embedded Distribution Alignment(MEDA) and BiGRU for

fault diagnosis. The noises of original signals were removed

by spectrum information, and BiGRU was utilized to learn

features, then MEDA was used to align auxiliary and unla-

beled samples. However, the method utilizes artiﬁcial prior

knowledge for denoising and does not analyze the impact of

small samples and time complexity. Yang et al.[27] proposed

a fault diagnosis method based on BiGRU and attention.

BiGRU was utilized to gain advanced expressions from fea-

tures extracted by CNN, then attention vectors were realized

to diagnose each segment. However, reference[27] does not

discuss the inﬂuence of small samples, and the means of

training is relatively conventional, and the performance of

model has not been further mined. In addition, the number of

training samples of DCA-BiGRU is 60% of that of reference

[22] with more diﬃcult diagnosis.

Although previous methods have achieved relatively sat-

isfactory results, deep learning models often require plenty

of samples to achieve the ideal generalization. However, due

to the relatively small labeled data, models are often unable

to fully learn the various eﬀective features of the limited

samples and prone to overﬁtting, which increases learning

diﬃculties[28]. Besides, the latest activation functions and

gradient descent back propagation algorithms of all sorts

have not been deeply comparative explored in fault diagnosis

under small samples. Ultimately, due to the interference of

diﬀerent working conditions, the eﬃciency is diﬃcult to be

guaranteed, which puts forward higher requirements.

Therefore, aiming to regularization technologies and fea-

ture extraction advantages of models, a new fault diagno-

sis method for small samples based on dual path convo-

lution with attention mechanism and BiGRU is proposed.

The convolution layer aims to extract high-low frequency

features of signals. Meanwhile attention mechanism that

can be regarded as a cost sensitive learning method[28]

values the fused features by allocated weights and sensi-

tive information selection, pouring attention to the main

spectra. Then, BiGRU can get the hidden information of

diﬀerent time sequence position. In addition to strengthening

the connection between channels and reducing parameters,

GAP and big kernels have more robust than capsule net-

work on model by increasing receptive ﬁelds[29]. Moreover,

the latest regularization methods further improve capacity

of generalization on DCA-BiGRU, where label smoothing

regularization(LSR) is introduced to balance the distribu-

tion diﬀerences between the labeled samples and calibrate

DCA-BiGRU. Improved AMSGrad accelerator(AMSGradP)

can be utilized to realize adaptive gradient optimization,

and 1D-Meta-ACON(activate or not) can adaptively activate

neurons, and adaptive batch normalization(AdaBN) enables

DCA-BiGRU to have stronger transfer performance.

The main contributions of the paper are as follows:

1. For small sample fault diagnosis, a novel method based

on designed attention mechanism and BiGRU is proposed

from the regularization and model structure, and the

eﬀects of LSR, activation functions and back propagation

algorithms are explored for the ﬁrst time. Also, the pro-

posed method has a higher test accuracy.

2. The sensitivities of attention mechanism and BiGRU to

the ratio of training samples are discussed, where the pro-

posed attention mechanism can capture the channel and

spatial information of vibration signals. Then, designing

GAP after BiGRU is beneﬁcial for improving diagnostic

performance. Also, visualization techniques are utilized

to gain a better understanding of blocks in DCA-BiGRU.

3. For the noises contained in practical industrial data, a

small sample transfer diagnosis framework based on pre-

training is proposed. The experimental results prove that

it has excellent capacities of generalization, adaptability

and robustness compared to other bearing and gearbox

diagnosis models under complex working conditions.

The rest of other parts in this paper is as follows. Section

2is mainly about the basic theoretical model for fault diagno-

sis. DCA-BiGRU and latest regularization training strategies

will be introduced in detail in Section 3. Section 4presents

some comparative experiments and analysis to prove the

excellent performance of the proposed model. In section 5, it

will draw the conclusion and prospect for the future research.

2. Methodologies

2.1. Convolutional neural network

CNN generally consists of two modules: one ﬁlter block

including convolution and pooling and the other classiﬁca-

tion block including full connection. The general CNN in

fault diagnosis is shown in Fig.1.

In signal processing, 1D-CNN is utilized to calculate

delay accumulation of signals with the same kernel. The

output is shown in Eq.(1).

Figure 1: CNN for fault diagnosis

y = 𝑅𝑒𝐿𝑈 (

𝑊



𝑤=1

𝑘𝑤𝑥𝑡−𝑤+1 +𝑏𝑤)(1)

et al.: Preprint submitted to Elsevier Page 2 of 15

where 𝑘𝑤and 𝑏𝑤are weight and bias matrix, respectively.

𝑥𝑡−𝑤+1 are input signals.

Pooling layer selects features and decreases parameters to

accelerate convergence. The reason why maximum pooling

is often utilized in fault diagnosis is that it can ﬁlter out

insigniﬁcant information, as shown in Eq.(2).

𝑦𝑖= max

𝑗∈𝑖𝑥𝑗(2)

where 𝑦𝑖is representations, and 𝑗is neurons in the 𝑖-th layer.

Batch Normalization(BN) can not only solve the internal

covariate migration and improve training eﬃciency, but also

act as a regularization trick because of batch selection ran-

domly, which can enhance generalization instead of Dropout.

Activation functions can enhance learning capacity of

neural network, improving the computational eﬃciency.

The distributed feature representations of vibration sig-

nals are mapped to the sample label space through full con-

nection layer. Finally, SoftMax is applied for fault diagnosis.

2.2. Bidirectional Gated Recurrent Unit

As shown in Fig.2, gated recurrent unit(GRU) consists of

an update gate 𝑧𝑡and a reset gate 𝑟t.𝑧𝑡is applied to control

the extent to which ℎ𝑡−1 enters ℎ𝑡. The higher values are, the

more information ℎ𝑡is entered. 𝑟tis utilized to control the

extent to which ℎ𝑡−1 enters 

ℎ𝑡. The smaller values are, the

less 

ℎ𝑡entry information. 𝑧𝑡and 𝑟tare calculated at 𝑡moment

as shown in Eq.(3∼7).

𝑟𝑡=𝜎[𝑊𝑟⊗ 𝑐𝑎𝑡(ℎ𝑡−1 , 𝑥𝑡)] (3)

𝑧𝑡=𝜎[𝑊𝑧⊗ 𝑐𝑎𝑡(ℎ𝑡−1 , 𝑥𝑡)] (4)



ℎ𝑡= tanh[𝑊

ℎ𝑡⊗ 𝑐𝑎𝑡(𝑟𝑡⊗ ℎ𝑡−1 , 𝑥𝑡)] (5)

ℎ𝑡= (1 − 𝑧𝑡)⊗ ℎ𝑡−1 +𝑧𝑡⊗

ℎ𝑡(6)

𝑦𝑡=𝜎(𝑊𝑜⊗ ℎ𝑡)(7)

where 𝑊𝑟, 𝑊𝑧, 𝑊 

ℎ𝑡is the weight matrix, 𝑐𝑎𝑡() means that

eigenvectors are connected. 𝜎is 𝑠𝑖𝑔𝑚𝑜𝑖𝑑;⊗means element-

wise product; the cell hidden state is ℎ;

ℎ𝑡means candidate

content in the current state, which controls the degree of

receiving new information.

For Bidirectional gated recurrent unit(BiGRU), the for-

ward 

ℎ𝑡and backward 

ℎ𝑡state without sharing parameters of

signals are connected through diﬀerent hidden layers, which

together act on results ℎ𝑡to express ampler features, as shown

in Eq.(8)



ℎ𝑡=𝐺𝑅𝑈 (𝑥𝑡,

ℎ𝑡−1),

ℎ𝑡=𝐺𝑅𝑈 (𝑥𝑡,

ℎ𝑡−1),

ℎ𝑡=𝑤𝑡

ℎ𝑡+𝑣𝑡

ℎ𝑡+𝑏𝑡

(8)

where 𝑤𝑡and 𝑣𝑡are weights corresponding to the forward or

backward state of BiGRU respectively, and 𝑏𝑡is bias.

3. The proposed fault diagnosis method

3.1. Fault diagnosis procedure

In intelligent machine fault diagnosis, multiple structures

and deep optimized algorithms can be integrated to achieve

an amazing eﬀect, where CNN-RNN has been applied to

some extent[30,31]. However, as mentioned in Section 1,

under small samples, the performance of CNN-RNN has not

been further discussed, and deep optimization algorithms

GRU Cell

BiGRU

Figure 2: The core structures of GRU cell and BiGRU

and training modes are conventional, whose potentiality has

not been further explored.

Besides, in fault diagnosis, at the current moment, Bi-

GRU makes the output state determined by the state of the

previous and next moments conjointly. Of course, the last

hidden neuron output is generally taken as the ﬁnal hidden

feature for diagnosis, for the reason that it has the most

abundant features. Nevertheless, the strategy ignores signal

features learned by other GRU cells.

Therefore, an intelligent fault diagnosis method called

DCA-BiGRU has been proposed, which is composed of data

enhancement, dual path convolution, attention mechanism,

BiGRU, GAP and diagnosis layer, as shown in Fig.4.

As shown in Fig.3, in practical application, the speciﬁc

steps of fault diagnosis based on DCA-BiGRU are as follows:

1) Obtain the original signals and realize data segmentation

and standardization.

2) Divide signals into training, veriﬁcation and test samples.

3) Propose the model structures and diagnostic method.

4) Oﬄine training: use the training set and regularization

strategies to train and save the optimal parameters.

5) Online diagnosis: apply the test set to verify the model

performance or load pre-training parameters and ﬁne-

tune the whole model to utilize parameter sharing transfer

learning to realize timely training and fault diagnosis.

3.2. Dual path convolution and feature fusion

The dual convolution layer adopts two paths to extract

the high-low frequency features of signals. On one path, two

larger convolution kernels are utilized to learn low-frequency

features. As described in Section 2.1, larger convolution

kernels can enhance robustness against noises. On the other

path, small convolution kernels are adopted to deepen neural

network, which integrates four nonlinear activation layers

to promote the discriminant capability. A combination of

both widens the model and extract multiscale features, which

provides a foundation for BiGRU to further learn advanced

features. Finally, features are fused through element-wise

product, where each channel contains abundant features.

To enhance the adaptability to DCA-BiGRU in diﬀerent

domains, AdaBN is leveraged to replace BN, where statis-

tical information from source domain to target domain is

adjusted to improve capacity of generalization[32].

et al.: Preprint submitted to Elsevier Page 3 of 15

Sliding window sampling Back Propagation

LSR 1D-Meta--

ACON

Training DCA-BiGRU

Early

Stopping

Optimized DCA-BiGRU Diagnosis result

Signal Processing

online testing

offline training

Fine-tune

Industrial

samples Diagnosis result

Sharing Parameters

AdaBN

Signal Normalization

Signal Partitioning

AMSGradP

Saving parameters

Training, Validation

Testing

Training, Validation

Testing

online application

Figure 3: Fault diagnosis framework based on sharing parameters

Dual Path Convolutional Layer Feature

Fusion Layer Attention Mechanism Bidirectiona GRU Global Average Pooling FC Layer

(1×15, 2) (1×10, 2)

(1×6, 1) (1×6, 1) (1×6, 1) (1×6, 2)

(1×2, 2)

hide=64

SoftMax

(1×2, 2)

AdaBN AdaBN

AdaBN AdaBN AdaBN

AdaBN AdaBN

(1×1, 1) (1×1, 1)

AdaBN

Figure 4: Overall schema for the proposed network architecture of DCA-BiGRU

3.3. The proposed attention mechanism of signals

Attention mechanism and LSR can be regarded as cost

sensitive learning methods, and 1D-Meta-ACON can be seen

as a means of meta-learning. For small samples, these regu-

larization methods will make contributions to generalization

and domain adaptability on model.

3.3.1. label smoothing regularization

Cross entropy loss(CE, 𝑙0) tends to focus on one di-

rection, leading to poor regulating capability. Consequently,

smoothing coeﬃcient 𝜀are added to increase the correct

diagnosis and reduce wrong diagnosis, which contributes to

countering overconﬁdence of models and enhances learning

capability. LSR(𝑙) can not only upgrade generalization, but

also calibrate models. It is mostly used in the ﬁeld of image

recognition, but rarely studied in fault diagnosis.

Supposing 𝑝(𝑘)is predicted distribution, 𝑞(𝑘)is real

distribution, real distribution after label smoothing is 𝑞′(𝑘)

with coeﬃcient 𝜀and category 𝐾, and label distribution is

set to uniform distribution 𝜇(𝑘) = 1∕𝐾. The relationship

between 𝑙0and 𝑙is succinctly deduced, as shown in Eq.(9).

𝑙= −

𝐾



𝑘=1

log(𝑝(𝑘))𝑞′(𝑘)

= −

𝐾



𝑘=1

log(𝑝(𝑘))[(1 − 𝜀)𝑞(𝑘) + 𝜀

𝐾]

= (1 − 𝜀)[−

𝐾



𝑘=1

log(𝑝(𝑘))𝑞(𝑘)] + 𝜀[

−

𝐾



𝑘=1

log(𝑝(𝑘))

𝐾]

= (1 − 𝜀)𝑙0+𝜀[

−

𝐾



𝑘=1

log(𝑝(𝑘))

𝐾]

(9)

By learning smooth labels instead of real labels to allevi-

ate overﬁtting, so we argue that LSR has potential advantages

in dealing with small samples in fault diagnosis.

3.3.2. The proposed 1D-signal attention mechanism

In Fig.5, a 1D-signal attention mechanism is proposed,

which can tell us what models demand to focus on about

original signals.

et al.: Preprint submitted to Elsevier Page 4 of 15

To calculate attention between channels, it is indispens-

able to squeeze the dimension of input feature matrix, and

global pooling is generally adopted. Furthermore, compared

with GAP that focuses on the overall information, we argue

that global max pooling(GMP) provides the crucial pulses

(𝑥𝐺𝑀𝑃 ) for the signal characteristic matrix (𝑥), and in the-

ory, it is the decisive pulses that is regarded as the main

distinguishing criterion for fault diagnosis, so GMP is more

suitable than GAP for the proposed attention block, which

will be veriﬁed by experiment.

The 𝑐-th channel GMP will be calculated as in Eq.(10).

𝑥c

𝐺𝑀𝑃 = Max

0≤𝑗<𝑑 𝑥𝑐(1, 𝑗 )(10)

Dual Conv

Feature Fusion

Re-weight

AdaptiveMaxPool1d

Conv1d

AdaBN

Conv1d

ⓧ

B×C1×D

B×C1×1

B×C2×(D+1)

B×C1×D

Input

Output

•

GM P

Meta-ACON

Sigmoid

B×C1×D

Cat

Split

B×C1×D

Vibration signal

Feature with weight

B×C2×D

B×C2×(D+1)

transform

Figure 5: The architecture of the proposed Attention Block

Besides, in order to capture the spatial position infor-

mation, it is perfect to establish the relationship between

𝑥𝐺𝑀𝑃 and 𝑥, so they are concatenated together and sent into

a convolution mapping function 𝐹1that shares 1 × 6. The

dependency relationship is encoded to yield the intermediate

characteristic connection matrix 𝑓as shown in Eq.(11).

𝑓=𝛿(𝐹1[𝑐𝑎𝑡(𝑥,𝑥𝐺𝑀 𝑃 )]) (11)

where 𝛿is 1D-Meta-ACON activation function.

Then, 𝑓is split into 𝑥′and others. For the reason that

the transformed original characteristic matrix 𝑥′has not only

information of the critical pulse spectra, but also original

signal characteristics 𝑥, just 𝑥′is retained. Another 1 × 1

convolution mapping function 𝐹2transforms 𝑥′to the same

number of channels as 𝑥, as shown in Eq.(12).

𝑔=𝜎[𝐹2(𝑓𝑥′)] (12)

Finally, the output 𝑦𝑐is shown in Eq.(13).

𝑦𝑐=𝑥𝑐⊗ 𝑔𝑐(13)

3.3.3. The improved 1D-Meta-ACON

Aiming at the nonlinearity of vibration signals, in the

proposed attention block, a new activation function, Meta-

ACON is applied[33]. Neither ReLU nor Swish, but both are

considered and generalized to a general form. It is a form that

can learn whether to activate.

Whether or not to activate neurons is determined by the

smoothing coeﬃcient 𝛽𝑐, so as to dynamically and adaptively

eliminate inessential information. This is similar to the idea

of the proposed 1D-signal attention mechanism, focusing on

the central part in signals, which can conduce to improving

capacity of generalization and transmission performance.

Inspired by this, it is transformed into 𝛽𝑐suitable for 1D-

signals, 1D-Meta-ACON, as shown in Eq.(14).

𝛽𝑐=𝜎[𝐹4(𝐹3(1

𝐷𝐷

𝑑=1 𝑥𝑐,𝑑 ))] (14)

where in forward propagation, 𝛽𝑐is calculated initially. The

eigenvector 𝑥is calculated the mean value on D dimension.

After 𝐹3, 𝐹4(1×1convolution) transform, 𝛽𝑐between (0,1)

is obtained through Sigmoid, which is applied to control

whether or not to activate or activation degree, where 0

means inactive. Finally, adaptive variables 𝑝1and 𝑝2are set,

and supposing 𝑝=𝑝1−𝑝2, return activation output(𝑓𝑎)

obtained by Eq.(15), and 𝑝1and 𝑝2are adaptively adjusted

by back propagation.

𝑓𝑎=𝑝×𝑥𝑐,𝑑 ×𝜎[𝛽𝑐×𝑝×𝑥𝑐,𝑑 ] + 𝑝2×𝑥𝑐,𝑑 (15)

1D-Meta-ACON is a general form, which not only solves

the dead neuron problem, but also requires only a few param-

eters to learn to whether to activate. The research will explore

if it can make a diﬀerence in small sample fault diagnosis.

3.3.4. AMSGradP

AdaBN contributes to improving capacity of generaliza-

tion and scale invariance on model as same as BN. However,

Heo et al. pointed out the gradient descent with momen-

tum(GDM) will lead the eﬀective step to decreasing rapidly

during back propagation, resulting in slower convergence

or even sharp minimizers, so AdamP[34] was proposed,

which can just alleviate the puzzle by dropping the radial

component during optimized update, regulating growth of

weight norm, retarding the decay of the eﬀective step size,

thus training the model in a barrierless speed.

In this study, it is easy for small samples to converge to

the local optimum. Unfortunately, the author has not given

the improvement of more advanced AMSGrad. Inspired by

this, the idea of reference [34] are introduced into AMSGrad

called AMSGradP. In Appendix, Algorithm 1outlines the

pseudocode of AMSGradP.

3.4. BiGRU and GAP in fault diagnosis

LSTM has been described about in Section 1. In addition,

by merging the forget gate and the input gate into the update

gate, GRU has simpler structures, approximately 3/4 parame-

ter quantity than LSTM, while it has the similar performance

to LSTM in various tasks[35]. Apparently, GRU is more

suitable for small samples. At the same time, it is argued that

signals only have a deep correlation in one direction, which

is not appropriate. As mentioned in Section 2.2, BiGRU is

more suitable for the research.

FC layer with many parameters can greatly increase the

risk of overﬁtting, while GAP will not produce extra param-

eters, and retain the partial spatial coding information from

signals. In addition, as described in Section 2.2, we consider

et al.: Preprint submitted to Elsevier Page 5 of 15

Table 1

The structures of DCA-BiGRU

Type Kernel/Stride Unit Activation AdaBN Input Output Parameter

Conv1d_1 18/2&10/2 / 1D-Meta-ACON YES (-1,1,1024) (-1,30,248) 19036

Maxpool_1 2/2 / 1D-Meta-ACON / (-1,30,248) (-1,30,124) /

Conv1d_21 6/1&6/1 / 1D-Meta-ACON YES (-1,1,1024) (-1,40,1014) 15816

Maxpool_21 2/2 / 1D-Meta-ACON / (-1,40,1014) (-1,40,507) /

Conv1d_22 6/1&6/2 / 1D-Meta-ACON YES (-1,40,507) (-1,30,249) 14976

Maxpool_22 2/2 / 1D-Meta-ACON / (-1,30,249) (-1,30,124) /

Attention 1/1 / 1D-Meta-ACON YES (-1,30,124) (-1,30,124) 666

BiGRU / 128 Tanh / (-1,30,124) (-1,30,128) 72960

GAP / / / / (-1,30,128) (-1,30,1) /

FC / 10 SotfMax / (-1,30) (-1,10) 310

Total:123764

not only output of the last GRU cell, but also outputs of entire

GRU cells, and GAP just fulﬁlls the above requirements,

preserving features learned by other GRU cells.

Lastly, we hold the view that the feature matrix has

gathered the critical spectra from original signals, whose

global information should be focused on, so GAP is preferred

instead of GMP. The structures of DCA-BiGRU in detail is

shown in Table 1, where a smaller number of parameters will

facilitate the small sample fault diagnosis.

4. Result analysis and discussion

The proportion of each kind of training samples(𝛼%)

regards as the evaluation criteria. We argue if 𝛼<0.5, it can

be called small samples[36]. Firstly, the superiorities of new

regularization training methods proposed will be veriﬁed.

Then, when 𝛼=0.1~0.5(around 20~100 training samples),

the small sample learning capacity of diﬀerent models will

be veriﬁed, and the performance will be evaluated under

diﬀerent working conditions and noises. Finally, parameter

sharing that is applied to the small sample transfer learning

to a new data set will be discussed, and meanwhile visual

interpretations of DCA-BiGRU will be discussed. All exper-

iments are performed under the same random seed, and the

settings about experiments are shown in Table 2.

Table 2

Description of experimental parameters

Settings Value

Batch_size 32

Maximum epochs 150

Optimizer AMSGradP

Learning rate 0.001

Weight decay(except bias) 0.0001

Early Stopping(patience) 10

AMSGradP(Nesterov) True

1D-Meta-ACON(reduction) 16

Attention Block(𝐶1/𝐶2/𝐷) 30/6/124

The experiment is implemented in PyTorch 1.8.0, Python

3.8.5, running on Intel(R) Core i7-6700HQ CPU @3.40GHz

(8G RAM), GTX970M GPU. The ﬂow chart shown in Fig.3

illustrates the overall framework for fault diagnosis. It has

proved that ﬁne-tuning the model can obtain more accurate

diagnosis results, and the time cost is aﬀordable[36,37];

hence the paper will adopt to ﬁne-tuning the whole DCA-

BiGRU for the anti-noise experiment.

4.1. Data enhancement

Data enhancement aims to generate more samples from

vibration signals, prevent ANN from learning irrelevant fea-

tures. As shown in Fig.4, assuming that the sliding window is

𝑙, a sample is generated starting from the 𝑖-th with an interval

𝑙, where the adjacent samples are set with an overlap value.

Assuming the sliding step size is 𝑚,𝑁is the signal

length, and the quantity of samples 𝑛=𝑁−𝑙

𝑚+ 1(𝑚=400,

𝑙=1024).

4.2. Model evaluation and metrics method

Diagnosis performance can be formulated by a confusion

matrix, where it has two valuable indicators.

In multi-class case, this is the average of F1-score of each

class with weighting depending on the average parameter,

where sensitivity(recall) and precision are the key perfor-

mance, which can be calculated as Eq.(16∼17).

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑇 𝑃

𝑇 𝑃 +𝐹 𝑃 , 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =𝑇 𝑃

𝑇 𝑃 +𝐹 𝑁 (16)

𝐹𝛽=(1 + 𝛽2)(𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 +𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦)

𝛽2×𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 +𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 (𝛽= 1) (17)

where True Positive(TP) is an outcome where the model

correctly predicts the positive class. False Positive(FP) is an

outcome where the model incorrectly predicts the positive

class. False Negative(FN) is an outcome where the model

incorrectly predicts the negative class. The weight of sensi-

tivity is 𝛽times of precision.

Geometric mean(G-mean) tries to maximize the accu-

racy on each of classes while keeping these accuracies bal-

anced. For multi-class problems it is a higher root of the

product of sensitivity for each class, as shown in Eq.(18).

𝐺−𝑚𝑒𝑎𝑛 =𝑛







𝑁



𝑛=1

𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦𝑛(18)

4.3. Case 1: Data from CWRU

4.3.1. Description and division of data

The drive end rolling bearing data provided by Case

Western Reserve University[38] is acquired by the device

as shown in Fig.6, where the single point faults(inner ring,

et al.: Preprint submitted to Elsevier Page 6 of 15

outer ring, rolling element) are caused by electrical dis-

charge machining(EDM), and the sampling frequency is

12kHz, with 0~3HP loads and three types of damage de-

grees(0.118/0.356/0.533mm). The acceleration sensor that

is located at the drive end of the motor housing collects

acceleration data. According to diﬀerent loads, signals are

divided into four data sets: A, B, C, D, as shown in Table 3.

Table 3

Partition of CWRU data sets

Data Loads Locations FD(mm) Label 𝛼%

A/B/C/D 0/1/2/3

0.118/0.356/0.533

0.1∼0.5

IF 1/2/3

OR 4/5/6

BF 7/8/9

Fan End

Torque transducer

Induction motor Load motor

Drive End

Figure 6: Bearing fault diagnosis model test-bed

4.3.2. The discussion of batch_size

A larger batchsize can shorten the training time of each

epoch, but it may also reduce capacity of generalization, so a

balance should be struck between both. For this reason, under

𝛼=0.3 with data set B, only batchsize changes, and the results

are shown in Table 4.

Table 4

Comparison of training results between diﬀerent batch_size

batch_size Early stopping eval_loss eval_acc Time/s

8 34 0.5532 100% 311.74

16 40 0.5684 99.64% 226.82

32 75 0.5577 99.93% 249.14

64 73 0.5706 99.64% 244.77

80 126 0.5804 99.71% 367.85

100 100 0.5955 99.21% 296.53

128 80 0.6219 98.50% 223.33

As is seen to us, the training diﬃculties with diﬀerent

batchsizes are not consistent, resulting in diﬀerent epochs

of early stopping. Apparently, it can achieve similar perfor-

mance in batchsize=8 or 32(100%, 99.93%), but the latter

takes less time, so batch_size=32.

4.3.3. Ablation comparative experiment

The ablation experiments regarding DCA-BiGRU(M5)

are carried out on four data sets A, B, C and D. The contrast

models are PCA-SVM(M1), DCNN-BiGRU(without atten-

tion, M2), DCNN(without attention and BiGRU, M3) and

DCA(without BiGRU, M4), which take G-mean as the index.

In order to avoid the random inﬂuence, each experiment

repeats ﬁve times to get error bars as shown in Fig.7, and

A→A represents training set→test set. The X-axis shows the

proportion of the training(𝛼). At the same time, the running

time of diﬀerent models, diﬀerent loads in diﬀerent 𝛼is

recorded until early stopping, as shown in Table 6.

From Fig.7, Table 6, as 𝛼augments, models learn more

features and G-mean gains an increase. Due to the lack of

elaborate processed of original signals, SVM cannot eﬀec-

tively deal with high-dimensional signals. Also, by compar-

ing M4and M5, it can be illustrated that BiGRU has advan-

tages in coping with small samples, which generates hidden

features, and contributes to the performance of the model

to increase by 21%~36%. From M4and M5in Fig.7c, when

𝛼=0.3, M4=0.9031, while M3=0.6431, which demonstrates

that attention mechanism also has a promising generalization

for small samples, because it can guide models to pour

attention to critical pulses, and only add 666 parameters.

With a combination of both, M5=0.9822. On the whole, both

are conducive to performance of the model for small samples.

Furthermore, the trend of running time increases with the

increase of 𝛼, where the advanced model requires more time.

In total, DCA-BiGRU has the highest diagnostic eﬃciency.

4.3.4. Experiment under diﬀerent loads

In general, the capability to deal with unlabeled sam-

ples from other loads is low when training with one data

set. Therefore, it is indispensable to evaluate the migration

versatility on DCA-BiGRU in fault diagnosis when the load

changes. A, B, C and D have diﬀerent loads and diﬀerent

signal distributions. In the past, most of the methods used to

test the generality under 𝛼= 0.7. The paper will explore the

generality of the proposed model in small samples. Applying

the model under training with Data set B, and the statistical

results are displayed in Table 5.

Table 5

G-mean of DCA-BiGRU under diﬀerent loads

𝛼(%) G-mean

Data set B→A B→B B→C B→D

0.1 95.30 96.78 99.21 93.24

0.2 98.60 99.41 99.09 98.32

0.3 99.48 99.71 99.60 98.56

0.4 99.21 100 99.86 98.83

0.5 99.74 100 100 98.97

As we can see, when G-mean<0.99, the migration ver-

satility enhances with the increase of 𝛼. For Data set D

with load 3, although the signal distribution changes com-

paratively obviously, the performance has not decreased

dramatically(average G-mean=0.97). When G-mean>0.99,

the performance is slightly diﬀerent due to random values. In

addition, when load 0 with 𝛼=0.1, under small samples with

inapparent fault pulses, G-mean>0.95. In this case, DCA-

BiGRU still achieve high performance, which fully indicates

that it has a pleasant migration versatility.

et al.: Preprint submitted to Elsevier Page 7 of 15

0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5

3 0

4 0

5 0

6 0

7 0

8 0

9 0

100

G - m e a n ( % )

P C A - S V M

D C N N - B i G R U

DCNN

D C A

D C A - B i G R U

(a) A→A

0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5

4 0

5 0

6 0

7 0

8 0

9 0

100

G - m e a n ( % )

P C A - S V M

D C N N - B i G R U

DCNN

D C A

D C A - B i G R U

(b) B→B

0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5

3 0

4 0

5 0

6 0

7 0

8 0

9 0

100

( % )

P C A - S V M

D C N N - B i G R U

DCNN

D C A

D C A - B i G R U

G-mean

0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5

3 0

4 0

5 0

6 0

7 0

8 0

9 0

100

( % )

P C A - S V M

D C N N - B i G R U

DCNN

D C A

D C A - B i G R U

G-mean

(d) D→D

Figure 7: G-mean values of test under diﬀerent loads

Table 6

Time of test under diﬀerent loads

Models 𝛼(%) Time(s)

A B C D

PCA-SVM

0.1 0.09 0.01 0.14 0.20

0.2 0.16 0.02 0.34 0.15

0.3 0.23 0.61 0.67 0.23

0.4 0.31 0.12 0.56 0.29

0.5 0.33 0.16 1.06 0.48

DCNN-BiGRU

0.1 21.31 19.40 15.44 14.57

0.2 21.42 13.96 27.94 22.90

0.3 30.84 21.02 43.89 54.01

0.4 34.27 17.81 33.15 56.73

0.5 48.68 26.76 63.85 60.34

DCNN

0.1 60.75 171.66 103.42 65.79

0.2 103.07 123.74 193.69 96.90

0.3 221.55 118.44 126.34 264.87

0.4 144.39 244.19 123.14 292.19

0.5 223.74 290.73 162.09 197.97

DCA

0.1 53.91 71.75 55.26 87.30

0.2 148.07 161.38 110.96 126.35

0.3 160.83 113.39 180.93 488.10

0.4 224.96 140.84 332.17 256.32

0.5 212.41 272.37 327.20 394.66

Ours

0.1 171.40 151.00 164.88 132.43

0.2 403.03 270.81 193.33 387.71

0.3 237.93 338.11 301.05 304.21

0.4 249.03 242.73 430.96 317.33

0.5 201.99 410.71 345.82 375.74

4.3.5. Analysis of regularization means

Vibration signals are distributed nonlinearly, while neu-

ral networks belong to linear calculation. In order to avoid

vanishing gradient, the nonlinear non-saturating activation

function is generally applied. In recent years, some latest

activation functions have been widely utilized, but the im-

provement of them to the method has not been explored

carefully in fault diagnosis. 1D-meta-ACON applied in this

paper with only 1098 parameters combines the advantages of

linear and nonlinear activation functions. One of them can be

preferred by referring to the performances of them.

The related activation functions are shown in Fig.8,

where the gradient of Mish is smoother than that of ReLU,

and Swish has the features of lower bound without upper

bound, smoothness and non-monotonicity, which can be

regarded as a smoothing form between linear and ReLU.

54321012345

ReLU

Swish

Mish

ELU

Softplus

Figure 8: Diﬀerent activation functions

All results are carried out under Data set B with 𝛼=0.3,

and the training loss and the accuracy of transfer are ob-

tained, as shown in Fig.9and Fig.10. It can be seen that all

et al.: Preprint submitted to Elsevier Page 8 of 15

(139,0.5469)

(102,0.5486)

(114,0.5466)

(85,0.5486)

(29,0.5856)

(91,0.5472)

Figure 9: Losses under diﬀerent activation functions

models can converge, and Softplus=0.5856 with the maxi-

mum loss triggers early stopping earliest. When epoch=114,

early stopping is triggered on ELU. From the stability of

convergence, expect for ReLU and Softplus, the other four

functions are relatively stable, where the diﬀerences of loss

are small in later epochs, and the diﬀerence of ﬁnal losses

among Swish, ELU, 1D-Meta-ACON is about 0.0003, while

1D-Meta-ACON has less epochs, with fastest convergence.

In addition, as shown in Fig.10, Mish, Swish and 1D-

Meta-ACON have a better migration generality, reaching

97.15%, 98.84% and 99.09% respectively under B→D. Mean-

while, ReLU and Softplus are poor under B→D, and the

performance of 1D-Meta-ACON is the best, which improves

by 0.25%. Regardless of extreme accuracy, one of the three

activation functions can be chosen according to the reality.

99.13

98.64 98.61

99.74 99.62 99.62

100 100 99.73 100 100 100

99.87

98.64

98.98

99.74 99.62

100

92.97

97.66

92.91

97.15

98.84 99.09

RuLU ELU Softplus Mish Swish Meta-ACON

100

101

accuracy(%)

Activation function

B→A

B→B

B→C

B→D

Figure 10: Generality under diﬀerent activation functions

Similarly, the eﬀects of diﬀerent adaptive optimization

gradient algorithms are compared. For a certain neural net-

work, they are utilized to optimize the objective functions,

and parameters are continuously updated in a negative direc-

tion until an optimal solution. The closer solution is to the

global optimum, the neural network has better generalization.

Optimizers:SGDM(0.576), AMSGrad(0.569), AadmW(

0.565), AdaBelief(0.561), AdaBound( 0.565), AdamP(0.562),

Adam(0.579), AMSGradP(0.555). The experimental results

S G D M A M S G r a d A d a m W A d a B e l i e f A d a B o u n d A d a m A d a m P A M S G r a d P

9 5

9 6

9 7

9 8

9 9

100

B → D

A l g o r i t h m s

a c c ( % )

Figure 11: Performance and time under diﬀerent algorithms

of veriﬁcation set are shown in Fig.13. It can be seen that

the accuracy of several optimization algorithms reach more

than 99%. Adam has the maximum oscillation amplitude,

and when epoch=142, it triggers early stopping. Compared

with AMSGrad(99.64%), AMSGradP(99.86%) improves by

0.22%. Also, SGDM reaches 99.86%, yet it requires more

epochs. From the point of view of convergence speed and

value, SGDM, Adabelief and Adam converge slowly, but

AMSGradP has the fastest convergence speed and highest

validation accuracy, which indicates that adding radial com-

ponent, AMSGradP retards the reduction of eﬀective step,

so that the algorithm reaches the vicinity of optimal point

with a relatively appropriate eﬀective step, and constantly

updates nearby, converging to 0.555. Except for these, the

speed of other algorithms is not much diﬀerent. The above

analyses fully indicate that adding radial components and

adjusting norm growth can eﬀectively improve results for

gradient descent algorithms in fault diagnosis.

Besides, the generality of transfer of each algorithm is

evaluated in Fig.11, which displays the performance of DCA-

BiGRU trained under 𝛼=0.3 with Data set B.

It can be found that Adabelief has the worst generaliza-

tion in the rolling bearing task, which is only 96.76% under

B→D. AMSGradP and Adabound have similar performance,

whereas AMSGradP is more stable for migration because

of a smaller error and has the acceptable training time.

Considering comprehensively, AMSGradP is more superior.

Based on the above argumentum, a benchmark model can

be trained applying AMSGradP and ﬁne-tuned employing

AdamW with fastest converge.

Eventually, the eﬀects of diﬀerent optimized strategies

on the model are compared. (W:AdaBN, GHMC:gradient

harmonizing mechanism for classiﬁcation, FL:Focal Loss,

G:GAP, B:BN). As an example, AdaBN, GAP and LSR are

applied into DCA-BiGRU(WLSRG). In the ﬁeld of NLP,

GHMC, FL, and LSR acquire more attention for unbalanced

distribution, but they have not been contrastively studied in

fault diagnosis under small samples.

It displays the inﬂuence of diﬀerent loss functions and

optimized strategies in Fig.12. Obviously, the task named

B→D is more diﬃcult. Initially, for WLSR, WFL, WGHMC

et al.: Preprint submitted to Elsevier Page 9 of 15

B → A B → B B → C B → D

9 5 . 0

9 5 . 5

9 6 . 0

9 6 . 5

9 7 . 0

9 7 . 5

9 8 . 0

9 8 . 5

9 9 . 0

9 9 . 5

100.0

a c c ( % )

D o m a i n m i g r a t i o n

W L S R W G H M C W F L W C E W L S R G B L S R 179.55s

413.83s

133.12s

170.51s

227.01s

300.43s

Figure 12: Accuracy under loss functions and strategies

and WCE, it can be stated that CE has the shortest training

time, whose accuracy is only 95.37%. GHMC solves the

problems of outliers and parameter joint training existed

in FL and improves by 0.44%. Compared with three, LSR

with 99.32% has the maximum accuracy. Furthermore, by

comparing WLSR and BLSR, there is an improvement of

about 1.36% by applying AdaBN. Ultimately, as mentioned

in 3.3.2, by comparing WLSR and WLSRG, GAP is 98.78%,

which descends by 0.54% than GMP in attention block.

In conclusion, the latest training methods make contribu-

tions to improve capacity of generalization.

4.3.6. Analysis of anti-noise robustness

Signals mostly contain noises in real situation. Hence, the

study will analyze the anti-noise robustness under diﬀerent

signal-to-noise ratio(SNR), which is deﬁned as in Eq.(19).

𝑆𝑁 𝑅dB = 10 lg( 𝑃𝑠𝑖𝑔𝑛𝑎𝑙

𝑃𝑛𝑜𝑠𝑖𝑒

)(19)

where, 𝑃𝑠𝑖𝑔𝑛𝑎𝑙 =1

𝑁



𝑖=1

𝑥2

𝑖is original signal power and 𝑃𝑛𝑜𝑖𝑠𝑒

is noise power.

Diﬀerent from the previous methods that the model is

directly trained by noise signals. The fault diagnosis frame-

work with sharing parameters shown in Fig.3will be applied,

which consists of oﬀ-line pre-training and online. The oﬀ-

line will utilize AMSGradP and Data set B to obtain the pre-

training parameters, while the online mainly aims to ﬁne-

tune models to achieve high eﬃciency, where the training

time will be cut down because parameters are close to opti-

mization values, so that noises can be quickly smooth away.

In this study, Gaussian white noises with SNR=-4~6dB

will be added to original signals. AdamW is applied to ﬁne-

tune the whole pre-training model. Besides, other settings

are the same. Previous studies have declared that with the

increase of SNR and 𝛼, the accuracy of test is continuously

improved. Therefore, a case with SNR=-4dB and 𝛼=0.1 is

applied to examine performance. Results regarding training

time and G-mean are shown in Table 7and Fig.14.

On one hand, with the increase of 𝛼, G-mean also in-

creases, but the time cost also increases. However, it is

reduced by approximately 2/3, compared with unloaded pre-

trained models. On the other hand, DCA-BiGRU still has the

highest diagnostic accuracy. Taking 𝛼=0.3 as an example,

Table 7

Time of diﬀerent 𝛼under SNR= –4dB

models Time/s

𝛼(%) 0.1 0.2 0.3 0.4 0.5

DCNN-BiGRU 16.30 17.44 27.12 43.55 75.82

DCNN 31.67 38.71 54.88 60.88 93.28

DCA 28.68 52.36 75.63 82.83 109.25

DCA-BiGRU 26.01 58.44 68.42 99.94 102.45

Table 8

Evaluation under 𝛼=0.1, Data set B

SNR -4 -2 0 2 4 6

G-mean 90.72 92.11 94.84 96.20 98.32 98.32

Time 26.08 60.15 11.09 11.01 11.98 11.00

four models are 87.10%, 74.18%, 77.09%, 92.72% in turn,

where BiGRU improves 19.92%, and attention mechanism

improves 5.62%. All in all, attention mechanism and BiGRU

has a strong capacity of robustness and diagnostic eﬃciency.

Another random seed is set to further evaluate DCA-

BiGRU under conditions with 𝛼=0.1 and SNR=-4~6dB, as

shown in Table 8. With the increase of SNR, G-mean also in-

creases. In addition, SNR= -4dB or -2dB requires more time,

because it may be that larger noises cause higher learning

diﬃculty, and demands more epochs. Furthermore, DCA-

BiGRU achieves G-mean>0.9 at various SNRs, manifesting

that it has an excellent anti-noise performance.

Finally, the changes of original outer ring fault signals

in DCA-BiGRU are shown in Fig.15. With the depth of

network, signal features become more abstract, and it is

easier to realize diagnosis.

4.4. Case 2: Data from University of Connecticut

4.4.1. Description and analysis of data

The data that is shared from University of Connecticut

is collected from a two-stage gearbox [39,40], where the

acquisition device is shown in Fig.17, and the acquisition

frequency is 20kHz, and The signals are recorded through

a dSPACE system(DS1006 processor board, dSPACE Inc.).

The speciﬁcations of the accelerometer including frequency

range, measure range, and sensitivity are 0.5Hz-10kHz, ±50

g, and 100 mV/g, respectively. Nine diﬀerent gear conditions

are introduced to the pinion on the input shaft, including

healthy condition, missing tooth, root crack, spalling, and

chipping tip with ﬁve diﬀerent levels of severity, and time-

domain signals of nine states are showed in Fig.18.

In original signals, a total of 104 samples with 3600

points are collected for gearbox states. In order to facilitate

experiments, all signals in a certain state are integrated into

a column, and the training set, veriﬁcation set and test set are

obtained by acquisition methods mentioned in Section 4.1.

The label of each state is 0~9 as shown in Fig.18.

4.4.2. Evaluation under diﬀerent working conditions

In reality, while gearbox system is recorded in a ﬁxed

sampling rate, due to speed variations under load distur-

bance, geometric tolerance, and motor control error etc, the

time-domain signals also reﬂect the changes of diﬀerent

working conditions. And Fig.16 reﬂects the change curve

et al.: Preprint submitted to Elsevier Page 10 of 15

(a) Accuracy

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150

0 . 6

0 . 8

1 . 0

1 . 2

1 . 4

1 . 6

1 . 8

2 . 0

2 . 2

2 . 4

e v a l_ l o s s

e p o c h

S G D M

A M S G r a d

A d a m W

A d a b e l i e f

A d a b o u n d

A d a m P

A d a m

A M S G r a d P

(b) Loss

Figure 13: Accuracy and losses of veriﬁcation set

0 . 1 0 . 2 0 . 3 0 . 4 0 . 5

6 5

7 0

7 5

8 0

8 5

9 0

9 5

G - m e a n

D C N N - B i G R U

DCNN

D C A

D C A - B i G R U

Figure 14: Fault diagnosis based on sharing parameters

SoftMax

Conv1

Conv2

Attention

BiGRU GAP

Figure 15: The signal changes in DCA-BiGRU

of accuracy and loss of training set and veriﬁcation set at

𝛼=0.3, where DCA-BiGRU has an excellent convergence

performance. When epoch =93, G-mean = 99.37%.

Fig.19 and Fig.20 embody the performance and training

time of each model with the increase of 𝛼. It can be indicated

that with the increase of 𝛼, G-mean also increases in test set.

When 𝛼=0.1, DCA-BiGRU has the ﬁrst-class performance

with G-mean=96.34%, while 79.76% on DCNN-BiGRU.

When 𝛼=0.5, these models almost always close to 100%

except for SVM. Overall analysis displays that when 𝛼 <0.3,

DCA-BiGRU<DCNN-BiGRU<DCA<DCNN, so the com-

bination of attention mechanism and BiGRU can just achieve

0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 1 0 0

1 0

2 0

3 0

4 0

5 0

6 0

7 0

8 0

9 0

100

e v a l _ a c c

a c c

e v a l _ l o s s

l o s s

e p o c h

a c c

0 . 4

0 . 6

0 . 8

1 . 0

1 . 2

1 . 4

1 . 6

1 . 8

2 . 0

2 . 2

2 . 4

l o s s

Figure 16: Training and veriﬁcation performance

Motor

controller

Gearbox

Brake

Data Dollection

Systems

Accelerometer

Figure 17: Gearbox system

the optimal performance. Similarly, the cost of high perfor-

mance is more training time, which requires for loading the

pre-training model to save training time.

4.4.3. Visual analysis

In order to further reveal the feature representations,

the T-SNE technology is applied to feature visualization,

where diﬀerent colors describe diﬀerent states. By compar-

ing Fig.21 and Fig.22, it can be found that DCNN extracts

features preliminarily and each state is further separated

through the attention mechanism. BiGRU 2 classiﬁes sam-

ples by extracting the hidden features at diﬀerent positions.

Finally, parameters of the classiﬁer are reduced by GAP.

et al.: Preprint submitted to Elsevier Page 11 of 15

Healthy Missing tooth Root crack

Spalling Chipping tip L1 Chipping tip L2

Chipping tip L3 Chipping tip L4 Chipping tip L5

amplitude

Sampling length

Figure 18: Vibration signals of nine faults

0 . 1 0 . 2 0 . 3 0 . 4 0 . 5

3 0

4 0

5 0

6 0

7 0

8 0

9 0

100

G - m e a n

P C A - S V M

D C N N - B i G R U

DCNN

D C A

D C A - B i G R U

Figure 19: G-mean in diﬀerent 𝛼

0 . 1 0 . 2 0 . 3 0 . 4 0 . 5

100

200

300

400

500

T i m e / s

P C A - S V M

D C A - B i G R U

DCNN

D C A

D C A - B i G R U

Figure 20: Time in diﬀerent 𝛼

BiGRU 1 only gets the output of the last hidden layer.

Through the comparison between BiGRU 1 and 2, it can be

seen that GAP pays attention to the output of neurons in all

hidden layers of BiGRU, which makes fault state separation

more obvious and reduces the training pressure of diagnosis

layer. In conclusion, DCA-BiGRU can better separate diﬀer-

ent states, which has a marvelous generalization.

The visualization of attention mechanism and BiGRU is

shown in Fig.23. The brighter the color, the higher degree

of activation. From these, it is observed that the attention

mechanism attaches importance to the degree of each chan-

nel in signals. In addition, BiGRU 2 further separates the

dimensionality reduction signals and extracts more vivid and

reﬁned features. Diﬀerent fault types have diﬀerent neuron

activation areas, so the corresponding features can be ex-

tracted from original signals without human intervention.

From top to bottom, from left to right, conv1, conv2, attention, GAP.

Figure 21: Feature visualization of diﬀerent layers

BiGRU 1 BiGRU 2

Figure 22: Visualization of diﬀerent BiGRU

Attention BiGRU 2

Figure 23: Weight visualization of attention and BiGRU

Grad-CAM++ is a widely applied visualization method,

whose basic idea is that the weight of the feature map

corresponding to a certain classiﬁcation can be expressed as

a gradient, and the global average of the gradient is utilized

to calculate the weight. In addition, ReLU and the weight

gradient 𝑎𝑘𝑐

𝑖are added into the feature map 𝑤𝑒𝑖𝑔ℎ𝑡. Only one

back propagation is required to calculate the gradient, which

is originally applied to 2D, but is improved and applied to

1D-signals, as shown in Algorithm 2in Appendix.

Attention mechanism is further explained, and Class

Activation Mapping(CAM) is calculated by extracting the

convolution kernel feature map of attention mechanism, as

shown in Fig.24. The higher the color level, the higher

CAM and the higher the feature distinction. The light blue

frame has circled higher parts of CAM. It can be found that

the locations of diﬀerent fault types activated by CAM are

diﬀerent, whose amplitudes are not the same, which fully

demonstrates that the attention mechanism can distinguish

the fault types without manual preprocessing. For example,

Missing tooth and Spalling have two distinct areas of class

et al.: Preprint submitted to Elsevier Page 12 of 15

Healthy Missing tooth Root crack

Spalling Chipping tip L1 Chipping tip L2

Chipping tip L3 Chipping tip L4 Chipping tip L5

Figure 24: Visualization of nine fault states under Grad-CAM++

activation. Besides, Chipping tip with diﬀerent damage de-

gree has diﬀerent activation areas, where the impact ampli-

tude is more distinct with the deepening of damage degree.

4.4.4. Anti-noise performance for gearbox

For the Gearbox fault, the learning rate is 0.0009 because

of loading pre-training model. AdamW and fault diagnostic

framework as shown in Fig.3are applied, and other parame-

ters are as same as above.

Under SNR=6dB, the anti-noise capacity of models un-

der diﬀerent 𝛼is calculated, as shown in Fig.25. Besides, the

inﬂuence of SNR is recorded as shown in Table 9.

When 𝛼=0.3, with the improvement of 𝛼, G-mean is

improving, indicating that the robustness of models is en-

hanced. Comparison between DCNN and DCNN-BiGRU

shows that BiGRU improves performance by 5.31% when

𝛼=0.1. For DCA-BiGRU and DCNN-BiGRU, when 𝛼=0.3,

attention mechanism makes the model increase by 0.86%.

In addition, by comparing whether the pre-training model

is loaded or not, it can be found that loading the pre-training

model not only improves G-mean, but also saves training

time. The greater the noises, the more obvious the advantage

of loading pre-training model. As an example SNR=0dB, G-

mean of loading pre-training parameters is 85.28%, and that

of unloading is 78.43%, increasing by 6.85%.

By observing the confusion matrix of both as shown in

Fig.26, DCNN-BiGRU whose sensitivity to Chipping tip L1

and L4 is low misclassiﬁes part of the healthy samples. On

the contrary, DCA-BiGRU can correctly distinguish healthy

and fault samples, but misclassiﬁes Missing tooth, Chipping

tip L2 and L3. In particular, the sensitivity to Chipping tip

L3 is low, which requires eﬀective measures to improve

performance under noises.

4.5. Comparison studies of diagnostic method

Finally, the rolling bearing data from CWRU is very

popular in machinery fault diagnosis researches. Compared

0 . 1 0 . 2 0 . 3 0 . 4 0 . 5

8 5

9 0

9 5

100

G - m e a n

DCNN

D C N N - B i G R U

D C A

D C A - B i G R U

Figure 25: G-mean of models under SNR=6dB

DCNN-BiGRU DCA-BiGRU

Figure 26: Confusion matrix under SNR=6dB, 𝛼=0.3

with some methods listed in Table 10, and DCA-BiGRU still

has reach 99.73% diagnostic performance in the case of no

human intervention, lower 𝛼and less sampling length, and

compared with DCA-BiLSTM, DCA-BiGRU increases by

0.17%.

Firstly, the length of sampling points can aﬀect the di-

agnosis results. The fewer sampling points are, the fewer

shock pulse will be contained in one sample. Compared with

references listed in Table 10, in the paper, one sample collects

et al.: Preprint submitted to Elsevier Page 13 of 15

Table 9

Anti-noise performance of DCA-BiGRU under 𝛼=0.3

Load Metric SNR/dB

0246810

YG-mean 85.28 93.56 95.26 98.84 99.42 99.42

Time 84.47 91.23 81.10 90.89 61.42 81.05

NG-mean 78.43 87.47 93.01 97.96 98.57 99.13

Time 103.59 131.34 104.74 195.88 232.62 134.38

1024 points. Furthermore, although there are fewer sampling

points in reference [3] and [22], sample dimension reduction

and feature extraction algorithms are applied, most of which

contain hyperparameters. In reference [3], smart evolution

algorithm is adopted to search suitable hyperparameters,

with high time complexity, and reference [22] is determined

by manual experience. Then, compared with the number of

training samples, in this paper, there is a lower proportion

of training sets and a lower number of training sets. For

example, the number of training sets is 60% than reference

[22]. Finally, DCA-BiGRU also achieved a more interesting

diagnostic result under harsher experimental environment

and higher diagnostic diﬃculty. In addition, capsule network

also has advantages in small sample fault diagnosis. How-

ever, after literature [23] is reproduced, capsule network has

about 1.2 million parameters, while the parameters of DCA-

BiGRU are about 120 thousand which means that DCA-

BiGRU has faster training speed and higher diagnosis eﬃ-

ciency because fewer parameters make faster training speed.

Table 10

Comparison of fault diagnosis of CWRU

Models Length Filtering 𝛼Accuracy

reference[3] 200 MCKD-RCMDE 0.8 99.00%

reference[10] 1200 / 0.8 98.36%

reference[14] 2000 Wiener ﬁltering 0.7 98.46%

reference[22] 784 Fast Kurtogram (100) 99.00%

ICN-Capsule 3000 Wavelet 0.83 99.96%

DCA-BiLSTM 1024 / 0.3(60) 99.56%

Ours 1024 / 0.3(60) 99.73%

MCKD: Maximum Correlated Kurtosis Deconvolution

RCMDE: Reﬁned Composite Multiscale Dispersion Entropy

5. Conclusion

A novel DCA-BiGRU model based on attention mech-

anism has been proposed to identify the health state of

equipment under small samples, where attention mechanism

captures the spatial and channel relations of signals. The sen-

sitivities of attention mechanism and BiGRU to the propor-

tion of training set are discussed, and activation functions and

gradient descent algorithms of all sorts have been explored.

AMSGradP, 1D-Meta-ACON and other novel technologies

are introduced to further improve capacities of generaliza-

tion and robustness. Subsequently, DCA-BiGRU based on

transfer learning, is veriﬁed on two diﬀerent test rigs that are

CWRU motor bearing data sets(Case 1) and University of

Connecticut gearbox data sets(Case 2) respectively. Variety

of visualization means are applied to initially reveal working

principle of DCA-BiGRU, which shows that DCA-BiGRU

has advantages in terms of diagnostic eﬃciency under dif-

ferent working conditions for small samples.

It can be noted that the diﬀerences between misclassiﬁed

and other samples demand to be further explored. In addition,

it is intractable for DCA-BiGRU to cope with the extremely

imbalanced data set. In the future, machine learning such

as meta learning, active sensitive cost learning, integrated

learning or domain adaptation and generalization in transfer

learning, will be combined with attention mechanism or

other structures to address more complicated fault diagnosis

situation with small sample and imbalanced data, which is

worth further studying.

CRediT authorship contribution statement

Xin Zhang: Writing original draft, Methodology, Anal-

ysis, Funding acquisition. Chao He: Software, Validation,

Visualization, Investigation. Yanping Lu: Experiment. Biao

Chen: Experiment. Le Zhu: Conceptualization Software. Li

Zhang: Supervision, Proofreading, Project administration.

Declaration of competing interest

The authors declare that they have no known competing

ﬁnancial interests or personal relationships that could have

appeared to inﬂuence the work reported in this paper.

Acknowledgments

The authors are grateful for the supports of the National

Key R&D Program of China (2018YFB1308700).

References

[1] J. Jiao, M. Zhao, J. Lin, K. Liang, A comprehensive review on convo-

lutional neural network in machine fault diagnosis, Neurocomputing

417 (2020) 36–63.

[2] S. Zhang, S. Zhang, B. Wang, T. G. Habetler, Deep learning al-

gorithms for bearing fault Diagnosticsx— A comprehensive review,

IEEE Access 8 (2020) 29857–29881.

[3] H. Luo, C. He, J. Zhou, L. Zhang, Rolling Bearing Sub-Health

Recognition via Extreme Learning Machine Based on Deep Belief

Network Optimized by Improved Fireworks, IEEE Access 9 (2021)

42013–42026.

[4] Y. Ke, C. Yao, E. Song, Q. Dong, L. Yang, An early fault diagnosis

method of common-rail injector based on improved CYCBD and

hierarchical ﬂuctuation dispersion entropy, Digit. Signal Process. 114

(2021) 103049.

[5] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, X. Huang, Pre-trained models

for natural language processing: A survey, Sci. China Technol. Sci.

(2020) 1–26.

[6] S. Hao, Y. Zhou, Y. Guo, A brief survey on semantic segmentation

with deep learning, Neurocomputing 406 (2020) 302–321.

[7] K. Tong, Y. Wu, F. Zhou, Recent advances in small object detection

based on deep learning: A review, Image Vis. Comput. 97 (2020)

103910.

[8] G. Algan, I. Ulusoy, Image classiﬁcation with deep learning in the

presence of noisy labels: A survey, Knowl. Based. Syst. 215 (2021)

106771.

[9] Z. Zhao, T. Li, J. Wu, C. Sun, S. Wang, R. Yan, X. Chen, Deep

learning algorithms for rotating machinery intelligent diagnosis: An

open source benchmark study, ISA Trans. 107 (2020) 224–255.

[10] J. Li, X. Li, D. He, Y. Qu, Unsupervised rotating machinery fault diag-

nosis method based on integrated SAE–DBN and a binary processor,

J. Intell. Manuf. 31 (8) (2020) 1899–1916.

[11] Y. Wang, G. Sun, Q. Jin, Imbalanced sample fault diagnosis of rotat-

ing machinery using conditional variational auto-encoder generative

adversarial network, Appl. Soft Comput. 92 (2020) 106333.

et al.: Preprint submitted to Elsevier Page 14 of 15

[12] Z. Wang, Y. Dong, W. Liu, Z. Ma, A novel fault diagnosis approach for

chillers based on 1-D convolutional neural network and gated recurrent

unit, Sensors 20 (9) (2020) 2458.

[13] X. Wang, D. Mao, X. Li, Bearing fault diagnosis based on vibro-

acoustic data fusion and 1D-CNN network, Measurement 173 (2021)

108518.

[14] X. Chen, B. Zhang, D. Gao, Bearing fault diagnosis base on multi-

scale CNN and LSTM model, J. Intell. Manuf. 32 (4) (2021) 971–987.

[15] D. Huang, Y. Fu, N. Qin, S. Gao, Fault diagnosis of high-speed train

bogie based on LSTM neural network, Sci. China Inf. Sci. 64 (1)

(2021) 119203.

[16] X. Li, X. Kong, J. Zhang, Z. Hu, C. Shi, A study on fault diagnosis of

bearing pitting under diﬀerent speed condition based on an improved

inception capsule network, Measurement 181 (2021) 109656.

[17] F. Zhou, S. Yang, H. Fujita, D. Chen, C. Wen, Deep learning fault

diagnosis method based on global optimization GAN for unbalanced

data, Knowl. Based Syst. 187 (2020) 104837.

[18] P. Kumar, A. S. Hati, Deep convolutional neural network based on

adaptive gradient optimizer for fault detection in SCIM, ISA Trans.

111 (2021) 350–359.

[19] A. Zhang, S. Li, Y. Cui, W. Yang, R. Dong, J. Hu, Limited data rolling

bearing fault diagnosis with few-shot learning, IEEE Access 7 (2019)

110895–110904.

[20] C. Wang, Z. Xu, An intelligent fault diagnosis model based on

deep neural network for few-shot fault diagnosis, Neurocomputing

Doi:10.1016/j.neucom.2020.11.070.

[21] J. Wu, Z. Zhao, C. Sun, R. Yan, X. Chen, Few-shot transfer learning

for intelligent fault diagnosis of machine, Measurement 166 (2020)

108202.

[22] S. R. Sauﬁ, Z. A. B. Ahmad, M. S. Leong, M. H. Lim, Gearbox fault

diagnosis using a deep learning model with limited data sample, IEEE

Trans. Ind. Inform. 16 (10) (2020) 6263–6271.

[23] T. Han, R. Ma, J. Zheng, Combination bidirectional long short-term

memory and capsule network for rotating machinery fault diagnosis,

Measurement 176 (2021) 109208.

[24] C. Li, K. Yang, H. Tang, P. Wang, J. Li, Q. He, Fault Diagnosis for

Rolling Bearings of a Freight Train under Limited Fault Data: Few-

Shot Learning Method, J. Transp. Eng. Part A Syst. 147 (8) (2021)

04021041.

[25] W. Zhang, G. Peng, C. Li, Y. Chen, Z. Zhang, A new deep learning

model for fault diagnosis with good anti-noise and domain adaptation

ability on raw vibration signals, Sensors 17 (2) (2017) 425.

[26] K. Zhao, H. Jiang, Z. Wu, T. Lu, A novel transfer learning fault diag-

nosis method based on Manifold Embedded Distribution Alignment

with a little labeled data, J. Intell. Manuf. (2020) 1–15.

[27] Z. Yang, J. Zhang, Z. Zhao, Z. Zhai, X. Chen, Interpreting network

knowledge with attention mechanism for bearing fault diagnosis,

Appl. Soft Comput. 97 (2020) 106829.

[28] T. Zhang, J. Chen, F. Li, K. Zhang, H. Lv, S. He, E. Xu, In-

telligent fault diagnosis of machines with small & imbalanced

data: A state-of-the-art review and possible extensions, ISA Trans.

Doi:10.1016/j.isatra.2021.02.042.

[29] J. Gu, V. Tresp, H. Hu, Capsule Network is Not More Robust than

Convolutional Network, in: Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognit (CVPR), 14309–14317,

2021.

[30] T. Huang, Q. Zhang, X. Tang, S. Zhao, X. Lu, A novel fault diagnosis

method based on CNN and LSTM and its application in fault diagnosis

for complex systems, Artif. Intell. Rev. (2021) 1–27.

[31] M. Jalayer, C. Orsenigo, C. Vercellis, Fault detection and diagnosis

for rotating machinery: A model based on convolutional LSTM, Fast

Fourier and continuous wavelet transforms, Comput. Ind. 125 (2021)

103378.

[32] Y. Li, N. Wang, J. Shi, X. Hou, J. Liu, Adaptive batch normalization

for practical domain adaptation, Pattern Recognit. 80 (2018) 109–117.

[33] N. Ma, X. Zhang, M. Liu, J. Sun, Activate or Not: Learning Cus-

tomized Activation, in: Proceedings of the IEEE Conference on Com-

puter Vision and Pattern Recognit (CVPR), 8032–8042, 2021.

[34] B. Heo, S. Chun, S. J. Oh, D. Han, S. Yun, G. Kim, Y. Uh, J.-W.

Ha, AdamP: Slowing Down the Slowdown for Momentum Optimizers

on Scale-invariant Weights, in: International Conference on Learning

Representations (ICLR), 2021.

[35] A. Shewalkar, D. Nyavanandi, S. A. Ludwig, Performance Evaluation

of Deep Neural Networks Applied to Speech Recognition: RNN,

LSTM and GRU, J. Artif. Intell. Soft Comput. Res. 9 (4) (2019) 235–

245.

[36] Y. Dong, Y. Li, H. Zheng, R. Wang, M. Xu, A new dynamic model and

transfer learning based intelligent fault diagnosis framework for rolling

element bearings race faults: Solving the small sample problem, ISA

Trans. Doi:10.1016/j.isatra.2021.03.042.

[37] X. Li, Y. Hu, M. Li, J. Zheng, Fault diagnostics between diﬀerent type

of components: A transfer learning approach, Appl. Soft Comput. 86

(2020) 105950.

[38] K. A. Loparo, Bearing data center, Case Western Reserve University .

[39] P. Cao, S. Zhang, J. Tang, Preprocessing-Free Gear Fault Diagnosis

Using Small Datasets With Deep Convolutional Neural Network-

Based Transfer Learning, IEEE Access 6 (2018) 26241–26253.

[40] P. Cao, S. Zhang, J. Tang, Gear Fault Data. ﬁgshare. Dataset.

Doi:10.6084/m9.ﬁgshare.6127874.v1.

Appendix

Algorithm 1 AMSGradP

Input: learning rate, 𝜂> 1; momentum, 𝛽1,𝛽2⊆(0,1);

critical value, 𝛿, 𝜀 > 0; time step, 𝑡; step size, 𝛼;

Output: Resulting parameter, 𝑤𝑡;

1: for 𝑤𝑡not converged do

2: 𝑔𝑡←∇𝑤𝑓𝑡(𝑤𝑡)

3: 𝑚𝑡←𝛽1𝑚𝑡−1 + (1 − 𝛽1)𝑔𝑡

4: 𝑣𝑡←𝛽2𝑣𝑡−1 + (1 − 𝛽2)𝑔𝑡2

5: 𝑣𝑡←max(𝑣𝑡−1,𝑣𝑡)𝑎𝑛𝑑 

𝑉𝑡←𝑑𝑖𝑎𝑔(𝑣𝑡)

6: 𝑝𝑡←𝑚𝑡∕(𝑣𝑡+𝜀)

7: if cos(𝑤𝑡, 𝑔𝑡)< 𝛿∕dim(𝑤)then

8: 𝑞𝑡=𝑤𝑡(𝑝𝑡)

9: else

10: 𝑞𝑡=𝑝𝑡

11: end if

12: 𝑤𝑡←𝑤𝑡−1 −𝛼𝑞𝑡

13: end for

Algorithm 2 1D-Grad-CAM++

Input: signal, 𝑥; category weight, 𝑦𝑐

𝑎𝑡𝑡; feature map, 𝐴𝑘

𝑎𝑡𝑡;

Output: heatmap, ℎ;

1: 𝑔𝑟𝑎𝑑 ←

𝑦𝑐

𝑎𝑡𝑡

𝐴𝑘

𝑎𝑡𝑡

2: 𝑎𝑘𝑐

𝑖←𝑔𝑟𝑎𝑑2

2𝑔𝑟𝑎𝑑2+

𝑖

𝐴𝑘

𝑎𝑡𝑡∗𝑔𝑟𝑎𝑑3

3: if 𝑔𝑟𝑎𝑑 > 0then

4: 𝑤𝑒𝑖𝑔ℎ𝑡 ←𝑔𝑟𝑎𝑑 ×𝑎𝑘𝑐

𝑖

5: else

6: 𝑤𝑒𝑖𝑔ℎ𝑡 ←0

7: end if

8: 𝑤𝑒𝑖𝑔ℎ𝑡.size ←𝑥.size by linear interpolation

9: ℎ←MinMaxScaler(𝑤𝑒𝑖𝑔ℎ𝑡)

et al.: Preprint submitted to Elsevier Page 15 of 15

Multi-scale Quaternion CNN and BiGRU with Cross Self-attention Feature Fusion for Fault Diagnosis of Bearing

Preprint

May 2024

In recent years, deep learning has led to significant advances in bearing fault diagnosis (FD). Most techniques aim to achieve greater accuracy. However, they are sensitive to noise and lack robustness, resulting in insufficient domain adaptation and anti-noise ability. The comparison of studies reveals that giving equal attention to all features does not differentiate their significance. In this work, we propose a novel FD model by integrating multi-scale quaternion convolutional neural network (MQCNN), bidirectional gated recurrent unit (BiGRU), and cross self-attention feature fusion (CSAFF). We have developed innovative designs in two modules, namely MQCNN and CSAFF. Firstly, MQCNN applies quaternion convolution to multi-scale architecture for the first time, aiming to extract the rich hidden features of the original signal from multiple scales. Then, the extracted multi-scale information is input into CSAFF for feature fusion, where CSAFF innovatively incorporates cross self-attention mechanism to enhance discriminative interaction representation within features. Finally, BiGRU captures temporal dependencies while a softmax layer is employed for fault classification, achieving accurate FD. To assess the efficacy of our approach, we experiment on three public datasets (CWRU, MFPT, and Ottawa) and compare it with other excellent methods. The results confirm its state-of-the-art, which the average accuracies can achieve up to 99.99%, 100%, and 99.21% on CWRU, MFPT, and Ottawa datasets. Moreover, we perform practical tests and ablation experiments to validate the efficacy and robustness of the proposed approach. Code is available at https://github.com/mubai011/MQCCAF.

Multi-scale quaternion CNN and BiGRU with cross self-attention feature fusion for fault diagnosis of bearing

Article

Full-text available

May 2024
MEAS SCI TECHNOL

DFSA-DAN: dynamic fusion of statistical metric and adversarial learning for domain adaptation network based intelligent fault diagnosis

Article

Full-text available

May 2024
MEAS SCI TECHNOL

The advancement of deep transfer learning has motivated research into the realization of intelligent fault diagnosis schemes for rolling bearing. Nevertheless, existing research rarely provides further insight into the importance of statistical distance metric-based methods and adversarial learning-based methods in domain adaptation, and the commonly used feature extractors are more difficult to extract features suitable for domain transformation. In this paper, a dynamic fusion of statistical metric and adversarial learning for domain adaptation network is proposed to achieve a dynamic measure of the importance of different domain adaptation methods. This new model utilizes a local maximum mean discrepancy metric to adjust the conditional distribution and adversarial training to adjust the marginal distribution between domains. Meanwhile, to assess the importance of the two distributions, a dynamic adaptation factor is introduced for dynamic evaluation. In addition, to extract features that are more suitable for domain transformation, the model incorporates a dual depth convolutional path with an attention mechanism as a feature extractor, enabling multi-scale feature extraction. Experimental results demonstrate the model’s superior generalization capability and robustness, enabling effective cross-domain fault diagnosis in diverse scenarios.

Intelligent Fault Diagnosis Across-Datasets Based on Second-Level Sequencing Meta-Learning for Small Samples

Article

Full-text available

Jan 2024

Mechanical equipment functioning in intricate surroundings is prone to malfunctions, which can lead to accidents and significant financial losses. A key component of machinery health management, fault diagnosis creates a connection between equipment state and health data monitoring. This paper presents a second-level sequencing meta-learning approach to tackle the constraints of insufficient fault data and cross-dataset issues, which might impair the accuracy of intelligent fault detection models under normal operating situations. By utilizing the Model-Agnostic Meta-Learning (MAML) core, this technique effectively addresses the problem of limited sample size. Second-level sequencing is implemented for cross-dataset fault diagnostics. The experimental findings, using Paderborn University bearing datasets and University of Connecticut gear datasets, demonstrate the superiority of the suggested Second-Level Sequencing Meta-Learning (SSML) model. SSML demonstrates superior performance compared to other models, with a 95.1% accuracy rate for bearing datasets and a 97.0% accuracy rate for gear datasets. This makes it very useful for diagnosing faults in complicated situations with limited data samples and across different datasets. The importance of sequencing in improving model stability and attaining high accuracy across datasets is emphasized in the study.

Experimental Investigation of the Differential Planetary Roller Screw Mechanism Intelligent Condition Monitoring and Public Dataset

Preprint

Full-text available

Jun 2024

This study introduces an innovative condition monitoring test rig for the differential planetary roller screws (DPRS), focusing on their performance in extreme conditions. It underscores the importance of anti-jamming and real-time monitoring for the DPRS reliability. The research presents a dynamic friction model using the Lagrange method, enhancing understanding of DPRS operations. Advanced signal processing techniques, including discrete wavelet transforms and a convolutional neural network, are implemented for effective feature extraction. A DWTC-BiGRU network is utilized to capture temporal dependencies, vital for monitoring DPRS under varying conditions. Experimental validation is conducted under thermal stress and load variations, demonstrating the system's durability and reliability. The study compares its method with existing algorithms, showing superior accuracy and robustness by combining mechanical modeling with computational techniques for real-time industrial monitoring. The dataset is publicly available at GitHub-haomjc/HealthMoni.

Dynamic simulation-assisted Gaussian mixture alignment approach for fault diagnosis of rotation machinery under small samples

Article

Full-text available

Jun 2024
MEAS SCI TECHNOL

Obtaining a substantial number of actual samples for rotating machinery in an industrial setting can be challenging, particularly when faulty samples are acquired under hazardous working conditions. The issue of insufficient samples hinders the effective training of reliable fault diagnosis models, impeding the industrial implementation of advanced intelligent methods. This study proposes an innovative dynamic simulation-assisted Gaussian mixture alignment model (DSGMA) to address the challenge of applying fault diagnosis technologies, with its performance mined by advanced transfer algorithms. Specifically, we establish a fault dynamics model for rotating machinery and acquire a substantial amount of simulated data as the source domain to facilitate the training of the deep neural network model. Subsequently, we propose a Gaussian mixture-guided domain alignment approach that assigns a domain-independent Gaussian distribution to each category as prior knowledge, with the parameters calculated using limited actual samples. Diagnostic knowledge is transferred from the source domain to the target domain by minimizing the Kullback–Leibler divergence between the features of the simulated samples and the Gaussian mixture priors. Furthermore, the DSGMA model incorporates Gaussian clustering loss to augment the clustering capability of samples belonging to the same category from real devices and enhances the computational stability of the parameters in the Gaussian mixture model. The efficacy of the DSGMA method is validated using three publicly available datasets and compared against five widely adopted methods. The experimental findings illustrate that DSGMA exhibits superior diagnostic and robust capabilities, facilitating efficient fault diagnosis under scenarios of small samples.

Rotating machinery fault classification based on one-dimensional residual network with attention mechanism and bidirectional gated recurrent unit

Article

Full-text available

May 2024
MEAS SCI TECHNOL

Conventional convolutional neural networks (CNNs) predominantly emphasize spatial features of signals and often fall short in prioritizing sequential features. As the number of layers increases, they are prone to issues such as vanishing or exploding gradients, leading to training instability and subsequent erratic fluctuations in loss values and recognition rates. To address this issue, a novel hybrid model, termed one-dimensional (1D) residual network with attention mechanism and bidirectional gated recurrent unit (BGRU) is developed for rotating machinery fault classification. First, a novel 1D residual network with optimized structure is constructed to obtain spatial features and mitigate the gradient vanishing or exploding. Second, the attention mechanism (AM) is designed to catch important impact characteristics for fault samples. Next, temporal features are mined through the BGRU. Finally, feature information is summarized through global average pooling, and the fully connected layer is utilized to output the final classification result for rotating machinery fault diagnosis. The developed technique which is tested on one set of planetary gear data and three different sets of bearing data, has achieved classification accuracy of 98.5%, 100%, 100%, and 100%, respectively. Compared with other methods, including CNN, CNN-BGRU, CNN-AM, and CNN with an AM-BGRU, the proposed technique has the highest recognition rate and stable diagnostic performance.

Online inspection of blackheart in potatoes using visible-near infrared spectroscopy and interpretable spectrogram-based modified ResNet modeling

Article

Full-text available

Jun 2024

Introduction Blackheart is one of the most common physiological diseases in potatoes during storage. In the initial stage, black spots only occur in tissues near the potato core and cannot be detected from an outward appearance. If not identified and removed in time, the disease will seriously undermine the quality and sale of theentire batch of potatoes. There is an urgent need to develop a method for early detection of blackheart in potatoes. Methods This paper used visible-near infrared (Vis/NIR) spectroscopy to conduct online discriminant analysis on potatoes with varying degrees of blackheart and healthy potatoes to achieve real-time detection. An efficient and lightweight detection model was developed for detecting different degrees of blackheart in potatoes by introducing the depthwise convolution, pointwise convolution, and efficient channel attention modules into the ResNet model. Two discriminative models, the support vector machine (SVM) and the ResNet model were compared with the modified ResNet model. Results and discussion The prediction accuracy for blackheart and healthy potatoes test sets reached 0.971 using the original spectrum combined with a modified ResNet model. Moreover, the modified ResNet model significantly reduced the number of parameters to 1434052, achieving a substantial 62.71% reduction in model complexity. Meanwhile, its performance was evidenced by a 4.18% improvement in accuracy. The Grad-CAM++ visualizations provided a qualitative assessment of the model’s focus across different severity grades of blackheart condition, highlighting the importance of different wavelengths in the analysis. In these visualizations, the most significant features were predominantly found in the 650–750 nm range, with a notable peak near 700 nm. This peak was speculated to be associated with the vibrational activities of the C-H bond, specifically the fourth overtone of the C-H functional group, within the molecular structure of the potato components. This research demonstrated that the modified ResNet model combined with Vis/NIR could assist in the detection of different degrees of black in potatoes.

Condition Monitoring and Fault Diagnosis of Rotating Machinery Towards Intelligent Manufacturing: Review and Prospect

Article

Jun 2024

Rotating machinery is advancing in the direction of high efficiency, high rotary speed, enhanced automation, and widespread application with the quickening growth of intelligent manufacturing. However, in the real operation process, it will inevitably incur wear, corrosion, fracture and other phenomena due to many negative factors such as vibration, impact, unsuitable lubrication and long-term abnormal usage. First, the review and succinct analysis are undertaken to aid in understanding condition monitoring and fault diagnosis of bearings. Then, this review examines identification, monitoring, categorization, and diagnostic procedures and illustrates how bearings’ geometrical tolerance and form profile are sensitive to failure. Therefore, a number of strategies, including artificial intelligence (AI) and traditional diagnosis methods are explored. The upcoming digital twin and AI technologies are also introduced and compared. Finally, by evaluating the current state of condition monitoring and fault diagnosis in industrial applications, future technical trends are predicted, and unresolved concerns are emphasized.

A novel lightweight dynamic focusing convolutional neural network LAND-FCNN for EEG emotion recognition

Article

May 2024

A novel fault diagnosis method based on CNN and LSTM and its application in fault diagnosis for complex systems

Article

Full-text available

Feb 2022
ARTIF INTELL REV

Fault diagnosis plays an important role in actual production activities. As large amounts of data can be collected efficiently and economically, data-driven methods based on deep learning have achieved remarkable results of fault diagnosis of complex systems due to their superiority in feature extraction. However, existing techniques rarely consider time delay of occurrence of faults, which affects the performance of fault diagnosis. In this paper, by synthetically considering feature extraction and time delay of occurrence of faults, we propose a novel fault diagnosis method that consists of two parts, namely, sliding window processing and CNN-LSTM model based on a combination of Convolutional Neural Network (CNN) and Long Short-Term Memory Network (LSTM). Firstly, samples obtained from multivariate time series by the sliding window processing integrates feature information and time delay information. Then, the obtained samples are fed into the proposed CNN-LSTM model including CNN layers and LSTM layers. The CNN layers perform feature learning without relying on prior knowledge. Time delay information is captured with the use of the LSTM layers. The fault diagnosis of the Tennessee Eastman chemical process is addressed, and it is verified that the predictive accuracy and noise sensitivity of fault diagnosis can be greatly improved when the proposed method is applied. Comparisons with five existing fault diagnosis methods show the superiority of the proposed method.

Rolling Bearing Sub-Health Recognition via Extreme Learning Machine Based on Deep Belief Network Optimized by Improved Fireworks

Article

Full-text available

Mar 2021

Rolling bearings, as the main components of the large industrial rotating equipment, usually work under complex conditions and are prone to break down. It can provide a certain theoretical basis for identifying the sub-health state of the industrial equipment by the analysis from the incipient weak signals. Thus, a sub-health recognition offline algorithm based on Refined Composite Multiscale Dispersion Entropy (RCMDE) and Deep Belief Network-Extreme Learning Machine (DBN-ELM) optimized by Improved Firework Algorithm (IFWA) is proposed. First of all, in light of the drawbacks that it is easy to fall into local optima and cross the boundary for exploding fireworks in Firework Algorithm (FWA), Cauchy mutation and adaptive dynamic explosion radius factor coefficient is introduced into IFWA. Secondly, Maximum Correlation Kurtosis Deconvolution (MCKD) optimized by the improved parameters is used to process the incipient vibration signals with nonlinearity, nonstationary, and IFWA is used to adaptively adjust to the period T and the filter length L in MCKD(IFWA-MCKD). Then, each sequence of signals is further extracted the feature—RCMDE to rich sample diversity. Finally, combining the powerful unsupervised learning capability from DBN and the generalization capability from ELM, DBN-ELM can be established. What’s more, in order to avoid the interference of human on the parameters, IFWA is used to optimize the number of hidden nodes in DBN-ELM, and the IFWA-DBN-ELM is established. It shows that the algorithm has the higher sub-health recognition accuracy, better robustness and generalization, which has a better industrial application prospect.

Intelligent fault diagnosis of machines with small & imbalanced data: A state-of-the-art review and possible extensions

Article

Full-text available

Mar 2021
ISA T

The research on intelligent fault diagnosis has yielded remarkable achievements based on artificial intelligence-related technologies. In engineering scenarios, machines usually work in a normal condition, which means limited fault data can be collected. Intelligent fault diagnosis with small & imbalanced data (S&I-IFD), which refers to build intelligent diagnosis models using limited machine faulty samples to achieve accurate fault identification, has been attracting the attention of researchers. Nowadays, the research on S&I-IFD has achieved fruitful results, but a review of the latest achievements is still lacking, and the future research directions are not clear enough. To address this, we review the research results on S&I-IFD and provides some future perspectives in this paper. The existing research results are divided into three categories: the data augmentation-based, the feature learning-based, and the classifier design-based. Data augmentation-based strategy improves the performance of diagnosis models by augmenting training data. Feature learning-based strategy identifies faults accurately by extracting features from small & imbalanced data. Classifier design-based strategy achieves high diagnosis accuracy by constructing classifiers suitable for small & imbalanced data. Finally, this paper points out the research challenges faced by S&I-IFD and provides some directions that may bring breakthroughs, including meta-learning and zero-shot learning.

Capsule Network is Not More Robust than Convolutional Network

Conference Paper

Jun 2021

Activate or Not: Learning Customized Activation

Conference Paper

Jun 2021

Fault Diagnosis for Rolling Bearings of a Freight Train under Limited Fault Data: Few-Shot Learning Method

Article

Aug 2021

In recent years, many machine learning-based methods have emerged to detect faulty bearings. However, most of these methods may not be practical due to the need to collect a large number of fault samples for training. This paper developed a novel few-shot learning framework for the fault diagnosis of freight train rolling bearings. The proposed method has the capability to transfer the learning outcome from one bearing fault diagnosis model to another different but related task for which very limited training data are available. The authors established a single-wheelset platform to collect acceleration signals of different types of bearing faults. The authors preprocessed the data through data segmentation and frequency domain transformation, and divided the data into training and test sets according to a certain ratio. A one-dimensional convolutional neural network (1D-CNN) was established to automatically extract the features of the bearing vibration signals and classify the fault types. The authors implemented two few-shot learning methods through parameter fine-tuning and a conditional Wasserstein generative adversarial network (C-WGAN). A case study demonstrated the classification performance of the proposed models. The results showed that the diagnosis capability of the 1D-CNN in the frequency domain is significantly superior to that in the time domain. However, when the amount of data is small, the 1D-CNN model does not work. In contrast, the few-shot learning of bearing faults works well for both the fine-tuned CNN and C-WGAN models. Furthermore, the classification performance of the C-WGAN is better than that of the fine-tuned CNN when the training data are extremely limited.

A Study on fault diagnosis of bearing pitting under different speed condition based on an improved inception capsule network

Article

May 2021
MEASUREMENT

The bearings fault diagnosis is essential for the maintenance and reliability of rotating machinery. Bearings pitting is one of the most common fault types of rotating machinery. However, due to the complex working conditions of bearings, it is challenging to diagnose the pitting faults in bearings inner and outer rings at different speeds. In this paper, an improved one-dimensional inception capsule network (IICN) is proposed to solve the problem of bearing pitting fault diagnosis under complex working conditions. Firstly, the raw bearing vibration signal is processed using the improved Inception network. The function of the stage is to approximate an optimal local sparse structure with a simple dense substructure for bearing healthy state feature extraction. And then inputs concatenated features to the primary capsule layer and the routing capsule layer. The inputs are mapped to feature vector space and weighted by the dynamic routing algorithm. The dynamic routing algorithm encodes the significant spatial relationship between low-level features and upper-level features. The Euclidean length of each capsule vector is the probability of belonging to this bearing healthy condition. In order to validate the effectiveness of the IICN method, bearings pitting experiments at different speeds were designed. The raw bearings vibration signal data under six different health conditions are collected, and the effectiveness of the IICN method is verified. Experimental results show that the IICN method can effectively distinguish different degrees of bearing pitting fault at different speeds, and its diagnostic accuracy is superior to other advanced deep learning methods.

A new dynamic model and transfer learning based intelligent fault diagnosis framework for rolling element bearings race faults: Solving the small sample problem

Article

Apr 2021
ISA T

Intelligent fault diagnosis of rolling element bearings gains increasing attention in recent years due to the promising development of artificial intelligent technology. Many intelligent diagnosis methods work well requiring massive historical data of the diagnosed object. However, it is hard to get sufficient fault data in advance in real diagnosis scenario and the diagnosis model constructed on such small dataset suffers from serious overfitting and losing the ability of generalization, which is described as small sample problem in this paper. Focus on the small sample problem, this paper proposes a new intelligent fault diagnosis framework based on dynamic model and transfer learning for rolling element bearings race faults. In the proposed framework, dynamic model of bearing is utilized to generate massive and various simulation data, then the diagnosis knowledge learned from simulation data is leveraged to real scenario based on convolutional neural network (CNN) and parameter transfer strategies. The effectiveness of the proposed method is verified and discussed based on three fault diagnosis cases in detail. The results show that based on the simulation data and parameter transfer strategies in CNN, the proposed method can learn more transferable features and reduce the feature distribution discrepancy, contributing to enhancing the fault identification performance significantly.

An early fault diagnosis method of common-rail injector based on improved CYCBD and hierarchical fluctuation dispersion entropy

Article

Apr 2021
DIGIT SIGNAL PROCESS

Early fault diagnosis of common rail injectors is essential to reduce diesel engine testing and maintenance costs. Therefore, this paper proposes a new common rail injector early fault diagnosis method, which combines the Maximum Second-order Cyclostationary Blind Deconvolution (CYCBD) optimized by the Seagull Optimization Algorithm (SOA) and Hierarchical Fluctuation Dispersion Entropy (HFDE). First, we use SOA adaptively to seek the optimal filter length of CYCBD and use the optimal CYCBD to filter the fuel pressure signal of the high-pressure fuel pipe. Then, in order to make up for the shortcomings of Multi-scale Fluctuation Dispersion Entropy (MFDE) ignoring high-frequency component information, this paper proposes HFDE to extract the fault characteristics after filtering. Finally, we input the fault characteristics into Least Squares Support Vector Machines (LSSVM) for classification and recognition. Through the analysis of experimental data, the method proposed in this paper can effectively identify the early failure state of the common rail injector. Compared with the existing methods, the proposed method has a higher fault recognition rate.

Combination Bidirectional Long Short-Term Memory and Capsule Network for Rotating Machinery Fault Diagnosis

Article

Feb 2021
MEASUREMENT

For the application of deep learning in the field of fault diagnosis, its recognition accuracy is limited by the size and quality of the training samples, such as small size samples, low signal-to-noise ratio and different working conditions. In order to solve above problems, one novel method for fault classification is proposed based on a Bidirectional Long Short-Term Memory (Bi-LSTM) and a Capsule Network with convolutional neural network (BLC-CNN). The Bi-LSTM is utilized to achieve the feature denoising and fusion, which is extracted by CNN. The fault diagnosis with insufficient training samples is carried out by the capsule network. The influence of sample size on the method is discussed emphatically. The effectiveness and superiority of the proposed method are validated through analyzing the data of bearings and gears under different working conditions with different noise. The results indicate that the proposed method has good performance and immunity to noise.

Fault diagnosis for small samples based on attention mechanism

Abstract and Figures

Recommended publications

A Simple Data Augmentation Algorithm and A Self-Adaptive Convolutional Architecture for Few-Shot Fau...

A Novel Attentional Feature Fusion with Inception Based on Capsule Network and Application to the Fa...

Intelligent Fault Diagnosis Method Based on CA-ResNet Model

Fault Diagnosis for Limited Annotation Signals and Strong Noise Based on Interpretable Attention Mec...

Lightweight Network with Variable Asymmetric Rebalancing Strategy for Small and Imbalanced Fault Dia...