Content uploaded by Chao He
Author content
All content in this area was uploaded by Chao He on Aug 01, 2022
Content may be subject to copyright.
Graphical Abstract
Fault Diagnosis for Small Samples Based on Attention Mechanism
Xin Zhang,Chao He,Yanping Lu,Biao Chen,Le Zhu,Li Zhang
Sliding window sampling Back Propagation
LSR Meta-
ACON
Training DCA-BiGRU
Early
Stopping
Optimized DCA-BiGRU Fault diagnosis
Data Processing
online testing
offline training
Fine-tune
Industrial samples Fault diagnosis
Sharing Parameters
AdaBN
Data Normalization
Data Partitioning
AdamP
Saving parameters
Training, Validation
Testing
Training, Validation
Testing
B→A B→B B→C B→D
95.0
95.5
96.0
96.5
97.0
97.5
98.0
98.5
99.0
99.5
100.0
acc(%)
Domain migration
WLSR WGHMC WFL WCE WLSRG BLSR 179.55s
413.83s
133.12s
170.51s
227.01s
300.43s
0.1 0.2 0.3 0.4 0.5
65
70
75
80
85
90
95
G-mean
a
DCNN-BiGRU
DCNN
DCA
DCA-BiGRU
online application
Highlights
Fault Diagnosis for Small Samples Based on Attention Mechanism
Xin Zhang,Chao He,Yanping Lu,Biao Chen,Le Zhu,Li Zhang
•A fault diagnosis model based on dual path convolution with attention mechanism and BiGRU is proposed.
•The impact of low training set ratio is discussed on fault diagnosis.
•The influence of BiGRU and attention mechanism are studied on small samples.
•The performance of the method has been verified in the bearing and gearbox data sets.
•Different working conditions of the equipment can be dealt with effectively.
Fault Diagnosis for Small Samples Based on Attention Mechanism
Xin Zhanga, Chao Heb, Yanping Lub, Biao Chenb, Le Zhucand Li Zhangb,∗
aSchool of Materials Science and Engineering, Northeastern University, Shenyang 110819, China
bSchool of Information, Liaoning University, Shenyang 110036, China
cSchool of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
ARTICLE INFO
Keywords:
Convolutional neural network
Bidirectional gated recurrent unit
Attention mechanism
Rolling bearings
Small samples
Fault diagnosis
ABSTRACT
Aiming at the application of deep learning in fault diagnosis, mechanical rotating equipment com-
ponents are prone to failure under complex working environment, and the industrial big data suffers
from limited labeled samples, different working conditions and noises. In order to explore the problems
above, a small sample fault diagnosis method is proposed based on dual path convolution with attention
mechanism(DCA) and Bidirectional Gated Recurrent Unit(DCA-BiGRU), whose performance can
be effectively mined by the latest regularization training strategies. BiGRU is utilized to realize
spatiotemporal feature fusion, where vibration signal fused features with attention weight are extracted
by DCA. Besides, global average pooling(GAP) is applied to dimension reduction and fault diagnosis.
It is indicated that DCA-BiGRU has exceptional capacities of generalization and robustness by
experiments, and can effectively carry out diagnosis under various complicated situations.
1. Introduction
With the development of industrial Internet of Things,
the manufacturability, integration and precision of rotating
machinery system are constantly improving, but complexity,
nonlinearity and uncertainty are also significantly enhanced,
which has become a huge challenge[1]. During the long-
term running, rotating machinery will be affected by material
degradation, loads, temperature and humidity, leading to the
breakdown of key components easily, which will depress
plant benefits, or lead to casualties and ecological pollution.
Therefore, it is of great significance to monitor the status of
rotating machinery.
In the past few years, fault diagnosis methods based on
signal analysis, swarm intelligence evolution and machine
learning have continued to emerge[2–4]. However, it is too
dependent on prior knowledge of experts and features are
extracted by manual, which makes it difficult to process
big data and learn advanced features. Additionally, swarm
intelligence is a heuristic algorithm and the optimized result
is hard to be stable because of randomness. Furthermore,
related algorithms with a quite high time complexity cannot
guarantee to figure out the global optimum. Finally, in the
face of complex and changeable industrial data, it is difficult
for vanilla shallow models to achieve ideal results.
In recent years, with the development of deep learning,
it has made remarkable achievements in image classifica-
tion, semantic segmentation, target detection and natural
language processing[5–8]. Similarly, it also provides some
directions of settling the problems encountered above in fault
diagnosis[9]. Accordingly, a series of studies for fault diag-
nosis have set off a research upsurge, which include convo-
lutional neural network, autoencoder, generative adversarial
network, deep belief network, recurrent neural network and
capsule network etc[10–16]. Implementation of these meth-
ods usually requires to design novel and efficient structures
or improve deep optimized algorithms. Alternatively, the
∗Corresponding author.
zhang_li@lnu.edu.cn (L. Zhang)
ORCID (s):
distribution features of signals are required to analyze from
multiple perspectives. For example, Zhou et al.[17] added
a data generation and filtering strategy into autoencoder-
generative adversarial networks(AE-GAN) for unbalanced
data, where autoencoder was utilized to learn features of
unbalanced samples, and the discriminator aimed to filter out
unqualified generated samples. Kumar et al.[18] adopted a
Deep CNN model based on AdaGrad, which fused multiple
sensor data to generate images for fault diagnosis.
Furthermore, small sample fault diagnosis has become a
new research focus. Zhang et al.[19] put forward a method
for small samples based on siamese neural network, and the
same or different sample pairs were input to calculate 𝐿1
distance of feature vectors, judging whether to belong to the
same class to train, and then support sets and query sets as
pairs were calculated similarity to realize fault diagnosis. On
this basis, Wang et al.[20] proposed a comparison diagnosis
model which applied the full connected layer as the similarity
measure of feature pairs to judge whether they belonged to
a certain type, and meanwhile regularization methods were
added to improve performance. Wu et al.[21] compared small
sample transfer learning among feature transfer, fine-tuning
and meta relation network, and concluded that under small
samples or the similarity between source domain and target
domain was large, the meta relation transfer was dominant.
On the contrary, the advantage of feature transfer was gradu-
ally obvious. Saufi et al.[22] came up with a small sample
fault diagnosis method based on spectral kurtosis filtering
and particle swarm optimization stacked sparse autoencoder,
where a high diagnostic accuracy can be achieved when the
number of per fault training samples is 100. Han et al.[23]
applied bidirectional long short-term memory(BiLSTM) and
capsule network to design a small sample fault diagnosis
method, which proved that capsule network had a satisfying
performance after denoising and fusion signals by BiLSTM.
Li et al.[24] developed a conditional Wasserstein generative
adversarial network(CWGAN), where vast similar samples
were generated by training CWGAN with vast source domain
et al.: Preprint submitted to Elsevier Page 1 of 15
samples, and pre-trained CWGAN was fine-tuned to achieve
transfer learning under target domain with limited samples.
For small samples, they either utilize regularization tech-
nologies and feature extraction advantages of models, or
generate substantial high-quality samples based on the distri-
bution of real samples, or apply emerging machine learning
technologies such as meta-learning and transfer learning.
The design of big convolution kernels is beneficial to
enhance robustness[25], while that of deep small convolution
kernels effectively extract abstract features. Also, time-step
information cannot be ignored in vibration signals. Com-
pared with CNN, RNN can just meet requirements.
To learn temporal and hidden features in different loca-
tions, an effective strategy is to employ a gated RNN struc-
ture, LSTM or GRU. LSTM has an excellent time modeling
capability while has many parameters, which easily leads to
overfitting under small samples. Similarly, it is inappropriate
to assume that signals only propagate information forward,
so BiGRU with similar performance to BiLSTM, fewer pa-
rameters and propagating forth and back is a terrific choice.
Zhao et al.[26] put forward a method of combining Manifold
Embedded Distribution Alignment(MEDA) and BiGRU for
fault diagnosis. The noises of original signals were removed
by spectrum information, and BiGRU was utilized to learn
features, then MEDA was used to align auxiliary and unla-
beled samples. However, the method utilizes artificial prior
knowledge for denoising and does not analyze the impact of
small samples and time complexity. Yang et al.[27] proposed
a fault diagnosis method based on BiGRU and attention.
BiGRU was utilized to gain advanced expressions from fea-
tures extracted by CNN, then attention vectors were realized
to diagnose each segment. However, reference[27] does not
discuss the influence of small samples, and the means of
training is relatively conventional, and the performance of
model has not been further mined. In addition, the number of
training samples of DCA-BiGRU is 60% of that of reference
[22] with more difficult diagnosis.
Although previous methods have achieved relatively sat-
isfactory results, deep learning models often require plenty
of samples to achieve the ideal generalization. However, due
to the relatively small labeled data, models are often unable
to fully learn the various effective features of the limited
samples and prone to overfitting, which increases learning
difficulties[28]. Besides, the latest activation functions and
gradient descent back propagation algorithms of all sorts
have not been deeply comparative explored in fault diagnosis
under small samples. Ultimately, due to the interference of
different working conditions, the efficiency is difficult to be
guaranteed, which puts forward higher requirements.
Therefore, aiming to regularization technologies and fea-
ture extraction advantages of models, a new fault diagno-
sis method for small samples based on dual path convo-
lution with attention mechanism and BiGRU is proposed.
The convolution layer aims to extract high-low frequency
features of signals. Meanwhile attention mechanism that
can be regarded as a cost sensitive learning method[28]
values the fused features by allocated weights and sensi-
tive information selection, pouring attention to the main
spectra. Then, BiGRU can get the hidden information of
different time sequence position. In addition to strengthening
the connection between channels and reducing parameters,
GAP and big kernels have more robust than capsule net-
work on model by increasing receptive fields[29]. Moreover,
the latest regularization methods further improve capacity
of generalization on DCA-BiGRU, where label smoothing
regularization(LSR) is introduced to balance the distribu-
tion differences between the labeled samples and calibrate
DCA-BiGRU. Improved AMSGrad accelerator(AMSGradP)
can be utilized to realize adaptive gradient optimization,
and 1D-Meta-ACON(activate or not) can adaptively activate
neurons, and adaptive batch normalization(AdaBN) enables
DCA-BiGRU to have stronger transfer performance.
The main contributions of the paper are as follows:
1. For small sample fault diagnosis, a novel method based
on designed attention mechanism and BiGRU is proposed
from the regularization and model structure, and the
effects of LSR, activation functions and back propagation
algorithms are explored for the first time. Also, the pro-
posed method has a higher test accuracy.
2. The sensitivities of attention mechanism and BiGRU to
the ratio of training samples are discussed, where the pro-
posed attention mechanism can capture the channel and
spatial information of vibration signals. Then, designing
GAP after BiGRU is beneficial for improving diagnostic
performance. Also, visualization techniques are utilized
to gain a better understanding of blocks in DCA-BiGRU.
3. For the noises contained in practical industrial data, a
small sample transfer diagnosis framework based on pre-
training is proposed. The experimental results prove that
it has excellent capacities of generalization, adaptability
and robustness compared to other bearing and gearbox
diagnosis models under complex working conditions.
The rest of other parts in this paper is as follows. Section
2is mainly about the basic theoretical model for fault diagno-
sis. DCA-BiGRU and latest regularization training strategies
will be introduced in detail in Section 3. Section 4presents
some comparative experiments and analysis to prove the
excellent performance of the proposed model. In section 5, it
will draw the conclusion and prospect for the future research.
2. Methodologies
2.1. Convolutional neural network
CNN generally consists of two modules: one filter block
including convolution and pooling and the other classifica-
tion block including full connection. The general CNN in
fault diagnosis is shown in Fig.1.
In signal processing, 1D-CNN is utilized to calculate
delay accumulation of signals with the same kernel. The
output is shown in Eq.(1).
Figure 1: CNN for fault diagnosis
y = 𝑅𝑒𝐿𝑈 (
𝑊
𝑤=1
𝑘𝑤𝑥𝑡−𝑤+1 +𝑏𝑤)(1)
et al.: Preprint submitted to Elsevier Page 2 of 15
where 𝑘𝑤and 𝑏𝑤are weight and bias matrix, respectively.
𝑥𝑡−𝑤+1 are input signals.
Pooling layer selects features and decreases parameters to
accelerate convergence. The reason why maximum pooling
is often utilized in fault diagnosis is that it can filter out
insignificant information, as shown in Eq.(2).
𝑦𝑖= max
𝑗∈𝑖𝑥𝑗(2)
where 𝑦𝑖is representations, and 𝑗is neurons in the 𝑖-th layer.
Batch Normalization(BN) can not only solve the internal
covariate migration and improve training efficiency, but also
act as a regularization trick because of batch selection ran-
domly, which can enhance generalization instead of Dropout.
Activation functions can enhance learning capacity of
neural network, improving the computational efficiency.
The distributed feature representations of vibration sig-
nals are mapped to the sample label space through full con-
nection layer. Finally, SoftMax is applied for fault diagnosis.
2.2. Bidirectional Gated Recurrent Unit
As shown in Fig.2, gated recurrent unit(GRU) consists of
an update gate 𝑧𝑡and a reset gate 𝑟t.𝑧𝑡is applied to control
the extent to which ℎ𝑡−1 enters ℎ𝑡. The higher values are, the
more information ℎ𝑡is entered. 𝑟tis utilized to control the
extent to which ℎ𝑡−1 enters
ℎ𝑡. The smaller values are, the
less
ℎ𝑡entry information. 𝑧𝑡and 𝑟tare calculated at 𝑡moment
as shown in Eq.(3∼7).
𝑟𝑡=𝜎[𝑊𝑟⊗ 𝑐𝑎𝑡(ℎ𝑡−1 , 𝑥𝑡)] (3)
𝑧𝑡=𝜎[𝑊𝑧⊗ 𝑐𝑎𝑡(ℎ𝑡−1 , 𝑥𝑡)] (4)
ℎ𝑡= tanh[𝑊
ℎ𝑡⊗ 𝑐𝑎𝑡(𝑟𝑡⊗ ℎ𝑡−1 , 𝑥𝑡)] (5)
ℎ𝑡= (1 − 𝑧𝑡)⊗ ℎ𝑡−1 +𝑧𝑡⊗
ℎ𝑡(6)
𝑦𝑡=𝜎(𝑊𝑜⊗ ℎ𝑡)(7)
where 𝑊𝑟, 𝑊𝑧, 𝑊
ℎ𝑡is the weight matrix, 𝑐𝑎𝑡() means that
eigenvectors are connected. 𝜎is 𝑠𝑖𝑔𝑚𝑜𝑖𝑑;⊗means element-
wise product; the cell hidden state is ℎ;
ℎ𝑡means candidate
content in the current state, which controls the degree of
receiving new information.
For Bidirectional gated recurrent unit(BiGRU), the for-
ward
ℎ𝑡and backward
ℎ𝑡state without sharing parameters of
signals are connected through different hidden layers, which
together act on results ℎ𝑡to express ampler features, as shown
in Eq.(8)
ℎ𝑡=𝐺𝑅𝑈 (𝑥𝑡,
ℎ𝑡−1),
ℎ𝑡=𝐺𝑅𝑈 (𝑥𝑡,
ℎ𝑡−1),
ℎ𝑡=𝑤𝑡
ℎ𝑡+𝑣𝑡
ℎ𝑡+𝑏𝑡
(8)
where 𝑤𝑡and 𝑣𝑡are weights corresponding to the forward or
backward state of BiGRU respectively, and 𝑏𝑡is bias.
3. The proposed fault diagnosis method
3.1. Fault diagnosis procedure
In intelligent machine fault diagnosis, multiple structures
and deep optimized algorithms can be integrated to achieve
an amazing effect, where CNN-RNN has been applied to
some extent[30,31]. However, as mentioned in Section 1,
under small samples, the performance of CNN-RNN has not
been further discussed, and deep optimization algorithms
GRU Cell
BiGRU
Figure 2: The core structures of GRU cell and BiGRU
and training modes are conventional, whose potentiality has
not been further explored.
Besides, in fault diagnosis, at the current moment, Bi-
GRU makes the output state determined by the state of the
previous and next moments conjointly. Of course, the last
hidden neuron output is generally taken as the final hidden
feature for diagnosis, for the reason that it has the most
abundant features. Nevertheless, the strategy ignores signal
features learned by other GRU cells.
Therefore, an intelligent fault diagnosis method called
DCA-BiGRU has been proposed, which is composed of data
enhancement, dual path convolution, attention mechanism,
BiGRU, GAP and diagnosis layer, as shown in Fig.4.
As shown in Fig.3, in practical application, the specific
steps of fault diagnosis based on DCA-BiGRU are as follows:
1) Obtain the original signals and realize data segmentation
and standardization.
2) Divide signals into training, verification and test samples.
3) Propose the model structures and diagnostic method.
4) Offline training: use the training set and regularization
strategies to train and save the optimal parameters.
5) Online diagnosis: apply the test set to verify the model
performance or load pre-training parameters and fine-
tune the whole model to utilize parameter sharing transfer
learning to realize timely training and fault diagnosis.
3.2. Dual path convolution and feature fusion
The dual convolution layer adopts two paths to extract
the high-low frequency features of signals. On one path, two
larger convolution kernels are utilized to learn low-frequency
features. As described in Section 2.1, larger convolution
kernels can enhance robustness against noises. On the other
path, small convolution kernels are adopted to deepen neural
network, which integrates four nonlinear activation layers
to promote the discriminant capability. A combination of
both widens the model and extract multiscale features, which
provides a foundation for BiGRU to further learn advanced
features. Finally, features are fused through element-wise
product, where each channel contains abundant features.
To enhance the adaptability to DCA-BiGRU in different
domains, AdaBN is leveraged to replace BN, where statis-
tical information from source domain to target domain is
adjusted to improve capacity of generalization[32].
et al.: Preprint submitted to Elsevier Page 3 of 15
Sliding window sampling Back Propagation
LSR 1D-Meta--
ACON
Training DCA-BiGRU
Early
Stopping
Optimized DCA-BiGRU Diagnosis result
Signal Processing
online testing
offline training
Fine-tune
Industrial
samples Diagnosis result
Sharing Parameters
AdaBN
Signal Normalization
Signal Partitioning
AMSGradP
Saving parameters
Training, Validation
Testing
Training, Validation
Testing
online application
Figure 3: Fault diagnosis framework based on sharing parameters
Dual Path Convolutional Layer Feature
Fusion Layer Attention Mechanism Bidirectiona GRU Global Average Pooling FC Layer
(1×15, 2) (1×10, 2)
(1×6, 1) (1×6, 1) (1×6, 1) (1×6, 2)
(1×2, 2)
hide=64
SoftMax
(1×2, 2)
(1×2, 2)
AdaBN AdaBN
AdaBN AdaBN AdaBN
AdaBN AdaBN
(1×1, 1) (1×1, 1)
AdaBN
Figure 4: Overall schema for the proposed network architecture of DCA-BiGRU
3.3. The proposed attention mechanism of signals
Attention mechanism and LSR can be regarded as cost
sensitive learning methods, and 1D-Meta-ACON can be seen
as a means of meta-learning. For small samples, these regu-
larization methods will make contributions to generalization
and domain adaptability on model.
3.3.1. label smoothing regularization
Cross entropy loss(CE, 𝑙0) tends to focus on one di-
rection, leading to poor regulating capability. Consequently,
smoothing coefficient 𝜀are added to increase the correct
diagnosis and reduce wrong diagnosis, which contributes to
countering overconfidence of models and enhances learning
capability. LSR(𝑙) can not only upgrade generalization, but
also calibrate models. It is mostly used in the field of image
recognition, but rarely studied in fault diagnosis.
Supposing 𝑝(𝑘)is predicted distribution, 𝑞(𝑘)is real
distribution, real distribution after label smoothing is 𝑞′(𝑘)
with coefficient 𝜀and category 𝐾, and label distribution is
set to uniform distribution 𝜇(𝑘) = 1∕𝐾. The relationship
between 𝑙0and 𝑙is succinctly deduced, as shown in Eq.(9).
𝑙= −
𝐾
𝑘=1
log(𝑝(𝑘))𝑞′(𝑘)
= −
𝐾
𝑘=1
log(𝑝(𝑘))[(1 − 𝜀)𝑞(𝑘) + 𝜀
𝐾]
= (1 − 𝜀)[−
𝐾
𝑘=1
log(𝑝(𝑘))𝑞(𝑘)] + 𝜀[
−
𝐾
𝑘=1
log(𝑝(𝑘))
𝐾]
= (1 − 𝜀)𝑙0+𝜀[
−
𝐾
𝑘=1
log(𝑝(𝑘))
𝐾]
(9)
By learning smooth labels instead of real labels to allevi-
ate overfitting, so we argue that LSR has potential advantages
in dealing with small samples in fault diagnosis.
3.3.2. The proposed 1D-signal attention mechanism
In Fig.5, a 1D-signal attention mechanism is proposed,
which can tell us what models demand to focus on about
original signals.
et al.: Preprint submitted to Elsevier Page 4 of 15
To calculate attention between channels, it is indispens-
able to squeeze the dimension of input feature matrix, and
global pooling is generally adopted. Furthermore, compared
with GAP that focuses on the overall information, we argue
that global max pooling(GMP) provides the crucial pulses
(𝑥𝐺𝑀𝑃 ) for the signal characteristic matrix (𝑥), and in the-
ory, it is the decisive pulses that is regarded as the main
distinguishing criterion for fault diagnosis, so GMP is more
suitable than GAP for the proposed attention block, which
will be verified by experiment.
The 𝑐-th channel GMP will be calculated as in Eq.(10).
𝑥c
𝐺𝑀𝑃 = Max
0≤𝑗<𝑑 𝑥𝑐(1, 𝑗 )(10)
Dual Conv
Feature Fusion
Re-weight
AdaptiveMaxPool1d
Conv1d
AdaBN
Conv1d
ⓧ
B×C1×D
B×C1×1
B×C2×(D+1)
B×C1×D
B×C1×D
Input
Output
•
x
GM P
x
x
Meta-ACON
Sigmoid
B×C1×D
Cat
Split
B×C1×D
Vibration signal
Feature with weight
B×C2×D
B×C2×(D+1)
B×C2×(D+1)
transform
Figure 5: The architecture of the proposed Attention Block
Besides, in order to capture the spatial position infor-
mation, it is perfect to establish the relationship between
𝑥𝐺𝑀𝑃 and 𝑥, so they are concatenated together and sent into
a convolution mapping function 𝐹1that shares 1 × 6. The
dependency relationship is encoded to yield the intermediate
characteristic connection matrix 𝑓as shown in Eq.(11).
𝑓=𝛿(𝐹1[𝑐𝑎𝑡(𝑥,𝑥𝐺𝑀 𝑃 )]) (11)
where 𝛿is 1D-Meta-ACON activation function.
Then, 𝑓is split into 𝑥′and others. For the reason that
the transformed original characteristic matrix 𝑥′has not only
information of the critical pulse spectra, but also original
signal characteristics 𝑥, just 𝑥′is retained. Another 1 × 1
convolution mapping function 𝐹2transforms 𝑥′to the same
number of channels as 𝑥, as shown in Eq.(12).
𝑔=𝜎[𝐹2(𝑓𝑥′)] (12)
Finally, the output 𝑦𝑐is shown in Eq.(13).
𝑦𝑐=𝑥𝑐⊗ 𝑔𝑐(13)
3.3.3. The improved 1D-Meta-ACON
Aiming at the nonlinearity of vibration signals, in the
proposed attention block, a new activation function, Meta-
ACON is applied[33]. Neither ReLU nor Swish, but both are
considered and generalized to a general form. It is a form that
can learn whether to activate.
Whether or not to activate neurons is determined by the
smoothing coefficient 𝛽𝑐, so as to dynamically and adaptively
eliminate inessential information. This is similar to the idea
of the proposed 1D-signal attention mechanism, focusing on
the central part in signals, which can conduce to improving
capacity of generalization and transmission performance.
Inspired by this, it is transformed into 𝛽𝑐suitable for 1D-
signals, 1D-Meta-ACON, as shown in Eq.(14).
𝛽𝑐=𝜎[𝐹4(𝐹3(1
𝐷𝐷
𝑑=1 𝑥𝑐,𝑑 ))] (14)
where in forward propagation, 𝛽𝑐is calculated initially. The
eigenvector 𝑥is calculated the mean value on D dimension.
After 𝐹3, 𝐹4(1×1convolution) transform, 𝛽𝑐between (0,1)
is obtained through Sigmoid, which is applied to control
whether or not to activate or activation degree, where 0
means inactive. Finally, adaptive variables 𝑝1and 𝑝2are set,
and supposing 𝑝=𝑝1−𝑝2, return activation output(𝑓𝑎)
obtained by Eq.(15), and 𝑝1and 𝑝2are adaptively adjusted
by back propagation.
𝑓𝑎=𝑝×𝑥𝑐,𝑑 ×𝜎[𝛽𝑐×𝑝×𝑥𝑐,𝑑 ] + 𝑝2×𝑥𝑐,𝑑 (15)
1D-Meta-ACON is a general form, which not only solves
the dead neuron problem, but also requires only a few param-
eters to learn to whether to activate. The research will explore
if it can make a difference in small sample fault diagnosis.
3.3.4. AMSGradP
AdaBN contributes to improving capacity of generaliza-
tion and scale invariance on model as same as BN. However,
Heo et al. pointed out the gradient descent with momen-
tum(GDM) will lead the effective step to decreasing rapidly
during back propagation, resulting in slower convergence
or even sharp minimizers, so AdamP[34] was proposed,
which can just alleviate the puzzle by dropping the radial
component during optimized update, regulating growth of
weight norm, retarding the decay of the effective step size,
thus training the model in a barrierless speed.
In this study, it is easy for small samples to converge to
the local optimum. Unfortunately, the author has not given
the improvement of more advanced AMSGrad. Inspired by
this, the idea of reference [34] are introduced into AMSGrad
called AMSGradP. In Appendix, Algorithm 1outlines the
pseudocode of AMSGradP.
3.4. BiGRU and GAP in fault diagnosis
LSTM has been described about in Section 1. In addition,
by merging the forget gate and the input gate into the update
gate, GRU has simpler structures, approximately 3/4 parame-
ter quantity than LSTM, while it has the similar performance
to LSTM in various tasks[35]. Apparently, GRU is more
suitable for small samples. At the same time, it is argued that
signals only have a deep correlation in one direction, which
is not appropriate. As mentioned in Section 2.2, BiGRU is
more suitable for the research.
FC layer with many parameters can greatly increase the
risk of overfitting, while GAP will not produce extra param-
eters, and retain the partial spatial coding information from
signals. In addition, as described in Section 2.2, we consider
et al.: Preprint submitted to Elsevier Page 5 of 15
Table 1
The structures of DCA-BiGRU
Type Kernel/Stride Unit Activation AdaBN Input Output Parameter
Conv1d_1 18/2&10/2 / 1D-Meta-ACON YES (-1,1,1024) (-1,30,248) 19036
Maxpool_1 2/2 / 1D-Meta-ACON / (-1,30,248) (-1,30,124) /
Conv1d_21 6/1&6/1 / 1D-Meta-ACON YES (-1,1,1024) (-1,40,1014) 15816
Maxpool_21 2/2 / 1D-Meta-ACON / (-1,40,1014) (-1,40,507) /
Conv1d_22 6/1&6/2 / 1D-Meta-ACON YES (-1,40,507) (-1,30,249) 14976
Maxpool_22 2/2 / 1D-Meta-ACON / (-1,30,249) (-1,30,124) /
Attention 1/1 / 1D-Meta-ACON YES (-1,30,124) (-1,30,124) 666
BiGRU / 128 Tanh / (-1,30,124) (-1,30,128) 72960
GAP / / / / (-1,30,128) (-1,30,1) /
FC / 10 SotfMax / (-1,30) (-1,10) 310
Total:123764
not only output of the last GRU cell, but also outputs of entire
GRU cells, and GAP just fulfills the above requirements,
preserving features learned by other GRU cells.
Lastly, we hold the view that the feature matrix has
gathered the critical spectra from original signals, whose
global information should be focused on, so GAP is preferred
instead of GMP. The structures of DCA-BiGRU in detail is
shown in Table 1, where a smaller number of parameters will
facilitate the small sample fault diagnosis.
4. Result analysis and discussion
The proportion of each kind of training samples(𝛼%)
regards as the evaluation criteria. We argue if 𝛼<0.5, it can
be called small samples[36]. Firstly, the superiorities of new
regularization training methods proposed will be verified.
Then, when 𝛼=0.1~0.5(around 20~100 training samples),
the small sample learning capacity of different models will
be verified, and the performance will be evaluated under
different working conditions and noises. Finally, parameter
sharing that is applied to the small sample transfer learning
to a new data set will be discussed, and meanwhile visual
interpretations of DCA-BiGRU will be discussed. All exper-
iments are performed under the same random seed, and the
settings about experiments are shown in Table 2.
Table 2
Description of experimental parameters
Settings Value
Batch_size 32
Maximum epochs 150
Optimizer AMSGradP
Learning rate 0.001
Weight decay(except bias) 0.0001
Early Stopping(patience) 10
AMSGradP(Nesterov) True
1D-Meta-ACON(reduction) 16
Attention Block(𝐶1/𝐶2/𝐷) 30/6/124
The experiment is implemented in PyTorch 1.8.0, Python
3.8.5, running on Intel(R) Core i7-6700HQ CPU @3.40GHz
(8G RAM), GTX970M GPU. The flow chart shown in Fig.3
illustrates the overall framework for fault diagnosis. It has
proved that fine-tuning the model can obtain more accurate
diagnosis results, and the time cost is affordable[36,37];
hence the paper will adopt to fine-tuning the whole DCA-
BiGRU for the anti-noise experiment.
4.1. Data enhancement
Data enhancement aims to generate more samples from
vibration signals, prevent ANN from learning irrelevant fea-
tures. As shown in Fig.4, assuming that the sliding window is
𝑙, a sample is generated starting from the 𝑖-th with an interval
𝑙, where the adjacent samples are set with an overlap value.
Assuming the sliding step size is 𝑚,𝑁is the signal
length, and the quantity of samples 𝑛=𝑁−𝑙
𝑚+ 1(𝑚=400,
𝑙=1024).
4.2. Model evaluation and metrics method
Diagnosis performance can be formulated by a confusion
matrix, where it has two valuable indicators.
In multi-class case, this is the average of F1-score of each
class with weighting depending on the average parameter,
where sensitivity(recall) and precision are the key perfor-
mance, which can be calculated as Eq.(16∼17).
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑇 𝑃
𝑇 𝑃 +𝐹 𝑃 , 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =𝑇 𝑃
𝑇 𝑃 +𝐹 𝑁 (16)
𝐹𝛽=(1 + 𝛽2)(𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 +𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦)
𝛽2×𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 +𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 (𝛽= 1) (17)
where True Positive(TP) is an outcome where the model
correctly predicts the positive class. False Positive(FP) is an
outcome where the model incorrectly predicts the positive
class. False Negative(FN) is an outcome where the model
incorrectly predicts the negative class. The weight of sensi-
tivity is 𝛽times of precision.
Geometric mean(G-mean) tries to maximize the accu-
racy on each of classes while keeping these accuracies bal-
anced. For multi-class problems it is a higher root of the
product of sensitivity for each class, as shown in Eq.(18).
𝐺−𝑚𝑒𝑎𝑛 =𝑛
𝑁
𝑛=1
𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦𝑛(18)
4.3. Case 1: Data from CWRU
4.3.1. Description and division of data
The drive end rolling bearing data provided by Case
Western Reserve University[38] is acquired by the device
as shown in Fig.6, where the single point faults(inner ring,
et al.: Preprint submitted to Elsevier Page 6 of 15
outer ring, rolling element) are caused by electrical dis-
charge machining(EDM), and the sampling frequency is
12kHz, with 0~3HP loads and three types of damage de-
grees(0.118/0.356/0.533mm). The acceleration sensor that
is located at the drive end of the motor housing collects
acceleration data. According to different loads, signals are
divided into four data sets: A, B, C, D, as shown in Table 3.
Table 3
Partition of CWRU data sets
Data Loads Locations FD(mm) Label 𝛼%
A/B/C/D 0/1/2/3
N
0.118/0.356/0.533
0
0.1∼0.5
IF 1/2/3
OR 4/5/6
BF 7/8/9
Fan End
Torque transducer
Induction motor Load motor
Drive End
Figure 6: Bearing fault diagnosis model test-bed
4.3.2. The discussion of batch_size
A larger batchsize can shorten the training time of each
epoch, but it may also reduce capacity of generalization, so a
balance should be struck between both. For this reason, under
𝛼=0.3 with data set B, only batchsize changes, and the results
are shown in Table 4.
Table 4
Comparison of training results between different batch_size
batch_size Early stopping eval_loss eval_acc Time/s
8 34 0.5532 100% 311.74
16 40 0.5684 99.64% 226.82
32 75 0.5577 99.93% 249.14
64 73 0.5706 99.64% 244.77
80 126 0.5804 99.71% 367.85
100 100 0.5955 99.21% 296.53
128 80 0.6219 98.50% 223.33
As is seen to us, the training difficulties with different
batchsizes are not consistent, resulting in different epochs
of early stopping. Apparently, it can achieve similar perfor-
mance in batchsize=8 or 32(100%, 99.93%), but the latter
takes less time, so batch_size=32.
4.3.3. Ablation comparative experiment
The ablation experiments regarding DCA-BiGRU(M5)
are carried out on four data sets A, B, C and D. The contrast
models are PCA-SVM(M1), DCNN-BiGRU(without atten-
tion, M2), DCNN(without attention and BiGRU, M3) and
DCA(without BiGRU, M4), which take G-mean as the index.
In order to avoid the random influence, each experiment
repeats five times to get error bars as shown in Fig.7, and
A→A represents training set→test set. The X-axis shows the
proportion of the training(𝛼). At the same time, the running
time of different models, different loads in different 𝛼is
recorded until early stopping, as shown in Table 6.
From Fig.7, Table 6, as 𝛼augments, models learn more
features and G-mean gains an increase. Due to the lack of
elaborate processed of original signals, SVM cannot effec-
tively deal with high-dimensional signals. Also, by compar-
ing M4and M5, it can be illustrated that BiGRU has advan-
tages in coping with small samples, which generates hidden
features, and contributes to the performance of the model
to increase by 21%~36%. From M4and M5in Fig.7c, when
𝛼=0.3, M4=0.9031, while M3=0.6431, which demonstrates
that attention mechanism also has a promising generalization
for small samples, because it can guide models to pour
attention to critical pulses, and only add 666 parameters.
With a combination of both, M5=0.9822. On the whole, both
are conducive to performance of the model for small samples.
Furthermore, the trend of running time increases with the
increase of 𝛼, where the advanced model requires more time.
In total, DCA-BiGRU has the highest diagnostic efficiency.
4.3.4. Experiment under different loads
In general, the capability to deal with unlabeled sam-
ples from other loads is low when training with one data
set. Therefore, it is indispensable to evaluate the migration
versatility on DCA-BiGRU in fault diagnosis when the load
changes. A, B, C and D have different loads and different
signal distributions. In the past, most of the methods used to
test the generality under 𝛼= 0.7. The paper will explore the
generality of the proposed model in small samples. Applying
the model under training with Data set B, and the statistical
results are displayed in Table 5.
Table 5
G-mean of DCA-BiGRU under different loads
𝛼(%) G-mean
Data set B→A B→B B→C B→D
0.1 95.30 96.78 99.21 93.24
0.2 98.60 99.41 99.09 98.32
0.3 99.48 99.71 99.60 98.56
0.4 99.21 100 99.86 98.83
0.5 99.74 100 100 98.97
As we can see, when G-mean<0.99, the migration ver-
satility enhances with the increase of 𝛼. For Data set D
with load 3, although the signal distribution changes com-
paratively obviously, the performance has not decreased
dramatically(average G-mean=0.97). When G-mean>0.99,
the performance is slightly different due to random values. In
addition, when load 0 with 𝛼=0.1, under small samples with
inapparent fault pulses, G-mean>0.95. In this case, DCA-
BiGRU still achieve high performance, which fully indicates
that it has a pleasant migration versatility.
et al.: Preprint submitted to Elsevier Page 7 of 15
0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5
3 0
4 0
5 0
6 0
7 0
8 0
9 0
100
G - m e a n ( % )
a
P C A - S V M
D C N N - B i G R U
DCNN
D C A
D C A - B i G R U
(a) A→A
0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5
4 0
5 0
6 0
7 0
8 0
9 0
100
G - m e a n ( % )
a
P C A - S V M
D C N N - B i G R U
DCNN
D C A
D C A - B i G R U
(b) B→B
0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5
3 0
4 0
5 0
6 0
7 0
8 0
9 0
100
( % )
a
P C A - S V M
D C N N - B i G R U
DCNN
D C A
D C A - B i G R U
G-mean
(c) C→C
0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5
3 0
4 0
5 0
6 0
7 0
8 0
9 0
100
( % )
a
P C A - S V M
D C N N - B i G R U
DCNN
D C A
D C A - B i G R U
G-mean
(d) D→D
Figure 7: G-mean values of test under different loads
Table 6
Time of test under different loads
Models 𝛼(%) Time(s)
A B C D
PCA-SVM
0.1 0.09 0.01 0.14 0.20
0.2 0.16 0.02 0.34 0.15
0.3 0.23 0.61 0.67 0.23
0.4 0.31 0.12 0.56 0.29
0.5 0.33 0.16 1.06 0.48
DCNN-BiGRU
0.1 21.31 19.40 15.44 14.57
0.2 21.42 13.96 27.94 22.90
0.3 30.84 21.02 43.89 54.01
0.4 34.27 17.81 33.15 56.73
0.5 48.68 26.76 63.85 60.34
DCNN
0.1 60.75 171.66 103.42 65.79
0.2 103.07 123.74 193.69 96.90
0.3 221.55 118.44 126.34 264.87
0.4 144.39 244.19 123.14 292.19
0.5 223.74 290.73 162.09 197.97
DCA
0.1 53.91 71.75 55.26 87.30
0.2 148.07 161.38 110.96 126.35
0.3 160.83 113.39 180.93 488.10
0.4 224.96 140.84 332.17 256.32
0.5 212.41 272.37 327.20 394.66
Ours
0.1 171.40 151.00 164.88 132.43
0.2 403.03 270.81 193.33 387.71
0.3 237.93 338.11 301.05 304.21
0.4 249.03 242.73 430.96 317.33
0.5 201.99 410.71 345.82 375.74
4.3.5. Analysis of regularization means
Vibration signals are distributed nonlinearly, while neu-
ral networks belong to linear calculation. In order to avoid
vanishing gradient, the nonlinear non-saturating activation
function is generally applied. In recent years, some latest
activation functions have been widely utilized, but the im-
provement of them to the method has not been explored
carefully in fault diagnosis. 1D-meta-ACON applied in this
paper with only 1098 parameters combines the advantages of
linear and nonlinear activation functions. One of them can be
preferred by referring to the performances of them.
The related activation functions are shown in Fig.8,
where the gradient of Mish is smoother than that of ReLU,
and Swish has the features of lower bound without upper
bound, smoothness and non-monotonicity, which can be
regarded as a smoothing form between linear and ReLU.
54321012345
1
0
1
2
3
4
5
ReLU
Swish
Mish
ELU
Softplus
Figure 8: Different activation functions
All results are carried out under Data set B with 𝛼=0.3,
and the training loss and the accuracy of transfer are ob-
tained, as shown in Fig.9and Fig.10. It can be seen that all
et al.: Preprint submitted to Elsevier Page 8 of 15
(139,0.5469)
(102,0.5486)
(114,0.5466)
(85,0.5486)
(29,0.5856)
(91,0.5472)
Figure 9: Losses under different activation functions
models can converge, and Softplus=0.5856 with the maxi-
mum loss triggers early stopping earliest. When epoch=114,
early stopping is triggered on ELU. From the stability of
convergence, expect for ReLU and Softplus, the other four
functions are relatively stable, where the differences of loss
are small in later epochs, and the difference of final losses
among Swish, ELU, 1D-Meta-ACON is about 0.0003, while
1D-Meta-ACON has less epochs, with fastest convergence.
In addition, as shown in Fig.10, Mish, Swish and 1D-
Meta-ACON have a better migration generality, reaching
97.15%, 98.84% and 99.09% respectively under B→D. Mean-
while, ReLU and Softplus are poor under B→D, and the
performance of 1D-Meta-ACON is the best, which improves
by 0.25%. Regardless of extreme accuracy, one of the three
activation functions can be chosen according to the reality.
99.13
98.64 98.61
99.74 99.62 99.62
100 100 99.73 100 100 100
99.87
98.64
98.98
99.74 99.62
100
92.97
97.66
92.91
97.15
98.84 99.09
RuLU ELU Softplus Mish Swish Meta-ACON
0
92
93
94
95
96
97
98
99
100
101
accuracy(%)
Activation function
B→A
B→B
B→C
B→D
Figure 10: Generality under different activation functions
Similarly, the effects of different adaptive optimization
gradient algorithms are compared. For a certain neural net-
work, they are utilized to optimize the objective functions,
and parameters are continuously updated in a negative direc-
tion until an optimal solution. The closer solution is to the
global optimum, the neural network has better generalization.
Optimizers:SGDM(0.576), AMSGrad(0.569), AadmW(
0.565), AdaBelief(0.561), AdaBound( 0.565), AdamP(0.562),
Adam(0.579), AMSGradP(0.555). The experimental results
S G D M A M S G r a d A d a m W A d a B e l i e f A d a B o u n d A d a m A d a m P A M S G r a d P
9 5
9 6
9 7
9 8
9 9
100
B → D
A l g o r i t h m s
a c c ( % )
Figure 11: Performance and time under different algorithms
of verification set are shown in Fig.13. It can be seen that
the accuracy of several optimization algorithms reach more
than 99%. Adam has the maximum oscillation amplitude,
and when epoch=142, it triggers early stopping. Compared
with AMSGrad(99.64%), AMSGradP(99.86%) improves by
0.22%. Also, SGDM reaches 99.86%, yet it requires more
epochs. From the point of view of convergence speed and
value, SGDM, Adabelief and Adam converge slowly, but
AMSGradP has the fastest convergence speed and highest
validation accuracy, which indicates that adding radial com-
ponent, AMSGradP retards the reduction of effective step,
so that the algorithm reaches the vicinity of optimal point
with a relatively appropriate effective step, and constantly
updates nearby, converging to 0.555. Except for these, the
speed of other algorithms is not much different. The above
analyses fully indicate that adding radial components and
adjusting norm growth can effectively improve results for
gradient descent algorithms in fault diagnosis.
Besides, the generality of transfer of each algorithm is
evaluated in Fig.11, which displays the performance of DCA-
BiGRU trained under 𝛼=0.3 with Data set B.
It can be found that Adabelief has the worst generaliza-
tion in the rolling bearing task, which is only 96.76% under
B→D. AMSGradP and Adabound have similar performance,
whereas AMSGradP is more stable for migration because
of a smaller error and has the acceptable training time.
Considering comprehensively, AMSGradP is more superior.
Based on the above argumentum, a benchmark model can
be trained applying AMSGradP and fine-tuned employing
AdamW with fastest converge.
Eventually, the effects of different optimized strategies
on the model are compared. (W:AdaBN, GHMC:gradient
harmonizing mechanism for classification, FL:Focal Loss,
G:GAP, B:BN). As an example, AdaBN, GAP and LSR are
applied into DCA-BiGRU(WLSRG). In the field of NLP,
GHMC, FL, and LSR acquire more attention for unbalanced
distribution, but they have not been contrastively studied in
fault diagnosis under small samples.
It displays the influence of different loss functions and
optimized strategies in Fig.12. Obviously, the task named
B→D is more difficult. Initially, for WLSR, WFL, WGHMC
et al.: Preprint submitted to Elsevier Page 9 of 15
B → A B → B B → C B → D
9 5 . 0
9 5 . 5
9 6 . 0
9 6 . 5
9 7 . 0
9 7 . 5
9 8 . 0
9 8 . 5
9 9 . 0
9 9 . 5
100.0
a c c ( % )
D o m a i n m i g r a t i o n
W L S R W G H M C W F L W C E W L S R G B L S R 179.55s
413.83s
133.12s
170.51s
227.01s
300.43s
Figure 12: Accuracy under loss functions and strategies
and WCE, it can be stated that CE has the shortest training
time, whose accuracy is only 95.37%. GHMC solves the
problems of outliers and parameter joint training existed
in FL and improves by 0.44%. Compared with three, LSR
with 99.32% has the maximum accuracy. Furthermore, by
comparing WLSR and BLSR, there is an improvement of
about 1.36% by applying AdaBN. Ultimately, as mentioned
in 3.3.2, by comparing WLSR and WLSRG, GAP is 98.78%,
which descends by 0.54% than GMP in attention block.
In conclusion, the latest training methods make contribu-
tions to improve capacity of generalization.
4.3.6. Analysis of anti-noise robustness
Signals mostly contain noises in real situation. Hence, the
study will analyze the anti-noise robustness under different
signal-to-noise ratio(SNR), which is defined as in Eq.(19).
𝑆𝑁 𝑅dB = 10 lg( 𝑃𝑠𝑖𝑔𝑛𝑎𝑙
𝑃𝑛𝑜𝑠𝑖𝑒
)(19)
where, 𝑃𝑠𝑖𝑔𝑛𝑎𝑙 =1
𝑁
𝑁
𝑖=1
𝑥2
𝑖is original signal power and 𝑃𝑛𝑜𝑖𝑠𝑒
is noise power.
Different from the previous methods that the model is
directly trained by noise signals. The fault diagnosis frame-
work with sharing parameters shown in Fig.3will be applied,
which consists of off-line pre-training and online. The off-
line will utilize AMSGradP and Data set B to obtain the pre-
training parameters, while the online mainly aims to fine-
tune models to achieve high efficiency, where the training
time will be cut down because parameters are close to opti-
mization values, so that noises can be quickly smooth away.
In this study, Gaussian white noises with SNR=-4~6dB
will be added to original signals. AdamW is applied to fine-
tune the whole pre-training model. Besides, other settings
are the same. Previous studies have declared that with the
increase of SNR and 𝛼, the accuracy of test is continuously
improved. Therefore, a case with SNR=-4dB and 𝛼=0.1 is
applied to examine performance. Results regarding training
time and G-mean are shown in Table 7and Fig.14.
On one hand, with the increase of 𝛼, G-mean also in-
creases, but the time cost also increases. However, it is
reduced by approximately 2/3, compared with unloaded pre-
trained models. On the other hand, DCA-BiGRU still has the
highest diagnostic accuracy. Taking 𝛼=0.3 as an example,
Table 7
Time of different 𝛼under SNR= –4dB
models Time/s
𝛼(%) 0.1 0.2 0.3 0.4 0.5
DCNN-BiGRU 16.30 17.44 27.12 43.55 75.82
DCNN 31.67 38.71 54.88 60.88 93.28
DCA 28.68 52.36 75.63 82.83 109.25
DCA-BiGRU 26.01 58.44 68.42 99.94 102.45
Table 8
Evaluation under 𝛼=0.1, Data set B
SNR -4 -2 0 2 4 6
G-mean 90.72 92.11 94.84 96.20 98.32 98.32
Time 26.08 60.15 11.09 11.01 11.98 11.00
four models are 87.10%, 74.18%, 77.09%, 92.72% in turn,
where BiGRU improves 19.92%, and attention mechanism
improves 5.62%. All in all, attention mechanism and BiGRU
has a strong capacity of robustness and diagnostic efficiency.
Another random seed is set to further evaluate DCA-
BiGRU under conditions with 𝛼=0.1 and SNR=-4~6dB, as
shown in Table 8. With the increase of SNR, G-mean also in-
creases. In addition, SNR= -4dB or -2dB requires more time,
because it may be that larger noises cause higher learning
difficulty, and demands more epochs. Furthermore, DCA-
BiGRU achieves G-mean>0.9 at various SNRs, manifesting
that it has an excellent anti-noise performance.
Finally, the changes of original outer ring fault signals
in DCA-BiGRU are shown in Fig.15. With the depth of
network, signal features become more abstract, and it is
easier to realize diagnosis.
4.4. Case 2: Data from University of Connecticut
4.4.1. Description and analysis of data
The data that is shared from University of Connecticut
is collected from a two-stage gearbox [39,40], where the
acquisition device is shown in Fig.17, and the acquisition
frequency is 20kHz, and The signals are recorded through
a dSPACE system(DS1006 processor board, dSPACE Inc.).
The specifications of the accelerometer including frequency
range, measure range, and sensitivity are 0.5Hz-10kHz, ±50
g, and 100 mV/g, respectively. Nine different gear conditions
are introduced to the pinion on the input shaft, including
healthy condition, missing tooth, root crack, spalling, and
chipping tip with five different levels of severity, and time-
domain signals of nine states are showed in Fig.18.
In original signals, a total of 104 samples with 3600
points are collected for gearbox states. In order to facilitate
experiments, all signals in a certain state are integrated into
a column, and the training set, verification set and test set are
obtained by acquisition methods mentioned in Section 4.1.
The label of each state is 0~9 as shown in Fig.18.
4.4.2. Evaluation under different working conditions
In reality, while gearbox system is recorded in a fixed
sampling rate, due to speed variations under load distur-
bance, geometric tolerance, and motor control error etc, the
time-domain signals also reflect the changes of different
working conditions. And Fig.16 reflects the change curve
et al.: Preprint submitted to Elsevier Page 10 of 15
(a) Accuracy
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
0 . 6
0 . 8
1 . 0
1 . 2
1 . 4
1 . 6
1 . 8
2 . 0
2 . 2
2 . 4
e v a l_ l o s s
e p o c h
S G D M
A M S G r a d
A d a m W
A d a b e l i e f
A d a b o u n d
A d a m P
A d a m
A M S G r a d P
(b) Loss
Figure 13: Accuracy and losses of verification set
0 . 1 0 . 2 0 . 3 0 . 4 0 . 5
6 5
7 0
7 5
8 0
8 5
9 0
9 5
G - m e a n
a
D C N N - B i G R U
DCNN
D C A
D C A - B i G R U
Figure 14: Fault diagnosis based on sharing parameters
SoftMax
Conv1
Conv2
Attention
BiGRU GAP
Figure 15: The signal changes in DCA-BiGRU
of accuracy and loss of training set and verification set at
𝛼=0.3, where DCA-BiGRU has an excellent convergence
performance. When epoch =93, G-mean = 99.37%.
Fig.19 and Fig.20 embody the performance and training
time of each model with the increase of 𝛼. It can be indicated
that with the increase of 𝛼, G-mean also increases in test set.
When 𝛼=0.1, DCA-BiGRU has the first-class performance
with G-mean=96.34%, while 79.76% on DCNN-BiGRU.
When 𝛼=0.5, these models almost always close to 100%
except for SVM. Overall analysis displays that when 𝛼 <0.3,
DCA-BiGRU<DCNN-BiGRU<DCA<DCNN, so the com-
bination of attention mechanism and BiGRU can just achieve
0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 1 0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
100
e v a l _ a c c
a c c
e v a l _ l o s s
l o s s
e p o c h
a c c
0 . 4
0 . 6
0 . 8
1 . 0
1 . 2
1 . 4
1 . 6
1 . 8
2 . 0
2 . 2
2 . 4
l o s s
Figure 16: Training and verification performance
Motor
controller
Gearbox
Brake
Data Dollection
Systems
Accelerometer
Figure 17: Gearbox system
the optimal performance. Similarly, the cost of high perfor-
mance is more training time, which requires for loading the
pre-training model to save training time.
4.4.3. Visual analysis
In order to further reveal the feature representations,
the T-SNE technology is applied to feature visualization,
where different colors describe different states. By compar-
ing Fig.21 and Fig.22, it can be found that DCNN extracts
features preliminarily and each state is further separated
through the attention mechanism. BiGRU 2 classifies sam-
ples by extracting the hidden features at different positions.
Finally, parameters of the classifier are reduced by GAP.
et al.: Preprint submitted to Elsevier Page 11 of 15
Healthy Missing tooth Root crack
Spalling Chipping tip L1 Chipping tip L2
Chipping tip L3 Chipping tip L4 Chipping tip L5
amplitude
Sampling length
Figure 18: Vibration signals of nine faults
0 . 1 0 . 2 0 . 3 0 . 4 0 . 5
3 0
4 0
5 0
6 0
7 0
8 0
9 0
100
G - m e a n
a
P C A - S V M
D C N N - B i G R U
DCNN
D C A
D C A - B i G R U
Figure 19: G-mean in different 𝛼
0 . 1 0 . 2 0 . 3 0 . 4 0 . 5
0
100
200
300
400
500
T i m e / s
α
P C A - S V M
D C A - B i G R U
DCNN
D C A
D C A - B i G R U
Figure 20: Time in different 𝛼
BiGRU 1 only gets the output of the last hidden layer.
Through the comparison between BiGRU 1 and 2, it can be
seen that GAP pays attention to the output of neurons in all
hidden layers of BiGRU, which makes fault state separation
more obvious and reduces the training pressure of diagnosis
layer. In conclusion, DCA-BiGRU can better separate differ-
ent states, which has a marvelous generalization.
The visualization of attention mechanism and BiGRU is
shown in Fig.23. The brighter the color, the higher degree
of activation. From these, it is observed that the attention
mechanism attaches importance to the degree of each chan-
nel in signals. In addition, BiGRU 2 further separates the
dimensionality reduction signals and extracts more vivid and
refined features. Different fault types have different neuron
activation areas, so the corresponding features can be ex-
tracted from original signals without human intervention.
From top to bottom, from left to right, conv1, conv2, attention, GAP.
Figure 21: Feature visualization of different layers
BiGRU 1 BiGRU 2
Figure 22: Visualization of different BiGRU
Attention BiGRU 2
Figure 23: Weight visualization of attention and BiGRU
Grad-CAM++ is a widely applied visualization method,
whose basic idea is that the weight of the feature map
corresponding to a certain classification can be expressed as
a gradient, and the global average of the gradient is utilized
to calculate the weight. In addition, ReLU and the weight
gradient 𝑎𝑘𝑐
𝑖are added into the feature map 𝑤𝑒𝑖𝑔ℎ𝑡. Only one
back propagation is required to calculate the gradient, which
is originally applied to 2D, but is improved and applied to
1D-signals, as shown in Algorithm 2in Appendix.
Attention mechanism is further explained, and Class
Activation Mapping(CAM) is calculated by extracting the
convolution kernel feature map of attention mechanism, as
shown in Fig.24. The higher the color level, the higher
CAM and the higher the feature distinction. The light blue
frame has circled higher parts of CAM. It can be found that
the locations of different fault types activated by CAM are
different, whose amplitudes are not the same, which fully
demonstrates that the attention mechanism can distinguish
the fault types without manual preprocessing. For example,
Missing tooth and Spalling have two distinct areas of class
et al.: Preprint submitted to Elsevier Page 12 of 15
Healthy Missing tooth Root crack
Spalling Chipping tip L1 Chipping tip L2
Chipping tip L3 Chipping tip L4 Chipping tip L5
Figure 24: Visualization of nine fault states under Grad-CAM++
activation. Besides, Chipping tip with different damage de-
gree has different activation areas, where the impact ampli-
tude is more distinct with the deepening of damage degree.
4.4.4. Anti-noise performance for gearbox
For the Gearbox fault, the learning rate is 0.0009 because
of loading pre-training model. AdamW and fault diagnostic
framework as shown in Fig.3are applied, and other parame-
ters are as same as above.
Under SNR=6dB, the anti-noise capacity of models un-
der different 𝛼is calculated, as shown in Fig.25. Besides, the
influence of SNR is recorded as shown in Table 9.
When 𝛼=0.3, with the improvement of 𝛼, G-mean is
improving, indicating that the robustness of models is en-
hanced. Comparison between DCNN and DCNN-BiGRU
shows that BiGRU improves performance by 5.31% when
𝛼=0.1. For DCA-BiGRU and DCNN-BiGRU, when 𝛼=0.3,
attention mechanism makes the model increase by 0.86%.
In addition, by comparing whether the pre-training model
is loaded or not, it can be found that loading the pre-training
model not only improves G-mean, but also saves training
time. The greater the noises, the more obvious the advantage
of loading pre-training model. As an example SNR=0dB, G-
mean of loading pre-training parameters is 85.28%, and that
of unloading is 78.43%, increasing by 6.85%.
By observing the confusion matrix of both as shown in
Fig.26, DCNN-BiGRU whose sensitivity to Chipping tip L1
and L4 is low misclassifies part of the healthy samples. On
the contrary, DCA-BiGRU can correctly distinguish healthy
and fault samples, but misclassifies Missing tooth, Chipping
tip L2 and L3. In particular, the sensitivity to Chipping tip
L3 is low, which requires effective measures to improve
performance under noises.
4.5. Comparison studies of diagnostic method
Finally, the rolling bearing data from CWRU is very
popular in machinery fault diagnosis researches. Compared
0 . 1 0 . 2 0 . 3 0 . 4 0 . 5
8 5
9 0
9 5
100
G - m e a n
a
DCNN
D C N N - B i G R U
D C A
D C A - B i G R U
Figure 25: G-mean of models under SNR=6dB
DCNN-BiGRU DCA-BiGRU
Figure 26: Confusion matrix under SNR=6dB, 𝛼=0.3
with some methods listed in Table 10, and DCA-BiGRU still
has reach 99.73% diagnostic performance in the case of no
human intervention, lower 𝛼and less sampling length, and
compared with DCA-BiLSTM, DCA-BiGRU increases by
0.17%.
Firstly, the length of sampling points can affect the di-
agnosis results. The fewer sampling points are, the fewer
shock pulse will be contained in one sample. Compared with
references listed in Table 10, in the paper, one sample collects
et al.: Preprint submitted to Elsevier Page 13 of 15
Table 9
Anti-noise performance of DCA-BiGRU under 𝛼=0.3
Load Metric SNR/dB
0246810
YG-mean 85.28 93.56 95.26 98.84 99.42 99.42
Time 84.47 91.23 81.10 90.89 61.42 81.05
NG-mean 78.43 87.47 93.01 97.96 98.57 99.13
Time 103.59 131.34 104.74 195.88 232.62 134.38
1024 points. Furthermore, although there are fewer sampling
points in reference [3] and [22], sample dimension reduction
and feature extraction algorithms are applied, most of which
contain hyperparameters. In reference [3], smart evolution
algorithm is adopted to search suitable hyperparameters,
with high time complexity, and reference [22] is determined
by manual experience. Then, compared with the number of
training samples, in this paper, there is a lower proportion
of training sets and a lower number of training sets. For
example, the number of training sets is 60% than reference
[22]. Finally, DCA-BiGRU also achieved a more interesting
diagnostic result under harsher experimental environment
and higher diagnostic difficulty. In addition, capsule network
also has advantages in small sample fault diagnosis. How-
ever, after literature [23] is reproduced, capsule network has
about 1.2 million parameters, while the parameters of DCA-
BiGRU are about 120 thousand which means that DCA-
BiGRU has faster training speed and higher diagnosis effi-
ciency because fewer parameters make faster training speed.
Table 10
Comparison of fault diagnosis of CWRU
Models Length Filtering 𝛼Accuracy
reference[3] 200 MCKD-RCMDE 0.8 99.00%
reference[10] 1200 / 0.8 98.36%
reference[14] 2000 Wiener filtering 0.7 98.46%
reference[22] 784 Fast Kurtogram (100) 99.00%
ICN-Capsule 3000 Wavelet 0.83 99.96%
DCA-BiLSTM 1024 / 0.3(60) 99.56%
Ours 1024 / 0.3(60) 99.73%
MCKD: Maximum Correlated Kurtosis Deconvolution
RCMDE: Refined Composite Multiscale Dispersion Entropy
5. Conclusion
A novel DCA-BiGRU model based on attention mech-
anism has been proposed to identify the health state of
equipment under small samples, where attention mechanism
captures the spatial and channel relations of signals. The sen-
sitivities of attention mechanism and BiGRU to the propor-
tion of training set are discussed, and activation functions and
gradient descent algorithms of all sorts have been explored.
AMSGradP, 1D-Meta-ACON and other novel technologies
are introduced to further improve capacities of generaliza-
tion and robustness. Subsequently, DCA-BiGRU based on
transfer learning, is verified on two different test rigs that are
CWRU motor bearing data sets(Case 1) and University of
Connecticut gearbox data sets(Case 2) respectively. Variety
of visualization means are applied to initially reveal working
principle of DCA-BiGRU, which shows that DCA-BiGRU
has advantages in terms of diagnostic efficiency under dif-
ferent working conditions for small samples.
It can be noted that the differences between misclassified
and other samples demand to be further explored. In addition,
it is intractable for DCA-BiGRU to cope with the extremely
imbalanced data set. In the future, machine learning such
as meta learning, active sensitive cost learning, integrated
learning or domain adaptation and generalization in transfer
learning, will be combined with attention mechanism or
other structures to address more complicated fault diagnosis
situation with small sample and imbalanced data, which is
worth further studying.
CRediT authorship contribution statement
Xin Zhang: Writing original draft, Methodology, Anal-
ysis, Funding acquisition. Chao He: Software, Validation,
Visualization, Investigation. Yanping Lu: Experiment. Biao
Chen: Experiment. Le Zhu: Conceptualization Software. Li
Zhang: Supervision, Proofreading, Project administration.
Declaration of competing interest
The authors declare that they have no known competing
financial interests or personal relationships that could have
appeared to influence the work reported in this paper.
Acknowledgments
The authors are grateful for the supports of the National
Key R&D Program of China (2018YFB1308700).
References
[1] J. Jiao, M. Zhao, J. Lin, K. Liang, A comprehensive review on convo-
lutional neural network in machine fault diagnosis, Neurocomputing
417 (2020) 36–63.
[2] S. Zhang, S. Zhang, B. Wang, T. G. Habetler, Deep learning al-
gorithms for bearing fault Diagnosticsx— A comprehensive review,
IEEE Access 8 (2020) 29857–29881.
[3] H. Luo, C. He, J. Zhou, L. Zhang, Rolling Bearing Sub-Health
Recognition via Extreme Learning Machine Based on Deep Belief
Network Optimized by Improved Fireworks, IEEE Access 9 (2021)
42013–42026.
[4] Y. Ke, C. Yao, E. Song, Q. Dong, L. Yang, An early fault diagnosis
method of common-rail injector based on improved CYCBD and
hierarchical fluctuation dispersion entropy, Digit. Signal Process. 114
(2021) 103049.
[5] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, X. Huang, Pre-trained models
for natural language processing: A survey, Sci. China Technol. Sci.
(2020) 1–26.
[6] S. Hao, Y. Zhou, Y. Guo, A brief survey on semantic segmentation
with deep learning, Neurocomputing 406 (2020) 302–321.
[7] K. Tong, Y. Wu, F. Zhou, Recent advances in small object detection
based on deep learning: A review, Image Vis. Comput. 97 (2020)
103910.
[8] G. Algan, I. Ulusoy, Image classification with deep learning in the
presence of noisy labels: A survey, Knowl. Based. Syst. 215 (2021)
106771.
[9] Z. Zhao, T. Li, J. Wu, C. Sun, S. Wang, R. Yan, X. Chen, Deep
learning algorithms for rotating machinery intelligent diagnosis: An
open source benchmark study, ISA Trans. 107 (2020) 224–255.
[10] J. Li, X. Li, D. He, Y. Qu, Unsupervised rotating machinery fault diag-
nosis method based on integrated SAE–DBN and a binary processor,
J. Intell. Manuf. 31 (8) (2020) 1899–1916.
[11] Y. Wang, G. Sun, Q. Jin, Imbalanced sample fault diagnosis of rotat-
ing machinery using conditional variational auto-encoder generative
adversarial network, Appl. Soft Comput. 92 (2020) 106333.
et al.: Preprint submitted to Elsevier Page 14 of 15
[12] Z. Wang, Y. Dong, W. Liu, Z. Ma, A novel fault diagnosis approach for
chillers based on 1-D convolutional neural network and gated recurrent
unit, Sensors 20 (9) (2020) 2458.
[13] X. Wang, D. Mao, X. Li, Bearing fault diagnosis based on vibro-
acoustic data fusion and 1D-CNN network, Measurement 173 (2021)
108518.
[14] X. Chen, B. Zhang, D. Gao, Bearing fault diagnosis base on multi-
scale CNN and LSTM model, J. Intell. Manuf. 32 (4) (2021) 971–987.
[15] D. Huang, Y. Fu, N. Qin, S. Gao, Fault diagnosis of high-speed train
bogie based on LSTM neural network, Sci. China Inf. Sci. 64 (1)
(2021) 119203.
[16] X. Li, X. Kong, J. Zhang, Z. Hu, C. Shi, A study on fault diagnosis of
bearing pitting under different speed condition based on an improved
inception capsule network, Measurement 181 (2021) 109656.
[17] F. Zhou, S. Yang, H. Fujita, D. Chen, C. Wen, Deep learning fault
diagnosis method based on global optimization GAN for unbalanced
data, Knowl. Based Syst. 187 (2020) 104837.
[18] P. Kumar, A. S. Hati, Deep convolutional neural network based on
adaptive gradient optimizer for fault detection in SCIM, ISA Trans.
111 (2021) 350–359.
[19] A. Zhang, S. Li, Y. Cui, W. Yang, R. Dong, J. Hu, Limited data rolling
bearing fault diagnosis with few-shot learning, IEEE Access 7 (2019)
110895–110904.
[20] C. Wang, Z. Xu, An intelligent fault diagnosis model based on
deep neural network for few-shot fault diagnosis, Neurocomputing
Doi:10.1016/j.neucom.2020.11.070.
[21] J. Wu, Z. Zhao, C. Sun, R. Yan, X. Chen, Few-shot transfer learning
for intelligent fault diagnosis of machine, Measurement 166 (2020)
108202.
[22] S. R. Saufi, Z. A. B. Ahmad, M. S. Leong, M. H. Lim, Gearbox fault
diagnosis using a deep learning model with limited data sample, IEEE
Trans. Ind. Inform. 16 (10) (2020) 6263–6271.
[23] T. Han, R. Ma, J. Zheng, Combination bidirectional long short-term
memory and capsule network for rotating machinery fault diagnosis,
Measurement 176 (2021) 109208.
[24] C. Li, K. Yang, H. Tang, P. Wang, J. Li, Q. He, Fault Diagnosis for
Rolling Bearings of a Freight Train under Limited Fault Data: Few-
Shot Learning Method, J. Transp. Eng. Part A Syst. 147 (8) (2021)
04021041.
[25] W. Zhang, G. Peng, C. Li, Y. Chen, Z. Zhang, A new deep learning
model for fault diagnosis with good anti-noise and domain adaptation
ability on raw vibration signals, Sensors 17 (2) (2017) 425.
[26] K. Zhao, H. Jiang, Z. Wu, T. Lu, A novel transfer learning fault diag-
nosis method based on Manifold Embedded Distribution Alignment
with a little labeled data, J. Intell. Manuf. (2020) 1–15.
[27] Z. Yang, J. Zhang, Z. Zhao, Z. Zhai, X. Chen, Interpreting network
knowledge with attention mechanism for bearing fault diagnosis,
Appl. Soft Comput. 97 (2020) 106829.
[28] T. Zhang, J. Chen, F. Li, K. Zhang, H. Lv, S. He, E. Xu, In-
telligent fault diagnosis of machines with small & imbalanced
data: A state-of-the-art review and possible extensions, ISA Trans.
Doi:10.1016/j.isatra.2021.02.042.
[29] J. Gu, V. Tresp, H. Hu, Capsule Network is Not More Robust than
Convolutional Network, in: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognit (CVPR), 14309–14317,
2021.
[30] T. Huang, Q. Zhang, X. Tang, S. Zhao, X. Lu, A novel fault diagnosis
method based on CNN and LSTM and its application in fault diagnosis
for complex systems, Artif. Intell. Rev. (2021) 1–27.
[31] M. Jalayer, C. Orsenigo, C. Vercellis, Fault detection and diagnosis
for rotating machinery: A model based on convolutional LSTM, Fast
Fourier and continuous wavelet transforms, Comput. Ind. 125 (2021)
103378.
[32] Y. Li, N. Wang, J. Shi, X. Hou, J. Liu, Adaptive batch normalization
for practical domain adaptation, Pattern Recognit. 80 (2018) 109–117.
[33] N. Ma, X. Zhang, M. Liu, J. Sun, Activate or Not: Learning Cus-
tomized Activation, in: Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognit (CVPR), 8032–8042, 2021.
[34] B. Heo, S. Chun, S. J. Oh, D. Han, S. Yun, G. Kim, Y. Uh, J.-W.
Ha, AdamP: Slowing Down the Slowdown for Momentum Optimizers
on Scale-invariant Weights, in: International Conference on Learning
Representations (ICLR), 2021.
[35] A. Shewalkar, D. Nyavanandi, S. A. Ludwig, Performance Evaluation
of Deep Neural Networks Applied to Speech Recognition: RNN,
LSTM and GRU, J. Artif. Intell. Soft Comput. Res. 9 (4) (2019) 235–
245.
[36] Y. Dong, Y. Li, H. Zheng, R. Wang, M. Xu, A new dynamic model and
transfer learning based intelligent fault diagnosis framework for rolling
element bearings race faults: Solving the small sample problem, ISA
Trans. Doi:10.1016/j.isatra.2021.03.042.
[37] X. Li, Y. Hu, M. Li, J. Zheng, Fault diagnostics between different type
of components: A transfer learning approach, Appl. Soft Comput. 86
(2020) 105950.
[38] K. A. Loparo, Bearing data center, Case Western Reserve University .
[39] P. Cao, S. Zhang, J. Tang, Preprocessing-Free Gear Fault Diagnosis
Using Small Datasets With Deep Convolutional Neural Network-
Based Transfer Learning, IEEE Access 6 (2018) 26241–26253.
[40] P. Cao, S. Zhang, J. Tang, Gear Fault Data. figshare. Dataset.
Doi:10.6084/m9.figshare.6127874.v1.
Appendix
Algorithm 1 AMSGradP
Input: learning rate, 𝜂> 1; momentum, 𝛽1,𝛽2⊆(0,1);
critical value, 𝛿, 𝜀 > 0; time step, 𝑡; step size, 𝛼;
Output: Resulting parameter, 𝑤𝑡;
1: for 𝑤𝑡not converged do
2: 𝑔𝑡←∇𝑤𝑓𝑡(𝑤𝑡)
3: 𝑚𝑡←𝛽1𝑚𝑡−1 + (1 − 𝛽1)𝑔𝑡
4: 𝑣𝑡←𝛽2𝑣𝑡−1 + (1 − 𝛽2)𝑔𝑡2
5: 𝑣𝑡←max(𝑣𝑡−1,𝑣𝑡)𝑎𝑛𝑑
𝑉𝑡←𝑑𝑖𝑎𝑔(𝑣𝑡)
6: 𝑝𝑡←𝑚𝑡∕(𝑣𝑡+𝜀)
7: if cos(𝑤𝑡, 𝑔𝑡)< 𝛿∕dim(𝑤)then
8: 𝑞𝑡=𝑤𝑡(𝑝𝑡)
9: else
10: 𝑞𝑡=𝑝𝑡
11: end if
12: 𝑤𝑡←𝑤𝑡−1 −𝛼𝑞𝑡
13: end for
Algorithm 2 1D-Grad-CAM++
Input: signal, 𝑥; category weight, 𝑦𝑐
𝑎𝑡𝑡; feature map, 𝐴𝑘
𝑎𝑡𝑡;
Output: heatmap, ℎ;
1: 𝑔𝑟𝑎𝑑 ←
𝑦𝑐
𝑎𝑡𝑡
𝐴𝑘
𝑎𝑡𝑡
2: 𝑎𝑘𝑐
𝑖←𝑔𝑟𝑎𝑑2
2𝑔𝑟𝑎𝑑2+
𝑖
𝐴𝑘
𝑎𝑡𝑡∗𝑔𝑟𝑎𝑑3
3: if 𝑔𝑟𝑎𝑑 > 0then
4: 𝑤𝑒𝑖𝑔ℎ𝑡 ←𝑔𝑟𝑎𝑑 ×𝑎𝑘𝑐
𝑖
5: else
6: 𝑤𝑒𝑖𝑔ℎ𝑡 ←0
7: end if
8: 𝑤𝑒𝑖𝑔ℎ𝑡.size ←𝑥.size by linear interpolation
9: ℎ←MinMaxScaler(𝑤𝑒𝑖𝑔ℎ𝑡)
et al.: Preprint submitted to Elsevier Page 15 of 15