ArticlePDF Available

Fault diagnosis for small samples based on attention mechanism

Authors:

Abstract and Figures

Aiming at the application of deep learning in fault diagnosis, mechanical rotating equipment components are prone to failure under complex working environment, and the industrial big data suffers from limited labeled samples, different working conditions and noises. In order to explore the problems above, a small sample fault diagnosis method is proposed based on dual path convolution with attention mechanism(DCA) and Bidirectional Gated Recurrent Unit(DCA-BiGRU), whose performance can be effectively mined by the latest regularization training strategies. BiGRU is utilized to realize spatiotemporal feature fusion, where vibration signal fused features with attention weight are extracted by DCA. Besides, global average pooling(GAP) is applied to dimension reduction and fault diagnosis. It is indicated that DCA-BiGRU has exceptional capacities of generalization and robustness by experiments, and can effectively carry out diagnosis under various complicated situations.
Content may be subject to copyright.
Graphical Abstract
Fault Diagnosis for Small Samples Based on Attention Mechanism
Xin Zhang,Chao He,Yanping Lu,Biao Chen,Le Zhu,Li Zhang
Sliding window sampling Back Propagation
LSR Meta-
ACON
Training DCA-BiGRU
Early
Stopping
Optimized DCA-BiGRU Fault diagnosis
Data Processing
online testing
offline training
Fine-tune
Industrial samples Fault diagnosis
Sharing Parameters
AdaBN
Data Normalization
Data Partitioning
AdamP
Saving parameters
Training, Validation
Testing
Training, Validation
Testing
B→A B→B B→C B→D
95.0
95.5
96.0
96.5
97.0
97.5
98.0
98.5
99.0
99.5
100.0
acc(%)
Domain migration
WLSR WGHMC WFL WCE WLSRG BLSR 179.55s
413.83s
133.12s
170.51s
227.01s
300.43s
0.1 0.2 0.3 0.4 0.5
65
70
75
80
85
90
95
G-mean
a
DCNN-BiGRU
DCNN
DCA
DCA-BiGRU
online application
Highlights
Fault Diagnosis for Small Samples Based on Attention Mechanism
Xin Zhang,Chao He,Yanping Lu,Biao Chen,Le Zhu,Li Zhang
A fault diagnosis model based on dual path convolution with attention mechanism and BiGRU is proposed.
The impact of low training set ratio is discussed on fault diagnosis.
The influence of BiGRU and attention mechanism are studied on small samples.
The performance of the method has been verified in the bearing and gearbox data sets.
Different working conditions of the equipment can be dealt with effectively.
Fault Diagnosis for Small Samples Based on Attention Mechanism
Xin Zhanga, Chao Heb, Yanping Lub, Biao Chenb, Le Zhucand Li Zhangb,
aSchool of Materials Science and Engineering, Northeastern University, Shenyang 110819, China
bSchool of Information, Liaoning University, Shenyang 110036, China
cSchool of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
ARTICLE INFO
Keywords:
Convolutional neural network
Bidirectional gated recurrent unit
Attention mechanism
Rolling bearings
Small samples
Fault diagnosis
ABSTRACT
Aiming at the application of deep learning in fault diagnosis, mechanical rotating equipment com-
ponents are prone to failure under complex working environment, and the industrial big data suffers
from limited labeled samples, different working conditions and noises. In order to explore the problems
above, a small sample fault diagnosis method is proposed based on dual path convolution with attention
mechanism(DCA) and Bidirectional Gated Recurrent Unit(DCA-BiGRU), whose performance can
be effectively mined by the latest regularization training strategies. BiGRU is utilized to realize
spatiotemporal feature fusion, where vibration signal fused features with attention weight are extracted
by DCA. Besides, global average pooling(GAP) is applied to dimension reduction and fault diagnosis.
It is indicated that DCA-BiGRU has exceptional capacities of generalization and robustness by
experiments, and can effectively carry out diagnosis under various complicated situations.
1. Introduction
With the development of industrial Internet of Things,
the manufacturability, integration and precision of rotating
machinery system are constantly improving, but complexity,
nonlinearity and uncertainty are also significantly enhanced,
which has become a huge challenge[1]. During the long-
term running, rotating machinery will be affected by material
degradation, loads, temperature and humidity, leading to the
breakdown of key components easily, which will depress
plant benefits, or lead to casualties and ecological pollution.
Therefore, it is of great significance to monitor the status of
rotating machinery.
In the past few years, fault diagnosis methods based on
signal analysis, swarm intelligence evolution and machine
learning have continued to emerge[24]. However, it is too
dependent on prior knowledge of experts and features are
extracted by manual, which makes it difficult to process
big data and learn advanced features. Additionally, swarm
intelligence is a heuristic algorithm and the optimized result
is hard to be stable because of randomness. Furthermore,
related algorithms with a quite high time complexity cannot
guarantee to figure out the global optimum. Finally, in the
face of complex and changeable industrial data, it is difficult
for vanilla shallow models to achieve ideal results.
In recent years, with the development of deep learning,
it has made remarkable achievements in image classifica-
tion, semantic segmentation, target detection and natural
language processing[58]. Similarly, it also provides some
directions of settling the problems encountered above in fault
diagnosis[9]. Accordingly, a series of studies for fault diag-
nosis have set off a research upsurge, which include convo-
lutional neural network, autoencoder, generative adversarial
network, deep belief network, recurrent neural network and
capsule network etc[1016]. Implementation of these meth-
ods usually requires to design novel and efficient structures
or improve deep optimized algorithms. Alternatively, the
Corresponding author.
zhang_li@lnu.edu.cn (L. Zhang)
ORCID (s):
distribution features of signals are required to analyze from
multiple perspectives. For example, Zhou et al.[17] added
a data generation and filtering strategy into autoencoder-
generative adversarial networks(AE-GAN) for unbalanced
data, where autoencoder was utilized to learn features of
unbalanced samples, and the discriminator aimed to filter out
unqualified generated samples. Kumar et al.[18] adopted a
Deep CNN model based on AdaGrad, which fused multiple
sensor data to generate images for fault diagnosis.
Furthermore, small sample fault diagnosis has become a
new research focus. Zhang et al.[19] put forward a method
for small samples based on siamese neural network, and the
same or different sample pairs were input to calculate 𝐿1
distance of feature vectors, judging whether to belong to the
same class to train, and then support sets and query sets as
pairs were calculated similarity to realize fault diagnosis. On
this basis, Wang et al.[20] proposed a comparison diagnosis
model which applied the full connected layer as the similarity
measure of feature pairs to judge whether they belonged to
a certain type, and meanwhile regularization methods were
added to improve performance. Wu et al.[21] compared small
sample transfer learning among feature transfer, fine-tuning
and meta relation network, and concluded that under small
samples or the similarity between source domain and target
domain was large, the meta relation transfer was dominant.
On the contrary, the advantage of feature transfer was gradu-
ally obvious. Saufi et al.[22] came up with a small sample
fault diagnosis method based on spectral kurtosis filtering
and particle swarm optimization stacked sparse autoencoder,
where a high diagnostic accuracy can be achieved when the
number of per fault training samples is 100. Han et al.[23]
applied bidirectional long short-term memory(BiLSTM) and
capsule network to design a small sample fault diagnosis
method, which proved that capsule network had a satisfying
performance after denoising and fusion signals by BiLSTM.
Li et al.[24] developed a conditional Wasserstein generative
adversarial network(CWGAN), where vast similar samples
were generated by training CWGAN with vast source domain
et al.: Preprint submitted to Elsevier Page 1 of 15
samples, and pre-trained CWGAN was fine-tuned to achieve
transfer learning under target domain with limited samples.
For small samples, they either utilize regularization tech-
nologies and feature extraction advantages of models, or
generate substantial high-quality samples based on the distri-
bution of real samples, or apply emerging machine learning
technologies such as meta-learning and transfer learning.
The design of big convolution kernels is beneficial to
enhance robustness[25], while that of deep small convolution
kernels effectively extract abstract features. Also, time-step
information cannot be ignored in vibration signals. Com-
pared with CNN, RNN can just meet requirements.
To learn temporal and hidden features in different loca-
tions, an effective strategy is to employ a gated RNN struc-
ture, LSTM or GRU. LSTM has an excellent time modeling
capability while has many parameters, which easily leads to
overfitting under small samples. Similarly, it is inappropriate
to assume that signals only propagate information forward,
so BiGRU with similar performance to BiLSTM, fewer pa-
rameters and propagating forth and back is a terrific choice.
Zhao et al.[26] put forward a method of combining Manifold
Embedded Distribution Alignment(MEDA) and BiGRU for
fault diagnosis. The noises of original signals were removed
by spectrum information, and BiGRU was utilized to learn
features, then MEDA was used to align auxiliary and unla-
beled samples. However, the method utilizes artificial prior
knowledge for denoising and does not analyze the impact of
small samples and time complexity. Yang et al.[27] proposed
a fault diagnosis method based on BiGRU and attention.
BiGRU was utilized to gain advanced expressions from fea-
tures extracted by CNN, then attention vectors were realized
to diagnose each segment. However, reference[27] does not
discuss the influence of small samples, and the means of
training is relatively conventional, and the performance of
model has not been further mined. In addition, the number of
training samples of DCA-BiGRU is 60% of that of reference
[22] with more difficult diagnosis.
Although previous methods have achieved relatively sat-
isfactory results, deep learning models often require plenty
of samples to achieve the ideal generalization. However, due
to the relatively small labeled data, models are often unable
to fully learn the various effective features of the limited
samples and prone to overfitting, which increases learning
difficulties[28]. Besides, the latest activation functions and
gradient descent back propagation algorithms of all sorts
have not been deeply comparative explored in fault diagnosis
under small samples. Ultimately, due to the interference of
different working conditions, the efficiency is difficult to be
guaranteed, which puts forward higher requirements.
Therefore, aiming to regularization technologies and fea-
ture extraction advantages of models, a new fault diagno-
sis method for small samples based on dual path convo-
lution with attention mechanism and BiGRU is proposed.
The convolution layer aims to extract high-low frequency
features of signals. Meanwhile attention mechanism that
can be regarded as a cost sensitive learning method[28]
values the fused features by allocated weights and sensi-
tive information selection, pouring attention to the main
spectra. Then, BiGRU can get the hidden information of
different time sequence position. In addition to strengthening
the connection between channels and reducing parameters,
GAP and big kernels have more robust than capsule net-
work on model by increasing receptive fields[29]. Moreover,
the latest regularization methods further improve capacity
of generalization on DCA-BiGRU, where label smoothing
regularization(LSR) is introduced to balance the distribu-
tion differences between the labeled samples and calibrate
DCA-BiGRU. Improved AMSGrad accelerator(AMSGradP)
can be utilized to realize adaptive gradient optimization,
and 1D-Meta-ACON(activate or not) can adaptively activate
neurons, and adaptive batch normalization(AdaBN) enables
DCA-BiGRU to have stronger transfer performance.
The main contributions of the paper are as follows:
1. For small sample fault diagnosis, a novel method based
on designed attention mechanism and BiGRU is proposed
from the regularization and model structure, and the
effects of LSR, activation functions and back propagation
algorithms are explored for the first time. Also, the pro-
posed method has a higher test accuracy.
2. The sensitivities of attention mechanism and BiGRU to
the ratio of training samples are discussed, where the pro-
posed attention mechanism can capture the channel and
spatial information of vibration signals. Then, designing
GAP after BiGRU is beneficial for improving diagnostic
performance. Also, visualization techniques are utilized
to gain a better understanding of blocks in DCA-BiGRU.
3. For the noises contained in practical industrial data, a
small sample transfer diagnosis framework based on pre-
training is proposed. The experimental results prove that
it has excellent capacities of generalization, adaptability
and robustness compared to other bearing and gearbox
diagnosis models under complex working conditions.
The rest of other parts in this paper is as follows. Section
2is mainly about the basic theoretical model for fault diagno-
sis. DCA-BiGRU and latest regularization training strategies
will be introduced in detail in Section 3. Section 4presents
some comparative experiments and analysis to prove the
excellent performance of the proposed model. In section 5, it
will draw the conclusion and prospect for the future research.
2. Methodologies
2.1. Convolutional neural network
CNN generally consists of two modules: one filter block
including convolution and pooling and the other classifica-
tion block including full connection. The general CNN in
fault diagnosis is shown in Fig.1.
In signal processing, 1D-CNN is utilized to calculate
delay accumulation of signals with the same kernel. The
output is shown in Eq.(1).
Figure 1: CNN for fault diagnosis
y = 𝑅𝑒𝐿𝑈 (
𝑊
𝑤=1
𝑘𝑤𝑥𝑡𝑤+1 +𝑏𝑤)(1)
et al.: Preprint submitted to Elsevier Page 2 of 15
where 𝑘𝑤and 𝑏𝑤are weight and bias matrix, respectively.
𝑥𝑡𝑤+1 are input signals.
Pooling layer selects features and decreases parameters to
accelerate convergence. The reason why maximum pooling
is often utilized in fault diagnosis is that it can filter out
insignificant information, as shown in Eq.(2).
𝑦𝑖= max
𝑗𝑖𝑥𝑗(2)
where 𝑦𝑖is representations, and 𝑗is neurons in the 𝑖-th layer.
Batch Normalization(BN) can not only solve the internal
covariate migration and improve training efficiency, but also
act as a regularization trick because of batch selection ran-
domly, which can enhance generalization instead of Dropout.
Activation functions can enhance learning capacity of
neural network, improving the computational efficiency.
The distributed feature representations of vibration sig-
nals are mapped to the sample label space through full con-
nection layer. Finally, SoftMax is applied for fault diagnosis.
2.2. Bidirectional Gated Recurrent Unit
As shown in Fig.2, gated recurrent unit(GRU) consists of
an update gate 𝑧𝑡and a reset gate 𝑟t.𝑧𝑡is applied to control
the extent to which 𝑡−1 enters 𝑡. The higher values are, the
more information 𝑡is entered. 𝑟tis utilized to control the
extent to which 𝑡−1 enters
𝑡. The smaller values are, the
less
𝑡entry information. 𝑧𝑡and 𝑟tare calculated at 𝑡moment
as shown in Eq.(37).
𝑟𝑡=𝜎[𝑊𝑟 𝑐𝑎𝑡(𝑡−1 , 𝑥𝑡)] (3)
𝑧𝑡=𝜎[𝑊𝑧 𝑐𝑎𝑡(𝑡−1 , 𝑥𝑡)] (4)
𝑡= tanh[𝑊
𝑡 𝑐𝑎𝑡(𝑟𝑡 𝑡−1 , 𝑥𝑡)] (5)
𝑡= (1 𝑧𝑡) 𝑡−1 +𝑧𝑡
𝑡(6)
𝑦𝑡=𝜎(𝑊𝑜 𝑡)(7)
where 𝑊𝑟, 𝑊𝑧, 𝑊
𝑡is the weight matrix, 𝑐𝑎𝑡() means that
eigenvectors are connected. 𝜎is 𝑠𝑖𝑔𝑚𝑜𝑖𝑑;means element-
wise product; the cell hidden state is ;
𝑡means candidate
content in the current state, which controls the degree of
receiving new information.
For Bidirectional gated recurrent unit(BiGRU), the for-
ward 
𝑡and backward 
𝑡state without sharing parameters of
signals are connected through different hidden layers, which
together act on results 𝑡to express ampler features, as shown
in Eq.(8)

𝑡=𝐺𝑅𝑈 (𝑥𝑡,
𝑡−1),
𝑡=𝐺𝑅𝑈 (𝑥𝑡,
𝑡−1),
𝑡=𝑤𝑡
𝑡+𝑣𝑡
𝑡+𝑏𝑡
(8)
where 𝑤𝑡and 𝑣𝑡are weights corresponding to the forward or
backward state of BiGRU respectively, and 𝑏𝑡is bias.
3. The proposed fault diagnosis method
3.1. Fault diagnosis procedure
In intelligent machine fault diagnosis, multiple structures
and deep optimized algorithms can be integrated to achieve
an amazing effect, where CNN-RNN has been applied to
some extent[30,31]. However, as mentioned in Section 1,
under small samples, the performance of CNN-RNN has not
been further discussed, and deep optimization algorithms
GRU Cell
BiGRU
Figure 2: The core structures of GRU cell and BiGRU
and training modes are conventional, whose potentiality has
not been further explored.
Besides, in fault diagnosis, at the current moment, Bi-
GRU makes the output state determined by the state of the
previous and next moments conjointly. Of course, the last
hidden neuron output is generally taken as the final hidden
feature for diagnosis, for the reason that it has the most
abundant features. Nevertheless, the strategy ignores signal
features learned by other GRU cells.
Therefore, an intelligent fault diagnosis method called
DCA-BiGRU has been proposed, which is composed of data
enhancement, dual path convolution, attention mechanism,
BiGRU, GAP and diagnosis layer, as shown in Fig.4.
As shown in Fig.3, in practical application, the specific
steps of fault diagnosis based on DCA-BiGRU are as follows:
1) Obtain the original signals and realize data segmentation
and standardization.
2) Divide signals into training, verification and test samples.
3) Propose the model structures and diagnostic method.
4) Offline training: use the training set and regularization
strategies to train and save the optimal parameters.
5) Online diagnosis: apply the test set to verify the model
performance or load pre-training parameters and fine-
tune the whole model to utilize parameter sharing transfer
learning to realize timely training and fault diagnosis.
3.2. Dual path convolution and feature fusion
The dual convolution layer adopts two paths to extract
the high-low frequency features of signals. On one path, two
larger convolution kernels are utilized to learn low-frequency
features. As described in Section 2.1, larger convolution
kernels can enhance robustness against noises. On the other
path, small convolution kernels are adopted to deepen neural
network, which integrates four nonlinear activation layers
to promote the discriminant capability. A combination of
both widens the model and extract multiscale features, which
provides a foundation for BiGRU to further learn advanced
features. Finally, features are fused through element-wise
product, where each channel contains abundant features.
To enhance the adaptability to DCA-BiGRU in different
domains, AdaBN is leveraged to replace BN, where statis-
tical information from source domain to target domain is
adjusted to improve capacity of generalization[32].
et al.: Preprint submitted to Elsevier Page 3 of 15
Sliding window sampling Back Propagation
LSR 1D-Meta--
ACON
Training DCA-BiGRU
Early
Stopping
Optimized DCA-BiGRU Diagnosis result
Signal Processing
online testing
offline training
Fine-tune
Industrial
samples Diagnosis result
Sharing Parameters
AdaBN
Signal Normalization
Signal Partitioning
AMSGradP
Saving parameters
Training, Validation
Testing
Training, Validation
Testing
online application
Figure 3: Fault diagnosis framework based on sharing parameters
Dual Path Convolutional Layer Feature
Fusion Layer Attention Mechanism Bidirectiona GRU Global Average Pooling FC Layer
(15, 2) (10, 2)
(6, 1) (6, 1) (6, 1) (6, 2)
(2, 2)
hide=64
SoftMax
(2, 2)
(2, 2)
AdaBN AdaBN
AdaBN AdaBN AdaBN
AdaBN AdaBN
(1, 1) (1, 1)
AdaBN
Figure 4: Overall schema for the proposed network architecture of DCA-BiGRU
3.3. The proposed attention mechanism of signals
Attention mechanism and LSR can be regarded as cost
sensitive learning methods, and 1D-Meta-ACON can be seen
as a means of meta-learning. For small samples, these regu-
larization methods will make contributions to generalization
and domain adaptability on model.
3.3.1. label smoothing regularization
Cross entropy loss(CE, 𝑙0) tends to focus on one di-
rection, leading to poor regulating capability. Consequently,
smoothing coefficient 𝜀are added to increase the correct
diagnosis and reduce wrong diagnosis, which contributes to
countering overconfidence of models and enhances learning
capability. LSR(𝑙) can not only upgrade generalization, but
also calibrate models. It is mostly used in the field of image
recognition, but rarely studied in fault diagnosis.
Supposing 𝑝(𝑘)is predicted distribution, 𝑞(𝑘)is real
distribution, real distribution after label smoothing is 𝑞(𝑘)
with coefficient 𝜀and category 𝐾, and label distribution is
set to uniform distribution 𝜇(𝑘) = 1∕𝐾. The relationship
between 𝑙0and 𝑙is succinctly deduced, as shown in Eq.(9).
𝑙=
𝐾
𝑘=1
log(𝑝(𝑘))𝑞(𝑘)
=
𝐾
𝑘=1
log(𝑝(𝑘))[(1 𝜀)𝑞(𝑘) + 𝜀
𝐾]
= (1 𝜀)[−
𝐾
𝑘=1
log(𝑝(𝑘))𝑞(𝑘)] + 𝜀[
𝐾
𝑘=1
log(𝑝(𝑘))
𝐾]
= (1 𝜀)𝑙0+𝜀[
𝐾
𝑘=1
log(𝑝(𝑘))
𝐾]
(9)
By learning smooth labels instead of real labels to allevi-
ate overfitting, so we argue that LSR has potential advantages
in dealing with small samples in fault diagnosis.
3.3.2. The proposed 1D-signal attention mechanism
In Fig.5, a 1D-signal attention mechanism is proposed,
which can tell us what models demand to focus on about
original signals.
et al.: Preprint submitted to Elsevier Page 4 of 15
To calculate attention between channels, it is indispens-
able to squeeze the dimension of input feature matrix, and
global pooling is generally adopted. Furthermore, compared
with GAP that focuses on the overall information, we argue
that global max pooling(GMP) provides the crucial pulses
(𝑥𝐺𝑀𝑃 ) for the signal characteristic matrix (𝑥), and in the-
ory, it is the decisive pulses that is regarded as the main
distinguishing criterion for fault diagnosis, so GMP is more
suitable than GAP for the proposed attention block, which
will be verified by experiment.
The 𝑐-th channel GMP will be calculated as in Eq.(10).
𝑥c
𝐺𝑀𝑃 = Max
0𝑗<𝑑 𝑥𝑐(1, 𝑗 )(10)
Dual Conv
Feature Fusion
Re-weight
AdaptiveMaxPool1d
Conv1d
AdaBN
Conv1d
B×C1×D
B×C1×1
B×C2×(D+1)
B×C1×D
B×C1×D
Input
Output
x
x
Meta-ACON
Sigmoid
B×C1×D
Cat
Split
B×C1×D
Vibration signal
Feature with weight
B×C2×D
B×C2×(D+1)
B×C2×(D+1)
transform
Figure 5: The architecture of the proposed Attention Block
Besides, in order to capture the spatial position infor-
mation, it is perfect to establish the relationship between
𝑥𝐺𝑀𝑃 and 𝑥, so they are concatenated together and sent into
a convolution mapping function 𝐹1that shares 1 × 6. The
dependency relationship is encoded to yield the intermediate
characteristic connection matrix 𝑓as shown in Eq.(11).
𝑓=𝛿(𝐹1[𝑐𝑎𝑡(𝑥,𝑥𝐺𝑀 𝑃 )]) (11)
where 𝛿is 1D-Meta-ACON activation function.
Then, 𝑓is split into 𝑥and others. For the reason that
the transformed original characteristic matrix 𝑥has not only
information of the critical pulse spectra, but also original
signal characteristics 𝑥, just 𝑥is retained. Another 1 × 1
convolution mapping function 𝐹2transforms 𝑥to the same
number of channels as 𝑥, as shown in Eq.(12).
𝑔=𝜎[𝐹2(𝑓𝑥)] (12)
Finally, the output 𝑦𝑐is shown in Eq.(13).
𝑦𝑐=𝑥𝑐 𝑔𝑐(13)
3.3.3. The improved 1D-Meta-ACON
Aiming at the nonlinearity of vibration signals, in the
proposed attention block, a new activation function, Meta-
ACON is applied[33]. Neither ReLU nor Swish, but both are
considered and generalized to a general form. It is a form that
can learn whether to activate.
Whether or not to activate neurons is determined by the
smoothing coefficient 𝛽𝑐, so as to dynamically and adaptively
eliminate inessential information. This is similar to the idea
of the proposed 1D-signal attention mechanism, focusing on
the central part in signals, which can conduce to improving
capacity of generalization and transmission performance.
Inspired by this, it is transformed into 𝛽𝑐suitable for 1D-
signals, 1D-Meta-ACON, as shown in Eq.(14).
𝛽𝑐=𝜎[𝐹4(𝐹3(1
𝐷𝐷
𝑑=1 𝑥𝑐,𝑑 ))] (14)
where in forward propagation, 𝛽𝑐is calculated initially. The
eigenvector 𝑥is calculated the mean value on D dimension.
After 𝐹3, 𝐹4(1×1convolution) transform, 𝛽𝑐between (0,1)
is obtained through Sigmoid, which is applied to control
whether or not to activate or activation degree, where 0
means inactive. Finally, adaptive variables 𝑝1and 𝑝2are set,
and supposing 𝑝=𝑝1𝑝2, return activation output(𝑓𝑎)
obtained by Eq.(15), and 𝑝1and 𝑝2are adaptively adjusted
by back propagation.
𝑓𝑎=𝑝×𝑥𝑐,𝑑 ×𝜎[𝛽𝑐×𝑝×𝑥𝑐,𝑑 ] + 𝑝2×𝑥𝑐,𝑑 (15)
1D-Meta-ACON is a general form, which not only solves
the dead neuron problem, but also requires only a few param-
eters to learn to whether to activate. The research will explore
if it can make a difference in small sample fault diagnosis.
3.3.4. AMSGradP
AdaBN contributes to improving capacity of generaliza-
tion and scale invariance on model as same as BN. However,
Heo et al. pointed out the gradient descent with momen-
tum(GDM) will lead the effective step to decreasing rapidly
during back propagation, resulting in slower convergence
or even sharp minimizers, so AdamP[34] was proposed,
which can just alleviate the puzzle by dropping the radial
component during optimized update, regulating growth of
weight norm, retarding the decay of the effective step size,
thus training the model in a barrierless speed.
In this study, it is easy for small samples to converge to
the local optimum. Unfortunately, the author has not given
the improvement of more advanced AMSGrad. Inspired by
this, the idea of reference [34] are introduced into AMSGrad
called AMSGradP. In Appendix, Algorithm 1outlines the
pseudocode of AMSGradP.
3.4. BiGRU and GAP in fault diagnosis
LSTM has been described about in Section 1. In addition,
by merging the forget gate and the input gate into the update
gate, GRU has simpler structures, approximately 3/4 parame-
ter quantity than LSTM, while it has the similar performance
to LSTM in various tasks[35]. Apparently, GRU is more
suitable for small samples. At the same time, it is argued that
signals only have a deep correlation in one direction, which
is not appropriate. As mentioned in Section 2.2, BiGRU is
more suitable for the research.
FC layer with many parameters can greatly increase the
risk of overfitting, while GAP will not produce extra param-
eters, and retain the partial spatial coding information from
signals. In addition, as described in Section 2.2, we consider
et al.: Preprint submitted to Elsevier Page 5 of 15
Table 1
The structures of DCA-BiGRU
Type Kernel/Stride Unit Activation AdaBN Input Output Parameter
Conv1d_1 18/2&10/2 / 1D-Meta-ACON YES (-1,1,1024) (-1,30,248) 19036
Maxpool_1 2/2 / 1D-Meta-ACON / (-1,30,248) (-1,30,124) /
Conv1d_21 6/1&6/1 / 1D-Meta-ACON YES (-1,1,1024) (-1,40,1014) 15816
Maxpool_21 2/2 / 1D-Meta-ACON / (-1,40,1014) (-1,40,507) /
Conv1d_22 6/1&6/2 / 1D-Meta-ACON YES (-1,40,507) (-1,30,249) 14976
Maxpool_22 2/2 / 1D-Meta-ACON / (-1,30,249) (-1,30,124) /
Attention 1/1 / 1D-Meta-ACON YES (-1,30,124) (-1,30,124) 666
BiGRU / 128 Tanh / (-1,30,124) (-1,30,128) 72960
GAP / / / / (-1,30,128) (-1,30,1) /
FC / 10 SotfMax / (-1,30) (-1,10) 310
Total:123764
not only output of the last GRU cell, but also outputs of entire
GRU cells, and GAP just fulfills the above requirements,
preserving features learned by other GRU cells.
Lastly, we hold the view that the feature matrix has
gathered the critical spectra from original signals, whose
global information should be focused on, so GAP is preferred
instead of GMP. The structures of DCA-BiGRU in detail is
shown in Table 1, where a smaller number of parameters will
facilitate the small sample fault diagnosis.
4. Result analysis and discussion
The proportion of each kind of training samples(𝛼%)
regards as the evaluation criteria. We argue if 𝛼<0.5, it can
be called small samples[36]. Firstly, the superiorities of new
regularization training methods proposed will be verified.
Then, when 𝛼=0.1~0.5(around 20~100 training samples),
the small sample learning capacity of different models will
be verified, and the performance will be evaluated under
different working conditions and noises. Finally, parameter
sharing that is applied to the small sample transfer learning
to a new data set will be discussed, and meanwhile visual
interpretations of DCA-BiGRU will be discussed. All exper-
iments are performed under the same random seed, and the
settings about experiments are shown in Table 2.
Table 2
Description of experimental parameters
Settings Value
Batch_size 32
Maximum epochs 150
Optimizer AMSGradP
Learning rate 0.001
Weight decay(except bias) 0.0001
Early Stopping(patience) 10
AMSGradP(Nesterov) True
1D-Meta-ACON(reduction) 16
Attention Block(𝐶1/𝐶2/𝐷) 30/6/124
The experiment is implemented in PyTorch 1.8.0, Python
3.8.5, running on Intel(R) Core i7-6700HQ CPU @3.40GHz
(8G RAM), GTX970M GPU. The flow chart shown in Fig.3
illustrates the overall framework for fault diagnosis. It has
proved that fine-tuning the model can obtain more accurate
diagnosis results, and the time cost is affordable[36,37];
hence the paper will adopt to fine-tuning the whole DCA-
BiGRU for the anti-noise experiment.
4.1. Data enhancement
Data enhancement aims to generate more samples from
vibration signals, prevent ANN from learning irrelevant fea-
tures. As shown in Fig.4, assuming that the sliding window is
𝑙, a sample is generated starting from the 𝑖-th with an interval
𝑙, where the adjacent samples are set with an overlap value.
Assuming the sliding step size is 𝑚,𝑁is the signal
length, and the quantity of samples 𝑛=𝑁𝑙
𝑚+ 1(𝑚=400,
𝑙=1024).
4.2. Model evaluation and metrics method
Diagnosis performance can be formulated by a confusion
matrix, where it has two valuable indicators.
In multi-class case, this is the average of F1-score of each
class with weighting depending on the average parameter,
where sensitivity(recall) and precision are the key perfor-
mance, which can be calculated as Eq.(1617).
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑇 𝑃
𝑇 𝑃 +𝐹 𝑃 , 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =𝑇 𝑃
𝑇 𝑃 +𝐹 𝑁 (16)
𝐹𝛽=(1 + 𝛽2)(𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 +𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦)
𝛽2×𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 +𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 (𝛽= 1) (17)
where True Positive(TP) is an outcome where the model
correctly predicts the positive class. False Positive(FP) is an
outcome where the model incorrectly predicts the positive
class. False Negative(FN) is an outcome where the model
incorrectly predicts the negative class. The weight of sensi-
tivity is 𝛽times of precision.
Geometric mean(G-mean) tries to maximize the accu-
racy on each of classes while keeping these accuracies bal-
anced. For multi-class problems it is a higher root of the
product of sensitivity for each class, as shown in Eq.(18).
𝐺𝑚𝑒𝑎𝑛 =𝑛
𝑁
𝑛=1
𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦𝑛(18)
4.3. Case 1: Data from CWRU
4.3.1. Description and division of data
The drive end rolling bearing data provided by Case
Western Reserve University[38] is acquired by the device
as shown in Fig.6, where the single point faults(inner ring,
et al.: Preprint submitted to Elsevier Page 6 of 15
outer ring, rolling element) are caused by electrical dis-
charge machining(EDM), and the sampling frequency is
12kHz, with 0~3HP loads and three types of damage de-
grees(0.118/0.356/0.533mm). The acceleration sensor that
is located at the drive end of the motor housing collects
acceleration data. According to different loads, signals are
divided into four data sets: A, B, C, D, as shown in Table 3.
Table 3
Partition of CWRU data sets
Data Loads Locations FD(mm) Label 𝛼%
A/B/C/D 0/1/2/3
N
0.118/0.356/0.533
0
0.10.5
IF 1/2/3
OR 4/5/6
BF 7/8/9
Fan End
Torque transducer
Induction motor Load motor
Drive End
Figure 6: Bearing fault diagnosis model test-bed
4.3.2. The discussion of batch_size
A larger batchsize can shorten the training time of each
epoch, but it may also reduce capacity of generalization, so a
balance should be struck between both. For this reason, under
𝛼=0.3 with data set B, only batchsize changes, and the results
are shown in Table 4.
Table 4
Comparison of training results between different batch_size
batch_size Early stopping eval_loss eval_acc Time/s
8 34 0.5532 100% 311.74
16 40 0.5684 99.64% 226.82
32 75 0.5577 99.93% 249.14
64 73 0.5706 99.64% 244.77
80 126 0.5804 99.71% 367.85
100 100 0.5955 99.21% 296.53
128 80 0.6219 98.50% 223.33
As is seen to us, the training difficulties with different
batchsizes are not consistent, resulting in different epochs
of early stopping. Apparently, it can achieve similar perfor-
mance in batchsize=8 or 32(100%, 99.93%), but the latter
takes less time, so batch_size=32.
4.3.3. Ablation comparative experiment
The ablation experiments regarding DCA-BiGRU(M5)
are carried out on four data sets A, B, C and D. The contrast
models are PCA-SVM(M1), DCNN-BiGRU(without atten-
tion, M2), DCNN(without attention and BiGRU, M3) and
DCA(without BiGRU, M4), which take G-mean as the index.
In order to avoid the random influence, each experiment
repeats five times to get error bars as shown in Fig.7, and
AA represents training settest set. The X-axis shows the
proportion of the training(𝛼). At the same time, the running
time of different models, different loads in different 𝛼is
recorded until early stopping, as shown in Table 6.
From Fig.7, Table 6, as 𝛼augments, models learn more
features and G-mean gains an increase. Due to the lack of
elaborate processed of original signals, SVM cannot effec-
tively deal with high-dimensional signals. Also, by compar-
ing M4and M5, it can be illustrated that BiGRU has advan-
tages in coping with small samples, which generates hidden
features, and contributes to the performance of the model
to increase by 21%~36%. From M4and M5in Fig.7c, when
𝛼=0.3, M4=0.9031, while M3=0.6431, which demonstrates
that attention mechanism also has a promising generalization
for small samples, because it can guide models to pour
attention to critical pulses, and only add 666 parameters.
With a combination of both, M5=0.9822. On the whole, both
are conducive to performance of the model for small samples.
Furthermore, the trend of running time increases with the
increase of 𝛼, where the advanced model requires more time.
In total, DCA-BiGRU has the highest diagnostic efficiency.
4.3.4. Experiment under different loads
In general, the capability to deal with unlabeled sam-
ples from other loads is low when training with one data
set. Therefore, it is indispensable to evaluate the migration
versatility on DCA-BiGRU in fault diagnosis when the load
changes. A, B, C and D have different loads and different
signal distributions. In the past, most of the methods used to
test the generality under 𝛼= 0.7. The paper will explore the
generality of the proposed model in small samples. Applying
the model under training with Data set B, and the statistical
results are displayed in Table 5.
Table 5
G-mean of DCA-BiGRU under different loads
𝛼(%) G-mean
Data set BA BB BC BD
0.1 95.30 96.78 99.21 93.24
0.2 98.60 99.41 99.09 98.32
0.3 99.48 99.71 99.60 98.56
0.4 99.21 100 99.86 98.83
0.5 99.74 100 100 98.97
As we can see, when G-mean<0.99, the migration ver-
satility enhances with the increase of 𝛼. For Data set D
with load 3, although the signal distribution changes com-
paratively obviously, the performance has not decreased
dramatically(average G-mean=0.97). When G-mean>0.99,
the performance is slightly different due to random values. In
addition, when load 0 with 𝛼=0.1, under small samples with
inapparent fault pulses, G-mean>0.95. In this case, DCA-
BiGRU still achieve high performance, which fully indicates
that it has a pleasant migration versatility.
et al.: Preprint submitted to Elsevier Page 7 of 15
0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5
3 0
4 0
5 0
6 0
7 0
8 0
9 0
100
G - m e a n ( % )
a
P C A - S V M
D C N N - B i G R U
DCNN
D C A
D C A - B i G R U
(a) AA
0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5
4 0
5 0
6 0
7 0
8 0
9 0
100
G - m e a n ( % )
a
P C A - S V M
D C N N - B i G R U
DCNN
D C A
D C A - B i G R U
(b) BB
0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5
3 0
4 0
5 0
6 0
7 0
8 0
9 0
100
( % )
a
P C A - S V M
D C N N - B i G R U
DCNN
D C A
D C A - B i G R U
G-mean
(c) CC
0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5
3 0
4 0
5 0
6 0
7 0
8 0
9 0
100
( % )
a
P C A - S V M
D C N N - B i G R U
DCNN
D C A
D C A - B i G R U
G-mean
(d) DD
Figure 7: G-mean values of test under different loads
Table 6
Time of test under different loads
Models 𝛼(%) Time(s)
A B C D
PCA-SVM
0.1 0.09 0.01 0.14 0.20
0.2 0.16 0.02 0.34 0.15
0.3 0.23 0.61 0.67 0.23
0.4 0.31 0.12 0.56 0.29
0.5 0.33 0.16 1.06 0.48
DCNN-BiGRU
0.1 21.31 19.40 15.44 14.57
0.2 21.42 13.96 27.94 22.90
0.3 30.84 21.02 43.89 54.01
0.4 34.27 17.81 33.15 56.73
0.5 48.68 26.76 63.85 60.34
DCNN
0.1 60.75 171.66 103.42 65.79
0.2 103.07 123.74 193.69 96.90
0.3 221.55 118.44 126.34 264.87
0.4 144.39 244.19 123.14 292.19
0.5 223.74 290.73 162.09 197.97
DCA
0.1 53.91 71.75 55.26 87.30
0.2 148.07 161.38 110.96 126.35
0.3 160.83 113.39 180.93 488.10
0.4 224.96 140.84 332.17 256.32
0.5 212.41 272.37 327.20 394.66
Ours
0.1 171.40 151.00 164.88 132.43
0.2 403.03 270.81 193.33 387.71
0.3 237.93 338.11 301.05 304.21
0.4 249.03 242.73 430.96 317.33
0.5 201.99 410.71 345.82 375.74
4.3.5. Analysis of regularization means
Vibration signals are distributed nonlinearly, while neu-
ral networks belong to linear calculation. In order to avoid
vanishing gradient, the nonlinear non-saturating activation
function is generally applied. In recent years, some latest
activation functions have been widely utilized, but the im-
provement of them to the method has not been explored
carefully in fault diagnosis. 1D-meta-ACON applied in this
paper with only 1098 parameters combines the advantages of
linear and nonlinear activation functions. One of them can be
preferred by referring to the performances of them.
The related activation functions are shown in Fig.8,
where the gradient of Mish is smoother than that of ReLU,
and Swish has the features of lower bound without upper
bound, smoothness and non-monotonicity, which can be
regarded as a smoothing form between linear and ReLU.
54321012345
1
0
1
2
3
4
5
ReLU
Swish
Mish
ELU
Softplus
Figure 8: Different activation functions
All results are carried out under Data set B with 𝛼=0.3,
and the training loss and the accuracy of transfer are ob-
tained, as shown in Fig.9and Fig.10. It can be seen that all
et al.: Preprint submitted to Elsevier Page 8 of 15
(139,0.5469)
(102,0.5486)
(114,0.5466)
(85,0.5486)
(29,0.5856)
(91,0.5472)
Figure 9: Losses under different activation functions
models can converge, and Softplus=0.5856 with the maxi-
mum loss triggers early stopping earliest. When epoch=114,
early stopping is triggered on ELU. From the stability of
convergence, expect for ReLU and Softplus, the other four
functions are relatively stable, where the differences of loss
are small in later epochs, and the difference of final losses
among Swish, ELU, 1D-Meta-ACON is about 0.0003, while
1D-Meta-ACON has less epochs, with fastest convergence.
In addition, as shown in Fig.10, Mish, Swish and 1D-
Meta-ACON have a better migration generality, reaching
97.15%, 98.84% and 99.09% respectively under BD. Mean-
while, ReLU and Softplus are poor under BD, and the
performance of 1D-Meta-ACON is the best, which improves
by 0.25%. Regardless of extreme accuracy, one of the three
activation functions can be chosen according to the reality.
99.13
98.64 98.61
99.74 99.62 99.62
100 100 99.73 100 100 100
99.87
98.64
98.98
99.74 99.62
100
92.97
97.66
92.91
97.15
98.84 99.09
RuLU ELU Softplus Mish Swish Meta-ACON
0
92
93
94
95
96
97
98
99
100
101
accuracy(%)
Activation function
B→A
B→B
B→C
B→D
Figure 10: Generality under different activation functions
Similarly, the effects of different adaptive optimization
gradient algorithms are compared. For a certain neural net-
work, they are utilized to optimize the objective functions,
and parameters are continuously updated in a negative direc-
tion until an optimal solution. The closer solution is to the
global optimum, the neural network has better generalization.
Optimizers:SGDM(0.576), AMSGrad(0.569), AadmW(
0.565), AdaBelief(0.561), AdaBound( 0.565), AdamP(0.562),
Adam(0.579), AMSGradP(0.555). The experimental results
S G D M A M S G r a d A d a m W A d a B e l i e f A d a B o u n d A d a m A d a m P A M S G r a d P
9 5
9 6
9 7
9 8
9 9
100
B D
A l g o r i t h m s
a c c ( % )
Figure 11: Performance and time under different algorithms
of verification set are shown in Fig.13. It can be seen that
the accuracy of several optimization algorithms reach more
than 99%. Adam has the maximum oscillation amplitude,
and when epoch=142, it triggers early stopping. Compared
with AMSGrad(99.64%), AMSGradP(99.86%) improves by
0.22%. Also, SGDM reaches 99.86%, yet it requires more
epochs. From the point of view of convergence speed and
value, SGDM, Adabelief and Adam converge slowly, but
AMSGradP has the fastest convergence speed and highest
validation accuracy, which indicates that adding radial com-
ponent, AMSGradP retards the reduction of effective step,
so that the algorithm reaches the vicinity of optimal point
with a relatively appropriate effective step, and constantly
updates nearby, converging to 0.555. Except for these, the
speed of other algorithms is not much different. The above
analyses fully indicate that adding radial components and
adjusting norm growth can effectively improve results for
gradient descent algorithms in fault diagnosis.
Besides, the generality of transfer of each algorithm is
evaluated in Fig.11, which displays the performance of DCA-
BiGRU trained under 𝛼=0.3 with Data set B.
It can be found that Adabelief has the worst generaliza-
tion in the rolling bearing task, which is only 96.76% under
BD. AMSGradP and Adabound have similar performance,
whereas AMSGradP is more stable for migration because
of a smaller error and has the acceptable training time.
Considering comprehensively, AMSGradP is more superior.
Based on the above argumentum, a benchmark model can
be trained applying AMSGradP and fine-tuned employing
AdamW with fastest converge.
Eventually, the effects of different optimized strategies
on the model are compared. (W:AdaBN, GHMC:gradient
harmonizing mechanism for classification, FL:Focal Loss,
G:GAP, B:BN). As an example, AdaBN, GAP and LSR are
applied into DCA-BiGRU(WLSRG). In the field of NLP,
GHMC, FL, and LSR acquire more attention for unbalanced
distribution, but they have not been contrastively studied in
fault diagnosis under small samples.
It displays the influence of different loss functions and
optimized strategies in Fig.12. Obviously, the task named
BD is more difficult. Initially, for WLSR, WFL, WGHMC
et al.: Preprint submitted to Elsevier Page 9 of 15
B A B B B C B D
9 5 . 0
9 5 . 5
9 6 . 0
9 6 . 5
9 7 . 0
9 7 . 5
9 8 . 0
9 8 . 5
9 9 . 0
9 9 . 5
100.0
a c c ( % )
D o m a i n m i g r a t i o n
W L S R W G H M C W F L W C E W L S R G B L S R 179.55s
413.83s
133.12s
170.51s
227.01s
300.43s
Figure 12: Accuracy under loss functions and strategies
and WCE, it can be stated that CE has the shortest training
time, whose accuracy is only 95.37%. GHMC solves the
problems of outliers and parameter joint training existed
in FL and improves by 0.44%. Compared with three, LSR
with 99.32% has the maximum accuracy. Furthermore, by
comparing WLSR and BLSR, there is an improvement of
about 1.36% by applying AdaBN. Ultimately, as mentioned
in 3.3.2, by comparing WLSR and WLSRG, GAP is 98.78%,
which descends by 0.54% than GMP in attention block.
In conclusion, the latest training methods make contribu-
tions to improve capacity of generalization.
4.3.6. Analysis of anti-noise robustness
Signals mostly contain noises in real situation. Hence, the
study will analyze the anti-noise robustness under different
signal-to-noise ratio(SNR), which is defined as in Eq.(19).
𝑆𝑁 𝑅dB = 10 lg( 𝑃𝑠𝑖𝑔𝑛𝑎𝑙
𝑃𝑛𝑜𝑠𝑖𝑒
)(19)
where, 𝑃𝑠𝑖𝑔𝑛𝑎𝑙 =1
𝑁
𝑁
𝑖=1
𝑥2
𝑖is original signal power and 𝑃𝑛𝑜𝑖𝑠𝑒
is noise power.
Different from the previous methods that the model is
directly trained by noise signals. The fault diagnosis frame-
work with sharing parameters shown in Fig.3will be applied,
which consists of off-line pre-training and online. The off-
line will utilize AMSGradP and Data set B to obtain the pre-
training parameters, while the online mainly aims to fine-
tune models to achieve high efficiency, where the training
time will be cut down because parameters are close to opti-
mization values, so that noises can be quickly smooth away.
In this study, Gaussian white noises with SNR=-4~6dB
will be added to original signals. AdamW is applied to fine-
tune the whole pre-training model. Besides, other settings
are the same. Previous studies have declared that with the
increase of SNR and 𝛼, the accuracy of test is continuously
improved. Therefore, a case with SNR=-4dB and 𝛼=0.1 is
applied to examine performance. Results regarding training
time and G-mean are shown in Table 7and Fig.14.
On one hand, with the increase of 𝛼, G-mean also in-
creases, but the time cost also increases. However, it is
reduced by approximately 2/3, compared with unloaded pre-
trained models. On the other hand, DCA-BiGRU still has the
highest diagnostic accuracy. Taking 𝛼=0.3 as an example,
Table 7
Time of different 𝛼under SNR= –4dB
models Time/s
𝛼(%) 0.1 0.2 0.3 0.4 0.5
DCNN-BiGRU 16.30 17.44 27.12 43.55 75.82
DCNN 31.67 38.71 54.88 60.88 93.28
DCA 28.68 52.36 75.63 82.83 109.25
DCA-BiGRU 26.01 58.44 68.42 99.94 102.45
Table 8
Evaluation under 𝛼=0.1, Data set B
SNR -4 -2 0 2 4 6
G-mean 90.72 92.11 94.84 96.20 98.32 98.32
Time 26.08 60.15 11.09 11.01 11.98 11.00
four models are 87.10%, 74.18%, 77.09%, 92.72% in turn,
where BiGRU improves 19.92%, and attention mechanism
improves 5.62%. All in all, attention mechanism and BiGRU
has a strong capacity of robustness and diagnostic efficiency.
Another random seed is set to further evaluate DCA-
BiGRU under conditions with 𝛼=0.1 and SNR=-4~6dB, as
shown in Table 8. With the increase of SNR, G-mean also in-
creases. In addition, SNR= -4dB or -2dB requires more time,
because it may be that larger noises cause higher learning
difficulty, and demands more epochs. Furthermore, DCA-
BiGRU achieves G-mean>0.9 at various SNRs, manifesting
that it has an excellent anti-noise performance.
Finally, the changes of original outer ring fault signals
in DCA-BiGRU are shown in Fig.15. With the depth of
network, signal features become more abstract, and it is
easier to realize diagnosis.
4.4. Case 2: Data from University of Connecticut
4.4.1. Description and analysis of data
The data that is shared from University of Connecticut
is collected from a two-stage gearbox [39,40], where the
acquisition device is shown in Fig.17, and the acquisition
frequency is 20kHz, and The signals are recorded through
a dSPACE system(DS1006 processor board, dSPACE Inc.).
The specifications of the accelerometer including frequency
range, measure range, and sensitivity are 0.5Hz-10kHz, ±50
g, and 100 mV/g, respectively. Nine different gear conditions
are introduced to the pinion on the input shaft, including
healthy condition, missing tooth, root crack, spalling, and
chipping tip with five different levels of severity, and time-
domain signals of nine states are showed in Fig.18.
In original signals, a total of 104 samples with 3600
points are collected for gearbox states. In order to facilitate
experiments, all signals in a certain state are integrated into
a column, and the training set, verification set and test set are
obtained by acquisition methods mentioned in Section 4.1.
The label of each state is 0~9 as shown in Fig.18.
4.4.2. Evaluation under different working conditions
In reality, while gearbox system is recorded in a fixed
sampling rate, due to speed variations under load distur-
bance, geometric tolerance, and motor control error etc, the
time-domain signals also reflect the changes of different
working conditions. And Fig.16 reflects the change curve
et al.: Preprint submitted to Elsevier Page 10 of 15
(a) Accuracy
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
0 . 6
0 . 8
1 . 0
1 . 2
1 . 4
1 . 6
1 . 8
2 . 0
2 . 2
2 . 4
e v a l_ l o s s
e p o c h
S G D M
A M S G r a d
A d a m W
A d a b e l i e f
A d a b o u n d
A d a m P
A d a m
A M S G r a d P
(b) Loss
Figure 13: Accuracy and losses of verification set
0 . 1 0 . 2 0 . 3 0 . 4 0 . 5
6 5
7 0
7 5
8 0
8 5
9 0
9 5
G - m e a n
a
D C N N - B i G R U
DCNN
D C A
D C A - B i G R U
Figure 14: Fault diagnosis based on sharing parameters
SoftMax
Conv1
Conv2
Attention
BiGRU GAP
Figure 15: The signal changes in DCA-BiGRU
of accuracy and loss of training set and verification set at
𝛼=0.3, where DCA-BiGRU has an excellent convergence
performance. When epoch =93, G-mean = 99.37%.
Fig.19 and Fig.20 embody the performance and training
time of each model with the increase of 𝛼. It can be indicated
that with the increase of 𝛼, G-mean also increases in test set.
When 𝛼=0.1, DCA-BiGRU has the first-class performance
with G-mean=96.34%, while 79.76% on DCNN-BiGRU.
When 𝛼=0.5, these models almost always close to 100%
except for SVM. Overall analysis displays that when 𝛼 <0.3,
DCA-BiGRU<DCNN-BiGRU<DCA<DCNN, so the com-
bination of attention mechanism and BiGRU can just achieve
0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 1 0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
100
e v a l _ a c c
a c c
e v a l _ l o s s
l o s s
e p o c h
a c c
0 . 4
0 . 6
0 . 8
1 . 0
1 . 2
1 . 4
1 . 6
1 . 8
2 . 0
2 . 2
2 . 4
l o s s
Figure 16: Training and verification performance
Motor
controller
Gearbox
Brake
Data Dollection
Systems
Accelerometer
Figure 17: Gearbox system
the optimal performance. Similarly, the cost of high perfor-
mance is more training time, which requires for loading the
pre-training model to save training time.
4.4.3. Visual analysis
In order to further reveal the feature representations,
the T-SNE technology is applied to feature visualization,
where different colors describe different states. By compar-
ing Fig.21 and Fig.22, it can be found that DCNN extracts
features preliminarily and each state is further separated
through the attention mechanism. BiGRU 2 classifies sam-
ples by extracting the hidden features at different positions.
Finally, parameters of the classifier are reduced by GAP.
et al.: Preprint submitted to Elsevier Page 11 of 15
Healthy Missing tooth Root crack
Spalling Chipping tip L1 Chipping tip L2
Chipping tip L3 Chipping tip L4 Chipping tip L5
amplitude
Sampling length
Figure 18: Vibration signals of nine faults
0 . 1 0 . 2 0 . 3 0 . 4 0 . 5
3 0
4 0
5 0
6 0
7 0
8 0
9 0
100
G - m e a n
a
P C A - S V M
D C N N - B i G R U
DCNN
D C A
D C A - B i G R U
Figure 19: G-mean in different 𝛼
0 . 1 0 . 2 0 . 3 0 . 4 0 . 5
0
100
200
300
400
500
T i m e / s
α
P C A - S V M
D C A - B i G R U
DCNN
D C A
D C A - B i G R U
Figure 20: Time in different 𝛼
BiGRU 1 only gets the output of the last hidden layer.
Through the comparison between BiGRU 1 and 2, it can be
seen that GAP pays attention to the output of neurons in all
hidden layers of BiGRU, which makes fault state separation
more obvious and reduces the training pressure of diagnosis
layer. In conclusion, DCA-BiGRU can better separate differ-
ent states, which has a marvelous generalization.
The visualization of attention mechanism and BiGRU is
shown in Fig.23. The brighter the color, the higher degree
of activation. From these, it is observed that the attention
mechanism attaches importance to the degree of each chan-
nel in signals. In addition, BiGRU 2 further separates the
dimensionality reduction signals and extracts more vivid and
refined features. Different fault types have different neuron
activation areas, so the corresponding features can be ex-
tracted from original signals without human intervention.
From top to bottom, from left to right, conv1, conv2, attention, GAP.
Figure 21: Feature visualization of different layers
BiGRU 1 BiGRU 2
Figure 22: Visualization of different BiGRU
Attention BiGRU 2
Figure 23: Weight visualization of attention and BiGRU
Grad-CAM++ is a widely applied visualization method,
whose basic idea is that the weight of the feature map
corresponding to a certain classification can be expressed as
a gradient, and the global average of the gradient is utilized
to calculate the weight. In addition, ReLU and the weight
gradient 𝑎𝑘𝑐
𝑖are added into the feature map 𝑤𝑒𝑖𝑔ℎ𝑡. Only one
back propagation is required to calculate the gradient, which
is originally applied to 2D, but is improved and applied to
1D-signals, as shown in Algorithm 2in Appendix.
Attention mechanism is further explained, and Class
Activation Mapping(CAM) is calculated by extracting the
convolution kernel feature map of attention mechanism, as
shown in Fig.24. The higher the color level, the higher
CAM and the higher the feature distinction. The light blue
frame has circled higher parts of CAM. It can be found that
the locations of different fault types activated by CAM are
different, whose amplitudes are not the same, which fully
demonstrates that the attention mechanism can distinguish
the fault types without manual preprocessing. For example,
Missing tooth and Spalling have two distinct areas of class
et al.: Preprint submitted to Elsevier Page 12 of 15
Healthy Missing tooth Root crack
Spalling Chipping tip L1 Chipping tip L2
Chipping tip L3 Chipping tip L4 Chipping tip L5
Figure 24: Visualization of nine fault states under Grad-CAM++
activation. Besides, Chipping tip with different damage de-
gree has different activation areas, where the impact ampli-
tude is more distinct with the deepening of damage degree.
4.4.4. Anti-noise performance for gearbox
For the Gearbox fault, the learning rate is 0.0009 because
of loading pre-training model. AdamW and fault diagnostic
framework as shown in Fig.3are applied, and other parame-
ters are as same as above.
Under SNR=6dB, the anti-noise capacity of models un-
der different 𝛼is calculated, as shown in Fig.25. Besides, the
influence of SNR is recorded as shown in Table 9.
When 𝛼=0.3, with the improvement of 𝛼, G-mean is
improving, indicating that the robustness of models is en-
hanced. Comparison between DCNN and DCNN-BiGRU
shows that BiGRU improves performance by 5.31% when
𝛼=0.1. For DCA-BiGRU and DCNN-BiGRU, when 𝛼=0.3,
attention mechanism makes the model increase by 0.86%.
In addition, by comparing whether the pre-training model
is loaded or not, it can be found that loading the pre-training
model not only improves G-mean, but also saves training
time. The greater the noises, the more obvious the advantage
of loading pre-training model. As an example SNR=0dB, G-
mean of loading pre-training parameters is 85.28%, and that
of unloading is 78.43%, increasing by 6.85%.
By observing the confusion matrix of both as shown in
Fig.26, DCNN-BiGRU whose sensitivity to Chipping tip L1
and L4 is low misclassifies part of the healthy samples. On
the contrary, DCA-BiGRU can correctly distinguish healthy
and fault samples, but misclassifies Missing tooth, Chipping
tip L2 and L3. In particular, the sensitivity to Chipping tip
L3 is low, which requires effective measures to improve
performance under noises.
4.5. Comparison studies of diagnostic method
Finally, the rolling bearing data from CWRU is very
popular in machinery fault diagnosis researches. Compared
0 . 1 0 . 2 0 . 3 0 . 4 0 . 5
8 5
9 0
9 5
100
G - m e a n
a
DCNN
D C N N - B i G R U
D C A
D C A - B i G R U
Figure 25: G-mean of models under SNR=6dB
DCNN-BiGRU DCA-BiGRU
Figure 26: Confusion matrix under SNR=6dB, 𝛼=0.3
with some methods listed in Table 10, and DCA-BiGRU still
has reach 99.73% diagnostic performance in the case of no
human intervention, lower 𝛼and less sampling length, and
compared with DCA-BiLSTM, DCA-BiGRU increases by
0.17%.
Firstly, the length of sampling points can affect the di-
agnosis results. The fewer sampling points are, the fewer
shock pulse will be contained in one sample. Compared with
references listed in Table 10, in the paper, one sample collects
et al.: Preprint submitted to Elsevier Page 13 of 15
Table 9
Anti-noise performance of DCA-BiGRU under 𝛼=0.3
Load Metric SNR/dB
0246810
YG-mean 85.28 93.56 95.26 98.84 99.42 99.42
Time 84.47 91.23 81.10 90.89 61.42 81.05
NG-mean 78.43 87.47 93.01 97.96 98.57 99.13
Time 103.59 131.34 104.74 195.88 232.62 134.38
1024 points. Furthermore, although there are fewer sampling
points in reference [3] and [22], sample dimension reduction
and feature extraction algorithms are applied, most of which
contain hyperparameters. In reference [3], smart evolution
algorithm is adopted to search suitable hyperparameters,
with high time complexity, and reference [22] is determined
by manual experience. Then, compared with the number of
training samples, in this paper, there is a lower proportion
of training sets and a lower number of training sets. For
example, the number of training sets is 60% than reference
[22]. Finally, DCA-BiGRU also achieved a more interesting
diagnostic result under harsher experimental environment
and higher diagnostic difficulty. In addition, capsule network
also has advantages in small sample fault diagnosis. How-
ever, after literature [23] is reproduced, capsule network has
about 1.2 million parameters, while the parameters of DCA-
BiGRU are about 120 thousand which means that DCA-
BiGRU has faster training speed and higher diagnosis effi-
ciency because fewer parameters make faster training speed.
Table 10
Comparison of fault diagnosis of CWRU
Models Length Filtering 𝛼Accuracy
reference[3] 200 MCKD-RCMDE 0.8 99.00%
reference[10] 1200 / 0.8 98.36%
reference[14] 2000 Wiener filtering 0.7 98.46%
reference[22] 784 Fast Kurtogram (100) 99.00%
ICN-Capsule 3000 Wavelet 0.83 99.96%
DCA-BiLSTM 1024 / 0.3(60) 99.56%
Ours 1024 / 0.3(60) 99.73%
MCKD: Maximum Correlated Kurtosis Deconvolution
RCMDE: Refined Composite Multiscale Dispersion Entropy
5. Conclusion
A novel DCA-BiGRU model based on attention mech-
anism has been proposed to identify the health state of
equipment under small samples, where attention mechanism
captures the spatial and channel relations of signals. The sen-
sitivities of attention mechanism and BiGRU to the propor-
tion of training set are discussed, and activation functions and
gradient descent algorithms of all sorts have been explored.
AMSGradP, 1D-Meta-ACON and other novel technologies
are introduced to further improve capacities of generaliza-
tion and robustness. Subsequently, DCA-BiGRU based on
transfer learning, is verified on two different test rigs that are
CWRU motor bearing data sets(Case 1) and University of
Connecticut gearbox data sets(Case 2) respectively. Variety
of visualization means are applied to initially reveal working
principle of DCA-BiGRU, which shows that DCA-BiGRU
has advantages in terms of diagnostic efficiency under dif-
ferent working conditions for small samples.
It can be noted that the differences between misclassified
and other samples demand to be further explored. In addition,
it is intractable for DCA-BiGRU to cope with the extremely
imbalanced data set. In the future, machine learning such
as meta learning, active sensitive cost learning, integrated
learning or domain adaptation and generalization in transfer
learning, will be combined with attention mechanism or
other structures to address more complicated fault diagnosis
situation with small sample and imbalanced data, which is
worth further studying.
CRediT authorship contribution statement
Xin Zhang: Writing original draft, Methodology, Anal-
ysis, Funding acquisition. Chao He: Software, Validation,
Visualization, Investigation. Yanping Lu: Experiment. Biao
Chen: Experiment. Le Zhu: Conceptualization Software. Li
Zhang: Supervision, Proofreading, Project administration.
Declaration of competing interest
The authors declare that they have no known competing
financial interests or personal relationships that could have
appeared to influence the work reported in this paper.
Acknowledgments
The authors are grateful for the supports of the National
Key R&D Program of China (2018YFB1308700).
References
[1] J. Jiao, M. Zhao, J. Lin, K. Liang, A comprehensive review on convo-
lutional neural network in machine fault diagnosis, Neurocomputing
417 (2020) 36–63.
[2] S. Zhang, S. Zhang, B. Wang, T. G. Habetler, Deep learning al-
gorithms for bearing fault Diagnosticsx— A comprehensive review,
IEEE Access 8 (2020) 29857–29881.
[3] H. Luo, C. He, J. Zhou, L. Zhang, Rolling Bearing Sub-Health
Recognition via Extreme Learning Machine Based on Deep Belief
Network Optimized by Improved Fireworks, IEEE Access 9 (2021)
42013–42026.
[4] Y. Ke, C. Yao, E. Song, Q. Dong, L. Yang, An early fault diagnosis
method of common-rail injector based on improved CYCBD and
hierarchical fluctuation dispersion entropy, Digit. Signal Process. 114
(2021) 103049.
[5] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, X. Huang, Pre-trained models
for natural language processing: A survey, Sci. China Technol. Sci.
(2020) 1–26.
[6] S. Hao, Y. Zhou, Y. Guo, A brief survey on semantic segmentation
with deep learning, Neurocomputing 406 (2020) 302–321.
[7] K. Tong, Y. Wu, F. Zhou, Recent advances in small object detection
based on deep learning: A review, Image Vis. Comput. 97 (2020)
103910.
[8] G. Algan, I. Ulusoy, Image classification with deep learning in the
presence of noisy labels: A survey, Knowl. Based. Syst. 215 (2021)
106771.
[9] Z. Zhao, T. Li, J. Wu, C. Sun, S. Wang, R. Yan, X. Chen, Deep
learning algorithms for rotating machinery intelligent diagnosis: An
open source benchmark study, ISA Trans. 107 (2020) 224–255.
[10] J. Li, X. Li, D. He, Y. Qu, Unsupervised rotating machinery fault diag-
nosis method based on integrated SAE–DBN and a binary processor,
J. Intell. Manuf. 31 (8) (2020) 1899–1916.
[11] Y. Wang, G. Sun, Q. Jin, Imbalanced sample fault diagnosis of rotat-
ing machinery using conditional variational auto-encoder generative
adversarial network, Appl. Soft Comput. 92 (2020) 106333.
et al.: Preprint submitted to Elsevier Page 14 of 15
[12] Z. Wang, Y. Dong, W. Liu, Z. Ma, A novel fault diagnosis approach for
chillers based on 1-D convolutional neural network and gated recurrent
unit, Sensors 20 (9) (2020) 2458.
[13] X. Wang, D. Mao, X. Li, Bearing fault diagnosis based on vibro-
acoustic data fusion and 1D-CNN network, Measurement 173 (2021)
108518.
[14] X. Chen, B. Zhang, D. Gao, Bearing fault diagnosis base on multi-
scale CNN and LSTM model, J. Intell. Manuf. 32 (4) (2021) 971–987.
[15] D. Huang, Y. Fu, N. Qin, S. Gao, Fault diagnosis of high-speed train
bogie based on LSTM neural network, Sci. China Inf. Sci. 64 (1)
(2021) 119203.
[16] X. Li, X. Kong, J. Zhang, Z. Hu, C. Shi, A study on fault diagnosis of
bearing pitting under different speed condition based on an improved
inception capsule network, Measurement 181 (2021) 109656.
[17] F. Zhou, S. Yang, H. Fujita, D. Chen, C. Wen, Deep learning fault
diagnosis method based on global optimization GAN for unbalanced
data, Knowl. Based Syst. 187 (2020) 104837.
[18] P. Kumar, A. S. Hati, Deep convolutional neural network based on
adaptive gradient optimizer for fault detection in SCIM, ISA Trans.
111 (2021) 350–359.
[19] A. Zhang, S. Li, Y. Cui, W. Yang, R. Dong, J. Hu, Limited data rolling
bearing fault diagnosis with few-shot learning, IEEE Access 7 (2019)
110895–110904.
[20] C. Wang, Z. Xu, An intelligent fault diagnosis model based on
deep neural network for few-shot fault diagnosis, Neurocomputing
Doi:10.1016/j.neucom.2020.11.070.
[21] J. Wu, Z. Zhao, C. Sun, R. Yan, X. Chen, Few-shot transfer learning
for intelligent fault diagnosis of machine, Measurement 166 (2020)
108202.
[22] S. R. Saufi, Z. A. B. Ahmad, M. S. Leong, M. H. Lim, Gearbox fault
diagnosis using a deep learning model with limited data sample, IEEE
Trans. Ind. Inform. 16 (10) (2020) 6263–6271.
[23] T. Han, R. Ma, J. Zheng, Combination bidirectional long short-term
memory and capsule network for rotating machinery fault diagnosis,
Measurement 176 (2021) 109208.
[24] C. Li, K. Yang, H. Tang, P. Wang, J. Li, Q. He, Fault Diagnosis for
Rolling Bearings of a Freight Train under Limited Fault Data: Few-
Shot Learning Method, J. Transp. Eng. Part A Syst. 147 (8) (2021)
04021041.
[25] W. Zhang, G. Peng, C. Li, Y. Chen, Z. Zhang, A new deep learning
model for fault diagnosis with good anti-noise and domain adaptation
ability on raw vibration signals, Sensors 17 (2) (2017) 425.
[26] K. Zhao, H. Jiang, Z. Wu, T. Lu, A novel transfer learning fault diag-
nosis method based on Manifold Embedded Distribution Alignment
with a little labeled data, J. Intell. Manuf. (2020) 1–15.
[27] Z. Yang, J. Zhang, Z. Zhao, Z. Zhai, X. Chen, Interpreting network
knowledge with attention mechanism for bearing fault diagnosis,
Appl. Soft Comput. 97 (2020) 106829.
[28] T. Zhang, J. Chen, F. Li, K. Zhang, H. Lv, S. He, E. Xu, In-
telligent fault diagnosis of machines with small & imbalanced
data: A state-of-the-art review and possible extensions, ISA Trans.
Doi:10.1016/j.isatra.2021.02.042.
[29] J. Gu, V. Tresp, H. Hu, Capsule Network is Not More Robust than
Convolutional Network, in: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognit (CVPR), 14309–14317,
2021.
[30] T. Huang, Q. Zhang, X. Tang, S. Zhao, X. Lu, A novel fault diagnosis
method based on CNN and LSTM and its application in fault diagnosis
for complex systems, Artif. Intell. Rev. (2021) 1–27.
[31] M. Jalayer, C. Orsenigo, C. Vercellis, Fault detection and diagnosis
for rotating machinery: A model based on convolutional LSTM, Fast
Fourier and continuous wavelet transforms, Comput. Ind. 125 (2021)
103378.
[32] Y. Li, N. Wang, J. Shi, X. Hou, J. Liu, Adaptive batch normalization
for practical domain adaptation, Pattern Recognit. 80 (2018) 109–117.
[33] N. Ma, X. Zhang, M. Liu, J. Sun, Activate or Not: Learning Cus-
tomized Activation, in: Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognit (CVPR), 8032–8042, 2021.
[34] B. Heo, S. Chun, S. J. Oh, D. Han, S. Yun, G. Kim, Y. Uh, J.-W.
Ha, AdamP: Slowing Down the Slowdown for Momentum Optimizers
on Scale-invariant Weights, in: International Conference on Learning
Representations (ICLR), 2021.
[35] A. Shewalkar, D. Nyavanandi, S. A. Ludwig, Performance Evaluation
of Deep Neural Networks Applied to Speech Recognition: RNN,
LSTM and GRU, J. Artif. Intell. Soft Comput. Res. 9 (4) (2019) 235–
245.
[36] Y. Dong, Y. Li, H. Zheng, R. Wang, M. Xu, A new dynamic model and
transfer learning based intelligent fault diagnosis framework for rolling
element bearings race faults: Solving the small sample problem, ISA
Trans. Doi:10.1016/j.isatra.2021.03.042.
[37] X. Li, Y. Hu, M. Li, J. Zheng, Fault diagnostics between different type
of components: A transfer learning approach, Appl. Soft Comput. 86
(2020) 105950.
[38] K. A. Loparo, Bearing data center, Case Western Reserve University .
[39] P. Cao, S. Zhang, J. Tang, Preprocessing-Free Gear Fault Diagnosis
Using Small Datasets With Deep Convolutional Neural Network-
Based Transfer Learning, IEEE Access 6 (2018) 26241–26253.
[40] P. Cao, S. Zhang, J. Tang, Gear Fault Data. figshare. Dataset.
Doi:10.6084/m9.figshare.6127874.v1.
Appendix
Algorithm 1 AMSGradP
Input: learning rate, 𝜂> 1; momentum, 𝛽1,𝛽2(0,1);
critical value, 𝛿, 𝜀 > 0; time step, 𝑡; step size, 𝛼;
Output: Resulting parameter, 𝑤𝑡;
1: for 𝑤𝑡not converged do
2: 𝑔𝑡𝑤𝑓𝑡(𝑤𝑡)
3: 𝑚𝑡𝛽1𝑚𝑡−1 + (1 𝛽1)𝑔𝑡
4: 𝑣𝑡𝛽2𝑣𝑡−1 + (1 𝛽2)𝑔𝑡2
5: 𝑣𝑡max(𝑣𝑡−1,𝑣𝑡)𝑎𝑛𝑑
𝑉𝑡𝑑𝑖𝑎𝑔(𝑣𝑡)
6: 𝑝𝑡𝑚𝑡∕(𝑣𝑡+𝜀)
7: if cos(𝑤𝑡, 𝑔𝑡)< 𝛿dim(𝑤)then
8: 𝑞𝑡=𝑤𝑡(𝑝𝑡)
9: else
10: 𝑞𝑡=𝑝𝑡
11: end if
12: 𝑤𝑡𝑤𝑡−1 𝛼𝑞𝑡
13: end for
Algorithm 2 1D-Grad-CAM++
Input: signal, 𝑥; category weight, 𝑦𝑐
𝑎𝑡𝑡; feature map, 𝐴𝑘
𝑎𝑡𝑡;
Output: heatmap, ;
1: 𝑔𝑟𝑎𝑑
𝑦𝑐
𝑎𝑡𝑡
𝐴𝑘
𝑎𝑡𝑡
2: 𝑎𝑘𝑐
𝑖𝑔𝑟𝑎𝑑2
2𝑔𝑟𝑎𝑑2+
𝑖
𝐴𝑘
𝑎𝑡𝑡𝑔𝑟𝑎𝑑3
3: if 𝑔𝑟𝑎𝑑 > 0then
4: 𝑤𝑒𝑖𝑔ℎ𝑡 𝑔𝑟𝑎𝑑 ×𝑎𝑘𝑐
𝑖
5: else
6: 𝑤𝑒𝑖𝑔ℎ𝑡 0
7: end if
8: 𝑤𝑒𝑖𝑔ℎ𝑡.size 𝑥.size by linear interpolation
9: MinMaxScaler(𝑤𝑒𝑖𝑔ℎ𝑡)
et al.: Preprint submitted to Elsevier Page 15 of 15
... Ren et al. [21] applied CNNs-LSTM for FD and achieved an accuracy rate of more than 99% in the identification of nuclear power plants, and the accuracy reaches more than 99%. Zhang et al. [22] designed a 1DCNN with an attention mechanism to extract the hidden features first, and the extracted features are fed into a BiGRU (DCA-BiGRU) for FD. We give a table enumerating classic deep learning-based FD methods, as shown in table 1. ...
... We performed a comparative analysis of our approach and several prominent methods, including WD-CNN [12], MSCNN [13], CNNs-LSTM [21], DCA-BiGRU [22] (Framework of 1DCNN, attention mech- ...
... Among the three data sets, it has the highest average accuracy for MFPT (100%) and Ottawa (99.21%). For CWRU data, its performance is slightly inferior to that of DCA-BiGRU [22]. We compute the average mean value of three data sets for each method, and we determine that the proposed approach is exceptionally efficient, boasting an average accuracy rate of 99.73%. ...
Preprint
In recent years, deep learning has led to significant advances in bearing fault diagnosis (FD). Most techniques aim to achieve greater accuracy. However, they are sensitive to noise and lack robustness, resulting in insufficient domain adaptation and anti-noise ability. The comparison of studies reveals that giving equal attention to all features does not differentiate their significance. In this work, we propose a novel FD model by integrating multi-scale quaternion convolutional neural network (MQCNN), bidirectional gated recurrent unit (BiGRU), and cross self-attention feature fusion (CSAFF). We have developed innovative designs in two modules, namely MQCNN and CSAFF. Firstly, MQCNN applies quaternion convolution to multi-scale architecture for the first time, aiming to extract the rich hidden features of the original signal from multiple scales. Then, the extracted multi-scale information is input into CSAFF for feature fusion, where CSAFF innovatively incorporates cross self-attention mechanism to enhance discriminative interaction representation within features. Finally, BiGRU captures temporal dependencies while a softmax layer is employed for fault classification, achieving accurate FD. To assess the efficacy of our approach, we experiment on three public datasets (CWRU, MFPT, and Ottawa) and compare it with other excellent methods. The results confirm its state-of-the-art, which the average accuracies can achieve up to 99.99%, 100%, and 99.21% on CWRU, MFPT, and Ottawa datasets. Moreover, we perform practical tests and ablation experiments to validate the efficacy and robustness of the proposed approach. Code is available at https://github.com/mubai011/MQCCAF.
... Ren et al [22] applied CNNs-LSTM for FD and achieved an accuracy rate of more than 99% in the identification of nuclear power plants, and the accuracy reaches more than 99%. Zhang et al [23] designed a 1DCNN with an attention mechanism to extract the hidden features first, and the extracted features are fed into a BiGRU (DCA-BiGRU) for FD. We give a table enumerating classic deep learning-based FD methods, as shown in table 1. ...
... We performed a comparative analysis of our approach and several prominent methods, including WDCNN [12], MSCNN [14], CNNs-LSTM [22], DCA-BiGRU [23] (Framework of 1DCNN, attention mechanism, and BiGRU), and QCNN [13] on three datasets to illustrate its effectiveness on three subsets. For different methods, we perform training and testing strictly according to the parameters given by their papers in the actual test, and the specific structural parameters are shown in table 2. The data of this experiment are shown in table 3, where Acc refers to the accuracy of the model on the corresponding data sets, Mean Acc is its average accuracy on the three data sets, and the number of parameters(Params) of the model is also calculated. ...
... Am-ong the three data sets, it has the highest average accuracy for MFPT (100%) and Ottawa (99.21%). For CWRU data, its performance is slightly inferior to that of DCA-BiGRU [23]. We compute the average mean value of three data sets for each method, and we determine that the proposed approach is exceptionally efficient, boasting an average accuracy rate of 99.73%. ...
Article
Full-text available
In recent years, deep learning has led to significant advances in bearing fault diagnosis (FD). Most techniques aim to achieve greater accuracy. However, they are sensitive to noise and lack robustness, resulting in insufficient domain adaptation and anti-noise ability. The comparison of studies reveals that giving equal attention to all features does not differentiate their significance. In this work, we propose a novel FD model by integrating multi-scale quaternion convolutional neural network (MQCNN), bidirectional gated recurrent unit (BiGRU), and cross self-attention feature fusion (CSAFF). We have developed innovative designs in two modules, namely MQCNN and CSAFF. Firstly, MQCNN applies quaternion convolution to multi-scale architecture for the first time, aiming to extract the rich hidden features of the original signal from multiple scales. Then, the extracted multi-scale information is input into CSAFF for feature fusion, where CSAFF innovatively incorporates cross self-attention mechanism to enhance discriminative interaction representation within features. Finally, BiGRU captures temporal dependencies while a softmax layer is employed for fault classification, achieving accurate FD. To assess the efficacy of our approach, we experiment on three public datasets (CWRU, MFPT, and Ottawa) and compare it with other excellent methods. The results confirm its state-of-the-art, which the average accuracies can achieve up to 99.99%, 100%, and 99.21% on CWRU, MFPT, and Ottawa datasets. Moreover, we perform practical tests and ablation experiments to validate the efficacy and robustness of the proposed approach. Code is available at https://github.com/mubai011/MQCCAF.
... Supervising the health status of bearing is crucial for maintaining uninterrupted and efficient production. During prolonged operation, factors like material degradation, varying loads, and environmental conditions accelerate bearing deterioration [1]. This deterioration not only affects financial outcomes but also leads to * Author to whom any correspondence should be addressed. ...
... Two convolutional paths of different depths are designed to capture high and low frequency features of the signal, respectively. Designing larger convolution kernels is advantageous for enhancing robustness, while the design of deeper, smaller convolution kernels is effective for extracting abstract features [1]. In one path, two big convolutional kernels are used to extract global features. ...
Article
Full-text available
The advancement of deep transfer learning has motivated research into the realization of intelligent fault diagnosis schemes for rolling bearing. Nevertheless, existing research rarely provides further insight into the importance of statistical distance metric-based methods and adversarial learning-based methods in domain adaptation, and the commonly used feature extractors are more difficult to extract features suitable for domain transformation. In this paper, a dynamic fusion of statistical metric and adversarial learning for domain adaptation network is proposed to achieve a dynamic measure of the importance of different domain adaptation methods. This new model utilizes a local maximum mean discrepancy metric to adjust the conditional distribution and adversarial training to adjust the marginal distribution between domains. Meanwhile, to assess the importance of the two distributions, a dynamic adaptation factor is introduced for dynamic evaluation. In addition, to extract features that are more suitable for domain transformation, the model incorporates a dual depth convolutional path with an attention mechanism as a feature extractor, enabling multi-scale feature extraction. Experimental results demonstrate the model’s superior generalization capability and robustness, enabling effective cross-domain fault diagnosis in diverse scenarios.
... In recent years, with the wide application of big data and sensor technology in machinery devices, a large amount of data on the working status of equipment is generated [5][6][7][8]. These data not only record the working status of the equipment, but also bring opportunities for faults diagnosis by analyzing the status information of the devices in work conditions to ensure their safe operation [9][10][11][12][13]. ...
Article
Full-text available
Mechanical equipment functioning in intricate surroundings is prone to malfunctions, which can lead to accidents and significant financial losses. A key component of machinery health management, fault diagnosis creates a connection between equipment state and health data monitoring. This paper presents a second-level sequencing meta-learning approach to tackle the constraints of insufficient fault data and cross-dataset issues, which might impair the accuracy of intelligent fault detection models under normal operating situations. By utilizing the Model-Agnostic Meta-Learning (MAML) core, this technique effectively addresses the problem of limited sample size. Second-level sequencing is implemented for cross-dataset fault diagnostics. The experimental findings, using Paderborn University bearing datasets and University of Connecticut gear datasets, demonstrate the superiority of the suggested Second-Level Sequencing Meta-Learning (SSML) model. SSML demonstrates superior performance compared to other models, with a 95.1% accuracy rate for bearing datasets and a 97.0% accuracy rate for gear datasets. This makes it very useful for diagnosing faults in complicated situations with limited data samples and across different datasets. The importance of sequencing in improving model stability and attaining high accuracy across datasets is emphasized in the study.
... Currently, DPRS failures primarily manifest in two ways: thread contact under load, leading to plastic deformation and permanent impressions on thread surfaces, and fatigue pitting after certain stress cycles. These failures collectively degrade transmission quality, cause vibrations [5] and noise [6], and, in severe cases, lead to jamming and product failure [7]. ...
Preprint
Full-text available
This study introduces an innovative condition monitoring test rig for the differential planetary roller screws (DPRS), focusing on their performance in extreme conditions. It underscores the importance of anti-jamming and real-time monitoring for the DPRS reliability. The research presents a dynamic friction model using the Lagrange method, enhancing understanding of DPRS operations. Advanced signal processing techniques, including discrete wavelet transforms and a convolutional neural network, are implemented for effective feature extraction. A DWTC-BiGRU network is utilized to capture temporal dependencies, vital for monitoring DPRS under varying conditions. Experimental validation is conducted under thermal stress and load variations, demonstrating the system's durability and reliability. The study compares its method with existing algorithms, showing superior accuracy and robustness by combining mechanical modeling with computational techniques for real-time industrial monitoring. The dataset is publicly available at GitHub-haomjc/HealthMoni.
... In addition to generating a substantial amount of synthetic data, some scholars have provided similar working conditions and equipment data for comparative diagnostic purposes. Zhang et al [15] introduced the attention mechanism into a deep neural network and proposed a dual-path convolution model with an attention mechanism and bidirectional gated recurrent unit, which significantly enhanced the accuracy and robustness of fault diagnosis in small sample scenarios. Su et al [16] presented an innovative data reconstruction hierarchical recurrent metalearning approach to address the challenge of fault diagnosis across varying working conditions in a limited sample setting. ...
Article
Full-text available
Obtaining a substantial number of actual samples for rotating machinery in an industrial setting can be challenging, particularly when faulty samples are acquired under hazardous working conditions. The issue of insufficient samples hinders the effective training of reliable fault diagnosis models, impeding the industrial implementation of advanced intelligent methods. This study proposes an innovative dynamic simulation-assisted Gaussian mixture alignment model (DSGMA) to address the challenge of applying fault diagnosis technologies, with its performance mined by advanced transfer algorithms. Specifically, we establish a fault dynamics model for rotating machinery and acquire a substantial amount of simulated data as the source domain to facilitate the training of the deep neural network model. Subsequently, we propose a Gaussian mixture-guided domain alignment approach that assigns a domain-independent Gaussian distribution to each category as prior knowledge, with the parameters calculated using limited actual samples. Diagnostic knowledge is transferred from the source domain to the target domain by minimizing the Kullback–Leibler divergence between the features of the simulated samples and the Gaussian mixture priors. Furthermore, the DSGMA model incorporates Gaussian clustering loss to augment the clustering capability of samples belonging to the same category from real devices and enhances the computational stability of the parameters in the Gaussian mixture model. The efficacy of the DSGMA method is validated using three publicly available datasets and compared against five widely adopted methods. The experimental findings illustrate that DSGMA exhibits superior diagnostic and robust capabilities, facilitating efficient fault diagnosis under scenarios of small samples.
... For example, the LSTM model has a high computational complexity, requires more parameters, and suffers from the problems of gradient vanishing or gradient explosion during training. Therefore, Zhang et al [24] constructed a CNN with an AM and BGRU (CAM-BGRU) and applied it to small-sample fault diagnosis of bearings. The application of two sets of data demonstrated that the CAM-BGRU model has better noise resistance and that AM can focus on the fault features. ...
Article
Full-text available
Conventional convolutional neural networks (CNNs) predominantly emphasize spatial features of signals and often fall short in prioritizing sequential features. As the number of layers increases, they are prone to issues such as vanishing or exploding gradients, leading to training instability and subsequent erratic fluctuations in loss values and recognition rates. To address this issue, a novel hybrid model, termed one-dimensional (1D) residual network with attention mechanism and bidirectional gated recurrent unit (BGRU) is developed for rotating machinery fault classification. First, a novel 1D residual network with optimized structure is constructed to obtain spatial features and mitigate the gradient vanishing or exploding. Second, the attention mechanism (AM) is designed to catch important impact characteristics for fault samples. Next, temporal features are mined through the BGRU. Finally, feature information is summarized through global average pooling, and the fully connected layer is utilized to output the final classification result for rotating machinery fault diagnosis. The developed technique which is tested on one set of planetary gear data and three different sets of bearing data, has achieved classification accuracy of 98.5%, 100%, 100%, and 100%, respectively. Compared with other methods, including CNN, CNN-BGRU, CNN-AM, and CNN with an AM-BGRU, the proposed technique has the highest recognition rate and stable diagnostic performance.
Article
Full-text available
Introduction Blackheart is one of the most common physiological diseases in potatoes during storage. In the initial stage, black spots only occur in tissues near the potato core and cannot be detected from an outward appearance. If not identified and removed in time, the disease will seriously undermine the quality and sale of theentire batch of potatoes. There is an urgent need to develop a method for early detection of blackheart in potatoes. Methods This paper used visible-near infrared (Vis/NIR) spectroscopy to conduct online discriminant analysis on potatoes with varying degrees of blackheart and healthy potatoes to achieve real-time detection. An efficient and lightweight detection model was developed for detecting different degrees of blackheart in potatoes by introducing the depthwise convolution, pointwise convolution, and efficient channel attention modules into the ResNet model. Two discriminative models, the support vector machine (SVM) and the ResNet model were compared with the modified ResNet model. Results and discussion The prediction accuracy for blackheart and healthy potatoes test sets reached 0.971 using the original spectrum combined with a modified ResNet model. Moreover, the modified ResNet model significantly reduced the number of parameters to 1434052, achieving a substantial 62.71% reduction in model complexity. Meanwhile, its performance was evidenced by a 4.18% improvement in accuracy. The Grad-CAM++ visualizations provided a qualitative assessment of the model’s focus across different severity grades of blackheart condition, highlighting the importance of different wavelengths in the analysis. In these visualizations, the most significant features were predominantly found in the 650–750 nm range, with a notable peak near 700 nm. This peak was speculated to be associated with the vibrational activities of the C-H bond, specifically the fourth overtone of the C-H functional group, within the molecular structure of the potato components. This research demonstrated that the modified ResNet model combined with Vis/NIR could assist in the detection of different degrees of black in potatoes.
Article
Rotating machinery is advancing in the direction of high efficiency, high rotary speed, enhanced automation, and widespread application with the quickening growth of intelligent manufacturing. However, in the real operation process, it will inevitably incur wear, corrosion, fracture and other phenomena due to many negative factors such as vibration, impact, unsuitable lubrication and long-term abnormal usage. First, the review and succinct analysis are undertaken to aid in understanding condition monitoring and fault diagnosis of bearings. Then, this review examines identification, monitoring, categorization, and diagnostic procedures and illustrates how bearings’ geometrical tolerance and form profile are sensitive to failure. Therefore, a number of strategies, including artificial intelligence (AI) and traditional diagnosis methods are explored. The upcoming digital twin and AI technologies are also introduced and compared. Finally, by evaluating the current state of condition monitoring and fault diagnosis in industrial applications, future technical trends are predicted, and unresolved concerns are emphasized.
Article
Full-text available
Fault diagnosis plays an important role in actual production activities. As large amounts of data can be collected efficiently and economically, data-driven methods based on deep learning have achieved remarkable results of fault diagnosis of complex systems due to their superiority in feature extraction. However, existing techniques rarely consider time delay of occurrence of faults, which affects the performance of fault diagnosis. In this paper, by synthetically considering feature extraction and time delay of occurrence of faults, we propose a novel fault diagnosis method that consists of two parts, namely, sliding window processing and CNN-LSTM model based on a combination of Convolutional Neural Network (CNN) and Long Short-Term Memory Network (LSTM). Firstly, samples obtained from multivariate time series by the sliding window processing integrates feature information and time delay information. Then, the obtained samples are fed into the proposed CNN-LSTM model including CNN layers and LSTM layers. The CNN layers perform feature learning without relying on prior knowledge. Time delay information is captured with the use of the LSTM layers. The fault diagnosis of the Tennessee Eastman chemical process is addressed, and it is verified that the predictive accuracy and noise sensitivity of fault diagnosis can be greatly improved when the proposed method is applied. Comparisons with five existing fault diagnosis methods show the superiority of the proposed method.
Article
Full-text available
Rolling bearings, as the main components of the large industrial rotating equipment, usually work under complex conditions and are prone to break down. It can provide a certain theoretical basis for identifying the sub-health state of the industrial equipment by the analysis from the incipient weak signals. Thus, a sub-health recognition offline algorithm based on Refined Composite Multiscale Dispersion Entropy (RCMDE) and Deep Belief Network-Extreme Learning Machine (DBN-ELM) optimized by Improved Firework Algorithm (IFWA) is proposed. First of all, in light of the drawbacks that it is easy to fall into local optima and cross the boundary for exploding fireworks in Firework Algorithm (FWA), Cauchy mutation and adaptive dynamic explosion radius factor coefficient is introduced into IFWA. Secondly, Maximum Correlation Kurtosis Deconvolution (MCKD) optimized by the improved parameters is used to process the incipient vibration signals with nonlinearity, nonstationary, and IFWA is used to adaptively adjust to the period T and the filter length L in MCKD(IFWA-MCKD). Then, each sequence of signals is further extracted the feature—RCMDE to rich sample diversity. Finally, combining the powerful unsupervised learning capability from DBN and the generalization capability from ELM, DBN-ELM can be established. What’s more, in order to avoid the interference of human on the parameters, IFWA is used to optimize the number of hidden nodes in DBN-ELM, and the IFWA-DBN-ELM is established. It shows that the algorithm has the higher sub-health recognition accuracy, better robustness and generalization, which has a better industrial application prospect.
Article
Full-text available
The research on intelligent fault diagnosis has yielded remarkable achievements based on artificial intelligence-related technologies. In engineering scenarios, machines usually work in a normal condition, which means limited fault data can be collected. Intelligent fault diagnosis with small & imbalanced data (S&I-IFD), which refers to build intelligent diagnosis models using limited machine faulty samples to achieve accurate fault identification, has been attracting the attention of researchers. Nowadays, the research on S&I-IFD has achieved fruitful results, but a review of the latest achievements is still lacking, and the future research directions are not clear enough. To address this, we review the research results on S&I-IFD and provides some future perspectives in this paper. The existing research results are divided into three categories: the data augmentation-based, the feature learning-based, and the classifier design-based. Data augmentation-based strategy improves the performance of diagnosis models by augmenting training data. Feature learning-based strategy identifies faults accurately by extracting features from small & imbalanced data. Classifier design-based strategy achieves high diagnosis accuracy by constructing classifiers suitable for small & imbalanced data. Finally, this paper points out the research challenges faced by S&I-IFD and provides some directions that may bring breakthroughs, including meta-learning and zero-shot learning.
Article
In recent years, many machine learning-based methods have emerged to detect faulty bearings. However, most of these methods may not be practical due to the need to collect a large number of fault samples for training. This paper developed a novel few-shot learning framework for the fault diagnosis of freight train rolling bearings. The proposed method has the capability to transfer the learning outcome from one bearing fault diagnosis model to another different but related task for which very limited training data are available. The authors established a single-wheelset platform to collect acceleration signals of different types of bearing faults. The authors preprocessed the data through data segmentation and frequency domain transformation, and divided the data into training and test sets according to a certain ratio. A one-dimensional convolutional neural network (1D-CNN) was established to automatically extract the features of the bearing vibration signals and classify the fault types. The authors implemented two few-shot learning methods through parameter fine-tuning and a conditional Wasserstein generative adversarial network (C-WGAN). A case study demonstrated the classification performance of the proposed models. The results showed that the diagnosis capability of the 1D-CNN in the frequency domain is significantly superior to that in the time domain. However, when the amount of data is small, the 1D-CNN model does not work. In contrast, the few-shot learning of bearing faults works well for both the fine-tuned CNN and C-WGAN models. Furthermore, the classification performance of the C-WGAN is better than that of the fine-tuned CNN when the training data are extremely limited.
Article
The bearings fault diagnosis is essential for the maintenance and reliability of rotating machinery. Bearings pitting is one of the most common fault types of rotating machinery. However, due to the complex working conditions of bearings, it is challenging to diagnose the pitting faults in bearings inner and outer rings at different speeds. In this paper, an improved one-dimensional inception capsule network (IICN) is proposed to solve the problem of bearing pitting fault diagnosis under complex working conditions. Firstly, the raw bearing vibration signal is processed using the improved Inception network. The function of the stage is to approximate an optimal local sparse structure with a simple dense substructure for bearing healthy state feature extraction. And then inputs concatenated features to the primary capsule layer and the routing capsule layer. The inputs are mapped to feature vector space and weighted by the dynamic routing algorithm. The dynamic routing algorithm encodes the significant spatial relationship between low-level features and upper-level features. The Euclidean length of each capsule vector is the probability of belonging to this bearing healthy condition. In order to validate the effectiveness of the IICN method, bearings pitting experiments at different speeds were designed. The raw bearings vibration signal data under six different health conditions are collected, and the effectiveness of the IICN method is verified. Experimental results show that the IICN method can effectively distinguish different degrees of bearing pitting fault at different speeds, and its diagnostic accuracy is superior to other advanced deep learning methods.
Article
Intelligent fault diagnosis of rolling element bearings gains increasing attention in recent years due to the promising development of artificial intelligent technology. Many intelligent diagnosis methods work well requiring massive historical data of the diagnosed object. However, it is hard to get sufficient fault data in advance in real diagnosis scenario and the diagnosis model constructed on such small dataset suffers from serious overfitting and losing the ability of generalization, which is described as small sample problem in this paper. Focus on the small sample problem, this paper proposes a new intelligent fault diagnosis framework based on dynamic model and transfer learning for rolling element bearings race faults. In the proposed framework, dynamic model of bearing is utilized to generate massive and various simulation data, then the diagnosis knowledge learned from simulation data is leveraged to real scenario based on convolutional neural network (CNN) and parameter transfer strategies. The effectiveness of the proposed method is verified and discussed based on three fault diagnosis cases in detail. The results show that based on the simulation data and parameter transfer strategies in CNN, the proposed method can learn more transferable features and reduce the feature distribution discrepancy, contributing to enhancing the fault identification performance significantly.
Article
Early fault diagnosis of common rail injectors is essential to reduce diesel engine testing and maintenance costs. Therefore, this paper proposes a new common rail injector early fault diagnosis method, which combines the Maximum Second-order Cyclostationary Blind Deconvolution (CYCBD) optimized by the Seagull Optimization Algorithm (SOA) and Hierarchical Fluctuation Dispersion Entropy (HFDE). First, we use SOA adaptively to seek the optimal filter length of CYCBD and use the optimal CYCBD to filter the fuel pressure signal of the high-pressure fuel pipe. Then, in order to make up for the shortcomings of Multi-scale Fluctuation Dispersion Entropy (MFDE) ignoring high-frequency component information, this paper proposes HFDE to extract the fault characteristics after filtering. Finally, we input the fault characteristics into Least Squares Support Vector Machines (LSSVM) for classification and recognition. Through the analysis of experimental data, the method proposed in this paper can effectively identify the early failure state of the common rail injector. Compared with the existing methods, the proposed method has a higher fault recognition rate.
Article
For the application of deep learning in the field of fault diagnosis, its recognition accuracy is limited by the size and quality of the training samples, such as small size samples, low signal-to-noise ratio and different working conditions. In order to solve above problems, one novel method for fault classification is proposed based on a Bidirectional Long Short-Term Memory (Bi-LSTM) and a Capsule Network with convolutional neural network (BLC-CNN). The Bi-LSTM is utilized to achieve the feature denoising and fusion, which is extracted by CNN. The fault diagnosis with insufficient training samples is carried out by the capsule network. The influence of sample size on the method is discussed emphatically. The effectiveness and superiority of the proposed method are validated through analyzing the data of bearings and gears under different working conditions with different noise. The results indicate that the proposed method has good performance and immunity to noise.