ArticlePDF Available

Hierarchical feature selection based on relative dependency for gear fault diagnosis

Authors:

Abstract and Figures

Feature selection is an important aspect under study in machine learning based diagnosis, that aims to remove irrelevant features for reaching good performance in the diagnostic systems. The behaviour of diagnostic models could be sensitive with regard to the amount of features, and significant features can represent the problem better than the entire set. Consequently, algorithms to identify these features are valuable contributions. This work deals with the feature selection problem through attribute clustering. The proposed algorithm is inspired by existing approaches, where the relative dependency between attributes is used to calculate dissimilarity values. The centroids of the created clusters are selected as representative attributes. The selection algorithm uses a random process for proposing centroid candidates, in this way, the inherent exploration in random search is included. A hierarchical procedure is proposed for implementing this algorithm. In each level of the hierarchy, the entire set of available attributes is split in disjoint sets and the selection process is applied on each subset. Once the significant attributes are proposed for each subset, a new set of available attributes is created and the selection process runs again in the next level. The hierarchical implementation aims to refine the search space in each level on a reduced set of selected attributes, while the computational time-consumption is improved also. The approach is tested with real data collected from a test bed, results show that the diagnosis precision by using a Random Forest based classifier is over 98 % with only 12 % of the attributes from the available set.
Content may be subject to copyright.
Appl Intell
DOI 10.1007/s10489-015-0725-3
Hierarchical feature selection based on relative dependency
for gear fault diagnosis
Mariela Cerrada1,2 ·Ren´
e-Vinicio S´
anchez2·Fannia Pacheco2·Diego Cabrera2·
Grover Zurita2·Chuan Li2,3
© Springer Science+Business Media New York 2015
Abstract Feature selection is an important aspect under
study in machine learning based diagnosis, that aims to
remove irrelevant features for reaching good performance in
the diagnostic systems. The behaviour of diagnostic models
could be sensitive with regard to the amount of features, and
significant features can represent the problem better than
the entire set. Consequently, algorithms to identify these
features are valuable contributions. This work deals with
the feature selection problem through attribute clustering.
The proposed algorithm is inspired by existing approaches,
where the relative dependency between attributes is used to
calculate dissimilarity values. The centroids of the created
clusters are selected as representative attributes. The selec-
tion algorithm uses a random process for proposing centroid
candidates, in this way, the inherent exploration in random
search is included. A hierarchical procedure is proposed for
implementing this algorithm. In each level of the hierar-
chy , the entire set of available attributes is split in disjoint
sets and the selection process is applied on each subset.
Once the significant attributes are proposed for each subset,
a new set of available attributes is created and the selec-
tion process runs again in the next level. The hierarchical
Mariela Cerrada
cerradam@ula.ve
1Control Systems Department, Universidad de Los Andes,
M´
erida, Venezuela
2Mechanical Engineering Department, Universidad Polit´
ecnica
Salesiana, Cuenca, Ecuador
3Research Center of System Health Maintenance, Chongqing
Technology and Business University, Chongqing, China
implementation aims to refine the search space in each level
on a reduced set of selected attributes, while the computa-
tional time-consumption is improved also. The approach is
tested with real data collected from a test bed, results show
that the diagnosis precision by using a Random Forest based
classifier is over 98 % with only 12 % of the attributes from
the available set.
Keywords Feature selection ·Attribute clustering ·Rough
sets ·Relative dependency ·Gear fault diagnosis
1 Introduction
Most industrial environments demand the continuous work-
ing of rotating machinery. Diagnostic systems are designed
to support this requirement, and increase the availability
of the industrial processes. Contributions to build up fault
diagnostic systems with accuracy, reliability and adequate
computational complexity are highly valuable. In machine
learning applications for data-driven diagnosis of rotating
machinery, the inputs of the classifier are frequently condi-
tioned parameters that are extracted from signals, such as
vibration, noise, among others; these parameters are defined
on the domains of time, frequency and time-frequency, in
order to enhance the data that are processed by the diagnos-
tic algorithms. Taking into account the amount of available
parameters as feature candidates for fault diagnosis, there
are two ways to address the diagnoser design: one of them
is by developing more complex classifiers that can deal with
the high dimensionality of the feature vector, and the sec-
ond one is by dealing with the dimensionality reduction
through feature selection. In feature selection, the problem
is the identification of the significant features that allow
keeping the classification precision. Furthermore, feature
M. Cerrada et al.
selection can help discover relevant features for a particular
application.
Feature selection in fault diagnosis of rotating machin-
ery is still an open research field. Each failure mode can
affect the condition parameters in a different way and, on the
other hand, in case of incipient failures, the best condition
parameters providing good diagnostic information are not
clearly identifiable. In most of cases, feature selection has
been treated as a dimensionality reduction problem where
several techniques are used, such as Principal Component
Analysis, Multidimensional Scaling, Factor Analysis, Pro-
jection Pursuit, Kernel Fisher Discriminant Analysis, Data
Fusion, and other linear and non linear techniques [18,19,
23,25]. However, these types of techniques commonly cre-
ate artificial features with lower dimension than the original
set; consequently, these reduced features lack of physical
meaning. Some efforts has been performed in [1] for devel-
oping dimensionality reduction techniques to find the best
subset of the original features, by using multivariate linear
regression and variables shrinkage.
In machine learning based fault diagnosis, feature selec-
tion aims to remove irrelevant features; wrapper, filtering
or embedded methods can be used for this purpose [8].
The computational burden of the selection algorithm may
be high for analysing the entire feature space in wrapper
approaches. This is the case when classic Genetic Algo-
rithms (GA) are used for dimensionality reduction [36].
However, efforts for integrating some techniques into wrap-
per approaches are described in [22], and more recently in
[2,6,44]. In filtering approaches, the data characteristics are
analysed to rank the attributes by mean of certain criteria,
this is a data treatment before the learning phase. Several
works using filtering approaches have been reported in [40],
where matrix factorization formalism is used. The work in
[24] proposes a measure based on cosine similarity to esti-
mate within-class and between-class separability; thereafter,
it uses a strategy of sequential backward search, for rank-
ing the features. In [21], a global geometric model and a
similarity measure are combined to filter disjoint feature
subsets. The similarity between the subsets and the original
set is evaluated, and the subset with the best similarity is
selected.
Particularly, this work is interested in Rough Sets (RS)
theory based approaches. The main idea behind the use
of this theory is to find equivalence classes based on the
concepts of indiscernibility, reduct, lower and upper approx-
imations, and dependency degree [9,16,37]. Equivalence
classes are used in Qin et al. [33] for attribute clustering,
the authors also use soft set theory to build a soft model
that is defined on equivalence classes and this model is
applied to obtain approximate sets. Qin et al. [33] pro-
pose the computation of the dependency degree through the
cardinality of the lower approximation for each attribute;
this is obtained after processing the tabular representation
of the soft sets, for all attributes. The attributes with high
dependency degrees define attribute clusters.
In [26], an unsupervised fuzzy-rough feature selection
algorithm is presented, based on the concept of dependency
between attributes. The dependency degree is determined by
a fuzzy-rough measure or a fuzzy discernibility, in order to
eliminate irrelevant features. This is achieved by substitut-
ing the decision feature of the supervised approach, by the
features that are being evaluated. The decision feature is the
class in classification problems. The procedure aims to find
the set of attributes Qthat depends on a set of attributes
P, that is, if all the values of the attributes in Qare deter-
mined by the values of the attributes in P. The algorithm
performs an exhaustive search because a measure denoted
as M(R,a) must be calculated for each attribute aA,
where R=Aaand Ais the entire set of attributes. The
measure M(R,a) could be either a measure of the boundary
region, or the discernibility [26].
Theworkin[10] proposes a Granular Neural Network
(GNN) based algorithm for feature selection which uses
Fuzzy Rough Sets (FRS) to calculate the initial weights
of the network; a granulation structure is needed to build
groups of patterns by using similarity measures from fuzzy
implications. The input and target vector for the GNN are
obtained from the mean matrix associated with each group.
FRS concepts are used to calculate the lower approxima-
tion and the positive region, and thereafter the dependency
degree of each attribute is computed, with regard to the deci-
sion attribute. Finally, a three-layer architecture is used to
perform the feature selection by minimizing a feature evalu-
ation index with respect to the connection weights between
the nodes in the hidden layer and output layer.
The approach in [9] proposes the Expectation-Maximi-
zation (EM) algorithm to create attribute clustering, lower
and upper approximations are calculated for each cluster to
determine the dependency degree. Then, the classic Quick
Reduct (QR) algorithm is executed. In [15], an algorithm
using Rough Set based Particle Swarm Optimization Quick
Reduct (RS-PSO-QR) is proposed to deal with the prob-
lem of the exhaustive search of attributes with the best
dependency degree.
RS theory has been used for feature selection in few
recent works on fault diagnosis of rotating machinery. In
[38], decision trees and RS-based methods are compared as
feature selectors, however, these methods have been applied
into a supervised environments by taking into account the
decision variables. In case of decision trees methods, the
variables in each node are selected as the important vari-
ables; in RS methods, feature selection is performed by con-
structing reducts from the discernibility matrix. A reduct is
the minimal set of attributes where the knowledge could be
retained regarding the decision variable. A similar approach
Hierarchical feature selection based on relative dependency...
is presented in [34] for fault diagnosis of gearboxes. RS-
based GA approach is used in [35] for selecting the best
input features to reduce the computational burden in gear
fault identification. QR algorithm is implemented with GA
in order to determine quickly the ranking of the attributes
based on the high dependency degree. In [45], the concept
of Kernel Neighbourhood Rough Sets (KNRS) is used to
deal directly with numerical data. Kernel method is com-
bined with neighbourhood rough sets to map the fault data
of rolling bearing to a high-dimension feature space. The
radius of the hypersphere in the feature space is the neigh-
bour value that is used for calculating lower and upper
approximations of the decision variable, with respect to the
set of attributes.
The common characteristic of the previous works is that
they work into a wrapper or supervised environment. Then,
feature selection are strongly linked with the decision vari-
ables. This is not the case in real application for fault
diagnosis where all the fault labels are not known. The lit-
erature review has shown that there are few recent works
using RS theory into an unsupervised environment. Besides
the work in [26],theworkin[14] proposes an unsupervised
algorithm to create attribute clustering and select the best
representative attribute.
In this work, feature selection is developed from the
unsupervised algorithm in [14], due to its ability for finding
representative features from ‘reducts’ that define attribute
clusters. In our approach, the algorithm in [14] has been
slightly modified to include a random search on the fea-
ture space, and a hierarchical approach is proposed for
implementing the algorithm with low computational time-
consumption. In each level of the hierarchy, the entire set
of available features, or attributes, is split in disjoint sub-
sets and the selection algorithm is applied on each subset.
After selecting a set of representative attributes for each
subset, a new aggregate set of attributes is created and the
selection process is performed again in the next level on
a reduced attribute space. This decomposition is developed
until reaching the hierarchy level where a reduced set of fea-
ture candidates is obtained, according to the stop criteria.
Besides the time-consumption, random selection and hier-
archical decomposition both of them help to improve the
exhaustive search process of the original algorithm.
Random Forest (RF) and Support Vector Machine (SVM)
based classifiers are used as diagnostic models to show
the performance of our selection algorithm. The rationale
for using these techniques is because their versatility in
real environments. RF has good performance for classifi-
cation tasks, this algorithm leads to interpretable models
and work well with small set of samples and large num-
ber of attributes. SVM has demonstrated to be a good
classifier with low computational burden in training and
testing phases, even in on line working. Both techniques are
available in several computational environments, and their
uses have been studied recently for fault diagnosis of rotat-
ing machinery in [3,5,7,11,17,31,43], among other
works.
The performance of feature selection with our hierarchi-
cal approach is compared with the results from the one-step
run of the original selection algorithm, for fault diagnosis
in gears. Our results are also compared with other feature
selection techniques such as Non-negative Matrix Factoriza-
tion and Entropy-based ranking of the importance variables
provided by RF algorithm.
This paper is organized as follows. Section 2presents the
theoretical background that supports the proposed approach.
Section 3details the experimental procedure to collect
the data, and the feature extraction process from vibration
signals. Section 4develops our approach for hierarchical
feature selection based on relative dependency. Section 5
analyses the performance of our approach for fault diag-
nosis in gears and compares with other feature selection
techniques and classifiers. Finally, Section 6presents the
conclusions and future works.
2 Theoretical background
This section briefly presents the conceptual foundations
about the techniques that are used in this work. We are
focused on attribute clustering as feature selection method
in Section 2.1, support vector machine and tree-based clas-
sifiers in Sections 2.2 and 2.3, respectively. More details of
this basic review can be seen in the references.
2.1 Relative dependency based attribute clustering
RS-based attribute clustering has been recently used for fea-
ture selection. The main objective is to find approximate
reducts, that is, the subset of features from an information
system that are able to keep the original indiscernibility.
Formally, the indiscernibility is defined as follows [32]:
Definition 1 Let I=(U, A) be an information system,
where U=x1,x
2,...x
nis a finite set of objects (samples)
and Ais a finite set of attributes (features). For any object
xiU,fa(xi)denotes the value of the attribute aAin
the object xi. The indiscernibility IND of a subset BA
isgivenin(1):
IND(B) ={(x, y) U×U|∀aB, fa(x) =fa(y)}(1)
If IND(A) =IND(B) then Bis a reduct of A.Bis a
minimal reduct if IND(B) =IND(A) and IND(B)=
IND(A),BB, are satisfied.
M. Cerrada et al.
An algorithm for feature selection has been proposed in
[14] that aims to improve the process of finding approximate
reducts, based on the relative dependency concept defined
in [12]. The relative dependency between two attributes is a
similarity measure, and it is defined as follows [14]:
Definition 2 Let Ai,A
jbe two attributes, the relative
dependency degree Dep(Ai,A
j)is defined in (2):
Dep(Ai,A
j)=
Ai(U)
(Ai,Aj)(U)
(2)
where Ai(U) is the projection of Uon Ai,and
(Ai,Aj)(U) is the projection of Uon (Ai,A
j).
The projection B(U) is computed in two steps: (i)
Define the set C=AB(ii) Merge the objects (samples)
such as they are indiscernible for all the attributes ciC,
according to (1).
The equality Dep(Ai,A
j)=Dep(Aj,A
i)is not always
true, consequently, the average is proposed to represent the
similarity between two attributes. Finally, the dissimilarity
measure dbetween two attributes is stated in (3):
d(Ai,A
j)=1
Avg(Dep(Ai,A
j), Dep(Aj,A
i)) (3)
Theworkin[14] creates clusters of attributes by using
the dissimilarity measure as defined in (3). In this context,
the centroid of the cluster is the attribute that represents sim-
ilar attributes in a region, and this centroid is selected as the
most representative attribute. The cluster strategy is called
Most Neighbours First (MNF) and it is assumed that the
algorithm can cluster the attributes into a defined number of
clusters; k-medoids approach is used with a different search
strategy to find the centroids.
Let I=(U, A) be an information system and Ka num-
ber of desired clusters, the MNF algorithm is summarized
in the following steps [14]:
Step 1 Randomly select Kattributes Atas the initial rep-
resentative attributes, Ac
tAcAdenotes the
center of the t-th cluster Ct,t=1,...,K.
Step 2 Compute the dissimilarity d(Ai,A
c
t)as in (3), for
each non-representative Ai,AiAAc,i=
1,...,n
t,t=1,...,K.
Step 3 Allocate all non-center attributes Aito their nearest
center according to the distances in step 2.
Step 4 Calculate the distance d(At,i,A
t,j),i= j,
(At,i,A
t,j)Ct.
Step 5 Calculate the radius rtof each cluster Ctas in (4):
rt=i=jd(At,i,A
t,j)
Cnt
2
(4)
where Cnt
2is the number of attribute pairs in Ct,
Ct=nt(nt1)
2.
Step 6 Find the set Near(At,m)as proposed in (5):
Near(At,m)=At,i |At,i Ct,d(A
t,m,A
t,i)rt
(5)
Step 7 Find the attribute At,l,Ct,suchas
Near(At,l)
>
Near(At,j)
,j= l, and set the attribute At,l as
the new center Ac
t.
Step 8 Repeat steps 2–7 until the stop criteria is met.
Step 9 Select the attributes Ac
t,Ct, as the most represen-
tative attributes.
The number of iterations is usually the stop criterion.
On the other hand, this algorithm only works on categori-
cal data type, therefore, a domain discretization is needed
to implementing the algorithm when numerical data is
used. Examples with low-dimensional feature space are pre-
sented in [14] to show the capability of the algorithm,
however this is not the case in real applications such as
fault diagnosis in rotating machinery. In our approach,
the above algorithm was slightly modified to deal with
high-dimensional attributes. These modifications will be
explained in Section 4.
2.2 Support vector machines
Support Vector Machine (SVM) is a non-probabilistic
binary linear classifier that has reported good performance
in classification problems. SVM-based classification is
stated as follows [13]:
The training dataset is composed of msamples of a cou-
ple (xi,y
i),wherei=1,...,m.Thevariablexiis a
vector of nattributes, i.e., xi=(xi1,x
i2,...,x
in)and yi
is the classification variable that can have binary values,
yi∈{1,1}. The vector xibelongs to class C1or C2.Ifxi
belongs to class C1then yi=1, and yi=−1 otherwise. On
the other hand, an hyperplane is defined as in (6):
x:f(x)=xTθ+θ0=0(6)
If the classes are separable, there exists a function xTθ+
θ0with yif(x
i)>0i, and we are able to find the
hyperplane that creates the best margin between samples
of classes C1and C2. This problem is summarized as an
optimization problem in (7):
minθ,θ0
1
2θ2
subject to
yi(xT
iθ+θ0)1,i(7)
Hierarchical feature selection based on relative dependency...
If the classes overlap in the feature space, the optimiza-
tion problem is stated as in (8):
minθ,θ0i
1
2θ2+Cξi
subject to
yixT
iθ+θ01ξi,i
ξi0(8)
where ξ=1
2,...,ξ
m)are slack variables and the
cost parameter Cis a tuning parameter. The optimization
problem in (8) is rewritten in the non-constrained form in
(9). Once the solution is obtained for ˆ
θand ˆ
θ0, the pre-
dicted class ˆyfor the new sample xis given by the decision
function in (10), where y∈{1,1}:
minθ,θ0
1
2θ2+Cmax(1yi(xT
iθ+θ0), 0)(9)
ˆy=arg maxyy(xTθ+θ0)(10)
The described binary classification is generalized for
multi-class classification by using the one-vs-all strategy.
Let Ok(x) be the output of the kth SMV model that clas-
sifies the class kfor the new sample x. The predicted class
ˆyisgivenby(11):
ˆy=arg maxkOk(x) (11)
2.3 Decision trees and random forest
Decision trees (DT) allow splitting the attribute space
through an iterative procedure of binary partition. It is
a powerful method for some classification and prediction
problems, providing an easily interpretable model [13,28,
41].
Consider the training dataset that is composed of the
couples (xi,y
i)as described in Section 2.2. For multi-
class classification, there are a collection of classes c,c=
1, .., C;thevariableyiis the logistic classification variable,
where yi=cindicates that the vector of attributes xiis
associated with the class c. From all the data, the DT algo-
rithm selects an attribute (variable) jand a partition point s,
and it set the pair of semi-planes R1and R2, as described in
(12), by following some guidelines to obtain a model with
tree topology [13]:
R1(j, s ) =X
XjsR2(j, s ) =X
Xj>s(12)
Then, the algorithm seeks the best selection (j, s ) that
solves the maximization problem of (13):
arg maxcˆpkc (13)
where ˆpkc =(1/Nk)xiRkI(y
i=c) indicates the pro-
portion of observations of class cin the region Rk, regarding
the total of Nkobservations into this region. Iis a mem-
bership indicator of the attribute vector to that region. The
expression ˆpkc is a homogeneity measure of the child node,
also called ‘impurity function’.
The iterative procedure leads to splitting the attributes
space in rdisjoints regions R, until the stop criterion is
achieved. The class cis assigned to the node kof the
tree, which represent the region Rk, through the expression
C(k) =arg maxcˆpkc . This procedure searches through-
out all possible values of all attributes among the samples,
and the resulting model is a binary tree where the leafs are
the best representation of the region Rk. Figure 1shows
an example with the result after applying this iterative
procedure.
One of the problems related to tree-based techniques is
the high variance. Therefore, the bootstrap aggregating tech-
nique, also called bagging, is applied to improve this issue
[13]. Basically, the bagged classifier is composed of a set
of classifiers that are trained with a random subset of the
available data. Each classifier proposes a class for the input
xiand the estimated class ˆ
CB
bag(xi)is assigned according to
(14), where ˆ
fbag is a vector of values pc(xi)that indicates
the proportion of classifiers that proposes the class c:
ˆ
Cbag(xi)=arg maxcˆ
fbag (14)
Fig. 1 a Binary division of the
attribute space bModel
representation as a binary tree
[13]
M. Cerrada et al.
Random Forest (RF) algorithm is a modified approach of
the bagging technique to build a collection of non-correlated
trees with low bias (low error on the training data) and low
variance (low error on the test data) [4,39,46]. RF algo-
rithm for classification problems is fully developed in the
literature and it is summarized in [13]. Let Xand Xbbe the
dataset of size mand the training dataset randomly selected
of size mb,mbm, respectively; the random dataset is
used to build each tree Tbof the forest. Figure 2illustrates
the structure of a RF which is composed of a collection of
trees Tb, and the decision for classifying a new sample xiis
according to (15):
ˆ
CB
rf (xi)=major ity vote ˆ
Cb(xi)B
1(15)
where ˆ
Cb(xi)is the estimated class by the tree Tband the
assigned class ˆ
CB
rf (xi)is the class with the largest number
of ‘votes’, after considering the proposed class by each tree
Tb.
The complement set OOBb=mmbof each tree Tbis
the out-of-bag sample, and it is used as the cross validation
set for the tree, during the training process. The OOB-error
is a performance measure of the RF, it is defined as the
average of the classification error of each tree Tbusing the
OOBbsample.
3 Measurement procedure and feature extraction
This section presents the experimental set-up to collect the
data related to the fault condition in gearboxes. A data
matrix with msamples (rows) and navailable attributes is
Fig. 2 RF structure
Fig. 3 Vibration analysis laboratory at the Salesian Polytechnic Uni-
versity, Cuenca-Ecuador
created, each row represents a machinery condition. The
collected data is the dataset that will be processed by our
hierarchical unsupervised feature selection approach.
All the experiments were carried out in the experimen-
tal test bed that is shown in Fig. 3. The rotation motion
of the equipment is generated by a 1.1 kW motor powered
by three-phase 220 V at 60 Hz, with a nominal speed of
1650 rpm. The torque motion is transmitted into a gear-
box, where several gear fault configurations are assembled.
The torque is transmitted to a pulley through the gearbox
shaft, which is part of the magnetic brake system. The
magnetic brake implements different loads according to the
experimental protocol. Variable-frequency drive was used
to generate different speeds. The data acquisition system is
formed by the NI CompactDAQ-9191 of National Instru-
ments and the module NI 9234, which is inserted in the DAQ
slot. This device has a maximum sample frequency of 51.2
kS/s, anti-aliasing filtering, 24-bit resolution, IEPE signal
coupling, and Ethernet communication. The data acquisition
software was developed in our laboratory on NI LabVIEW
environment.
The PBC IEPE accelerometer, with a sensitivity of 100
mV/g, was vertically located on the gearbox in order to
Tabl e 1 Experimental settings
Parameter Value
Sampling frequency 50 kHz
Length of each sample 10 sec
Number of tests 5
Rotation frequency (constant speed) 8 Hz, 12 Hz, 15 Hz
Range frequency (variable speed) 5–12 Hz , 12–18 Hz, 8–15 Hz
Load No Load, 10 V, 30 V
Hierarchical feature selection based on relative dependency...
Tabl e 2 Gear fault conditions
Label Description
f0 Healthy pinion, healthy gear
f1 Pinion tooth chaffing, healthy gear
f2 Pinion tooth wear, healthy gear
f3 25 % pinion tooth breakage, healthy gear
f4 50 % pinion tooth breakage, healthy gear
f5 100 % pinion tooth breakage, healthy gear
f6 Healthy pinion, 25 % gear crack
f7 Healthy pinion, 100 % gear crack
f8 Healthy pinion, 50 % gear chaffing
f9 25 % pinion tooth breakage, 25 % gear crack
record the vibration signals from the spur gears. Experi-
mental settings for the signal measurements are shown in
Tabl e 1; the gearbox was configured with ten different fault
modes, including the healthy condition, see Table 2.Fault
f0is the healthy (normal) condition, faults f1andf2are
incipient faults, f3, f4, f6, f8 are moderate faults, f5and
f7 are severe faults, and f9 is a multiple fault. An incipi-
ent fault is a fault that is just beginning to show symptoms,
this is an important condition to be diagnosed in industrial
applications. Figure 4shows the real gear conditions.
According to the parameter values, we have 900 sig-
nal samples. Next sections show the feature extraction for
each sample by using statistical parameters on time domain,
frequency domain and time-frequency domain. All data
processing was performed in Matlab©.
3.1 Condition parameters on time and frequency
domains
Seven classical condition parameters were obtained by
statistical analysis on time domain such as root mean
square(RMS), energy, crest factor, mean, standard devia-
tion, variance and skewness. These parameters were calcu-
lated over all the signal length for each sample, as shown in
Fig. 5. In frequency domain, four condition parameters were
calculated on eighty equal-sized frequency bands and fif-
teen octave-frequency bands, such as RMS, mean, standard
deviation and kurtosis, as illustrated in Fig. 6. Frequency
bands are used for extracting condition parameters, since a
fault can generate clear changes on the vibration amplitude
on specific bands in which this amplitude is non significant
in normal conditions.
3.2 Condition parameters on time-frequency domain
Raw signals on time domain, for each sample, were used as
input data to the wavelet packet analysis. Wavelet Transform
(WT) is a powerful mathematical tool for signal analy-
sis that has attracted great attention in several engineering
fields. The use of wavelets has been proven with suc-
cess for fault detection and diagnosis [3,27,30,42]. The
analysis presented in [42] provides an extensive overview
Fig. 4 Real gear damages
M. Cerrada et al.
Fig. 5 Feature extraction in time domain
of some latest efforts in the development and applica-
tions of WT for fault diagnosis in rotating machinery. In
Wavelet Packet Transform (WPT) framework, compression
and denoising ideas are exactly the same that are developed
in WT framework. WPT offers flexible analysis, the signal
details as well as the signal approximations are split in a tree
decomposition [29,42]. In the following, Wavelet Packet
Decomposition (WPD) is briefly summarized.
Let and φbe the selected wavelet function and its
corresponding scaling function, given by the (16)and(17),
respectively, where kis the shift index, g(k) is the impulse
response of the low pass filter φ,andh(k) is the impulse
response of the high pass filter , also known as Quadrature
Mirror Filters (QMF) [29]:
φ(t) =2
k
g(k)φ(2tk) (16)
(t) =2
k
h(k)φ (2tk) (17)
Let V0be a vector space which is generated by the scaling
functions φ(t) and its corresponding translation φ(t k).
The vector space V1is such that V1V0, and its cor-
responding scaling and translation functions are φ(2t) and
φ(2tk), respectively. The vectorial operation in (18)per-
mits to move from V0to V1, and in general from Vjto
Vj1, without loss of information; Wjis the orthogonal vec-
tor space called ‘Wavelet Space’ which is generated from
the wavelet function (t) and its corresponding translation
(t k):
Vj1=VjWj(18)
Equation (18) states that a function defined on Vj1could
be decomposed into a function that belongs to Vjand other
function that belongs to Wj.
Let x(t) be the discrete time signal, this signal is decom-
posed in both the Low Frequency Approximation (LFA) and
High Frequency Detail (HFD). LFA is obtained with the fil-
ter φ(t) by using the (19), and HFD with the filter (t) by
Fig. 6 Feature extraction in
frequency domain
Hierarchical feature selection based on relative dependency...
using the (20), where 2 is the sub-sampling operation for
the displacement from Vjto Vj1:
LFA =(x φ) 2 (19)
HFD =(x ) 2 (20)
In WPD, this process is repeated recursively on each
resulting signal LFA and HFD, until obtaining the required
level of decomposition. As a result, the raw signal on time
domain is decomposed in a binary tree with the signals LFA
and HFD. In the current work, five mother wavelets are used
in the analysis: Daubechies (db7), Symlet (sym), Coifier
(coif4), Biorthogonal (bior6.8) and Reverse Biorthogonal
(rbior6.8). The rationale for using several wavelets is to
collect as much information as possible for our applica-
tion. WPD was performed until four levels for each mother
wavelet, then 24coefficients are obtained for each one, and
eighty features are extracted by using the energy operator,
as shown in Fig. 7. In this figure, only the left branch of the
tree is illustrated for one wavelet mother.
Finally, we have calculated 817 condition parameters for
each sample, and all the dataset for the following machine
learning application was arranged in a matrix representation
with 900 samples (rows) and 817 attributes (columns).
4 Hierarchical unsupervised feature selection
This section presents our Hierarchical UnSupervised Fea-
ture Selection (HUSFS) approach that is inspired by the
algorithm in [14], see Section 2.1. In the proposal [14], two
aspects were noted:
1. In step 3, the randomly selected centroids Ac
thave not
necessarily nearest attributes regarding the dissimilarity
metric that has been calculated in step 2. It means that
not all the random centroids are defining a cluster. More
formally, step 3 states that:
Let Ctbe the cluster that is defined by the centroid
candidate Ac
t,t=1, ...K, as described in expression
(21):
Ct={Ai|Ai=arg minAd(Ai,A
c
t)}(21)
If Ct=∅, the centroid candidate Ac
tdo not define a
cluster, and only there are kclusters with k<K.As
a consequence, in the step 7, Near(At,m)=∅for the
centroid candidate Ac
t.
2. In step 7, the inequality
Near(At,l)
>
Near(At,j)
,
j= lis not always true, and the cluster Ctcan have
more than one At,l.
On the other hand, for a high-dimensional feature
space, there are other two aspects that can lead to
high computational burden and time-consuming, for
implementing the algorithm [14]:
3. An exhaustive search is performed over the entire fea-
ture space, regarding the value of the relative depen-
dency degree between features.
4. The number of given clusters Kcould be large to
explore the entire feature space.
In order to deal with the previous items, the algorithm in
[14] was slightly modified by our work in two ways: (i) by
rewriting the step 7 of the original algorithm to provide a
new random selection over the centroid candidates that can
define clusters with the best density, and (ii) by proposing
Fig. 7 Wavelet packet
decomposition
M. Cerrada et al.
a hierarchical implementation to reinforce the search pro-
cess over the subspaces of solutions with time-computation
improvement.
In a wide sense, exploration is the creation of new solu-
tions by searching on the entire solution space; exploitation
is the refinement of the current solutions. In our approach,
exploration is strengthened by providing several disjoint
searching subspaces; the modified selection algorithm is
executed on each subspace, by fixing a number of clus-
ters according to the cardinality of the feature subset. The
exploitation is enhanced by including a new random selec-
tion in each iteration, through the modification of the step 7,
and by filtering the current solutions in the next level.
The step 7 has been rewritten as follows:
Let NewAc
tbe the set of new centroid candidates that is
defined in (22):
NewAc
t={At,l |
Near(At,l)
Near(At,j)
,j= l}
(22)
and min
Near(At,l)
denotes the centroid candidates with
the lowest number of attributes in the cluster (low density).
Then:
In this sense, our work proposes a new random selection
in each iteration of the selection algorithm, by substituting
the attribute Ac
tthat do not define a cluster, and it gives the
priority to those attributes Ac
tthat define clusters with the
best density. Additionally, because the random nature of the
centroid assignment, we evaluate the number of times that
an attribute is selected as the significant one. This informa-
tion is used to finally select the best attributes after reaching
the number of iterations Nit. Particularly, we chose the
attributes that have been selected one time at least, in all the
iterations.
Moreover, a hierarchical approach is proposed in order to
improve the search procedure on the entire attribute space.
In each level, pdisjoints subsets Aiare defined over the set
of available attributes A;thatis,A=˙
iAi, and the attribute
clustering is performed for each subset Ai. The stop condi-
tion for the clustering algorithm is the number of iterations
Nit, and after reaching this stop condition for each anal-
ysed disjoint subset, NAlcentroids are proposed as selected
attributes, NAl<|A|. The new set of selected attributes is
processed again in the next level, as described above.
This procedure can be executed Nltimes, Nlbeing the
number of levels in the hierarchy, or until certain number
of attributes Naare obtained. Once one of these conditions
is accomplished, Lsubsets of attributes are proposed and
the final selection of the best subset is decided according to
its performance in a classification model. In this hierarchi-
cal approach, the parameters Nit,Aiand Kare adjustable
parameters in each level l. Parameters Naand Nlare fixed
at the beginning of the process. This procedure is illustrated
in Fig. 8and summarized as follows:
Input: Categorical matrix CM from the current numeri-
cal data matrix with msamples (rows) and nattributes
(columns).
Fig. 8 Hierarchical procedure
for feature selection
Hierarchical feature selection based on relative dependency...
1. Set the values of Nland Na,l=1,...,N
l.
2. For each l:
(a) Assign the available attributes to the set A.
(b) Create the disjoint sets1of attributes Ai,i=
1,...,p, from the set A.
(c) Create the data matrix DMifrom the categori-
cal matrix CM with msamples and attributes in
Ai.
(d) For each DMi:
Set the values of Kand Nit.
Run the HUSFS algorithm.
until reaching Nit
(e) Evaluate each set NewAc
tfor each Ai, and pro-
pose the set ˆ
AiAiwith the attributes that
have been selected one time at least.
(f) Create the new subset of available attributes
NAlsuch that NAl=∪
p
i=1ˆ
Ai.
until reaching l=Nlor |NAl|≤Nafor some l
3. For each l:
(a) Create new numerical data matrix NDMlwith
msamples and attributes in NAl.
(b) Train some machine learning based diagnosis
model for each data matrix NDMl.
(c) Evaluate the performance of each NDMlwith
the obtained diagnosis model.
4. Select the set of attributes NAlin the NDMlwith
the best performance.
Output: numerical data matrix NM with msamples
(rows) and NAlbest attributes (columns), and its corre-
sponding trained diagnosis model.
Based on the search process of the algorithm in [11]and
our proposed modification, the hierarchical approach has
the following characteristics that aim to preserve the most
significant attributes over the entire set of attributes:
1. The modified algorithm runs Nit times on a subspace
of solutions by each level, then, an exhaustive search is
locally performed over an important number of clusters
K. We propose to use K|Ai|
2.
2. As a consequence of the exhaustive local search, the
non-included features in each subspace are represented
by centroids (representative features) which will be
analysed in the next level of the hierarchy.
3. All features that have been selected one time at least
in Nit iterations are considered for the next level of
1For simplicity, in this work the disjoint sets have been randomly
selected and uniformly sized regarding the cardinality of the set A.
However, other criteria could be applied to decompose the set A, e.g.,
by grouping features with related meaning.
Tabl e 3 RF-based diagnoser with low correlation data
Performance measure All data Reduced attributes
811 237 258 330 409
Precision 0.9816 0.9860 0.9929 0.9964 0.9893
Sensitivity 0.9815 0.9852 0.9926 0.9363 0.9889
F-score 0.9814 0.9852 0.9926 0.9963 0.9889
the hierarchy, this is to enrich the number of centroid
candidates in the next level.
RF and SVM based classifiers are used as classification
models. These models are used because of their versatility
in real environments: SVM has demonstrated to be a good
classifier with low computational burden in training and test
phases, even for on-line working. The optimization algo-
rithms to reach good SVM classifiers are well known, and
Cis the only parameter to be adjusted in (9). However, the
adequate number of attributes for obtaining good perfor-
mance in a SVM-based classifier could lead to an exhaustive
search, as is widely discussed in the literature. On the other
hand, the good performance of RF-based models for clas-
sification tasks is also well known, even with small set
of samples and high number of attributes. However, large
number of trees could be needed in order to reach a good
precision. Both techniques are available in several computa-
tional environments, and they can be easy for implementing
in industrial applications. Even when the above models can
work with large number of attributes, our study addresses
the analysis of the effect of feature selection, through our
HUSFS approach, to obtain simple models that could be
implemented in industrial environments with less compu-
tational effort. The classification precision is the measure
used for evaluating the model performance.
5 Results and analysis in gear fault diagnosis
This section presents the results of selecting attributes
with the approach in Section 4. The data matrix was split
into 70 % of the samples for training and 30 % for test
Tabl e 4 SVM-based diagnoser with low correlation data
Performance measure All data Reduced attributes
811 237 258 330 409
Precision 0.9348 0.7845 0.8889 0.9391 0.9559
Sensitivity 0.9148 0.7741 0.8982 0.9333 0.9556
F-score 0.9194 0.7735 0.8914 0.9340 0.9333
M. Cerrada et al.
Tabl e 5 Hierarchical
unsupervised attribute selection Performace Low correlation Reduced attributes
measure data
330 144 82 62 54 42 24
Precision 0.9964 0.9820 0.9824 0.9788 0.9854 0.9890 0.9616
Sensitivity 0.9963 0.9815 0.9815 0.9778 0.9852 0.9889 0.9556
F-score 0.9963 0.9814 0.9814 0.9778 0.9852 0.9889 0.9546
RF-based diagnoser
phase (performance evaluation), for building each diagno-
sis model in all the experiments. This rate has been selected
from the classical suggestions in machine learning appli-
cations to get an adequate balance between accuracy and
precision, in both training and test phase. The initial data
matrix is taken from Section 3; a cleaning process was exe-
cuted over the matrix to delete non adequate data (NaN
and zero values), as a result, the data set is a matrix with
811 attributes (columns) and 900 samples (rows). Data nor-
malization was computed on the interval [−1,1],anda
Fig. 9 Results with HUSFS and RF-based classifier
Hierarchical feature selection based on relative dependency...
Tabl e 6 Hierarchical
unsupervised attribute selection Performance measure Low correlation data Reduced attributes
409 221 181 128 87
Precision 0.9559 0.9570 0.9524 0.9360 0.9113
Sensitivity 0.9556 0.9556 0.9444 0.9296 0.8963
F-score 0.9553 0.9552 0.9443 0.9301 0.8990
SVM-based diagnoser
preliminary attribute selection has been performed on the
previous data matrix, by using correlation analysis.
Correlation analysis is a very useful statistical technique
that aims to find dependence relationship between variables.
In feature selection, correlation would be a previous step to
be applied in order to identify those attributes which are very
related in the sense of the correlation measure, then, only
one attribute is selected as the representative one. Attributes
with a correlation upper than 85, 90, 95, 99 % were deleted,
as a result, four reduced datasets have been obtained with
237, 258, 330 and 409 attributes. Tables 3and 4show the
performance of each classifier, for the dataset. RF is com-
posed of 500 random trees, SVM uses Sequential Minimal
Optimization (SMO) with C=0.1, as optimization method.
The challenge is to reduce the previous low correlation
data with the HUSFS algorithm based on relative depen-
dency, and maintain an adequate accuracy. The classic
discretization method by binning was applied on the nor-
malized data matrix, with 40 bins. Next sections present the
results with RF and SVM classifiers. For both scenarios, the
algorithm parameters were selected as follows: Nit =200,
pwas adjusted according to the cardinality of the initial
set Afor each level, the size of Aiwas no more than 30
attributes, and Kis based on the greatest integer function
applied to |Ai|
2.
For the RF-based classifier, the subset with 330 attributes
was selected as initial set because it has the best result in
Tabl e 3. The number of levels was Nl=6. After running
the HUSFS algorithm, the sets of attributes for each level
are NA1=144, NA2=82, NA3=62, NA4=54,
NA5=42 and NA6=24. Table 5shows the diagnoser
performance in test phase.
The set NA5=42 has a good performance comparing
with the results from the original set with 330 attributes.
Precision, sensitivity and F-score are only decremented in
0.74 %, by using 12 % of the attributes. On the other hand,
with only 24 attributes, around 7 % of the original set,
the performance measures are decremented by 3.49, 4.08
and 4.18 %, for precision, sensitivity and F-score, respec-
tively. Both cases have an adequate performance evaluation
for the classifiers. Figure 9shows the hierarchical selection
from 144 attributes to 24 attributes. The attribute identifier
is on x-axis and the number of times that the attribute was
selected is on y-axis. After level l=5, cases eand f, cen-
troid candidates have been selected a number of times over
10%ofNit.
HUSFS algorithm was also applied with a SVM-based
classifier. The subset with 409 attributes was selected as
initial set according to its performance in Table 4. The num-
beroflevelswasNl=4, then, after running the HUSFS
algorithm, the sets of selected attributes are NA1=221,
NA2=181, NA3=128, NA4=87. Table 5shows the
diagnoser performance in test phase.
In this case, the effect of feature selection is differ-
ent with regard to the RF-based classifier. SVM is more
sensitive to the use of less number of attributes, this fact
is well-known in the literature and this is a rationale for
providing new algorithms for feature selection. The perfor-
mance measures are shown in Table 6, the precision with
128 attributes is decremented in 2 % comparing with the
result from the original set with 409 attributes. Sensitivity
and F-score are decremented in 2.72 and 2.64 %, respec-
tively. Moreover, with 181 selected attributes, around 44 %
of the entire set, the precision value is decremented by
0.35 %, as a consequence, this set could be considered as a
good selection. However, the set with 221 attributes, around
54 % of the entire set, is the best reduction based on its
performance values.
Our hierarchical approach was compared with the mod-
ified one-step run approach inspired by [14], using the
dataset with 330 attributes and a RF-based diagnoser. Tak-
ing into account the best result with the HUSFS in Table 5,
Tabl e 7 Unsupervised attribute selection in one-step run
Performance measure Reduced attributes
48 56 66
Precision 0.5178 0.5541 0.6431
Sensitivity 0.5037 0.5444 0.6333
F-score 0.5005 0.5424 0.6358
RF-based diagnoser
M. Cerrada et al.
Tabl e 8 Classification
precision for different feature
selection techniques and
classifiers
Classifier Entropy-based selection NMF HUSFS
29 14 9 35 42 24
RF 0.9852 0.9629 0.9555 0.8926 0.9890 0.9616
DT 0.8963 0.8926 0.8962 0.6296 0.8037 0.8111
1-NN 0.9888 0.9741 0.9741 0.5969 0.9666 0.9370
where a reduced set of 42 attributes was selected, we set
K=50 and Nit =500. After analysing the results,
we found three significant sets of attributes that have been
ranked according to the number of times that they were
selected; then, we have sets with 66, 56 and 48 attributes
selected over 40, 50 and 60 % of times regarding the
number of iterations Nit, respectively. The performance
measures are presented in Table 7. By comparing with the
results in Table 5, we have reached best performances with
the reduced set of attributes that have been obtained with
our HUSFS algorithm. This low performance is because
the one-step algorithm could need a high value of Kto
refine the search over the entire attribute space, when a
high-dimensional vector of attributes is processed. In this
case, the computational burden for running the algorithm is
considerably augmented.
Finally, we have run two additional attribute selection
techniques over the dataset with 300 attributes, to compare
with our results. The first one is the ranking of the impor-
tance variables provided by the RF algorithm. This ranking
is a supervised approach based on the entropy measure
that is calculated from the dataset with random selection
of the attributes in the OOB-samples [4]. We selected the
attributes with entropy values upper than 40, 50 and 60 %,
as a result, three sets with 29, 14 and 9 attributes are defined.
The second technique is Non-Negative-Matrix Factoriza-
tion (NMF), this is an unsupervised approach that is widely
used in clustering, classification and feature selection [20].
Tabl e 8shows the precision after applying these selection
techniques, and our HUSFS approach, with the following
classifiers: RF, Decision Trees (DT) and 1-Nearest Neigh-
bour (1-NN).
The diagnosis results are clearly similar when RF classi-
fier is used either with HUSFS or entropy-based supervised
selection. For the other classifiers, HUSFS approach is
better than the unsupervised NMF-based selection. The pre-
cision value, with 42 attributes from HUSFS, is improved
around 2.25 % by the entropy-based supervised selection
and the best result with 1-NN classifier. These results
show that our unsupervised hierarchical approach can select
attributes as good as supervised approaches.
In summary, according to the previous results and analy-
sis, the effect of the dimensionality reduction of the feature
vector on the underlying RF and SVM algorithms is to build
computational treatable models with an adequate precision
in classification, regarding the models that can be obtained
by using a large set of features. Particularly, RF is an algo-
rithm based on search over the attributes space, then, the
dimensionality reduction helps to improve the complexity
regarding the number of variables of each tree. Moreover,
SVM is a very sensitive algorithm regarding the attributes
that are used for getting good precision and generalization
capabilities, the dimensionality reduction aims to discover
the adequate features for a good performance of the SVM-
based classifier. This hierarchical approach can be applied
to other classification algorithms, and the particular effect
on the dimensionality reduction should be analysed for each
classifier.
6Conclusion
Unsupervised feature selection is an important aspect under
research in machine learning based classification. This
paper presents an unsupervised approach for feature selec-
tion, that is inspired by algorithms based on relative depen-
dency. Particularly, our approach has improved the proposal
in [14] by considering in each iteration a random selection
to include the centroid candidates that are defining clus-
ters with the best density. Our hierarchical implementation
with disjoint partition of the available set of features in each
level, permits to deal with the size of the search space in
case of large number of attributes. HUSFS approach aims
to perform a local search on a reduced space of attributes,
then, the best local selections are aggregated to compose a
new reduced space of attributes. In this sense, the search
procedure is refined, in each level.
For our case of study in gear fault diagnosis, the per-
formance of the diagnosers by using our HUSFS algo-
rithm is adequate regarding other supervised or unsuper-
vised techniques for feature selection. We have noted that
the execution time of the one-step algorithm is decre-
mented with our hierarchical approach, then, next works
could be addressed to the analysis of adequate compu-
tational implementations for improving the computational
burden.
Hierarchical feature selection based on relative dependency...
Acknowledgments The authors want to express a deep gratitude to
The Secretary of Higher Education, Science, Technology and Inno-
vation (SENESCYT) of the Republic of Ecuador and the Prometeo
program, for their support in this research work. We also acknowl-
edge the support of the GIDTEC research group of the Universidad
Polit´
ecnica Salesiana in Cuenca-Ecuador, for the accomplishment of
this research.
References
1. Bartkowiak A, Zimroz R (2014) Dimensionality reduction via
variables selection linear and nonlinear approaches with applica-
tion to vibration-based condition monitoring of planetary gearbox.
Appl Acoust 77:169–177
2. Benot F, van Heeswijk M, Miche Y, Verleysen M, Lendasse
A (2013) Feature selection for nonlinear models with extreme
learning machines. Neurocomputing 102:111–124. Advances in
extreme learning machines (ELM 2011)
3. Bordoloi D, Tiwari R (2014) Support vector machine based opti-
mization of multi-fault classification of gears with evolutionary
algorithms from time frequency vibration data. Measurement
55:1–14
4. Breiman L (2001) Random forests. Mach Learn 45(1):5–32
5. Cabrera D, Sancho F, S´
anchez RV, Zurita G, Cerrada M, Li C,
V´
asquez RE (2015) Fault diagnosis of spur gearbox based on ran-
dom forest and wavelet packet decomposition. Front Mech Eng.
doi:10.1007/s11465-015-0348-8
6. Cerrada M, S´
anchez RV, Cabrera D, Zurita G, Li C (2015)
Multi-stage feature selection by using genetic algorithms for
fault diagnosis in gearboxes based on vibration signal. Sensors
15(9):23,903–23,926
7. Cerrada M, Zurita G, Cabrera D, S´
anchez RV, Art´
es M,
Li C (2015) Fault diagnosis in spur gears based on genetic
algorithm and random forest. Mech Syst Signal Process.
doi:10.1016/j.ymssp.2015.08.030
8. Chandrashekar G, Sahin F (2014) A survey on feature selection
methods. Comput Electr Eng 40:16–28
9. Fazayeli F, Wang L, Mandziuk J (2008) Feature selection based
on the rough set theory and expectation-maximization clustering
algorithm. In: Chan CC, Grzymala-Busse J, Ziarko W (eds) Rough
sets and current trends in computing. Lecture Notes in Computer
Science, vol 5306, pp 272–282
10. Ganivada A, Ray SS, Pal SK (2013) Fuzzy rough sets, and a granu-
lar neural network for unsupervised feature selection. Neural Netw
48:91–108
11. Gryllias K, Antoniadis I (2012) A support vector machine
approach based on physical model training for rolling element
bearing fault detection in industrial environments. Eng Appl Artif
Intell 25(2):326–344
12. Han J, Hu X, Lin T (2004) Feature subset selection based on rel-
ative dependency between attributes. In: Tsumoto S, Sowiski R,
Komorowski J, Grzymaa-Busse J (eds) Rough sets and current
trends in computing. Lecture notes in computer science, vol 3066.
Springer, Berlin Heidelberg, pp 176–185
13. Hastie T, Tibshirani R, Friedman J (2009) The elements of statisti-
cal learning: data mining, inference and prediction. Springer, New
Yor k
14. Hong TP, Liou YL, Wang SL, Vo B (2014) Feature selection and
replacement by clustering attributes. Vietnam Journal of Computer
Science 1(1):47–55
15. Inbarani H, Bagyamathi M, Azar A (2015) A novel hybrid fea-
ture selection method based on rough set and improved harmony
search. Neural Comput & Applic:1–22
16. Jensen R, Shen Q (2008) Computational intelligence and features
selection: rough and fuzzy approaches. Wiley, New Jersey
17. Karabadji N, Khelf I, Seridi H, Laouar L (2012) Genetic opti-
mization of decision tree choice for fault diagnosis in an industrial
ventilator. In: Fakhfakh T, Bartelmus W, Chaari F, Zimroz R, Had-
dar M (eds) Condition monitoring of machinery in non-stationary
operations, pp 277–283
18. Li C, Liang M, Wang T (2015) Criterion fusion for spectral
segmentation and its application to optimal demodulation of
bearing vibration signals. Mech Syst Signal Process 6465:132–
148
19. Li C, Sanchez RV, Zurita G, Cerrada M, Cabrera D, Vasquez RE
(2015) Multimodal deep support vector classification with homol-
ogous features and its application to gearbox fault diagnosis.
Neurocomputing 168:119–127
20. Li Y, Ngom A (2013) The non-negative matrix factorization
toolbox for biological data mining. Source Code Biol Med 8
(10)
21. Liu C, Jiang D, Yang W (2014) Global geometric similarity
scheme for feature selection in fault diagnosis. Expert Syst Appl
41(8):3585–3595
22. Liu H, Yu L (2005) Toward integrating feature selection algo-
rithms for classification and clustering. IEEE Trans Knowl Data
Eng 17(4):491–502
23. Liu Z, Qu J, Zuo M, Hb Xu (2013) Fault level diagnosis for
planetary gearboxes using hybrid kernel feature selection and
kernel fisher discriminant analysis. Int J Adv Manuf Technol
67(5–8):1217–1230
24. Liu Z, Zhao X, Zuo M, Xu H (2014) Feature selection for fault
level diagnosis of planetary gearboxes. ADAC 8(4):377–401
25. van der Maaten L, Postma EO, van den Herik HJ (2009) Dimen-
sionality reduction: a comparative review. Tech. rep., Tilburg
University Technical Report, TiCC-TR 2009–005
26. Mac Parthal´
ain N, Jensen R (2013) Unsupervised fuzzy-rough set-
based dimensionality reduction. Inf Sci 229:106–121
27. Mallat S (2009) A wavelet tour of signal processing: the sparse
way. Elsevier Academic Press, Amsterdam
28. Mitchell T (1997) Machine learning. McGraw-Hill, New York
29. Mitra S (2011) Digital signal processing: a computer-based
approach. McGraw-Hill, New York
30. Muralidharan V, Sugumaran V (2013) Feature extraction using
wavelets and classification through decision tree algorithm for
fault diagnosis of mono-block centrifugal pump. Measurement
46(1):353–359
31. Muralidharan V, Sugumaran V, Indira V (2014) Fault diagnosis of
monoblock centrifugal pump using SVM. Int J Eng Sci Technol
17(3):152–157
32. Pawlak Z (1982) Rough sets. Int J Comput Inform Sci 11(5):341–
356
33. Qin H, Ma X, Zain JM, Herawan T (2012) A novel soft set
approach in selecting clustering attribute. Knowl-Based Syst
36:139–145
34. Rajeswari C, Sathiyabhama B, Devendiran S, Manivannan K
(2013) Fault gear categorization: a comparative study on fea-
ture classification using rough set theory and ID3. Int J Artif
Intell Appl Smart Devices 97:41–64. 12th Global Congress on
Manufacturing and Management (GCMM)-2014
35. Rajeswari C, Sathiyabhama B, Devendiran S, Manivannan K
(2014) A gear fault identification using wavelet transform, rough
set based GA, ANN and C4.5 algorithm. Procedia Eng 97:1831–
1841. 12th Global Congress on Manufacturing and Management
(GCMM)-2014
36. Raymer M, Punch W, Goodman E, Kuhn L, Jain A (2000) Dimen-
sionality reduction using genetic algorithms. IEEE Trans Evol
Comput 4(2):164–171
M. Cerrada et al.
37. Roman S (2001) Rough sets methods in feature reduction and
classification. Int J Appl Math Comput Sci 11:565–582
38. Sakthivel N, Sugumaran V, Nair BB (2010) Comparison of deci-
sion tree-fuzzy and rough set-fuzzy methods for fault categoriza-
tion of mono-block centrifugal pump. Mech Syst Signal Process
24(6):1887–1906
39. Verikas A, Gelzinis A, Bacauskiene M (2011) Mining data with
random forests: a survey and results of new tests. Pattern Recogn.
44(2):330–349
40. Wang S, Pedrycz W, Zhu Q, Zhu W (2015) Unsupervised fea-
ture selection via maximum projection and minimum redundancy.
Knowl-Based Syst 75:19–29
41. Witten IH, Frank E (2005) Data mining: practical machine learn-
ing tools and techniques. Morgan Kaufman, Boston
42. Yan R, Gao RX, Chen X (2014) Wavelets for fault diagnosis of
rotary machines: a review with applications. Signal Process 96:1–
15
43. Yang BS, Di X, Han T (2008) Random forests classifier for
machine fault diagnosis. J Mech Sci Technol 22(9):1716–1725
44. Yoon H, Park CS, Kim JS, Baek JG (2013) Algorithm learning
based neural network integrating feature selection and classifica-
tion. Expert Syst Appl 40(1):231–241
45. Zhu X, Zhang Y, Zhu Y (2012) Intelligent fault diagnosis of
rolling bearing based on kernel neighborhood rough sets and
statistical features. J Mech Sci Technol 26(9):2649–2657
46. Ziegler A, Knig IR (2013) Mining data with random forests: cur-
rent options for real-world applications. Wiley Interdiscip Rev
Data Min Knowl Discov 4(1):55–63
Mariela Cerrada (cer-
radam@ula.ve) received her
Ph.D. degree in Automatic
Systems in 2003 from the
INSA Toulouse-France. She
is currently a full dedicated
time titular Professor in the
Department of Control Sys-
tems and associate member of
the Studies Center on Micro-
computers and Distributed
Systems (CEMISID) at the
Engineering Faculty in the
Universidad de Los Andes of
Venezuela. She was Prometeo
Researcher at the Universidad
Polit´
ecnica Salesiana, Ecuador. Her main research area is on fault
diagnosis, supervision and intelligent control systems.
Ren´
e-Vinicio S´
anchez
(rsanchezl@ups.edu.ec)
received the B.S. in Mechani-
cal Engineering in 2004, from
the Universidad Polit´
ecnica
Salesiana (UPS), Ecuador.
He got his master in manage-
ment audit quality in 2008
at the UTPL, Ecuador, and
the master degree in indus-
trial technologies research in
2012 at the UNED, Spain.
Currently, he is a Professor in
the Department of Mechan-
ical Engineering at UPS.
His research interests are in
machinery health maintenance, pneumatic and hydraulic systems,
artificial intelligence and engineering education.
Fannia Pacheco (fannikaro@
gmail.com) received her M.
Sc. Degree in Computer Sci-
ence from the Universidad
de Los Andes, Venezuela in
2015. She joined the GIDTEC
research team at the Uni-
versidad Polit´
ecnica Salesiana
(UPS), Ecuador. Her research
interests cover novelty detec-
tion, data analysis and intelli-
gent systems.
Diego Cabrera (dcabrera@
ups.edu.ec) received his M.
Sc. degree at the Sevilla Uni-
versity in 2014 and he is a
Ph.D candidate in Computer
Science at the same Univer-
sity. Currently, he is a Pro-
fessor with the Department of
Mechanical Engineering at the
Universidad Polit´
ecnica Sale-
siana (UPS), Ecuador. He is a
member of the research group
GIDTEC at UPS. His research
areas are machine learning,
complex systems modelling
and intelligence systems.
Hierarchical feature selection based on relative dependency...
Grover Zurita (gzuritav@ups.
edu.ec) received his Ph. D.
degree from Lule˚
aUniver-
sity of Technology, Sweden, in
2001. He was a Postdoctoral
Fellow at the University of
New South Wales, Australia,
in 2002. Currently, he is a Pro-
fessor at the Private University
of Bolivia, and he was Prom-
eteo Researcher at the Univer-
sidad Polit´
ecnica Salesiana of
Ecuador. His research interests
are machine diagnosis, opti-
mization and control of inter-
nal combustion engines.
Chuan Li(chuanli@21cn.com)
received his Ph.D. degree from
the Chongqing University,
China, in 2007. He has been
successively a Postdoctoral
Fellow with the University of
Ottawa, Canada, a Research
Professor with the Korea
University, South Korea, and
a Senior Research Associate
with the City University of
Hong Kong, China. He is
currently a Professor with the
Chongqing Technology and
Business University, China,
and a Prometeo Researcher
with the Universidad Polit´
ecnica Salesiana, Ecuador. His research
interests include machinery healthy maintenance, and intelligent
systems.
... FS methods have been extensively employed, investigated, and developed in the field of PdM, including a wide range of industrial processes, e.g., machining [4], industrial components, e.g., machining tools [4], gears [17], bearings [18], as well as PdM tasks, e.g., diagnosis [6] and prognosis [19]. Regarding the evaluation of FS methods, the majority of these works only consider the predictive power of the FS, expressed as the predictive performance of the learning model whose input are the features selected by the FS method, e.g., accuracy or similar measures [4]. ...
... In [6], an FS scheme was proposed to reduce the detrimental effect of the data outliers on the accuracy of fault diagnosis. Other general FS techniques were also developed in the PdM area with the aim of improving the diagnosis accuracy, as in [17], [20]. ...
... To evaluate the effectiveness of the FS schemes proposed in the literature, their performance is usually compared with popular FS methods [4], [17], [27], [34], with other related works [6], [17], and/or with the case when no FS is performed (i.e., all the features are used) [6], [10], [27], [35], [36]. The major performance indicator used to evaluate a given FS or to compare different FSs is the predictive performance of the learning model, e.g., accuracy, which was built using the features selected by the corresponding FS method [4], [17], [18], [20], [27], [34]- [39]. ...
Preprint
Full-text available
p>Feature selection (FS) represents an essential step for many machine learning-based predictive maintenance (PdM) applications, including various industrial processes, components, and monitoring tasks. The selected features do not only serve as inputs to the learning models, but also they can influence further decisions and analysis, e.g., sensor selection, understandability of the PdM system. Hence, before deploying the PdM system, it is crucial to examine the reproducibility and robustness of the selected features under variations in the input data. This is particularly critical for the real-world datasets with a low sample-to-dimension ratio (SDR). However, to the best of our knowledge, stability of the FS methods under data variations has not been considered yet in the field of PdM. This paper addresses this issue with an application to tool condition monitoring in milling, where classifiers based on support vector machine and random forest were employed. We used a 5-fold cross-validation to evaluate three popular filter-based FS methods, namely Fisher score, maximum relevance minimum redundancy (mRMR), and ReliefF, in terms of both stability and macro-F1. Further, for each method, we investigated the impact of the homogeneous FS ensemble on both performance indicators. To gain broad insights, we used four (2:2) milling datasets generated from our experiments and NASA’s repository, which differ in the operating conditions, sensors, SDR, number of classes, etc. Among the findings: 1) Different FS methods can yield comparable macro-F1, yet considerably different FS stability values. 2) Fisher score (single and/or ensemble) is the superior in most of the cases. 3) mRMR’s stability is overall the lowest, the most variable over different settings, e.g., sensor(s), subset cardinality, and the one most benefiting from the ensemble. </p
... FS methods have been extensively employed, investigated, and developed in the field of PdM, including a wide range of industrial processes, e.g., machining [4], industrial components, e.g., machining tools [4], gears [17], bearings [18], as well as PdM tasks, e.g., diagnosis [6] and prognosis [19]. Regarding the evaluation of FS methods, the majority of these works only consider the predictive power of the FS, expressed as the predictive performance of the learning model whose input are the features selected by the FS method, e.g., accuracy or similar measures [4]. ...
... In [6], an FS scheme was proposed to reduce the detrimental effect of the data outliers on the accuracy of fault diagnosis. Other general FS techniques were also developed in the PdM area with the aim of improving the diagnosis accuracy, as in [17], [20]. ...
... To evaluate the effectiveness of the FS schemes proposed in the literature, their performance is usually compared with popular FS methods [4], [17], [27], [34], with other related works [6], [17], and/or with the case when no FS is performed (i.e., all the features are used) [6], [10], [27], [35], [36]. The major performance indicator used to evaluate a given FS or to compare different FSs is the predictive performance of the learning model, e.g., accuracy, which was built using the features selected by the corresponding FS method [4], [17], [18], [20], [27], [34]- [39]. ...
Preprint
Full-text available
p>Feature selection (FS) represents an essential step for many machine learning-based predictive maintenance (PdM) applications, including various industrial processes, components, and monitoring tasks. The selected features do not only serve as inputs to the learning models, but also they can influence further decisions and analysis, e.g., sensor selection, understandability of the PdM system. Hence, before deploying the PdM system, it is crucial to examine the reproducibility and robustness of the selected features under variations in the input data. This is particularly critical for the real-world datasets with a low sample-to-dimension ratio (SDR). However, to the best of our knowledge, stability of the FS methods under data variations has not been considered yet in the field of PdM. This paper addresses this issue with an application to tool condition monitoring in milling, where classifiers based on support vector machine and random forest were employed. We used a 5-fold cross-validation to evaluate three popular filter-based FS methods, namely Fisher score, maximum relevance minimum redundancy (mRMR), and ReliefF, in terms of both stability and macro-F1. Further, for each method, we investigated the impact of the homogeneous FS ensemble on both performance indicators. To gain broad insights, we used four (2:2) milling datasets generated from our experiments and NASA’s repository, which differ in the operating conditions, sensors, SDR, number of classes, etc. Among the findings: 1) Different FS methods can yield comparable macro-F1, yet considerably different FS stability values. 2) Fisher score (single and/or ensemble) is the superior in most of the cases. 3) mRMR’s stability is overall the lowest, the most variable over different settings, e.g., sensor(s), subset cardinality, and the one most benefiting from the ensemble. </p
... FS methods have been extensively employed, investigated, and developed in the field of PdM, including a wide range of industrial processes, e.g., machining [4], industrial components, e.g., machining tools [4], gears [17], bearings [18], as well as PdM tasks, e.g., diagnosis [6] and prognosis [19]. Regarding the evaluation of FS methods, the majority of these works only consider the predictive power of the FS, expressed as the predictive performance of the learning model whose input are the features selected by the FS method, e.g., accuracy or similar measures [4]. ...
... In [6], an FS scheme was proposed to reduce the detrimental effect of the data outliers on the accuracy of fault diagnosis. Other general FS techniques were also developed in the PdM area with the aim of improving the diagnosis accuracy, as in [17], [20]. ...
... To evaluate the effectiveness of the FS schemes proposed in the literature, their performance is usually compared with popular FS methods [4], [17], [27], [34], with other related works [6], [17], and/or with the case when no FS is performed (i.e., all the features are used) [6], [10], [27], [35], [36]. The major performance indicator used to evaluate a given FS or to compare different FSs is the predictive performance of the learning model, e.g., accuracy, which was built using the features selected by the corresponding FS method [4], [17], [18], [20], [27], [34]- [39]. ...
Preprint
Full-text available
p>Feature selection (FS) represents an essential step for many machine learning-based predictive maintenance (PdM) applications, including various industrial processes, components, and monitoring tasks. The selected features not only serve as inputs to the learning models, but also can influence further decisions and analysis, e.g., sensor selection, understandability of the PdM system. Hence, before deploying the PdM system, it is crucial to examine the reproducibility and robustness of the selected features under variations in the input data. This is particularly critical for real-world datasets with a low sample-to-dimension ratio (SDR). However, to the best of our knowledge, stability of the FS methods under data variations has not been considered yet in the field of PdM. This paper addresses this issue with an application to tool condition monitoring in milling, where classifiers based on support vector machine and random forest were employed. We used a 5-fold cross-validation to evaluate three popular filter-based FS methods, namely Fisher score, minimum redundancy maximum relevance (mRMR), and ReliefF, in terms of both stability and macro-F1. Further, for each method, we investigated the impact of the homogeneous FS ensemble on both performance indicators. To gain broad insights, we used four (2:2) milling datasets obtained from our experiments and NASA’s repository, which differ in the operating conditions, sensors, SDR, number of classes, etc. For each dataset, the study was conducted for two individual sensors and their fusion. Among the conclusions: 1) Different FS methods can yield comparable macro-F1, yet considerably different FS stability values. 2) Fisher score (single and/or ensemble) is the superior in most of the cases. 3) mRMR’s stability is overall the lowest, the most variable over different settings (e.g., sensor(s), subset cardinality), and the one that benefits the most from the ensemble. This paper was accepted for publication in Sensors. The peer-reviewed publication can be found in: https://www.mdpi.com/1424-8220/23/9/4461 </p
... FS methods have been extensively employed, investigated, and developed in the field of PdM, including a wide range of industrial processes, e.g., machining [4], industrial components, e.g., machining tools [4], gears [17], bearings [18], as well as PdM tasks, e.g., diagnosis [6] and prognosis [19]. Regarding the evaluation of FS methods, the majority of these works only consider the predictive power of the FS, expressed as the predictive performance of the learning model whose input are the features selected by the FS method, e.g., accuracy or similar measures [4]. ...
... In [6], an FS scheme was proposed to reduce the detrimental effect of the data outliers on the accuracy of fault diagnosis. Other general FS techniques were also developed in the PdM area with the aim of improving the diagnosis accuracy, as in [17], [20]. ...
... To evaluate the effectiveness of the FS schemes proposed in the literature, their performance is usually compared with popular FS methods [4], [17], [27], [34], with other related works [6], [17], and/or with the case when no FS is performed (i.e., all the features are used) [6], [10], [27], [35], [36]. The major performance indicator used to evaluate a given FS or to compare different FSs is the predictive performance of the learning model, e.g., accuracy, which was built using the features selected by the corresponding FS method [4], [17], [18], [20], [27], [34]- [39]. ...
Preprint
Full-text available
p>Feature selection (FS) represents an essential step for many machine learning-based predictive maintenance (PdM) applications, including various industrial processes, components, and monitoring tasks. The selected features do not only serve as inputs to the learning models, but also they can influence further decisions and analysis, e.g., sensor selection, understandability of the PdM system. Hence, before deploying the PdM system, it is crucial to examine the reproducibility and robustness of the selected features under variations in the input data. This is particularly critical for the real-world datasets with a low sample-to-dimension ratio (SDR). However, to the best of our knowledge, stability of the FS methods under data variations has not been considered yet in the field of PdM. This paper addresses this issue with an application to tool condition monitoring in milling, where classifiers based on support vector machine and random forest were employed. We used a 5-fold cross-validation to evaluate three popular filter-based FS methods, namely Fisher score, maximum relevance minimum redundancy (mRMR), and ReliefF, in terms of both stability and macro-F1. Further, for each method, we investigated the impact of the homogeneous FS ensemble on both performance indicators. To gain broad insights, we used four (2:2) milling datasets generated from our experiments and NASA’s repository, which differ in the operating conditions, sensors, SDR, number of classes, etc. Among the findings: 1) Different FS methods can yield comparable macro-F1, yet considerably different FS stability values. 2) Fisher score (single and/or ensemble) is the superior in most of the cases. 3) mRMR’s stability is overall the lowest, the most variable over different settings, e.g., sensor(s), subset cardinality, and the one most benefiting from the ensemble. </p
... Within the condition monitoring methods are those based on models Camps Echevarría et al. (Bernal de Lázaro et al. 2015, 2016, Pang et al. 2014, Sina et al. 2014). In the first approach, the use of models representing the operation of the processes is needed. ...
... It is possible to find many different kernel functions in the scientific literature, and the Gaussian kernel is one of the most popular. In general, the selection of a kernel depends on the application (Bernal de Lázaro et al. 2015, Motai 2015, Nayak et al. 2015, 2016. In this paper, several experiments were performed using various kernel functions such as the Gaussian Kernel, the Polynomial Kernel and the Hyper-tangent Kernel. ...
... Table I shows the faults considered to evaluate the Another problem of the current fuzzy clustering method is related to the correct selection of its parameters which is decisive in obtaining a high performance. Nowadays, these issues of crucial importance are open problems in the fault diagnosis applications and in others research fields (Wang et al. 2020a, Filho et al. 2015, 2016. The parameters Number of iterations, , m and were selected according to the experience in previous works (Rodríguez-Ramos et al. 2019, 2018a. ...
Article
Full-text available
Abstract In this paper, a robust approach to improve the performance of a condition monitoring process in industrial plants by using Pythagorean membership grades is presented. The FCM algorithm is modified by using Pythagorean fuzzy sets, to obtain a new variant of it called Pythagorean Fuzzy C-Means (PyFCM). In addition, a kernel version of PyFCM (KPyFCM) is obtained in order to achieve greater separability among classes, and reduce classification errors. The approach proposed is validated using experimental datasets and the Tennessee Eastman (TE) process benchmark. The results are compared with the results obtained with other algorithms that use standard and non-standard membership grades. The highest performance obtained by the approach proposed indicate its feasibility.
... The selection of features is essential in choosing the most suitable subset of features for fault diagnosis. The excess quantity and quality of feature affect the performance of the classifier and also lead to the over-fitting result [120,121]. Therefore, the selection of features is a necessary task to select the most suitable feature subsets for the machine learning algorithm. ...
Thesis
The gearboxes offer wide range of applications in Industry 4.0 due to its versality in motion or power transmission. Any transmission error or sudden failure within the gearbox increases the noise and vibration level of the whole system. It may lead to fatal harm, industry breakdown, and massive economic loss. That is why it is a dominant area of research in industry. In the past couple of decades, condition monitoring and fault diagnosis of the gearbox has been explored and developed. Based on the literature survey, two types of approaches are frequently applied in condition monitoring and fault diagnostics: (1) a data-driven approach; and (2) a physical model-based approach. However, there are some major limitations in these models: (1) large dataset requirements, high sampling frequency selection, and use of parametric filters in data-driven models; (2) incorporation of a basic time varying mesh stiffness (TVMS) model in the existing electro-mechanical (EM) and dynamic model without considering the effect of misalignment between the base and root circle, an accurate transition curve, exclusion of non-linear Hertzian contact stiffness, and revised fillet foundation stiffness considering the influence of structural coupling due to nearby loaded tooth; (3) incorporation of a basic gear fault model in the system; (4) requirement of an accurate model for industrial gearboxes, i.e., carburized gear tooth modeling; and (5) a separate modeling approach for coupled EM motor-gearbox systems. These limitations need to be addressed to develop a robust model that produces reliable, quick, and automated results for condition monitoring and fault diagnosis problems. In this research work, these limitations are addressed, and results show that the proposed models (both data-driven and physical) are more realistic, accurate, and better than conventional ones, and successfully depict the faults in the system.
... However, these methods usually produce synthetic properties greater than the original set, so the reduced set properties are not of physical importance [37,40]. Fisher's scores, ReliefF algorithms, Wilcoxon ranks, gains ratios, memetic characteristics selection, chi-squares, and IGs are used to select relevant characteristics and improve precision in the diagnosis of mechanical failures [41][42][43]. ...
Article
Full-text available
Wind turbines generate clean and renewable energy for the international market. The most ‎‎important aspect of wind turbine maintenance is reducing failures, downtime, and operating and maintenance expenses. ‎This study aims to detect multiple faults exhibited by wind turbine blades; failures such as cracks (tip crack, mid-span crack, and crack ‎near the root) were observed in the blades at different locations. The research suggests a new approach, incorporating vibration signals and machine learning techniques to identify various failures in wind turbine blades. The technology of ranking features such as ReliefF algorithms, chi-squares, and information gains was adopted to discuss a method framework to diagnose several problems in wind turbine blades, such as cracks in different locations. The k-nearest neighbors (KNNs), support vector machines, and random forests are used to classify data based on measured vibration signals. The eight main time-domain features are calculated from the vibration signals. The proposed methodology was validated using four databases. The results showed good classification accuracy in four databases, with at least three non-conventional features in each database’s top nine features of the three classification techniques. The results also showed that when the ReliefF selection algorithm is applied with the KNN classification algorithm, it generates the highest classification accuracy under all failure conditions, and the value is 97%. Finally, the performance of the proposed classification model is compared with other machine learning classification models, and a promising result is obtained.
... Bilevel optimization (BLOP) is an important research area of mathematical programming [12]. It has emerged as an important field for progress in handling many real-life problems in different domains, such as classification and machine learning [13,14]. The BLOP is an hierarchy of two optimization tasks (upper-level or leader and lowerlevel or follower problems). ...
Article
Full-text available
In multi-label classification, each instance could be assigned multiple labels at the same time. In such a situation, the relationships between labels and the class imbalance are two serious issues that should be addressed. Despite the important number of existing multi-label classification methods, the widespread class imbalance among labels has not been adequately addressed. Two main issues should be solved to come up with an effective classifier for imbalanced multi-label data. On the one hand, the imbalance could occur between labels and/or within a label. The “Between-labels imbalance” occurs where the imbalance is between labels however the “Within-label imbalance” occurs where the imbalance is in the label itself and it could occur across multiple labels. On the other hand, the labels’ processing order heavily influences the quality of a multi-label classifier. To deal with these challenges, we propose in this paper a bi-level evolutionary approach for the optimized induction of multivariate decision trees, where the upper-level role is to design the classifiers while the lower-level approximates the optimal labels’ ordering for each classifier. Our proposed method, named BIMLC-GA (Bi-level Imbalanced Multi-Label Classification Genetic Algorithm), is compared to several state-of-the-art methods across a variety of imbalanced multi-label data sets from several application fields and then applied on the miRNA-related diseases case study. The statistical analysis of the obtained results shows the merits of our proposal.
Article
Feature selection (FS) is recognized for its role in enhancing the performance of learning algorithms, especially for high-dimensional datasets. In recent times, FS has been framed as a multi-objective optimization problem, leading to the application of various multi-objective evolutionary algorithms (MOEAs) to address it. However, the solution space expands exponentially with the dataset’s dimensionality. Simultaneously, the extensive search space often results in numerous local optimal solutions due to a large proportion of unrelated and redundant features [H. Adeli and H. S. Park, Fully automated design of super-high-rise building structures by a hybrid ai model on a massively parallel machine, AI Mag. 17 (1996) 87–93]. Consequently, existing MOEAs struggle with local optima stagnation, particularly in large-scale multi-objective FS problems (LSMOFSPs). Different LSMOFSPs generally exhibit unique characteristics, yet most existing MOEAs rely on a single candidate solution generation strategy (CSGS), which may be less efficient for diverse LSMOFSPs [H. S. Park and H. Adeli, Distributed neural dynamics algorithms for optimization of large steel structures, J. Struct. Eng. ASCE 123 (1997) 880–888; M. Aldwaik and H. Adeli, Advances in optimization of highrise building structures, Struct. Multidiscip. Optim. 50 (2014) 899–919; E. G. González, J. R. Villar, Q. Tan, J. Sedano and C. Chira, An efficient multi-robot path planning solution using a* and coevolutionary algorithms, Integr. Comput. Aided Eng. 30 (2022) 41–52]. Moreover, selecting an appropriate MOEA and determining its corresponding parameter values for a specified LSMOFSP is time-consuming. To address these challenges, a multi-objective self-adaptive particle swarm optimization (MOSaPSO) algorithm is proposed, combined with a rapid nondominated sorting approach. MOSaPSO employs a self-adaptive mechanism, along with five modified efficient CSGSs, to generate new solutions. Experiments were conducted on ten datasets, and the results demonstrate that the number of features is effectively reduced by MOSaPSO while lowering the classification error rate. Furthermore, superior performance is observed in comparison to its counterparts on both the training and test sets, with advantages becoming increasingly evident as the dimensionality increases.
Article
A spectrum-image based representation of machine vibration signals with deep convolution neural network is proposed for machine fault classification in which the convolution layer is used for automatic feature extraction as an alternate to the conventional feature-based methods. Two different forms of spectrum representations are proposed, one based on the short time Fourier transform of the original signals and the other based on the short time Fourier transform of the intrinsic mode functions acquired by empirical mode decomposition. Empirical mode decomposition has its own merits in discriminating non stationary signals and the novelty of the work is to use the short time Fourier transform of intrinsic mode functions with deep convolution neural network model. The classification and validation accuracy of the model are investigated with respect to epochs. It is demonstrated that both spectrum-based techniques perform good with 100% model accuracies in a numerical experiment of binary classification on a bearing dataset that comprises of normal and faulty signals. In another experiment using milling data set, short time Fourier transform of intrinsic mode functions representation performs better with 100% training accuracy, F1 score of 0.8933 which is better than that of using short time Fourier transform of raw signals whose training accuracy is 64% and F1 score of 0.7486. The numerical study shows that the empirical mode decomposition based spectrum representation delivers the highest accuracy in the learning model obviating the necessity for independent feature extraction, feature selection, and dimension reduction. The numerical experiment is extended using empirical mode decomposition based spectrums for multiple class classification problems in bearing dataset. The confusion matrix obtained for 10 classes, shows that validation accuracy is 100% for all classes. The performance comparison throws light on the merits of empirical mode decomposition spectrum method over other state of the art methods.
Article
Full-text available
There are growing demands for condition-based monitoring of gearboxes, and therefore new methods to improve the reliability, effectiveness, accuracy of the gear fault detection ought to be evaluated. Feature selection is still an important aspect in machine learning-based diagnosis in order to reach good performance of the diagnostic models. On the other hand, random forest classifiers are suitable models in industrial environments where large data-samples are not usually available for training such diagnostic models. The main aim of this research is to build up a robust system for the multi-class fault diagnosis in spur gears, by selecting the best set of condition parameters on time, frequency and time–frequency domains, which are extracted from vibration signals. The diagnostic system is performed by using genetic algorithms and a classifier based on random forest, in a supervised environment. The original set of condition parameters is reduced around 66% regarding the initial size by using genetic algorithms, and still get an acceptable classification precision over 97%. The approach is tested on real vibration signals by considering several fault classes, one of them being an incipient fault, under different running conditions of load and velocity.
Article
Full-text available
This paper addresses the development of a random forest classifier for the multi-class fault diagnosis in spur gearboxes. The vibration signal’s condition parameters are first extracted by applying the wavelet packet decomposition with multiple mother wavelets, and the coefficients’ energy content for terminal nodes is used as the input feature for the classification problem. Then, a study through the parameters’ space to find the best values for the number of trees and the number of random features is performed. In this way, the best set of mother wavelets for the application is identified and the best features are selected through the internal ranking of the random forest classifier. The results show that the proposed method reached 98.68% in classification accuracy, and high efficiency and robustness in the models. © 2015, Higher Education Press and Springer-Verlag Berlin Heidelberg.
Article
Full-text available
There are growing demands for condition-based monitoring of gearboxes, and techniques to improve the reliability, effectiveness and accuracy for fault diagnosis are considered valuable contributions. Feature selection is still an important aspect in machine learning-based diagnosis in order to reach good performance in the diagnosis system. The main aim of this research is to propose a multi-stage feature selection mechanism for selecting the best set of condition parameters on the time, frequency and time-frequency domains, which are extracted from vibration signals for fault diagnosis purposes in gearboxes. The selection is based on genetic algorithms, proposing in each stage a new subset of the best features regarding the classifier performance in a supervised environment. The selected features are augmented at each stage and used as input for a neural network classifier in the next step, while a new subset of feature candidates is treated by the selection process. As a result, the inherent exploration and exploitation of the genetic algorithms for finding the best solutions of the selection problem are locally focused. The Sensors 2015, 15 23904 approach is tested on a dataset from a real test bed with several fault classes under different running conditions of load and velocity. The model performance for diagnosis is over 98%.
Article
This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for classification and clustering, groups and compares different algorithms with a categorizing framework based on search strategies, evaluation criteria, and data mining tasks, reveals unattempted combinations, and provides guidelines in selecting feature selection algorithms. With the categorizing framework, we continue our efforts toward building an integrated system for intelligent feature selection. A unifying platform is proposed as an intermediate step. An illustrative example is presented to show how existing feature selection algorithms can be integrated into a meta algorithm that can take advantage of individual algorithms. An added advantage of doing so is to help a user employ a suitable algorithm without knowing details of each algorithm. Some real-world applications are included to demonstrate the use of feature selection in data mining. We conclude this work by identifying trends and challenges of feature selection research and development.
Article
Fault diagnosis on a gear box is a difficult problem due to the non-stationary type of vibration signals it generates. Usually, one method of fault diagnosis can only inspect one corresponding fault category. Vibration based condition monitoring using machine learning methods is gaining momentum. In this paper, rough sets theory, is used to diagnose the fault gears in a gear box. Through the analysis of the final reducts generated using rough sets theory, it is shown that this method is effective for diagnosing more than one type of fault in a gear. The performance of rough set method are compared with those of the ID3 decision tree algorithm and the results prove that the rough set method has greater capability to bring out the different fault conditions of the gear box under investigation. The study reveals that the overall classification efficiency of the decision tree is to some extent better than the classification efficiency of rough sets method.
Article
The paper presents an application of rough sets and statistical methods to fea-ture reduction and pattern recognition. The presented description of rough sets theory emphasizes the role of rough sets reducts in feature selection and data reduction in pattern recognition. The overview of methods of feature selection emphasizes feature selection criteria, including rough set-based methods. The paper also contains a description of the algorithm for feature selection and re-duction based on the rough sets method proposed jointly with Principal Compo-nent Analysis. Finally, the paper presents numerical results of face recognition experiments using the learning vector quantization neural network, with feature selection based on the proposed principal components analysis and rough sets methods.
Article
Gearboxes are crucial transmission components in mechanical systems. Fault diagnosis is an important tool to maintain gearboxes in healthy conditions. It is challenging to recognize fault existences and, if any, failure patterns in such transmission elements due to their complicated configurations. This paper addresses a multimodal deep support vector classification (MDSVC) approach, which employs separation-fusion based deep learning in order to perform fault diagnosis tasks for gearboxes. Considering that different modalities can be made to describe same object, multimodal homologous features of the gearbox vibration measurements are first separated in time, frequency and wavelet modalities, respectively. A Gaussian-Bernoulli deep Boltzmann machine (GDBM) without final output is subsequently suggested to learn pattern representations for features in each modality. A support vector classifier is finally applied to fuse GDBMs in different modalities towards the construction of the MDSVC model. With the present model, “deep” representations from “wide” modalities improve fault diagnosis capabilities. Fault diagnosis experiments were carried out to evaluate the proposed method on both spur and helical gearboxes. The proposed model achieves the best fault classification rate in experiments when compared to representative deep and shallow learning methods,. Results indicate that the proposed separation-fusion based deep learning strategy is effective for the gearbox fault diagnosis.
Article
The defective bearing signatures can be detected by resonance demodulation of the vibration signals. The decision of the bearing fault detection largely depends on the quality of the identified resonant frequency band. Two key issues in locating the resonance frequency band are the proper segmentation of the frequency spectrum of interest and the criterion used to guide the search for the resonance band. To deal with the two issues, this paper proposes a criterion fusion approach to guide the spectral segmentation process. With the proposed approach, the frequency spectrum of the bearing signal is first divided into initial fine segments which are then adaptively merged into different subsets using an enhanced bottom-up segmentation technique. To guide the spectral segmentation and merging process, three commonly used criteria, i.e., kurtosis, smoothness index and crest factor are fused into a synthesized cost function using an entropy-based method. The final frequency band delivered by this approach has a good coverage of the resonant band and is then used to demodulate bearing signals. Both simulated and experimental signals have been employed to evaluate the proposed approach, which has also been compared to single-criterion methods. The comparison indicates that the fused criterion yields better results than those from the single-criterion.