ArticlePDF Available

Hierarchical feature selection based on relative dependency for gear fault diagnosis

April 2016
Applied Intelligence 44(3):1-17

April 2016
44(3):1-17

DOI:10.1007/s10489-015-0725-3

Authors:

Mariela Cerrada

Universidad Politécnica Salesiana (UPS)

René Vinicio Sánchez

Universidad Politécnica Salesiana (UPS)

Diego Román Cabrera

Universidad Politécnica Salesiana (UPS)

Show all 6 authorsHide

Feature selection is an important aspect under study in machine learning based diagnosis, that aims to remove irrelevant features for reaching good performance in the diagnostic systems. The behaviour of diagnostic models could be sensitive with regard to the amount of features, and significant features can represent the problem better than the entire set. Consequently, algorithms to identify these features are valuable contributions. This work deals with the feature selection problem through attribute clustering. The proposed algorithm is inspired by existing approaches, where the relative dependency between attributes is used to calculate dissimilarity values. The centroids of the created clusters are selected as representative attributes. The selection algorithm uses a random process for proposing centroid candidates, in this way, the inherent exploration in random search is included. A hierarchical procedure is proposed for implementing this algorithm. In each level of the hierarchy, the entire set of available attributes is split in disjoint sets and the selection process is applied on each subset. Once the significant attributes are proposed for each subset, a new set of available attributes is created and the selection process runs again in the next level. The hierarchical implementation aims to refine the search space in each level on a reduced set of selected attributes, while the computational time-consumption is improved also. The approach is tested with real data collected from a test bed, results show that the diagnosis precision by using a Random Forest based classifier is over 98 % with only 12 % of the attributes from the available set.

Experimental settings

…

RF structure

…

RF-based diagnoser with low correlation data

…

Real gear damages

…

SVM-based diagnoser with low correlation data

…

Figures - uploaded by Mariela Cerrada

Content may be subject to copyright.

Content uploaded by Mariela Cerrada

Content may be subject to copyright.

Content uploaded by Mariela Cerrada

Content may be subject to copyright.

Appl Intell

DOI 10.1007/s10489-015-0725-3

Hierarchical feature selection based on relative dependency

for gear fault diagnosis

Mariela Cerrada1,2 ·Ren´

e-Vinicio S´

anchez2·Fannia Pacheco2·Diego Cabrera2·

Grover Zurita2·Chuan Li2,3

Abstract Feature selection is an important aspect under

study in machine learning based diagnosis, that aims to

remove irrelevant features for reaching good performance in

the diagnostic systems. The behaviour of diagnostic models

could be sensitive with regard to the amount of features, and

significant features can represent the problem better than

the entire set. Consequently, algorithms to identify these

features are valuable contributions. This work deals with

the feature selection problem through attribute clustering.

The proposed algorithm is inspired by existing approaches,

where the relative dependency between attributes is used to

calculate dissimilarity values. The centroids of the created

clusters are selected as representative attributes. The selec-

tion algorithm uses a random process for proposing centroid

candidates, in this way, the inherent exploration in random

search is included. A hierarchical procedure is proposed for

implementing this algorithm. In each level of the hierar-

chy , the entire set of available attributes is split in disjoint

sets and the selection process is applied on each subset.

Once the significant attributes are proposed for each subset,

a new set of available attributes is created and the selec-

tion process runs again in the next level. The hierarchical

Mariela Cerrada

cerradam@ula.ve

1Control Systems Department, Universidad de Los Andes,

M´

erida, Venezuela

2Mechanical Engineering Department, Universidad Polit´

ecnica

Salesiana, Cuenca, Ecuador

3Research Center of System Health Maintenance, Chongqing

Technology and Business University, Chongqing, China

implementation aims to refine the search space in each level

on a reduced set of selected attributes, while the computa-

tional time-consumption is improved also. The approach is

tested with real data collected from a test bed, results show

that the diagnosis precision by using a Random Forest based

classifier is over 98 % with only 12 % of the attributes from

the available set.

Keywords Feature selection ·Attribute clustering ·Rough

sets ·Relative dependency ·Gear fault diagnosis

1 Introduction

Most industrial environments demand the continuous work-

ing of rotating machinery. Diagnostic systems are designed

to support this requirement, and increase the availability

of the industrial processes. Contributions to build up fault

diagnostic systems with accuracy, reliability and adequate

computational complexity are highly valuable. In machine

learning applications for data-driven diagnosis of rotating

machinery, the inputs of the classifier are frequently condi-

tioned parameters that are extracted from signals, such as

vibration, noise, among others; these parameters are defined

on the domains of time, frequency and time-frequency, in

order to enhance the data that are processed by the diagnos-

tic algorithms. Taking into account the amount of available

parameters as feature candidates for fault diagnosis, there

are two ways to address the diagnoser design: one of them

is by developing more complex classifiers that can deal with

the high dimensionality of the feature vector, and the sec-

ond one is by dealing with the dimensionality reduction

through feature selection. In feature selection, the problem

is the identification of the significant features that allow

keeping the classification precision. Furthermore, feature

M. Cerrada et al.

selection can help discover relevant features for a particular

application.

Feature selection in fault diagnosis of rotating machin-

ery is still an open research field. Each failure mode can

affect the condition parameters in a different way and, on the

other hand, in case of incipient failures, the best condition

parameters providing good diagnostic information are not

clearly identifiable. In most of cases, feature selection has

been treated as a dimensionality reduction problem where

several techniques are used, such as Principal Component

Analysis, Multidimensional Scaling, Factor Analysis, Pro-

jection Pursuit, Kernel Fisher Discriminant Analysis, Data

Fusion, and other linear and non linear techniques [18,19,

23,25]. However, these types of techniques commonly cre-

ate artificial features with lower dimension than the original

set; consequently, these reduced features lack of physical

meaning. Some efforts has been performed in [1] for devel-

oping dimensionality reduction techniques to find the best

subset of the original features, by using multivariate linear

regression and variables shrinkage.

In machine learning based fault diagnosis, feature selec-

tion aims to remove irrelevant features; wrapper, filtering

or embedded methods can be used for this purpose [8].

The computational burden of the selection algorithm may

be high for analysing the entire feature space in wrapper

approaches. This is the case when classic Genetic Algo-

rithms (GA) are used for dimensionality reduction [36].

However, efforts for integrating some techniques into wrap-

per approaches are described in [22], and more recently in

[2,6,44]. In filtering approaches, the data characteristics are

analysed to rank the attributes by mean of certain criteria,

this is a data treatment before the learning phase. Several

works using filtering approaches have been reported in [40],

where matrix factorization formalism is used. The work in

[24] proposes a measure based on cosine similarity to esti-

mate within-class and between-class separability; thereafter,

it uses a strategy of sequential backward search, for rank-

ing the features. In [21], a global geometric model and a

similarity measure are combined to filter disjoint feature

subsets. The similarity between the subsets and the original

set is evaluated, and the subset with the best similarity is

selected.

Particularly, this work is interested in Rough Sets (RS)

theory based approaches. The main idea behind the use

of this theory is to find equivalence classes based on the

concepts of indiscernibility, reduct, lower and upper approx-

imations, and dependency degree [9,16,37]. Equivalence

classes are used in Qin et al. [33] for attribute clustering,

the authors also use soft set theory to build a soft model

that is defined on equivalence classes and this model is

applied to obtain approximate sets. Qin et al. [33] pro-

pose the computation of the dependency degree through the

cardinality of the lower approximation for each attribute;

this is obtained after processing the tabular representation

of the soft sets, for all attributes. The attributes with high

dependency degrees define attribute clusters.

In [26], an unsupervised fuzzy-rough feature selection

algorithm is presented, based on the concept of dependency

between attributes. The dependency degree is determined by

a fuzzy-rough measure or a fuzzy discernibility, in order to

eliminate irrelevant features. This is achieved by substitut-

ing the decision feature of the supervised approach, by the

features that are being evaluated. The decision feature is the

class in classification problems. The procedure aims to find

the set of attributes Qthat depends on a set of attributes

P, that is, if all the values of the attributes in Qare deter-

mined by the values of the attributes in P. The algorithm

performs an exhaustive search because a measure denoted

as M(R,a) must be calculated for each attribute a∈A,

where R=A−aand Ais the entire set of attributes. The

measure M(R,a) could be either a measure of the boundary

region, or the discernibility [26].

Theworkin[10] proposes a Granular Neural Network

(GNN) based algorithm for feature selection which uses

Fuzzy Rough Sets (FRS) to calculate the initial weights

of the network; a granulation structure is needed to build

groups of patterns by using similarity measures from fuzzy

implications. The input and target vector for the GNN are

obtained from the mean matrix associated with each group.

FRS concepts are used to calculate the lower approxima-

tion and the positive region, and thereafter the dependency

degree of each attribute is computed, with regard to the deci-

sion attribute. Finally, a three-layer architecture is used to

perform the feature selection by minimizing a feature evalu-

ation index with respect to the connection weights between

the nodes in the hidden layer and output layer.

The approach in [9] proposes the Expectation-Maximi-

zation (EM) algorithm to create attribute clustering, lower

and upper approximations are calculated for each cluster to

determine the dependency degree. Then, the classic Quick

Reduct (QR) algorithm is executed. In [15], an algorithm

using Rough Set based Particle Swarm Optimization Quick

Reduct (RS-PSO-QR) is proposed to deal with the prob-

lem of the exhaustive search of attributes with the best

dependency degree.

RS theory has been used for feature selection in few

recent works on fault diagnosis of rotating machinery. In

[38], decision trees and RS-based methods are compared as

feature selectors, however, these methods have been applied

into a supervised environments by taking into account the

decision variables. In case of decision trees methods, the

variables in each node are selected as the important vari-

ables; in RS methods, feature selection is performed by con-

structing reducts from the discernibility matrix. A reduct is

the minimal set of attributes where the knowledge could be

retained regarding the decision variable. A similar approach

Hierarchical feature selection based on relative dependency...

is presented in [34] for fault diagnosis of gearboxes. RS-

based GA approach is used in [35] for selecting the best

input features to reduce the computational burden in gear

fault identification. QR algorithm is implemented with GA

in order to determine quickly the ranking of the attributes

based on the high dependency degree. In [45], the concept

of Kernel Neighbourhood Rough Sets (KNRS) is used to

deal directly with numerical data. Kernel method is com-

bined with neighbourhood rough sets to map the fault data

of rolling bearing to a high-dimension feature space. The

radius of the hypersphere in the feature space is the neigh-

bour value that is used for calculating lower and upper

approximations of the decision variable, with respect to the

set of attributes.

The common characteristic of the previous works is that

they work into a wrapper or supervised environment. Then,

feature selection are strongly linked with the decision vari-

ables. This is not the case in real application for fault

diagnosis where all the fault labels are not known. The lit-

erature review has shown that there are few recent works

using RS theory into an unsupervised environment. Besides

the work in [26],theworkin[14] proposes an unsupervised

algorithm to create attribute clustering and select the best

representative attribute.

In this work, feature selection is developed from the

unsupervised algorithm in [14], due to its ability for finding

representative features from ‘reducts’ that define attribute

clusters. In our approach, the algorithm in [14] has been

slightly modified to include a random search on the fea-

ture space, and a hierarchical approach is proposed for

implementing the algorithm with low computational time-

consumption. In each level of the hierarchy, the entire set

of available features, or attributes, is split in disjoint sub-

sets and the selection algorithm is applied on each subset.

After selecting a set of representative attributes for each

subset, a new aggregate set of attributes is created and the

selection process is performed again in the next level on

a reduced attribute space. This decomposition is developed

until reaching the hierarchy level where a reduced set of fea-

ture candidates is obtained, according to the stop criteria.

Besides the time-consumption, random selection and hier-

archical decomposition both of them help to improve the

exhaustive search process of the original algorithm.

Random Forest (RF) and Support Vector Machine (SVM)

based classifiers are used as diagnostic models to show

the performance of our selection algorithm. The rationale

for using these techniques is because their versatility in

real environments. RF has good performance for classifi-

cation tasks, this algorithm leads to interpretable models

and work well with small set of samples and large num-

ber of attributes. SVM has demonstrated to be a good

classifier with low computational burden in training and

testing phases, even in on line working. Both techniques are

available in several computational environments, and their

uses have been studied recently for fault diagnosis of rotat-

ing machinery in [3,5,7,11,17,31,43], among other

works.

The performance of feature selection with our hierarchi-

cal approach is compared with the results from the one-step

run of the original selection algorithm, for fault diagnosis

in gears. Our results are also compared with other feature

selection techniques such as Non-negative Matrix Factoriza-

tion and Entropy-based ranking of the importance variables

provided by RF algorithm.

This paper is organized as follows. Section 2presents the

theoretical background that supports the proposed approach.

Section 3details the experimental procedure to collect

the data, and the feature extraction process from vibration

signals. Section 4develops our approach for hierarchical

feature selection based on relative dependency. Section 5

analyses the performance of our approach for fault diag-

nosis in gears and compares with other feature selection

techniques and classifiers. Finally, Section 6presents the

conclusions and future works.

2 Theoretical background

This section briefly presents the conceptual foundations

about the techniques that are used in this work. We are

focused on attribute clustering as feature selection method

in Section 2.1, support vector machine and tree-based clas-

sifiers in Sections 2.2 and 2.3, respectively. More details of

this basic review can be seen in the references.

2.1 Relative dependency based attribute clustering

RS-based attribute clustering has been recently used for fea-

ture selection. The main objective is to find approximate

reducts, that is, the subset of features from an information

system that are able to keep the original indiscernibility.

Formally, the indiscernibility is defined as follows [32]:

Definition 1 Let I=(U, A) be an information system,

where U=x1,x

2,...x

nis a finite set of objects (samples)

and Ais a finite set of attributes (features). For any object

xi∈U,fa(xi)denotes the value of the attribute a∈Ain

the object xi. The indiscernibility IND of a subset B⊂A

isgivenin(1):

IND(B) ={(x, y) ∈U×U|∀a∈B, fa(x) =fa(y)}(1)

If IND(A) =IND(B) then Bis a reduct of A.Bis a

minimal reduct if IND(B) =IND(A) and IND(B)=

IND(A),∀B⊆B, are satisfied.

M. Cerrada et al.

An algorithm for feature selection has been proposed in

[14] that aims to improve the process of finding approximate

reducts, based on the relative dependency concept defined

in [12]. The relative dependency between two attributes is a

similarity measure, and it is defined as follows [14]:

Definition 2 Let Ai,A

jbe two attributes, the relative

dependency degree Dep(Ai,A

j)is defined in (2):

Dep(Ai,A

j)=

Ai(U)



(Ai,Aj)(U)



(2)

where Ai(U) is the projection of Uon Ai,and

(Ai,Aj)(U) is the projection of Uon (Ai,A

j).

The projection B(U) is computed in two steps: (i)

Define the set C=A−B(ii) Merge the objects (samples)

such as they are indiscernible for all the attributes ci∈C,

according to (1).

The equality Dep(Ai,A

j)=Dep(Aj,A

i)is not always

true, consequently, the average is proposed to represent the

similarity between two attributes. Finally, the dissimilarity

measure dbetween two attributes is stated in (3):

d(Ai,A

j)=1

Avg(Dep(Ai,A

j), Dep(Aj,A

i)) (3)

Theworkin[14] creates clusters of attributes by using

the dissimilarity measure as defined in (3). In this context,

the centroid of the cluster is the attribute that represents sim-

ilar attributes in a region, and this centroid is selected as the

most representative attribute. The cluster strategy is called

Most Neighbours First (MNF) and it is assumed that the

algorithm can cluster the attributes into a defined number of

clusters; k-medoids approach is used with a different search

strategy to find the centroids.

Let I=(U, A) be an information system and Ka num-

ber of desired clusters, the MNF algorithm is summarized

in the following steps [14]:

Step 1 Randomly select Kattributes Atas the initial rep-

resentative attributes, Ac

t∈Ac⊂Adenotes the

center of the t-th cluster Ct,t=1,...,K.

Step 2 Compute the dissimilarity d(Ai,A

t)as in (3), for

each non-representative Ai,Ai∈A−Ac,i=

1,...,n

t,t=1,...,K.

Step 3 Allocate all non-center attributes Aito their nearest

center according to the distances in step 2.

Step 4 Calculate the distance d(At,i,A

t,j),i= j,

∀(At,i,A

t,j)∈Ct.

Step 5 Calculate the radius rtof each cluster Ctas in (4):

rt=i=jd(At,i,A

t,j)

Cnt

(4)

where Cnt

2is the number of attribute pairs in Ct,

Ct=nt(nt−1)

Step 6 Find the set Near(At,m)as proposed in (5):

Near(At,m)=At,i |At,i ∈Ct,d(A

t,m,A

t,i)≤rt

(5)

Step 7 Find the attribute At,l,∀Ct,suchas

Near(At,l)

>



Near(At,j)

,∀j= l, and set the attribute At,l as

the new center Ac

Step 8 Repeat steps 2–7 until the stop criteria is met.

Step 9 Select the attributes Ac

t,∀Ct, as the most represen-

tative attributes.

The number of iterations is usually the stop criterion.

On the other hand, this algorithm only works on categori-

cal data type, therefore, a domain discretization is needed

to implementing the algorithm when numerical data is

used. Examples with low-dimensional feature space are pre-

sented in [14] to show the capability of the algorithm,

however this is not the case in real applications such as

fault diagnosis in rotating machinery. In our approach,

the above algorithm was slightly modified to deal with

high-dimensional attributes. These modifications will be

explained in Section 4.

2.2 Support vector machines

Support Vector Machine (SVM) is a non-probabilistic

binary linear classifier that has reported good performance

in classification problems. SVM-based classification is

stated as follows [13]:

The training dataset is composed of msamples of a cou-

ple (xi,y

i),wherei=1,...,m.Thevariablexiis a

vector of nattributes, i.e., xi=(xi1,x

i2,...,x

in)and yi

is the classification variable that can have binary values,

yi∈{−1,1}. The vector xibelongs to class C1or C2.Ifxi

belongs to class C1then yi=1, and yi=−1 otherwise. On

the other hand, an hyperplane is defined as in (6):

x:f(x)=xTθ+θ0=0(6)

If the classes are separable, there exists a function xTθ+

θ0with yif(x

i)>0∀i, and we are able to find the

hyperplane that creates the best margin between samples

of classes C1and C2. This problem is summarized as an

optimization problem in (7):

minθ,θ0

2θ2

subject to

yi(xT

iθ+θ0)≥1,∀i(7)

Hierarchical feature selection based on relative dependency...

If the classes overlap in the feature space, the optimiza-

tion problem is stated as in (8):

minθ,θ0,ξi

2θ2+Cξi

subject to

yixT

iθ+θ0≥1−ξi,∀i

ξi≥0(8)

where ξ=(ξ1,ξ

2,...,ξ

m)are slack variables and the

cost parameter Cis a tuning parameter. The optimization

problem in (8) is rewritten in the non-constrained form in

(9). Once the solution is obtained for ˆ

θand ˆ

θ0, the pre-

dicted class ˆyfor the new sample xis given by the decision

function in (10), where y∈{−1,1}:

minθ,θ0

2θ2+Cmax(1−yi(xT

iθ+θ0), 0)(9)

ˆy=arg maxyy(xTθ+θ0)(10)

The described binary classification is generalized for

multi-class classification by using the one-vs-all strategy.

Let Ok(x) be the output of the k−th SMV model that clas-

sifies the class kfor the new sample x. The predicted class

ˆyisgivenby(11):

ˆy=arg maxkOk(x) (11)

2.3 Decision trees and random forest

Decision trees (DT) allow splitting the attribute space

through an iterative procedure of binary partition. It is

a powerful method for some classification and prediction

problems, providing an easily interpretable model [13,28,

41].

Consider the training dataset that is composed of the

couples (xi,y

i)as described in Section 2.2. For multi-

class classification, there are a collection of classes c,c=

1, .., C;thevariableyiis the logistic classification variable,

where yi=cindicates that the vector of attributes xiis

associated with the class c. From all the data, the DT algo-

rithm selects an attribute (variable) jand a partition point s,

and it set the pair of semi-planes R1and R2, as described in

(12), by following some guidelines to obtain a model with

tree topology [13]:

R1(j, s ) =X

Xj≤sR2(j, s ) =X

Xj>s(12)

Then, the algorithm seeks the best selection (j, s ) that

solves the maximization problem of (13):

arg maxcˆpkc (13)

where ˆpkc =(1/Nk)xi∈RkI(y

i=c) indicates the pro-

portion of observations of class cin the region Rk, regarding

the total of Nkobservations into this region. Iis a mem-

bership indicator of the attribute vector to that region. The

expression ˆpkc is a homogeneity measure of the child node,

also called ‘impurity function’.

The iterative procedure leads to splitting the attributes

space in rdisjoints regions R, until the stop criterion is

achieved. The class cis assigned to the node kof the

tree, which represent the region Rk, through the expression

C(k) =arg maxcˆpkc . This procedure searches through-

out all possible values of all attributes among the samples,

and the resulting model is a binary tree where the leafs are

the best representation of the region Rk. Figure 1shows

an example with the result after applying this iterative

procedure.

One of the problems related to tree-based techniques is

the high variance. Therefore, the bootstrap aggregating tech-

nique, also called bagging, is applied to improve this issue

[13]. Basically, the bagged classifier is composed of a set

of classifiers that are trained with a random subset of the

available data. Each classifier proposes a class for the input

xiand the estimated class ˆ

bag(xi)is assigned according to

(14), where ˆ

fbag is a vector of values pc(xi)that indicates

the proportion of classifiers that proposes the class c:

Cbag(xi)=arg maxcˆ

fbag (14)

Fig. 1 a Binary division of the

attribute space bModel

representation as a binary tree

[13]

M. Cerrada et al.

Random Forest (RF) algorithm is a modified approach of

the bagging technique to build a collection of non-correlated

trees with low bias (low error on the training data) and low

variance (low error on the test data) [4,39,46]. RF algo-

rithm for classification problems is fully developed in the

literature and it is summarized in [13]. Let Xand Xbbe the

dataset of size mand the training dataset randomly selected

of size mb,mb⊂m, respectively; the random dataset is

used to build each tree Tbof the forest. Figure 2illustrates

the structure of a RF which is composed of a collection of

trees Tb, and the decision for classifying a new sample xiis

according to (15):

rf (xi)=major ity vote ˆ

Cb(xi)B

1(15)

where ˆ

Cb(xi)is the estimated class by the tree Tband the

assigned class ˆ

rf (xi)is the class with the largest number

of ‘votes’, after considering the proposed class by each tree

Tb.

The complement set OOBb=m−mbof each tree Tbis

the out-of-bag sample, and it is used as the cross validation

set for the tree, during the training process. The OOB-error

is a performance measure of the RF, it is defined as the

average of the classification error of each tree Tbusing the

OOBbsample.

3 Measurement procedure and feature extraction

This section presents the experimental set-up to collect the

data related to the fault condition in gearboxes. A data

matrix with msamples (rows) and navailable attributes is

Fig. 2 RF structure

Fig. 3 Vibration analysis laboratory at the Salesian Polytechnic Uni-

versity, Cuenca-Ecuador

created, each row represents a machinery condition. The

collected data is the dataset that will be processed by our

hierarchical unsupervised feature selection approach.

All the experiments were carried out in the experimen-

tal test bed that is shown in Fig. 3. The rotation motion

of the equipment is generated by a 1.1 kW motor powered

by three-phase 220 V at 60 Hz, with a nominal speed of

1650 rpm. The torque motion is transmitted into a gear-

box, where several gear fault configurations are assembled.

The torque is transmitted to a pulley through the gearbox

shaft, which is part of the magnetic brake system. The

magnetic brake implements different loads according to the

experimental protocol. Variable-frequency drive was used

to generate different speeds. The data acquisition system is

formed by the NI CompactDAQ-9191 of National Instru-

ments and the module NI 9234, which is inserted in the DAQ

slot. This device has a maximum sample frequency of 51.2

kS/s, anti-aliasing filtering, 24-bit resolution, IEPE signal

coupling, and Ethernet communication. The data acquisition

software was developed in our laboratory on NI LabVIEW

environment.

The PBC IEPE accelerometer, with a sensitivity of 100

mV/g, was vertically located on the gearbox in order to

Tabl e 1 Experimental settings

Parameter Value

Sampling frequency 50 kHz

Length of each sample 10 sec

Number of tests 5

Rotation frequency (constant speed) 8 Hz, 12 Hz, 15 Hz

Range frequency (variable speed) 5–12 Hz , 12–18 Hz, 8–15 Hz

Load No Load, 10 V, 30 V

Hierarchical feature selection based on relative dependency...

Tabl e 2 Gear fault conditions

Label Description

f0 Healthy pinion, healthy gear

f1 Pinion tooth chaffing, healthy gear

f2 Pinion tooth wear, healthy gear

f3 25 % pinion tooth breakage, healthy gear

f4 50 % pinion tooth breakage, healthy gear

f5 100 % pinion tooth breakage, healthy gear

f6 Healthy pinion, 25 % gear crack

f7 Healthy pinion, 100 % gear crack

f8 Healthy pinion, 50 % gear chaffing

f9 25 % pinion tooth breakage, 25 % gear crack

record the vibration signals from the spur gears. Experi-

mental settings for the signal measurements are shown in

Tabl e 1; the gearbox was configured with ten different fault

modes, including the healthy condition, see Table 2.Fault

f0is the healthy (normal) condition, faults f1andf2are

incipient faults, f3, f4, f6, f8 are moderate faults, f5and

f7 are severe faults, and f9 is a multiple fault. An incipi-

ent fault is a fault that is just beginning to show symptoms,

this is an important condition to be diagnosed in industrial

applications. Figure 4shows the real gear conditions.

According to the parameter values, we have 900 sig-

nal samples. Next sections show the feature extraction for

each sample by using statistical parameters on time domain,

frequency domain and time-frequency domain. All data

processing was performed in Matlab©.

3.1 Condition parameters on time and frequency

domains

Seven classical condition parameters were obtained by

statistical analysis on time domain such as root mean

square(RMS), energy, crest factor, mean, standard devia-

tion, variance and skewness. These parameters were calcu-

lated over all the signal length for each sample, as shown in

Fig. 5. In frequency domain, four condition parameters were

calculated on eighty equal-sized frequency bands and fif-

teen octave-frequency bands, such as RMS, mean, standard

deviation and kurtosis, as illustrated in Fig. 6. Frequency

bands are used for extracting condition parameters, since a

fault can generate clear changes on the vibration amplitude

on specific bands in which this amplitude is non significant

in normal conditions.

3.2 Condition parameters on time-frequency domain

Raw signals on time domain, for each sample, were used as

input data to the wavelet packet analysis. Wavelet Transform

(WT) is a powerful mathematical tool for signal analy-

sis that has attracted great attention in several engineering

fields. The use of wavelets has been proven with suc-

cess for fault detection and diagnosis [3,27,30,42]. The

analysis presented in [42] provides an extensive overview

Fig. 4 Real gear damages

M. Cerrada et al.

Fig. 5 Feature extraction in time domain

of some latest efforts in the development and applica-

tions of WT for fault diagnosis in rotating machinery. In

Wavelet Packet Transform (WPT) framework, compression

and denoising ideas are exactly the same that are developed

in WT framework. WPT offers flexible analysis, the signal

details as well as the signal approximations are split in a tree

decomposition [29,42]. In the following, Wavelet Packet

Decomposition (WPD) is briefly summarized.

Let and φbe the selected wavelet function and its

corresponding scaling function, given by the (16)and(17),

respectively, where kis the shift index, g(k) is the impulse

response of the low pass filter φ,andh(k) is the impulse

response of the high pass filter , also known as Quadrature

Mirror Filters (QMF) [29]:

φ(t) =√2

g(k)φ(2t−k) (16)

(t) =√2

h(k)φ (2t−k) (17)

Let V0be a vector space which is generated by the scaling

functions φ(t) and its corresponding translation φ(t −k).

The vector space V1is such that V1⊂V0, and its cor-

responding scaling and translation functions are φ(2t) and

φ(2t−k), respectively. The vectorial operation in (18)per-

mits to move from V0to V−1, and in general from Vjto

Vj−1, without loss of information; Wjis the orthogonal vec-

tor space called ‘Wavelet Space’ which is generated from

the wavelet function (t) and its corresponding translation

(t −k):

Vj−1=Vj⊕Wj(18)

Equation (18) states that a function defined on Vj−1could

be decomposed into a function that belongs to Vjand other

function that belongs to Wj.

Let x(t) be the discrete time signal, this signal is decom-

posed in both the Low Frequency Approximation (LFA) and

High Frequency Detail (HFD). LFA is obtained with the fil-

ter φ(t) by using the (19), and HFD with the filter (t) by

Fig. 6 Feature extraction in

frequency domain

Hierarchical feature selection based on relative dependency...

using the (20), where ↓2 is the sub-sampling operation for

the displacement from Vjto Vj−1:

LFA =(x ∗φ) ↓2 (19)

HFD =(x ∗) ↓2 (20)

In WPD, this process is repeated recursively on each

resulting signal LFA and HFD, until obtaining the required

level of decomposition. As a result, the raw signal on time

domain is decomposed in a binary tree with the signals LFA

and HFD. In the current work, five mother wavelets are used

in the analysis: Daubechies (db7), Symlet (sym), Coifier

(coif4), Biorthogonal (bior6.8) and Reverse Biorthogonal

(rbior6.8). The rationale for using several wavelets is to

collect as much information as possible for our applica-

tion. WPD was performed until four levels for each mother

wavelet, then 24coefficients are obtained for each one, and

eighty features are extracted by using the energy operator,

as shown in Fig. 7. In this figure, only the left branch of the

tree is illustrated for one wavelet mother.

Finally, we have calculated 817 condition parameters for

each sample, and all the dataset for the following machine

learning application was arranged in a matrix representation

with 900 samples (rows) and 817 attributes (columns).

4 Hierarchical unsupervised feature selection

This section presents our Hierarchical UnSupervised Fea-

ture Selection (HUSFS) approach that is inspired by the

algorithm in [14], see Section 2.1. In the proposal [14], two

aspects were noted:

1. In step 3, the randomly selected centroids Ac

thave not

necessarily nearest attributes regarding the dissimilarity

metric that has been calculated in step 2. It means that

not all the random centroids are defining a cluster. More

formally, step 3 states that:

Let Ctbe the cluster that is defined by the centroid

candidate Ac

t,t=1, ...K, as described in expression

(21):

Ct={Ai|Ai=arg minAd(Ai,A

t)}(21)

If Ct=∅, the centroid candidate Ac

tdo not define a

cluster, and only there are kclusters with k<K.As

a consequence, in the step 7, Near(At,m)=∅for the

centroid candidate Ac

2. In step 7, the inequality 

Near(At,l)

>

Near(At,j)

,

∀j= lis not always true, and the cluster Ctcan have

more than one At,l.

On the other hand, for a high-dimensional feature

space, there are other two aspects that can lead to

high computational burden and time-consuming, for

implementing the algorithm [14]:

3. An exhaustive search is performed over the entire fea-

ture space, regarding the value of the relative depen-

dency degree between features.

4. The number of given clusters Kcould be large to

explore the entire feature space.

In order to deal with the previous items, the algorithm in

[14] was slightly modified by our work in two ways: (i) by

rewriting the step 7 of the original algorithm to provide a

new random selection over the centroid candidates that can

define clusters with the best density, and (ii) by proposing

Fig. 7 Wavelet packet

decomposition

M. Cerrada et al.

a hierarchical implementation to reinforce the search pro-

cess over the subspaces of solutions with time-computation

improvement.

In a wide sense, exploration is the creation of new solu-

tions by searching on the entire solution space; exploitation

is the refinement of the current solutions. In our approach,

exploration is strengthened by providing several disjoint

searching subspaces; the modified selection algorithm is

executed on each subspace, by fixing a number of clus-

ters according to the cardinality of the feature subset. The

exploitation is enhanced by including a new random selec-

tion in each iteration, through the modification of the step 7,

and by filtering the current solutions in the next level.

The step 7 has been rewritten as follows:

Let NewAc

tbe the set of new centroid candidates that is

defined in (22):

NewAc

t={At,l |

Near(At,l)

≥

Near(At,j)

,∀j= l}

(22)

and min 

Near(At,l)

denotes the centroid candidates with

the lowest number of attributes in the cluster (low density).

Then:

In this sense, our work proposes a new random selection

in each iteration of the selection algorithm, by substituting

the attribute Ac

tthat do not define a cluster, and it gives the

priority to those attributes Ac

tthat define clusters with the

best density. Additionally, because the random nature of the

centroid assignment, we evaluate the number of times that

an attribute is selected as the significant one. This informa-

tion is used to finally select the best attributes after reaching

the number of iterations Nit. Particularly, we chose the

attributes that have been selected one time at least, in all the

iterations.

Moreover, a hierarchical approach is proposed in order to

improve the search procedure on the entire attribute space.

In each level, pdisjoints subsets Aiare defined over the set

of available attributes A;thatis,A=˙

∪iAi, and the attribute

clustering is performed for each subset Ai. The stop condi-

tion for the clustering algorithm is the number of iterations

Nit, and after reaching this stop condition for each anal-

ysed disjoint subset, NAlcentroids are proposed as selected

attributes, NAl<|A|. The new set of selected attributes is

processed again in the next level, as described above.

This procedure can be executed Nltimes, Nlbeing the

number of levels in the hierarchy, or until certain number

of attributes Naare obtained. Once one of these conditions

is accomplished, Lsubsets of attributes are proposed and

the final selection of the best subset is decided according to

its performance in a classification model. In this hierarchi-

cal approach, the parameters Nit,Aiand Kare adjustable

parameters in each level l. Parameters Naand Nlare fixed

at the beginning of the process. This procedure is illustrated

in Fig. 8and summarized as follows:

Input: Categorical matrix CM from the current numeri-

cal data matrix with msamples (rows) and nattributes

(columns).

Fig. 8 Hierarchical procedure

for feature selection

Hierarchical feature selection based on relative dependency...

1. Set the values of Nland Na,l=1,...,N

2. For each l:

(a) Assign the available attributes to the set A.

(b) Create the disjoint sets1of attributes Ai,i=

1,...,p, from the set A.

cal matrix CM with msamples and attributes in

Ai.

(d) For each DMi:

– Set the values of Kand Nit.

– Run the HUSFS algorithm.

until reaching Nit

(e) Evaluate each set NewAc

tfor each Ai, and pro-

pose the set ˆ

Ai⊂Aiwith the attributes that

have been selected one time at least.

(f) Create the new subset of available attributes

NAlsuch that NAl=∪

i=1ˆ

Ai.

until reaching l=Nlor |NAl|≤Nafor some l

3. For each l:

(a) Create new numerical data matrix NDMlwith

msamples and attributes in NAl.

(b) Train some machine learning based diagnosis

model for each data matrix NDMl.

the obtained diagnosis model.

4. Select the set of attributes NAlin the NDMlwith

the best performance.

Output: numerical data matrix NM with msamples

(rows) and NAlbest attributes (columns), and its corre-

sponding trained diagnosis model.

Based on the search process of the algorithm in [11]and

our proposed modification, the hierarchical approach has

the following characteristics that aim to preserve the most

significant attributes over the entire set of attributes:

1. The modified algorithm runs Nit times on a subspace

of solutions by each level, then, an exhaustive search is

locally performed over an important number of clusters

K. We propose to use K≈|Ai|

2. As a consequence of the exhaustive local search, the

non-included features in each subspace are represented

by centroids (representative features) which will be

analysed in the next level of the hierarchy.

3. All features that have been selected one time at least

in Nit iterations are considered for the next level of

1For simplicity, in this work the disjoint sets have been randomly

selected and uniformly sized regarding the cardinality of the set A.

However, other criteria could be applied to decompose the set A, e.g.,

by grouping features with related meaning.

Tabl e 3 RF-based diagnoser with low correlation data

Performance measure All data Reduced attributes

811 237 258 330 409

Precision 0.9816 0.9860 0.9929 0.9964 0.9893

Sensitivity 0.9815 0.9852 0.9926 0.9363 0.9889

F-score 0.9814 0.9852 0.9926 0.9963 0.9889

the hierarchy, this is to enrich the number of centroid

candidates in the next level.

RF and SVM based classifiers are used as classification

models. These models are used because of their versatility

in real environments: SVM has demonstrated to be a good

classifier with low computational burden in training and test

phases, even for on-line working. The optimization algo-

rithms to reach good SVM classifiers are well known, and

Cis the only parameter to be adjusted in (9). However, the

adequate number of attributes for obtaining good perfor-

mance in a SVM-based classifier could lead to an exhaustive

search, as is widely discussed in the literature. On the other

hand, the good performance of RF-based models for clas-

sification tasks is also well known, even with small set

of samples and high number of attributes. However, large

number of trees could be needed in order to reach a good

precision. Both techniques are available in several computa-

tional environments, and they can be easy for implementing

in industrial applications. Even when the above models can

work with large number of attributes, our study addresses

the analysis of the effect of feature selection, through our

HUSFS approach, to obtain simple models that could be

implemented in industrial environments with less compu-

tational effort. The classification precision is the measure

used for evaluating the model performance.

5 Results and analysis in gear fault diagnosis

This section presents the results of selecting attributes

with the approach in Section 4. The data matrix was split

into 70 % of the samples for training and 30 % for test

Tabl e 4 SVM-based diagnoser with low correlation data

Performance measure All data Reduced attributes

811 237 258 330 409

Precision 0.9348 0.7845 0.8889 0.9391 0.9559

Sensitivity 0.9148 0.7741 0.8982 0.9333 0.9556

F-score 0.9194 0.7735 0.8914 0.9340 0.9333

M. Cerrada et al.

Tabl e 5 Hierarchical

unsupervised attribute selection Performace Low correlation Reduced attributes

measure data

330 144 82 62 54 42 24

Precision 0.9964 0.9820 0.9824 0.9788 0.9854 0.9890 0.9616

Sensitivity 0.9963 0.9815 0.9815 0.9778 0.9852 0.9889 0.9556

F-score 0.9963 0.9814 0.9814 0.9778 0.9852 0.9889 0.9546

RF-based diagnoser

phase (performance evaluation), for building each diagno-

sis model in all the experiments. This rate has been selected

from the classical suggestions in machine learning appli-

cations to get an adequate balance between accuracy and

precision, in both training and test phase. The initial data

matrix is taken from Section 3; a cleaning process was exe-

cuted over the matrix to delete non adequate data (NaN

and zero values), as a result, the data set is a matrix with

811 attributes (columns) and 900 samples (rows). Data nor-

malization was computed on the interval [−1,1],anda

Fig. 9 Results with HUSFS and RF-based classifier

Hierarchical feature selection based on relative dependency...

Tabl e 6 Hierarchical

unsupervised attribute selection Performance measure Low correlation data Reduced attributes

409 221 181 128 87

Precision 0.9559 0.9570 0.9524 0.9360 0.9113

Sensitivity 0.9556 0.9556 0.9444 0.9296 0.8963

F-score 0.9553 0.9552 0.9443 0.9301 0.8990

SVM-based diagnoser

preliminary attribute selection has been performed on the

previous data matrix, by using correlation analysis.

Correlation analysis is a very useful statistical technique

that aims to find dependence relationship between variables.

In feature selection, correlation would be a previous step to

be applied in order to identify those attributes which are very

related in the sense of the correlation measure, then, only

one attribute is selected as the representative one. Attributes

with a correlation upper than 85, 90, 95, 99 % were deleted,

as a result, four reduced datasets have been obtained with

237, 258, 330 and 409 attributes. Tables 3and 4show the

performance of each classifier, for the dataset. RF is com-

posed of 500 random trees, SVM uses Sequential Minimal

Optimization (SMO) with C=0.1, as optimization method.

The challenge is to reduce the previous low correlation

data with the HUSFS algorithm based on relative depen-

dency, and maintain an adequate accuracy. The classic

discretization method by binning was applied on the nor-

malized data matrix, with 40 bins. Next sections present the

results with RF and SVM classifiers. For both scenarios, the

algorithm parameters were selected as follows: Nit =200,

pwas adjusted according to the cardinality of the initial

set Afor each level, the size of Aiwas no more than 30

attributes, and Kis based on the greatest integer function

applied to |Ai|

For the RF-based classifier, the subset with 330 attributes

was selected as initial set because it has the best result in

Tabl e 3. The number of levels was Nl=6. After running

the HUSFS algorithm, the sets of attributes for each level

are NA1=144, NA2=82, NA3=62, NA4=54,

NA5=42 and NA6=24. Table 5shows the diagnoser

performance in test phase.

The set NA5=42 has a good performance comparing

with the results from the original set with 330 attributes.

Precision, sensitivity and F-score are only decremented in

0.74 %, by using 12 % of the attributes. On the other hand,

with only 24 attributes, around 7 % of the original set,

the performance measures are decremented by 3.49, 4.08

and 4.18 %, for precision, sensitivity and F-score, respec-

tively. Both cases have an adequate performance evaluation

for the classifiers. Figure 9shows the hierarchical selection

from 144 attributes to 24 attributes. The attribute identifier

is on x-axis and the number of times that the attribute was

selected is on y-axis. After level l=5, cases eand f, cen-

troid candidates have been selected a number of times over

10%ofNit.

HUSFS algorithm was also applied with a SVM-based

classifier. The subset with 409 attributes was selected as

initial set according to its performance in Table 4. The num-

beroflevelswasNl=4, then, after running the HUSFS

algorithm, the sets of selected attributes are NA1=221,

NA2=181, NA3=128, NA4=87. Table 5shows the

diagnoser performance in test phase.

In this case, the effect of feature selection is differ-

ent with regard to the RF-based classifier. SVM is more

sensitive to the use of less number of attributes, this fact

is well-known in the literature and this is a rationale for

providing new algorithms for feature selection. The perfor-

mance measures are shown in Table 6, the precision with

128 attributes is decremented in 2 % comparing with the

result from the original set with 409 attributes. Sensitivity

and F-score are decremented in 2.72 and 2.64 %, respec-

tively. Moreover, with 181 selected attributes, around 44 %

of the entire set, the precision value is decremented by

0.35 %, as a consequence, this set could be considered as a

good selection. However, the set with 221 attributes, around

54 % of the entire set, is the best reduction based on its

performance values.

Our hierarchical approach was compared with the mod-

ified one-step run approach inspired by [14], using the

dataset with 330 attributes and a RF-based diagnoser. Tak-

ing into account the best result with the HUSFS in Table 5,

Tabl e 7 Unsupervised attribute selection in one-step run

Performance measure Reduced attributes

48 56 66

Precision 0.5178 0.5541 0.6431

Sensitivity 0.5037 0.5444 0.6333

F-score 0.5005 0.5424 0.6358

RF-based diagnoser

M. Cerrada et al.

Tabl e 8 Classification

precision for different feature

selection techniques and

classifiers

Classifier Entropy-based selection NMF HUSFS

29 14 9 35 42 24

RF 0.9852 0.9629 0.9555 0.8926 0.9890 0.9616

DT 0.8963 0.8926 0.8962 0.6296 0.8037 0.8111

1-NN 0.9888 0.9741 0.9741 0.5969 0.9666 0.9370

where a reduced set of 42 attributes was selected, we set

K=50 and Nit =500. After analysing the results,

we found three significant sets of attributes that have been

ranked according to the number of times that they were

selected; then, we have sets with 66, 56 and 48 attributes

selected over 40, 50 and 60 % of times regarding the

number of iterations Nit, respectively. The performance

measures are presented in Table 7. By comparing with the

results in Table 5, we have reached best performances with

the reduced set of attributes that have been obtained with

our HUSFS algorithm. This low performance is because

the one-step algorithm could need a high value of Kto

refine the search over the entire attribute space, when a

high-dimensional vector of attributes is processed. In this

case, the computational burden for running the algorithm is

considerably augmented.

Finally, we have run two additional attribute selection

techniques over the dataset with 300 attributes, to compare

with our results. The first one is the ranking of the impor-

tance variables provided by the RF algorithm. This ranking

is a supervised approach based on the entropy measure

that is calculated from the dataset with random selection

of the attributes in the OOB-samples [4]. We selected the

attributes with entropy values upper than 40, 50 and 60 %,

as a result, three sets with 29, 14 and 9 attributes are defined.

The second technique is Non-Negative-Matrix Factoriza-

tion (NMF), this is an unsupervised approach that is widely

used in clustering, classification and feature selection [20].

Tabl e 8shows the precision after applying these selection

techniques, and our HUSFS approach, with the following

classifiers: RF, Decision Trees (DT) and 1-Nearest Neigh-

bour (1-NN).

The diagnosis results are clearly similar when RF classi-

fier is used either with HUSFS or entropy-based supervised

selection. For the other classifiers, HUSFS approach is

better than the unsupervised NMF-based selection. The pre-

cision value, with 42 attributes from HUSFS, is improved

around 2.25 % by the entropy-based supervised selection

and the best result with 1-NN classifier. These results

show that our unsupervised hierarchical approach can select

attributes as good as supervised approaches.

In summary, according to the previous results and analy-

sis, the effect of the dimensionality reduction of the feature

vector on the underlying RF and SVM algorithms is to build

computational treatable models with an adequate precision

in classification, regarding the models that can be obtained

by using a large set of features. Particularly, RF is an algo-

rithm based on search over the attributes space, then, the

dimensionality reduction helps to improve the complexity

regarding the number of variables of each tree. Moreover,

SVM is a very sensitive algorithm regarding the attributes

that are used for getting good precision and generalization

capabilities, the dimensionality reduction aims to discover

the adequate features for a good performance of the SVM-

based classifier. This hierarchical approach can be applied

to other classification algorithms, and the particular effect

on the dimensionality reduction should be analysed for each

classifier.

6Conclusion

Unsupervised feature selection is an important aspect under

research in machine learning based classification. This

paper presents an unsupervised approach for feature selec-

tion, that is inspired by algorithms based on relative depen-

dency. Particularly, our approach has improved the proposal

in [14] by considering in each iteration a random selection

to include the centroid candidates that are defining clus-

ters with the best density. Our hierarchical implementation

with disjoint partition of the available set of features in each

level, permits to deal with the size of the search space in

case of large number of attributes. HUSFS approach aims

to perform a local search on a reduced space of attributes,

then, the best local selections are aggregated to compose a

new reduced space of attributes. In this sense, the search

procedure is refined, in each level.

For our case of study in gear fault diagnosis, the per-

formance of the diagnosers by using our HUSFS algo-

rithm is adequate regarding other supervised or unsuper-

vised techniques for feature selection. We have noted that

the execution time of the one-step algorithm is decre-

mented with our hierarchical approach, then, next works

could be addressed to the analysis of adequate compu-

tational implementations for improving the computational

burden.

Hierarchical feature selection based on relative dependency...

Acknowledgments The authors want to express a deep gratitude to

The Secretary of Higher Education, Science, Technology and Inno-

vation (SENESCYT) of the Republic of Ecuador and the Prometeo

program, for their support in this research work. We also acknowl-

edge the support of the GIDTEC research group of the Universidad

Polit´

ecnica Salesiana in Cuenca-Ecuador, for the accomplishment of

this research.

References

1. Bartkowiak A, Zimroz R (2014) Dimensionality reduction via

variables selection linear and nonlinear approaches with applica-

tion to vibration-based condition monitoring of planetary gearbox.

Appl Acoust 77:169–177

2. Benot F, van Heeswijk M, Miche Y, Verleysen M, Lendasse

A (2013) Feature selection for nonlinear models with extreme

learning machines. Neurocomputing 102:111–124. Advances in

extreme learning machines (ELM 2011)

3. Bordoloi D, Tiwari R (2014) Support vector machine based opti-

mization of multi-fault classification of gears with evolutionary

algorithms from time frequency vibration data. Measurement

55:1–14

4. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

5. Cabrera D, Sancho F, S´

anchez RV, Zurita G, Cerrada M, Li C,

V´

asquez RE (2015) Fault diagnosis of spur gearbox based on ran-

dom forest and wavelet packet decomposition. Front Mech Eng.

doi:10.1007/s11465-015-0348-8

6. Cerrada M, S´

anchez RV, Cabrera D, Zurita G, Li C (2015)

Multi-stage feature selection by using genetic algorithms for

fault diagnosis in gearboxes based on vibration signal. Sensors

15(9):23,903–23,926

7. Cerrada M, Zurita G, Cabrera D, S´

anchez RV, Art´

es M,

Li C (2015) Fault diagnosis in spur gears based on genetic

algorithm and random forest. Mech Syst Signal Process.

doi:10.1016/j.ymssp.2015.08.030

8. Chandrashekar G, Sahin F (2014) A survey on feature selection

methods. Comput Electr Eng 40:16–28

9. Fazayeli F, Wang L, Mandziuk J (2008) Feature selection based

on the rough set theory and expectation-maximization clustering

algorithm. In: Chan CC, Grzymala-Busse J, Ziarko W (eds) Rough

sets and current trends in computing. Lecture Notes in Computer

Science, vol 5306, pp 272–282

10. Ganivada A, Ray SS, Pal SK (2013) Fuzzy rough sets, and a granu-

lar neural network for unsupervised feature selection. Neural Netw

48:91–108

11. Gryllias K, Antoniadis I (2012) A support vector machine

approach based on physical model training for rolling element

bearing fault detection in industrial environments. Eng Appl Artif

Intell 25(2):326–344

12. Han J, Hu X, Lin T (2004) Feature subset selection based on rel-

ative dependency between attributes. In: Tsumoto S, Sowiski R,

Komorowski J, Grzymaa-Busse J (eds) Rough sets and current

trends in computing. Lecture notes in computer science, vol 3066.

Springer, Berlin Heidelberg, pp 176–185

13. Hastie T, Tibshirani R, Friedman J (2009) The elements of statisti-

cal learning: data mining, inference and prediction. Springer, New

Yor k

14. Hong TP, Liou YL, Wang SL, Vo B (2014) Feature selection and

replacement by clustering attributes. Vietnam Journal of Computer

Science 1(1):47–55

15. Inbarani H, Bagyamathi M, Azar A (2015) A novel hybrid fea-

ture selection method based on rough set and improved harmony

search. Neural Comput & Applic:1–22

16. Jensen R, Shen Q (2008) Computational intelligence and features

selection: rough and fuzzy approaches. Wiley, New Jersey

17. Karabadji N, Khelf I, Seridi H, Laouar L (2012) Genetic opti-

mization of decision tree choice for fault diagnosis in an industrial

ventilator. In: Fakhfakh T, Bartelmus W, Chaari F, Zimroz R, Had-

dar M (eds) Condition monitoring of machinery in non-stationary

operations, pp 277–283

18. Li C, Liang M, Wang T (2015) Criterion fusion for spectral

segmentation and its application to optimal demodulation of

bearing vibration signals. Mech Syst Signal Process 6465:132–

148

19. Li C, Sanchez RV, Zurita G, Cerrada M, Cabrera D, Vasquez RE

(2015) Multimodal deep support vector classification with homol-

ogous features and its application to gearbox fault diagnosis.

Neurocomputing 168:119–127

20. Li Y, Ngom A (2013) The non-negative matrix factorization

toolbox for biological data mining. Source Code Biol Med 8

(10)

21. Liu C, Jiang D, Yang W (2014) Global geometric similarity

scheme for feature selection in fault diagnosis. Expert Syst Appl

41(8):3585–3595

22. Liu H, Yu L (2005) Toward integrating feature selection algo-

rithms for classification and clustering. IEEE Trans Knowl Data

Eng 17(4):491–502

23. Liu Z, Qu J, Zuo M, Hb Xu (2013) Fault level diagnosis for

planetary gearboxes using hybrid kernel feature selection and

kernel fisher discriminant analysis. Int J Adv Manuf Technol

67(5–8):1217–1230

24. Liu Z, Zhao X, Zuo M, Xu H (2014) Feature selection for fault

level diagnosis of planetary gearboxes. ADAC 8(4):377–401

25. van der Maaten L, Postma EO, van den Herik HJ (2009) Dimen-

sionality reduction: a comparative review. Tech. rep., Tilburg

University Technical Report, TiCC-TR 2009–005

26. Mac Parthal´

ain N, Jensen R (2013) Unsupervised fuzzy-rough set-

based dimensionality reduction. Inf Sci 229:106–121

27. Mallat S (2009) A wavelet tour of signal processing: the sparse

way. Elsevier Academic Press, Amsterdam

28. Mitchell T (1997) Machine learning. McGraw-Hill, New York

29. Mitra S (2011) Digital signal processing: a computer-based

approach. McGraw-Hill, New York

30. Muralidharan V, Sugumaran V (2013) Feature extraction using

wavelets and classification through decision tree algorithm for

fault diagnosis of mono-block centrifugal pump. Measurement

46(1):353–359

31. Muralidharan V, Sugumaran V, Indira V (2014) Fault diagnosis of

monoblock centrifugal pump using SVM. Int J Eng Sci Technol

17(3):152–157

32. Pawlak Z (1982) Rough sets. Int J Comput Inform Sci 11(5):341–

356

33. Qin H, Ma X, Zain JM, Herawan T (2012) A novel soft set

approach in selecting clustering attribute. Knowl-Based Syst

36:139–145

34. Rajeswari C, Sathiyabhama B, Devendiran S, Manivannan K

(2013) Fault gear categorization: a comparative study on fea-

ture classification using rough set theory and ID3. Int J Artif

Intell Appl Smart Devices 97:41–64. 12th Global Congress on

Manufacturing and Management (GCMM)-2014

35. Rajeswari C, Sathiyabhama B, Devendiran S, Manivannan K

(2014) A gear fault identification using wavelet transform, rough

set based GA, ANN and C4.5 algorithm. Procedia Eng 97:1831–

1841. 12th Global Congress on Manufacturing and Management

(GCMM)-2014

36. Raymer M, Punch W, Goodman E, Kuhn L, Jain A (2000) Dimen-

sionality reduction using genetic algorithms. IEEE Trans Evol

Comput 4(2):164–171

M. Cerrada et al.

37. Roman S (2001) Rough sets methods in feature reduction and

classification. Int J Appl Math Comput Sci 11:565–582

38. Sakthivel N, Sugumaran V, Nair BB (2010) Comparison of deci-

sion tree-fuzzy and rough set-fuzzy methods for fault categoriza-

tion of mono-block centrifugal pump. Mech Syst Signal Process

24(6):1887–1906

39. Verikas A, Gelzinis A, Bacauskiene M (2011) Mining data with

random forests: a survey and results of new tests. Pattern Recogn.

44(2):330–349

40. Wang S, Pedrycz W, Zhu Q, Zhu W (2015) Unsupervised fea-

ture selection via maximum projection and minimum redundancy.

Knowl-Based Syst 75:19–29

41. Witten IH, Frank E (2005) Data mining: practical machine learn-

ing tools and techniques. Morgan Kaufman, Boston

42. Yan R, Gao RX, Chen X (2014) Wavelets for fault diagnosis of

rotary machines: a review with applications. Signal Process 96:1–

43. Yang BS, Di X, Han T (2008) Random forests classifier for

machine fault diagnosis. J Mech Sci Technol 22(9):1716–1725

44. Yoon H, Park CS, Kim JS, Baek JG (2013) Algorithm learning

based neural network integrating feature selection and classifica-

tion. Expert Syst Appl 40(1):231–241

45. Zhu X, Zhang Y, Zhu Y (2012) Intelligent fault diagnosis of

rolling bearing based on kernel neighborhood rough sets and

statistical features. J Mech Sci Technol 26(9):2649–2657

46. Ziegler A, Knig IR (2013) Mining data with random forests: cur-

rent options for real-world applications. Wiley Interdiscip Rev

Data Min Knowl Discov 4(1):55–63

Mariela Cerrada (cer-

radam@ula.ve) received her

Ph.D. degree in Automatic

Systems in 2003 from the

INSA Toulouse-France. She

is currently a full dedicated

time titular Professor in the

Department of Control Sys-

tems and associate member of

the Studies Center on Micro-

computers and Distributed

Systems (CEMISID) at the

Engineering Faculty in the

Universidad de Los Andes of

Venezuela. She was Prometeo

Researcher at the Universidad

Polit´

ecnica Salesiana, Ecuador. Her main research area is on fault

diagnosis, supervision and intelligent control systems.

Ren´

e-Vinicio S´

anchez

(rsanchezl@ups.edu.ec)

received the B.S. in Mechani-

cal Engineering in 2004, from

the Universidad Polit´

ecnica

Salesiana (UPS), Ecuador.

He got his master in manage-

ment audit quality in 2008

at the UTPL, Ecuador, and

the master degree in indus-

trial technologies research in

2012 at the UNED, Spain.

Currently, he is a Professor in

the Department of Mechan-

ical Engineering at UPS.

His research interests are in

machinery health maintenance, pneumatic and hydraulic systems,

artificial intelligence and engineering education.

Fannia Pacheco (fannikaro@

gmail.com) received her M.

Sc. Degree in Computer Sci-

ence from the Universidad

de Los Andes, Venezuela in

2015. She joined the GIDTEC

research team at the Uni-

versidad Polit´

ecnica Salesiana

(UPS), Ecuador. Her research

interests cover novelty detec-

tion, data analysis and intelli-

gent systems.

Diego Cabrera (dcabrera@

ups.edu.ec) received his M.

Sc. degree at the Sevilla Uni-

versity in 2014 and he is a

Ph.D candidate in Computer

Science at the same Univer-

sity. Currently, he is a Pro-

fessor with the Department of

Mechanical Engineering at the

Universidad Polit´

ecnica Sale-

siana (UPS), Ecuador. He is a

member of the research group

GIDTEC at UPS. His research

areas are machine learning,

complex systems modelling

and intelligence systems.

Hierarchical feature selection based on relative dependency...

Grover Zurita (gzuritav@ups.

edu.ec) received his Ph. D.

degree from Lule˚

aUniver-

sity of Technology, Sweden, in

2001. He was a Postdoctoral

Fellow at the University of

New South Wales, Australia,

in 2002. Currently, he is a Pro-

fessor at the Private University

of Bolivia, and he was Prom-

eteo Researcher at the Univer-

sidad Polit´

ecnica Salesiana of

Ecuador. His research interests

are machine diagnosis, opti-

mization and control of inter-

nal combustion engines.

Chuan Li(chuanli@21cn.com)

received his Ph.D. degree from

the Chongqing University,

China, in 2007. He has been

successively a Postdoctoral

Fellow with the University of

Ottawa, Canada, a Research

Professor with the Korea

University, South Korea, and

a Senior Research Associate

with the City University of

Hong Kong, China. He is

currently a Professor with the

Chongqing Technology and

Business University, China,

and a Prometeo Researcher

with the Universidad Polit´

ecnica Salesiana, Ecuador. His research

interests include machinery healthy maintenance, and intelligent

systems.

A preview of this full-text is provided by Springer Nature.

Learn more

Content available from Applied Intelligence

This content is subject to copyright. Terms and conditions apply.

On the Stability and Homogeneous Ensemble of Feature Selection for Predictive Maintenance: A Classification Application to Tool Condition Monitoring in Milling

Preprint

Full-text available

Mar 2023

p>Feature selection (FS) represents an essential step for many machine learning-based predictive maintenance (PdM) applications, including various industrial processes, components, and monitoring tasks. The selected features do not only serve as inputs to the learning models, but also they can influence further decisions and analysis, e.g., sensor selection, understandability of the PdM system. Hence, before deploying the PdM system, it is crucial to examine the reproducibility and robustness of the selected features under variations in the input data. This is particularly critical for the real-world datasets with a low sample-to-dimension ratio (SDR). However, to the best of our knowledge, stability of the FS methods under data variations has not been considered yet in the field of PdM. This paper addresses this issue with an application to tool condition monitoring in milling, where classifiers based on support vector machine and random forest were employed. We used a 5-fold cross-validation to evaluate three popular filter-based FS methods, namely Fisher score, maximum relevance minimum redundancy (mRMR), and ReliefF, in terms of both stability and macro-F1. Further, for each method, we investigated the impact of the homogeneous FS ensemble on both performance indicators. To gain broad insights, we used four (2:2) milling datasets generated from our experiments and NASA’s repository, which differ in the operating conditions, sensors, SDR, number of classes, etc. Among the findings: 1) Different FS methods can yield comparable macro-F1, yet considerably different FS stability values. 2) Fisher score (single and/or ensemble) is the superior in most of the cases. 3) mRMR’s stability is overall the lowest, the most variable over different settings, e.g., sensor(s), subset cardinality, and the one most benefiting from the ensemble. </p

On the Stability and Homogeneous Ensemble of Feature Selection for Predictive Maintenance: A Classification Application to Tool Condition Monitoring in Milling

Preprint

Full-text available

Feb 2023

On the Stability and Homogeneous Ensemble of Feature Selection for Predictive Maintenance: A Classification Application to Tool Condition Monitoring in Milling

Preprint

Full-text available

Feb 2023

p>Feature selection (FS) represents an essential step for many machine learning-based predictive maintenance (PdM) applications, including various industrial processes, components, and monitoring tasks. The selected features not only serve as inputs to the learning models, but also can influence further decisions and analysis, e.g., sensor selection, understandability of the PdM system. Hence, before deploying the PdM system, it is crucial to examine the reproducibility and robustness of the selected features under variations in the input data. This is particularly critical for real-world datasets with a low sample-to-dimension ratio (SDR). However, to the best of our knowledge, stability of the FS methods under data variations has not been considered yet in the field of PdM. This paper addresses this issue with an application to tool condition monitoring in milling, where classifiers based on support vector machine and random forest were employed. We used a 5-fold cross-validation to evaluate three popular filter-based FS methods, namely Fisher score, minimum redundancy maximum relevance (mRMR), and ReliefF, in terms of both stability and macro-F1. Further, for each method, we investigated the impact of the homogeneous FS ensemble on both performance indicators. To gain broad insights, we used four (2:2) milling datasets obtained from our experiments and NASA’s repository, which differ in the operating conditions, sensors, SDR, number of classes, etc. For each dataset, the study was conducted for two individual sensors and their fusion. Among the conclusions: 1) Different FS methods can yield comparable macro-F1, yet considerably different FS stability values. 2) Fisher score (single and/or ensemble) is the superior in most of the cases. 3) mRMR’s stability is overall the lowest, the most variable over different settings (e.g., sensor(s), subset cardinality), and the one that benefits the most from the ensemble. This paper was accepted for publication in Sensors. The peer-reviewed publication can be found in: https://www.mdpi.com/1424-8220/23/9/4461 </p

On the Stability and Homogeneous Ensemble of Feature Selection for Predictive Maintenance: A Classification Application to Tool Condition Monitoring in Milling

Preprint

Full-text available

Feb 2023

An approach to robust condition monitoring in industrial processes using pythagorean membership grades

Article

Full-text available

Dec 2022
AN ACAD BRAS CIENC

Abstract In this paper, a robust approach to improve the performance of a condition monitoring process in industrial plants by using Pythagorean membership grades is presented. The FCM algorithm is modified by using Pythagorean fuzzy sets, to obtain a new variant of it called Pythagorean Fuzzy C-Means (PyFCM). In addition, a kernel version of PyFCM (KPyFCM) is obtained in order to achieve greater separability among classes, and reduce classification errors. The approach proposed is validated using experimental datasets and the Tennessee Eastman (TE) process benchmark. The results are compared with the results obtained with other algorithms that use standard and non-standard membership grades. The highest performance obtained by the approach proposed indicate its feasibility.

Mathematical Modeling and Fault Diagnosis of the Motor-Gearbox System

Thesis

Jun 2023

Vikash Kumar

The gearboxes offer wide range of applications in Industry 4.0 due to its versality in motion or power transmission. Any transmission error or sudden failure within the gearbox increases the noise and vibration level of the whole system. It may lead to fatal harm, industry breakdown, and massive economic loss. That is why it is a dominant area of research in industry. In the past couple of decades, condition monitoring and fault diagnosis of the gearbox has been explored and developed. Based on the literature survey, two types of approaches are frequently applied in condition monitoring and fault diagnostics: (1) a data-driven approach; and (2) a physical model-based approach. However, there are some major limitations in these models: (1) large dataset requirements, high sampling frequency selection, and use of parametric filters in data-driven models; (2) incorporation of a basic time varying mesh stiffness (TVMS) model in the existing electro-mechanical (EM) and dynamic model without considering the effect of misalignment between the base and root circle, an accurate transition curve, exclusion of non-linear Hertzian contact stiffness, and revised fillet foundation stiffness considering the influence of structural coupling due to nearby loaded tooth; (3) incorporation of a basic gear fault model in the system; (4) requirement of an accurate model for industrial gearboxes, i.e., carburized gear tooth modeling; and (5) a separate modeling approach for coupled EM motor-gearbox systems. These limitations need to be addressed to develop a robust model that produces reliable, quick, and automated results for condition monitoring and fault diagnosis problems. In this research work, these limitations are addressed, and results show that the proposed models (both data-driven and physical) are more realistic, accurate, and better than conventional ones, and successfully depict the faults in the system.

A methodological approach for detecting multiple faults in wind turbine blades based on vibration signals and machine learning

Article

Full-text available

Sep 2023

Wind turbines generate clean and renewable energy for the international market. The most ‎‎important aspect of wind turbine maintenance is reducing failures, downtime, and operating and maintenance expenses. ‎This study aims to detect multiple faults exhibited by wind turbine blades; failures such as cracks (tip crack, mid-span crack, and crack ‎near the root) were observed in the blades at different locations. The research suggests a new approach, incorporating vibration signals and machine learning techniques to identify various failures in wind turbine blades. The technology of ranking features such as ReliefF algorithms, chi-squares, and information gains was adopted to discuss a method framework to diagnose several problems in wind turbine blades, such as cracks in different locations. The k-nearest neighbors (KNNs), support vector machines, and random forests are used to classify data based on measured vibration signals. The eight main time-domain features are calculated from the vibration signals. The proposed methodology was validated using four databases. The results showed good classification accuracy in four databases, with at least three non-conventional features in each database’s top nine features of the three classification techniques. The results also showed that when the ReliefF selection algorithm is applied with the KNN classification algorithm, it generates the highest classification accuracy under all failure conditions, and the value is 97%. Finally, the performance of the proposed classification model is compared with other machine learning classification models, and a promising result is obtained.

Imbalanced multi-label data classification as a bi-level optimization problem: application to miRNA-related diseases diagnosis

Article

Full-text available

Apr 2023
NEURAL COMPUT APPL

In multi-label classification, each instance could be assigned multiple labels at the same time. In such a situation, the relationships between labels and the class imbalance are two serious issues that should be addressed. Despite the important number of existing multi-label classification methods, the widespread class imbalance among labels has not been adequately addressed. Two main issues should be solved to come up with an effective classifier for imbalanced multi-label data. On the one hand, the imbalance could occur between labels and/or within a label. The “Between-labels imbalance” occurs where the imbalance is between labels however the “Within-label imbalance” occurs where the imbalance is in the label itself and it could occur across multiple labels. On the other hand, the labels’ processing order heavily influences the quality of a multi-label classifier. To deal with these challenges, we propose in this paper a bi-level evolutionary approach for the optimized induction of multivariate decision trees, where the upper-level role is to design the classifiers while the lower-level approximates the optimal labels’ ordering for each classifier. Our proposed method, named BIMLC-GA (Bi-level Imbalanced Multi-Label Classification Genetic Algorithm), is compared to several state-of-the-art methods across a variety of imbalanced multi-label data sets from several application fields and then applied on the miRNA-related diseases case study. The statistical analysis of the obtained results shows the merits of our proposal.

Multi-Objective Self-Adaptive Particle Swarm Optimization for Large-Scale Feature Selection in Classification

Article

Dec 2023

Feature selection (FS) is recognized for its role in enhancing the performance of learning algorithms, especially for high-dimensional datasets. In recent times, FS has been framed as a multi-objective optimization problem, leading to the application of various multi-objective evolutionary algorithms (MOEAs) to address it. However, the solution space expands exponentially with the dataset’s dimensionality. Simultaneously, the extensive search space often results in numerous local optimal solutions due to a large proportion of unrelated and redundant features [H. Adeli and H. S. Park, Fully automated design of super-high-rise building structures by a hybrid ai model on a massively parallel machine, AI Mag. 17 (1996) 87–93]. Consequently, existing MOEAs struggle with local optima stagnation, particularly in large-scale multi-objective FS problems (LSMOFSPs). Different LSMOFSPs generally exhibit unique characteristics, yet most existing MOEAs rely on a single candidate solution generation strategy (CSGS), which may be less efficient for diverse LSMOFSPs [H. S. Park and H. Adeli, Distributed neural dynamics algorithms for optimization of large steel structures, J. Struct. Eng. ASCE 123 (1997) 880–888; M. Aldwaik and H. Adeli, Advances in optimization of highrise building structures, Struct. Multidiscip. Optim. 50 (2014) 899–919; E. G. González, J. R. Villar, Q. Tan, J. Sedano and C. Chira, An efficient multi-robot path planning solution using a* and coevolutionary algorithms, Integr. Comput. Aided Eng. 30 (2022) 41–52]. Moreover, selecting an appropriate MOEA and determining its corresponding parameter values for a specified LSMOFSP is time-consuming. To address these challenges, a multi-objective self-adaptive particle swarm optimization (MOSaPSO) algorithm is proposed, combined with a rapid nondominated sorting approach. MOSaPSO employs a self-adaptive mechanism, along with five modified efficient CSGSs, to generate new solutions. Experiments were conducted on ten datasets, and the results demonstrate that the number of features is effectively reduced by MOSaPSO while lowering the classification error rate. Furthermore, superior performance is observed in comparison to its counterparts on both the training and test sets, with advantages becoming increasingly evident as the dimensionality increases.

Deep convolution neural network for machine health monitoring using spectrograms of vibration signal and its EMD-intrinsic mode functions

Article

Mar 2023

A spectrum-image based representation of machine vibration signals with deep convolution neural network is proposed for machine fault classification in which the convolution layer is used for automatic feature extraction as an alternate to the conventional feature-based methods. Two different forms of spectrum representations are proposed, one based on the short time Fourier transform of the original signals and the other based on the short time Fourier transform of the intrinsic mode functions acquired by empirical mode decomposition. Empirical mode decomposition has its own merits in discriminating non stationary signals and the novelty of the work is to use the short time Fourier transform of intrinsic mode functions with deep convolution neural network model. The classification and validation accuracy of the model are investigated with respect to epochs. It is demonstrated that both spectrum-based techniques perform good with 100% model accuracies in a numerical experiment of binary classification on a bearing dataset that comprises of normal and faulty signals. In another experiment using milling data set, short time Fourier transform of intrinsic mode functions representation performs better with 100% training accuracy, F1 score of 0.8933 which is better than that of using short time Fourier transform of raw signals whose training accuracy is 64% and F1 score of 0.7486. The numerical study shows that the empirical mode decomposition based spectrum representation delivers the highest accuracy in the learning model obviating the necessity for independent feature extraction, feature selection, and dimension reduction. The numerical experiment is extended using empirical mode decomposition based spectrums for multiple class classification problems in bearing dataset. The confusion matrix obtained for 10 classes, shows that validation accuracy is 100% for all classes. The performance comparison throws light on the merits of empirical mode decomposition spectrum method over other state of the art methods.

Fault diagnosis in spur gears based on genetic algorithm and random forest

Article

Full-text available

Sep 2015
MECH SYST SIGNAL PR

There are growing demands for condition-based monitoring of gearboxes, and therefore new methods to improve the reliability, effectiveness, accuracy of the gear fault detection ought to be evaluated. Feature selection is still an important aspect in machine learning-based diagnosis in order to reach good performance of the diagnostic models. On the other hand, random forest classifiers are suitable models in industrial environments where large data-samples are not usually available for training such diagnostic models. The main aim of this research is to build up a robust system for the multi-class fault diagnosis in spur gears, by selecting the best set of condition parameters on time, frequency and time–frequency domains, which are extracted from vibration signals. The diagnostic system is performed by using genetic algorithms and a classifier based on random forest, in a supervised environment. The original set of condition parameters is reduced around 66% regarding the initial size by using genetic algorithms, and still get an acceptable classification precision over 97%. The approach is tested on real vibration signals by considering several fault classes, one of them being an incipient fault, under different running conditions of load and velocity.

Fault diagnosis of spur gearbox based on random forest and wavelet packet decomposition

Article

Full-text available

Sep 2015

This paper addresses the development of a random forest classifier for the multi-class fault diagnosis in spur gearboxes. The vibration signal’s condition parameters are first extracted by applying the wavelet packet decomposition with multiple mother wavelets, and the coefficients’ energy content for terminal nodes is used as the input feature for the classification problem. Then, a study through the parameters’ space to find the best values for the number of trees and the number of random features is performed. In this way, the best set of mother wavelets for the application is identified and the best features are selected through the internal ranking of the random forest classifier. The results show that the proposed method reached 98.68% in classification accuracy, and high efficiency and robustness in the models. © 2015, Higher Education Press and Springer-Verlag Berlin Heidelberg.

Multi-Stage Feature Selection by Using Genetic Algorithms for Fault Diagnosis in Gearboxes Based on Vibration Signal

Article

Full-text available

Sep 2015
SENSORS-BASEL

There are growing demands for condition-based monitoring of gearboxes, and techniques to improve the reliability, effectiveness and accuracy for fault diagnosis are considered valuable contributions. Feature selection is still an important aspect in machine learning-based diagnosis in order to reach good performance in the diagnosis system. The main aim of this research is to propose a multi-stage feature selection mechanism for selecting the best set of condition parameters on the time, frequency and time-frequency domains, which are extracted from vibration signals for fault diagnosis purposes in gearboxes. The selection is based on genetic algorithms, proposing in each stage a new subset of the best features regarding the classifier performance in a supervised environment. The selected features are augmented at each stage and used as input for a neural network classifier in the next step, while a new subset of feature candidates is treated by the selection process. As a result, the inherent exploration and exploitation of the genetic algorithms for finding the best solutions of the selection problem are locally focused. The Sensors 2015, 15 23904 approach is tested on a dataset from a real test bed with several fault classes under different running conditions of load and velocity. The model performance for diagnosis is over 98%.

Feature subset selection based on relative dependency between attributes

Article

Jan 2004

Toward integrating feature selection algorithms for classification and clustering

Article

Apr 2005

This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for classification and clustering, groups and compares different algorithms with a categorizing framework based on search strategies, evaluation criteria, and data mining tasks, reveals unattempted combinations, and provides guidelines in selecting feature selection algorithms. With the categorizing framework, we continue our efforts toward building an integrated system for intelligent feature selection. A unifying platform is proposed as an intermediate step. An illustrative example is presented to show how existing feature selection algorithms can be integrated into a meta algorithm that can take advantage of individual algorithms. An added advantage of doing so is to help a user employ a suitable algorithm without knowing details of each algorithm. Some real-world applications are included to demonstrate the use of feature selection in data mining. We conclude this work by identifying trends and challenges of feature selection research and development.

Random forests

Article

Jan 2001

Fault Gear Categorization: A Comparative Study on Feature Classification using Rough Set Theory and ID3

Article

Nov 2013

Saravanan Natarajan

Fault diagnosis on a gear box is a difficult problem due to the non-stationary type of vibration signals it generates. Usually, one method of fault diagnosis can only inspect one corresponding fault category. Vibration based condition monitoring using machine learning methods is gaining momentum. In this paper, rough sets theory, is used to diagnose the fault gears in a gear box. Through the analysis of the final reducts generated using rough sets theory, it is shown that this method is effective for diagnosing more than one type of fault in a gear. The performance of rough set method are compared with those of the ID3 decision tree algorithm and the results prove that the rough set method has greater capability to bring out the different fault conditions of the gear box under investigation. The study reveals that the overall classification efficiency of the decision tree is to some extent better than the classification efficiency of rough sets method.

Rough sets methods in feature reduction and classification

Article

Jan 2001

Roman W. Swiniarski

The paper presents an application of rough sets and statistical methods to fea-ture reduction and pattern recognition. The presented description of rough sets theory emphasizes the role of rough sets reducts in feature selection and data reduction in pattern recognition. The overview of methods of feature selection emphasizes feature selection criteria, including rough set-based methods. The paper also contains a description of the algorithm for feature selection and re-duction based on the rough sets method proposed jointly with Principal Compo-nent Analysis. Finally, the paper presents numerical results of face recognition experiments using the learning vector quantization neural network, with feature selection based on the proposed principal components analysis and rough sets methods.

Multimodal deep support vector classification with homologous features and its application to gearbox fault diagnosis

Article

Jun 2015
NEUROCOMPUTING

Gearboxes are crucial transmission components in mechanical systems. Fault diagnosis is an important tool to maintain gearboxes in healthy conditions. It is challenging to recognize fault existences and, if any, failure patterns in such transmission elements due to their complicated configurations. This paper addresses a multimodal deep support vector classification (MDSVC) approach, which employs separation-fusion based deep learning in order to perform fault diagnosis tasks for gearboxes. Considering that different modalities can be made to describe same object, multimodal homologous features of the gearbox vibration measurements are first separated in time, frequency and wavelet modalities, respectively. A Gaussian-Bernoulli deep Boltzmann machine (GDBM) without final output is subsequently suggested to learn pattern representations for features in each modality. A support vector classifier is finally applied to fuse GDBMs in different modalities towards the construction of the MDSVC model. With the present model, “deep” representations from “wide” modalities improve fault diagnosis capabilities. Fault diagnosis experiments were carried out to evaluate the proposed method on both spur and helical gearboxes. The proposed model achieves the best fault classification rate in experiments when compared to representative deep and shallow learning methods,. Results indicate that the proposed separation-fusion based deep learning strategy is effective for the gearbox fault diagnosis.

Criterion fusion for spectral segmentation and its application to optimal demodulation of bearing vibration signals

Article

Apr 2015

The defective bearing signatures can be detected by resonance demodulation of the vibration signals. The decision of the bearing fault detection largely depends on the quality of the identified resonant frequency band. Two key issues in locating the resonance frequency band are the proper segmentation of the frequency spectrum of interest and the criterion used to guide the search for the resonance band. To deal with the two issues, this paper proposes a criterion fusion approach to guide the spectral segmentation process. With the proposed approach, the frequency spectrum of the bearing signal is first divided into initial fine segments which are then adaptively merged into different subsets using an enhanced bottom-up segmentation technique. To guide the spectral segmentation and merging process, three commonly used criteria, i.e., kurtosis, smoothness index and crest factor are fused into a synthesized cost function using an entropy-based method. The final frequency band delivered by this approach has a good coverage of the resonant band and is then used to demodulate bearing signals. Both simulated and experimental signals have been employed to evaluate the proposed approach, which has also been compared to single-criterion methods. The comparison indicates that the fused criterion yields better results than those from the single-criterion.

Hierarchical feature selection based on relative dependency for gear fault diagnosis

Abstract and Figures

Recommended publications

Fault diagnosis of helical gear box using variational mode decomposition and random forest algorithm

Fault Diagnosis of Helical Gearbox through Vibration Signals using Wavelet Features, J48 Decision Tr...

Suitability of random forest analysis for epidemiological research: Exploring sociodemographic and l...

A hierarchical ensemble learning framework for energy-efficient automatic train driving