Abstract and Figures

the area of predictive maintenance has taken a lot of prominence in the last couple of years due to various reasons. With new algorithms and methodologies growing across different learning methods, it has remained a challenge for industries to adopt which method is fit, robust and provide most accurate detection. Fault detection is one o f the critical components of predictive maintenance; it is very much needed for industries to detect faults early and accurately. In a production environment, to minimize the cost of maintenance, sometimes it is required to build a model with minimal or no historical data. In such cases, unsupervised learning would be a better option model building. In this paper, we have chosen a simple vibration data collected from an exhaust fan, and have fit different unsupervised learning algorithms such as PCA T2 statistic, Hierarchical clustering, K-Means, Fuzzy C-Means clustering and model-based clustering to test its accuracy, performance, and robustness. In the end, we have proposed a methodology to benchmark different algorithms and choosing the final model
The concept of predictive maintenance (PdM) was
proposed a few decades ago. PdM is also a subset of planned
maintenance. PdM did not gain prominence until the recent
decade. This rapid advance is mainly due to emerging
internet technologies, connected sensors, systems capable of
handling big data sets and realizing the need to use these
techniques. The abrupt growth can also be theorized due to
the demand for high-quality products, at the least cost and
with shortest lead time. Every year, it is estimated that U.S.
industry spends $200 billion on maintenance of plant
equipment and facilities and the result of ineffective
maintenance leads to a loss of more than $60 billion [1]. In
food and beverage industry it was estimated that failures and
downtime accounted for 18% of OEE [2]. Over the years,
different architecture, algorithms, and methodologies have
been proposed. One of the most prominent methods is
watchdog agent, a design enclosed with various machine
learning algorithms [3] [11]. Some of the other architectures
are an OSA-CBM architecture [4], SIMAP Architecture [5],
and predictive maintenance framework [6]. Emerging
technologies such as the Internet of things (IoT) devices have
formed a gateway to connect to machines and its
subcomponents to not only collect the process data and its
parameters but also to collect the physical health aspects of
the machine such as vibration, pressure, temperature,
acoustics, viscosity, flow rate and many as such. This
information is widely used for early fault detection, fault
identification, health assessment of the machine and predict
the future state of the machine. Some of this is made possible
due to machine learning algorithms available across different
learning domains.
Machine learning is a subsection of Artificial Intelligence
Figure 1. Machine learning can be defined a program or an
algorithm that is capable of learning with minimum or no
additional support. Machine learning helps in solving many
problems such as big data, vision, speech recognition, and
robotics [7]. Machine learning is classified into three types.
In supervised learning, the predictors and response variables
are known for building the model, in unsupervised learning, ,
only response variables are known, and in reinforced
learning, the agent learns actions and consequences by
interacting with the environment. In this research, the main
focus will be on unsupervised learning methodology. One of
the most commonly used approaches in unsupervised
learning is clustering where, response variables are grouped
into clusters either user-defined or model based on the
distance, model, density, class, or characteristic of that
variable. For this research, vibration data has been used. Data
collection, feature selection, and extraction will be described
in the later sections.
Figure 1. Structure of learning methods.
All the programming in this research is performed in a
statistical tool called as R- Programming. R- Program is
open source software and was designed by Ross Ihaka and
Robert Gentleman in August 1993. As of today, there are
The primary goal of PdM is to reduce the cost of a
product or service and to have a competitive advantage in the
market to survive. Today business analytics are embedded
across PdM to realize the need for it and to make appropriate
decisions. Business analytics can be viewed in three different
prospective (i) Descriptive analytics (ii) Predictive analytics
and (iii) Prescriptive analytics [16]. Descriptive analytics is a
process of answering questions like what happened in the
past? This is done by analyzing historical data and
summarizing them in charts. In maintenance, this step is
performed using control charts. Predictive analytics is an
extension to descriptive analytics where historical data is
analyzed to predict the future outcomes. In maintenance, it
is used predict type of failure and time to complete failure.
Finally, prescriptive analytics is a process of optimization to
identify the best alternatives to minimize or maximize the
objective. This also answers the questions such as what can
be done? In maintenance, this can be used to optimize the
maintenance schedules to minimize the cost of maintenance.
In this paper, our primary focus will be on descriptive and
predictive analytics to detect the faults.
Predictive analytics has spread its applications into
various applications such as railway track maintenance,
vehicle monitoring [23], automotive subcomponents [8],
utility systems [19], computer systems, electrical grids [13],
aircraft maintenance [21], oil and gas industry,
computational finance and many more.
Fault detection is one of the concepts in predictive
maintenance which is well accepted in the industry. Early
Failure detection could potentially eliminate catastrophic
machine failures. In one of the recent research studies, this
process is classified into different methods such as
quantitative model-based methods, qualitative model-based
methods, and process history based methods [25].
Principle component analysis (PCA) is one of the oldest
and most prominent algorithms that are widely used today. It
was first invented by Karl Pearson in 1901. Since then, they
have been many hybrid approaches to PCA for fault
detection such as using Kernel PCA [17], adaptive threshold
using Exponential weight moving average for T2 and Q
statistic [9], multiscale neighborhood normalization-based
multiple dynamic principal component analysis (MNN-
MDPCA) method [27], Independent Component Analysis.
Another common method used for fault detection is
clustering method. Similar to PCA, there are various
algorithms such as neural net clustering algorithm neural
networks and subtractive clustering [28], K-means [10],
Gaussian mixture model [15], C-Means, Hierarchical
Clustering [22], and Modified Rank Order clustering
(MROC) [33].
Fault detection is one of the most critical components of
predictive maintenance. Fault detection can be defined as a
process of identifying the abnormal behavior of a subsystem.
Any deviation from a standard behavior can be categorized
as a failure. In this section, we will discuss different
algorithms such as Principle Component Analysis (PCA) T2
statistic, Hierarchical clustering, K- Means clustering, C-
Means, and Model-based clustering for fault detection and
benchmark its results for vibration monitoring data.
A. Data Collection
Vibration data is one of the most commonly used
technique to detect any abnormalities in a submachine. In
this research paper, a vibration monitor sensor was set up on
an exhaust fan. The vibration was collected every 240
minutes for 12 days at a sampling frequency of 2048 Hz on
both X and Y axis. From the following data, different
features were extracted such as peak acceleration, peak
velocity, turning speed, RMS Velocity, and Damage
accumulation. Figure 2 is the time series plots of the data.
Figure 2. Feature data plot.
In Figure 2, we can see a trend line generating closer to
index 60th observation. In this paper, we will test to see how
different algorithms help in detecting this fault earlier.
B. Feature Selection Using PCA
Not all features extracted provide a true correlation. If
right features are not selected, then a significant amount of
noise would be added to the final model and hence, reduce
the accuracy of the model. One of the most prominent
algorithms for that is used for dimensionality reduction is
Principle component analysis. Principal component analysis
(PCA) is a mathematical algorithm that reduces the
dimensionality of the data while retaining most of the
variation (information) in the data set [18]. In a simple
context, it is an algorithm to identify patterns in data and
expressing such a way to showcase those similarities and
differences [29].
Step 1: Consider a data matrix X
[X]mxn (1)
where, X is the matrix, m is a row, and n is a column
Step 2: Subtract the mean from each dimension
  (2)
Step 3: Calculate the covariance matrix
 (3)
Step 4: Calculate the eigenvectors and eigenvalues of the
covariance matrix
    (4)
Step 5: Store the eigenvector in a matrix
     (5)
Step 6: Store eigenvalues in a diagonal matrix
 (6)
where [Eigen] is the eigenvalues corresponding to the
principal components, and P contains the loading vectors
Step 7: Rank eigenvalues in decreasing order and choose top
“r” vectors to retain
 (7)
Step 8: Retain “r” eigenvectors
     (8)
Step 9: Calculate the principal components [U] which is
projected in data matrix
    (9)
Summary of the PCA indicates that the first two principal
components show 95.65% of variance compared to rest of
the components.
A scree plot can be plotted for Eigenvalues versus
principle components as shown in Figure 4. This plot can be
used to define the components that show significant variance
in the data.
From summary data and scree plot, we can conclude that
the first two principal components present maximum
variation compared to the rest of the principal components.
C. T2 Statistic
T2 Statistic is a multivariate statistical analysis. The T 2
statistic for the data observation x can be calculated by [12]
 
 (10)
The upper confidence limit for T 2 is obtained using the
   (11)
Figure 3. Summary of PCA.
where n is the number of samples in the data, a is the number
of principal components, and α is the level of significance
[24]. This statistic can be used to measure the values against
the threshold and any values above the threshold; can be
concluded as out of control data. In this case, it is going to be
faulty data. The results for the vibration data are shown the
Figure 5.
Based on the results from T2 statistic in Figure 5, we can
observe that the faults can be detected as early as 41
observations. Hence, this early detection would help the
maintenance teams to monitor these process changes and
take corrective actions accordingly.
D. Cluster Analysis
Clustering analysis is one of the unsupervised learning
methods. In cluster analysis, similar data are grouped into
different clusters. Some of the most prominent cluster
analyses are K-Means clustering, C-Means clustering, and
hierarchical clustering. There are various merging principles
in hierarchical clustering. They are iterative, hierarchical,
density based, Metasearch controlled and stochastic. In this
paper, we will be discussing one of the commonly used
hierarchical clusterings.
E. Optimal Number of Clusters
In cluster analysis, we need to know the optimal number
of clusters that can be formed. Although we know that, we
have healthy data and faulty data, identifying the number of
optimal cluster formations in our data would help in
understanding different states in the data and representing the
data more accurately. To identify the number of clusters,
there are many procedures available such as elbow method,
Bayesian Inference Criterion method and nbClust package in
R. The results for elbow method is shown in Figure 6 and
using nbClust [30] is shown in Figure 7.
Figure 4. Scree plot to determine the variation between principal
Figure 5. T2 statistic results for training dataset and testing dataset.
From both the procedures shown in Figure 6 and Figure 7,
we can identify that 3 clusters are the optimal number of
clusters. For fault detection, we can use three clusters and
theorize each cluster represents a normal condition, warning
condition, and faulty condition. In the next section of cluster
analysis, we can observe how each of the clustering
algorithms provides the results.
Figure 6. Determining the optimal number of clusters based on elbow
Figure 7. Determining the number of clusters using nbClust package.
F. Heirarchical Clustering
Start by assigning each item to its own cluster, so that if
you have N items, you now have N clusters, each containing
just one item. Let the distances (similarities) between the
clusters equal the distances (similarities) between the items
they contain [24].
Step 1: Find the closest (most similar) pair of clusters and
merge them into a single cluster, so that now you have one
less cluster.
Step 2: Compute distances (similarities) between the new
cluster and each of the old clusters.
Step 3: Repeat steps 2 and 3 until all items are clustered into
a single cluster of size N.
In Figure 8, the cluster is formed based on the feature
data using Ward's method. Irrespective of feature data and
Principle components, the results were identical. Three
clusters were formed, where the first cluster includes
observations from 1 to 40, the second cluster includes
observations 41 to 67 and finally, the third cluster includes
observations from 68 to 71. Based on the domain knowledge,
we can represent cluster 1 as healthy dataset, cluster 2 as
warning dataset and finally cluster 3 as faulty data set.
G. K-Means and Fuzzy C-Means Clustering
K-means is one of the most common unsupervised
learning clustering algorithms. This most straightforward
algorithm’s goal is to divide the data set into pre-determined
clusters based on distance. Here, we have used Euclidian
distance. The graphical results as shown in Figure 9.
C-means is a data clustering technique where each data
point belongs to every cluster at some degree. Fuzzy C
means was first introduced by Bezdek [14]. Fuzzy C-Means
has been applied in various applications such as agricultural,
engineering, astronomy, chemistry, geology, image analysis
[14], medical diagnosis, and shape analysis and target
recognition [26]. The graphical results for C-Means is as
shown in Figure 9.
Summary of K-Means and C-Means Clustering
Within cluster sum of squares by cluster:
[1] 16.758705 39.575966 8.823486
(between_SS / total_SS = 90.2 %)
From Table III summary of K-means and C-means
clustering, we can observe that clusters of sizes 4, 27 and 40
are formed. Observation 1 to 40 formed one cluster, 41 to 67
formed second cluster and the third cluster with 68 to 71
observations. These results are same as hierarchical
Figure 8. Hierarchical clustering solution for fault identification.
H. Model-Based Clustering
A Gaussian mixture model (GMM) is used for modeling
data that comes from one of the several groups: the groups
might be different from each other, but data points within the
same group can be well-modeled by a Gaussian distribution
[20]. Gaussian finite mixture model fitted by EM algorithm
is an iterative algorithm where some initial random estimate
starts and updates every iterate until convergence is detected
[31] [32]. Initialization can be started based on a set of initial
parameters and start E-step or set of initial weights and
proceed to M-step. This step can be either set randomly or
could be chosen based on some method.
Summary of Classification
Mclust EVV (ellipsoidal, equal volume) model with five
log.likelihood n df BIC ICL
-57.23501 71 25 -221.037 -222.0734
Figure 9. K-Means and C-Means clustering for fault identification.
The results are summarized in Table 3. The results from
Gaussian finite mixture model fitted by EM algorithm
Classification, there was a total of 5 groups of components
are formed. Component 1 and two are assigned to
observation 1 to 40, component group 3 consists of
observation 41 to 63, component group 4 consist of
observations 64 to 67 and finally component 5 consists of
observations 68 to 71. It is interesting to note that, the critical
fault detection which is accurately predicted similarly to
other clustering algorithms as well.
In this research, initially, we were hypothesized that two
states in data. One is healthy data set, and the other is
unhealthy data set. Using PCA and T2 statistic, we were able
to fit our hypothesis states and able to detect the faults 31
observations ahead. Whereas, without a tool and just based
on data plots we could observe the trends only 11
observations ahead. As we moved on to fitting different
unsupervised clustering algorithms, we found most of the
clustering algorithms provided much more than the T2
Using elbow method and nbClust package, we were able
to identify that the most optimal number of clusters that
could be formed was three. Based on these results, when data
was fitted in hierarchical clustering, K-means, and C-means,
the results were nearly identical. Based on the previous
knowledge of the data, we were able to identify each of three
states. The first state was identified as healthy state (since it
was calibrated for healthy data), second state was identified
as the warning state and finally the third state was identified
as faulty state. It would not be surprising to obtain the
following results as all these algorithms were based on a
distance measure.
Figure 10. Gaussian finite mixture model fitted by EM algorithm
For our final model, Gaussian finite mixture model fitted
by EM algorithm was used. Unlike providing the number of
clusters, this model identifies optimal clusters and
accordingly classifies the observations into groups. Here, the
model recognized a total of 5 components. Although with
five components, upon closer investigation, we could
observe that, there is an overlap of component 1 and 2 and
component 3 and 4. When these components are reorganized
we can observe much similar pattern to the previous cluster
This research started out as a test bed to benchmark
different machine learning algorithms for early fault
detection using unsupervised learning. In our results, T2
statistic provided more accurate results compared to GMM
method, and no hypothesis was required to identify the
relationship between cluster and state. One of the main
benefits of this method is that, even when this is deployed to
the manufacturing environment, with minimum or no
domain knowledge, one can identify fault or critical
condition when compared to clustering analysis. On the other
hand in clustering, some information about the data is needed
to name the clusters as healthy, warning or critical.
Clustering methodology is undoubtedly a better tool in
detecting different levels of faults where T2 statistic would
be challenging after certain levels. To emphasize this, when
the cost machine maintenance is expensive, clustering would
be a flexible option where machine health can be monitored
continuously until a critical level is reached.
In conclusion of this study, although most algorithms
provided nearly similar results, each algorithm provided
deeper insight into the data. Hence, if the application is just
to detect the faults, T2 statistic would be an excellent tool.
But if fault detection needs to be performed under different
levels then, clustering algorithms would be a better choice.
Fault detection is one of the preliminary analytics for
predictive maintenance. Hence, detecting the fault accurately
is regarded important. This work is currently performed for
vibration data. The scope of this research can be extended
out to other physics-based parameters and combination of
these parameters. It would also be interesting to observe the
detection accuracy for bigger sample size and multiple fault
