Article

Determining the number of principal components for best reconstruction

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

A well-defined variance of reconstruction error (VRE) is proposed to determine the number of principal components in a PCA model for best reconstruction. Unlike most other methods in the literature, this proposed VRE method has a guaranteed minimum over the number of PC's corresponding to the best reconstruction. Therefore, it avoids the arbitrariness of other methods with monotonic indices. The VRE can also be used to remove variables that are little correlated with others and cannot be reliably reconstructed from the correlation-based PCA model. The effectiveness of this method is demonstrated with a simulated process.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The selection of the appropriate number of principal components [41] is the key step in identifying the PCA model. In this study, the reconstruction error variance is minimized based on the interval data as in [41] to determine for the PCA model. ...
... The selection of the appropriate number of principal components [41] is the key step in identifying the PCA model. In this study, the reconstruction error variance is minimized based on the interval data as in [41] to determine for the PCA model. Generally, when the PCA based single-valued data is applied, the reconstruction method is applied to estimate a variable based mainly on the PCA model. ...
... Generally, when the PCA based single-valued data is applied, the reconstruction method is applied to estimate a variable based mainly on the PCA model. The accuracy of the reconstruction depends on the capability of the PCA model to disclose iterative relations between all variables [41], [42]. In [18] the authors propose the IPCA approach using the variable reconstruction. ...
Article
Full-text available
The operation of heating, ventilation, and air conditioning (HVAC) systems is usually disturbed by many uncertainties such as measurement errors, noise, as well as temperature. Thus, this paper proposes a new multiscale interval principal component analysis (MSIPCA)-based machine learning (ML) technique for fault detection and diagnosis (FDD) of uncertain HVAC systems. The main goal of the developed MSIPCA-ML approach is to enhance the diagnosis performance, improve the indoor environment quality, and minimize the energy consumption in uncertain building systems. The model uncertainty is addressed by considering the interval-valued data representation. The performance of the proposed FDD is investigated using sets of synthetic and emulated data extracted under different operating conditions. The presented results confirm the high-efficiency of the developed technique in monitoring uncertain HVAC systems due to the high diagnosis capabilities of the interval feature-based support vector machines and k-nearest neighbors and their ability to distinguish between the different operating modes of the HVAC system.
... The selection of the appropriate number of principal components [122,156,157] is the key step in identifying the PCA model. In this study, the reconstruction error variance is minimized based on the interval data as in [122,123,156] to determine for the PCA model. ...
... The selection of the appropriate number of principal components [122,156,157] is the key step in identifying the PCA model. In this study, the reconstruction error variance is minimized based on the interval data as in [122,123,156] to determine for the PCA model. ...
Thesis
Currently, the indoor discomfort in residences and energy consumption of building are the most important problems facing the Heating, Ventilating, and Air Conditioning (HVAC) systems. Indeed, people spend 60% to 90% of their lives in buildings. Indoor comfort is essential for the health benefits, productivity, and well-being of the occupants. Building HVAC systems present for more than 66% of annual energy consumption in the European Union. Nevertheless, it has become evident that it is only in a small percentage of buildings that HVAC systems operate effectively or according to design intent. Studies have shown that malfunctions are one of the main reasons for the ineffective performance of these systems. It is estimated that an energy saving of 5% to 15% can be obtained simply by detecting faults and optimizing building control systems. Despite the good progress in current years, fault management techniques in building HVAC systems are yet generally underdeveloped; in particular, there is yet a lack of affordable, reliable, and scalable solutions for dealing with faults in HVAC systems. It is now well known that fault detection and diagnosis (FDD) are very important to guarantee the safety of HVAC systems, improve user comfort, improve energy efficiency, and reduce operating and maintenance costs. Nevertheless, efficient FDD techniques for HVAC systems stay a challenge and the efficiency and reliability of HVAC systems are becoming the most important challenges. This thesis focuses on a number of issues that, in our opinion, are crucial to the development of reliable and scalable diagnostic solutions for building HVAC systems. The FDD strategies may be categorized into two main phases: feature extraction and selection (FES), and fault classification (FC). In the FES step, the goal is to extract the most relevant and efficient features from data. The feature extraction strategy used in this work is the principal component analysis (PCA). The PCA model is the most well-known multivariate FES solution in data science. In order to enhance further the quality of the FES-based PCA, a multiscale representation will be proposed with the aim of combining the ability of multivariate techniques to extract cross-correlation between variables with the capability of orthonormal wavelets to separate feature from noise and approximately decorrelate the available measurements. Next, it is important to select the most relevant and informative features before performing the classification, which may improve the diagnosis efficiency. Therefore, to address the problem of feature selection, a multivariate statistical charts, statistical measures and Relief based methods are used. Finally, the faults must be classified to diagnose the HVAC systems. To this aim, different machine learning techniques are developed for systems monitoring and diagnosis. The second objective of this thesis is to extend the FDD strategies developed above to deal with model uncertainties in HVAC systems. To do that, a new interval FDD approaches will be developed. Real HVAC systems are often affected by different types of uncertainties mainly due to measurement errors and noise, as well as temperature. The uncertainty in the model may be addressed by considering the interval-valued data. Therefore, the developed techniques extend the above-proposed FDD approaches by taking into account the HVAC variable uncertainties. Finally, the developed FDD approaches for certain and uncertain HVAC systems are validated and assessed using different data-sets. Different scenarios and case studies are investigated in order to present the efficiency and robustness of the proposed techniques.
... Much as the strategies to define an appropriate number of principal components are well known in literature (see, e.g. [6], [7]), there is a lack of consensus about how to adjust them to detect specific events of interest in PMU data. Therefore, a systematic evaluation of those procedures is necessary to perform dimensionality reduction and event detection with PCA effectively. ...
... 4) Variance reconstruction error: In this method, further explained in [6], the optimal value of r is determined by the minimum variance reconstruction error (VRE), in agreement with (15), considering a faulty observation x f represented by an m-dimensional unitary vector ξ i multiplied by a fault magnitude f and the correlation matrix of reconstruction error R. This procedure results in the best reconstruction of the variables, as the VRE decreases monotonically in the residual subspace and increases in the projection subspace with the number of principal components, and the selection of r can be adjusted to detect specific events of interest defined by x f . ...
... Solving for the appropriate number of Principal Components (PCs) to retain in a static PCA model is an unsolved dilemma where many methods are proposed in the literature as pointed in Zwick and Velicer (1986). Common methods include Kaiser's rule (Kaiser, 1960), Cumulative Percent Variance (CPV) (Malinowski, 1991), Parallel Analysis (PA) (Horn, 1965); scree test (Cattell, 1966), cross-validation (Wold, 1978); and the Variance of Reconstruction Error (VRE) (Qin and Dunia, 2000). Valle (1999) summarized and discussed such rules where each of the criteria is derived from a particular rule of thumb and results in vague decisions for different PCA models and performance. ...
... Pursuing PCA modelling, the number of PCs is obtained using FDim evaluated on T revealing FDim T ð Þ ¼ 1:8970 ffi FDim X ð Þ ¼ 1:8943 for the 2 Â 2 system hence the number of PCs f g ¼ d d X e ¼ 2; and for the 3 Â 2 system, FDim T ð Þ ¼ 2:8813 ffi FDim X ð Þ ¼ 2:8713 resulting in the number f of PCsg ¼ dd X e ¼ 3. The results are shown in Fig. 2(b-c) for the two systems. To show superiority, FDim-based extraction of the number of PCs is compared to Kaiser's rule (Kaiser, 1960), CPV (Malinowski, 1991), PA (Horn, 1965), and VRE (Qin and Dunia, 2000) and the results are summarized in Table 4. It is obvious that CPV always overestimates the number of PCs while Kaiser's rule taking eigenvalues greater than 1, PA with random profile generation and VRE looking for a minimum of reconstructions provide vague estimations unrelated to the true relations for which the thresholds, assumptions, random parallel profiles, and considering specific disturbances associated with these methods limit the uniqueness and optimality of the number of PCs estimation. ...
Article
A novel Dynamic Kernel PCA (DKPCA) method is developed for process monitoring in nonlinear dynamical systems. Classical DKPCA approaches still exhibit vague linearity assumptions to determine the number of principal components and to construct the dynamical structure. The optimal Static PCA (SPCA) and Dynamic PCA (DPCA) structures are constructed herein through the powerful theory of the nonlinear Fractal Dimension (FDim). While DKPCA offers a generic data-driven modelling of nonlinear dynamical systems, the fractal correlation dimension provides an intrinsic measure of the data complexity counting for the nonlinear dynamics and the chaotic behaviour. The proposed Fractal-based DKPCA (FDKPCA) integrates the two strategies to overcome SPCA/DPCA/DKPCA shortcomings, FDim allows verifying the degree of fitting and ensures optimal dimensionality reduction. The novel fault detection and diagnosis method is validated through seven applications using the Process Network Optimization (PRONTO) benchmark with real heterogeneous data, FDKPCA showed superior performance compared to contemporary approaches.
... According to the most significant captured information in the data via its paperion, a PCA model with = 3 directions has been constructed. This is confirmed by the minimization of the reconstruction error variance [28], (see Figure 4). Table 4). ...
... The number of retained principal components has a significant impact on each step of the process modeling and monitoring scheme. Qin & Dunia [8,28] proposed to determine this parameter by minimizing the variance of reconstruction error. ...
Article
Fault detection and diagnosis (FDD) in the photovoltaic (PV) array has become a challenge due to the magnitudes of the faults, the presence of maximum power point trackers, non-linear PV characteristics, and the dependence on isolation efficiency. Thus, the aim of this paper is to develop an improved FDD technique of PV systems’ faults. The common FDD technique generally has two main steps: feature extraction and selection, and fault classification. Multivariate feature extraction and selection is very important for multivariate statistical systems monitoring. It can reduce the dimension of modeling data and facilitate the final monitoring accuracy. Therefore, in the proposed FDD approach, the principal component analysis (PCA) technique is used for extracting and selecting multivariate features and the supervised machine learning (SML) classifiers are applied for faults diagnosis. The FDD performance is established via different metrics for various PCA-based SML techniques using data extracted from different operating conditions of the grid-connected photovoltaic (GCPV) system. The obtained results confirm the feasibility and effectiveness of the proposed approaches for fault detection and diagnosis.
... A large number of rules are proposed in the literature to determine how many principal components (PCs) must be retained in the PCA model, such as average eigenvalue (AE), cumulative percent variance (CPV), variance of the reconstruction error (VRE), . . . , etc. [2], [12], [13]. ...
... 3) Variance of the reconstruction error (VRE): This criterion was developed by Qin and Dunia to select the number of principal components based on the best reconstruction of the variables [13]. The variance of the reconstruction error (VRE) is a function of the number of PCs, it has a minimum which directly determines the number of PCs. ...
... However, the distinction between significant or insignificant eigenvalues may not be obvious due to modeling errors (disturbances and nonlinearities) and noise. Most methods to determine the number of principal components are rather subjective in the general practice of PCA [33]. Other methods are based on criteria actually used in system identification (Akaike information criterion, minimum description length,...) to determine the system order and emphasize the approximation of the data matrix X. Various techniques have been developed to estimate the number of PCs. ...
... (see Valle et al. (1999) for a survey of these methods). Qin & Dunia (2000) proposed to determine by minimization of the variance of the reconstruction error in the residual subspace. This variable reconstruction consists in estimating a variable from other plant variables using the PCA model, i.e. using the redundancy relations between this variable and the others. ...
... A key issue to identify a PCA model is to select the adequate number of principal components [36,37,38]. The number of retained principal components has a significant impact on each step of the process modeling and monitoring scheme. ...
... The number of retained principal components has a significant impact on each step of the process modeling and monitoring scheme. Qin & Dunia [39,36,38] proposed to determine this parameter by minimizing the variance of reconstruction error. This method will be retained here to determine in the case of interval-valued data based PCA model. ...
... A key issue to identify a PCA model is to select the adequate number of principal components [51,52,37]. The number of retained principal components has a significant impact on each step of the process modeling and monitoring scheme. ...
... The number of retained principal components has a significant impact on each step of the process modeling and monitoring scheme. Qin & Dunia [51,53,37] proposed to determine the parameter by minimizing the variance of reconstruction error. This method will be retained here to determine in the case of interval-valued data based PCA model. ...
Article
In this paper, a new data-driven sensor fault detection and isolation (FDI) technique for interval-valued data is developed. The developed approach merges the benefits of generalized likelihood ratio (GLR) with interval-valued data and principal component analysis (PCA). This paper has three main contributions. The first contribution is to develop a criterion based on the variance of interval-valued reconstruction error to select the number of principal components to be kept in the PCA model. Secondly, interval-valued residuals are generated and a new fault detection chart based GLR is developed. Lastly, an enhanced interval reconstruction approach for fault isolation is developed. The proposed strategy is applied for distillation column process monitoring and air quality monitoring network.
... The number of principal components d can be determined by the variance of reconstruction error method (Qin and Dunia, 2000). To determine whether a fault occurs when a new sample x is available, two indices consisting of SPE and Hotelling's T 2 are frequently introduced. ...
... A statistical PCA model can be developed using the first 100 samples from the measured IAQ data ( Table 2). The method of calculating unreconstructed variances for best reconstruction (Qin and Dunia, 2000) was implemented to determine the optimal number of principal components. Three principal components were chosen on the basis of searching for the lowest unreconstructed variance. ...
Article
This article proposes a combined principal component analysis (PCA) and local Fisher discriminant analysis (LFDA) scheme to improve the fault diagnosis performance of the indoor air quality (IAQ) measuring devices in subway stations. The combined scheme employs PCA for fault detection step and subsequently utilizes LFDA for diagnosing faulty IAQ sensors. A fault discriminant index based on LFDA discriminant components is proposed for fault diagnosis. Effectiveness of the proposed approach is demonstrated on the IAQ measuring system, where three types of IAQ sensor faults including bias fault, drifting fault, and complete failure fault are involved. Results demonstrate that diagnosing performance of LFDA is better than that of conventional Fisher discriminant analysis. The combined method has the capability of detecting and discriminating the sensor faults in the subway system.
... In recent years, the process of monitoring including fault detection based on data, or multivariate statistical process control has been rapidly made [3]. For the high-dimensional and redundant properties, Principal component analysis (PCA) is applied to deal with linear correlated, multi-dimensional and Gaussian process data by mapping the data set onto a lower dimensional subspace [4]. ...
... (3) Thermal model By the law of energy conversation, the thermal model is written as equation (4). ...
... Indeed, various criteria have been investigated in the literature. More details are available in Valle et al (1999) and Qin and Dunia (2000). A measurement vector z ∈ R m can be decomposed as, ...
Article
Full-text available
Data driven methods have been recognized as an efficient tool of multivariate statistical process control (MSPC). Contribution plots are also well known as the popular tool of principal components analysis (PCA), which is used for isolating sensor fault without need of any priori information. However, studies carried out in the literature unified contribution plots in three general approaches. Furthermore, they demonstrated that correct diagnosis based on contribution plots is not guaranteed for both single and multiple sensor faults. Therefore, to deal with this issue, the present paper highlights a new formula of contribution called relative variation of contribution (rVOC). Simulation results show that the proposed method of contribution can successfully perform the fault isolation task, in comparison with partial decomposition contribution (PDC) and its relative version (rPDC) based on their fault isolation rate (FIR).
... The AC method includes an autocorrelation function of the PCs [75]: a threshold equal to 0.5 is imposed and autocorrelation values lower than the threshold are a symptom of noise presence in the component; thus, the considered component should be discarded and not included in the PCA model. In addition, other methods rely on the covariance or the correlation matrix, e.g., AIC [76], MDL [77], and IEF [78]; other approaches, e.g., VRE [79] or PRESS [80], are characterized by an almost monotonic decreasing behavior in some cases. Under these conditions, it is difficult to find a minimum point and consequently the choice of the number of PCs may not be adequate. ...
Article
Full-text available
In the present work, the design and the implementation of a Fault Detection and Isolation (FDI) system for an industrial machinery is proposed. The case study is represented by a multishaft centrifugal compressor used for the syngas manufacturing. The system has been conceived for the monitoring of the faults which may damage the multishaft centrifugal compressor: instrument single and multiple faults have been considered as well as process faults like fouling of the compressor stages and break of the thrust bearing. A new approach that combines Principal Component Analysis (PCA), Cluster Analysis and Pattern Recognition is developed. A novel procedure based on the statistical test ANOVA (ANalysis Of VAriance) is applied to determine the most suitable number of Principal Components (PCs). A key design issue of the proposed fault isolation scheme is the data Cluster Analysis performed to solve the practical issue of the complexity growth experienced when analyzing process faults, which typically involve many variables. In addition, an automatic online Pattern Recognition procedure for finding the most probable faults is proposed. Clustering procedure and Pattern Recognition are implemented within a Fuzzy Faults Classifier module. Experimental results on real plant data illustrate the validity of the approach. The main benefits produced by the FDI system concern the improvement of the maintenance operations, the enhancement of the reliability and availability of the compressor, the increase in the plant safety while achieving reduction in plant functioning costs.
... PCA models are formed by retaining only the PCs that are descriptive of systematic variation in the data. Determination of the proper number of PCs can be done by several techniques [6,7]. ...
Chapter
Data-driven modeling framework is often more suitable to describe the real-world systems that are associated with complexities. The input–output data available from a system can be used to derive different forms of computationally efficient data-driven models for the purpose of prediction, state estimation, monitoring, and other process systems engineering applications. In this chapter, various data-driven methods and algorithms, including principal component analysis, projection to latent structures, nonlinear iterative partial least squares, artificial neural networks, and RBFN are presented. The state estimators derived based on these methods can serve various applications concerning process systems engineering.
... To obtain a good performance of classification, it is of significance to extract statistical features via MSPCA model by exhaustively enumerating some possible values. In the current study, the selected features extracted from the MSPCA model are the squared prediction error (SPE) statistic [35], the T 2 statistic [36], the squared weighted error (SWE) statistic [37] and the first retained principal components (T ). ...
Article
Fault Detection and Isolation (FDI) in Heating, Ventilation, and Air Conditioning (HVAC) systems is an important approach to guarantee the human safety of these systems. Therefore, the implementation of a FDI framework is required to reduce the energy needs for buildings and improving indoor environment quality. The main goal of this paper is to merge the benefits of multiscale representation, Principal Component Analysis (PCA), and Machine Learning (ML) classifiers to improve the efficiency of the detection and isolation of Air Conditioning (AC) systems. First, the multivariate statistical features extraction and selection is achieved using the PCA method. Then, the multiscale representation is applied to separate feature from noise and approximately decorrelate autocorrelation between available measurements. Third, the extracted and selected features are introduced to several machine learning classifiers for fault classification purposes. The effectiveness and higher classification accuracy of the developed Multiscale PCA (MSPCA)-based ML technique is demonstrated using two examples: synthetic data and simulated data extracted from Air Conditioning systems.
... However, the distinction between significant or insignificant eigenvalues may not be obvious due to modeling errors (disturbances and nonlinearities) and noise. Most methods to determine the number of principal components are rather subjective in the general practice of PCA [101]. Other methods are based on criteria actually used in system identification (Akaike information criterion, minimum description length, ...) to determine the system order and emphasize the approximation of the data matrix X. Various techniques have been developed to estimate the number of PCs. ...
Thesis
Process monitoring is becoming increasingly important to maintain reliable and safe process operation. Among the most important applications of process safety are those related to environmental and chemical processes. A critical fault in a chemical or a petrochemical process may not only cause a degradation in the process performance or lower its product quality, but it can also result in catastrophes that may lead to fatal accidents and substantial economic losses. Therefore, detecting anomalies in chemical processes is vital for their safe and proper operations. Also, abnormal atmospheric pollution levels negatively affect the public health, animals, plants, climate, and damage the natural resources. Therefore, monitoring air quality is also crucial for the safety of humans and the environment. Thus, the main aim of this study is to develop enhanced fault detection methods than can improve air quality monitoring and the operation of chemical processes. When a model of the monitored process is available, model-based monitoring methods rely on comparing the process measured variables with the information obtained from the available model. Unfortunately, accurate models may not be available, especially for complex chemical and environmental processes. In the absence of a process model, latent variable models, such principal component analysis (PCA) and partial least squares (PLS), have been successfully used in monitoring processes with highly correlated process variables. When a process model is available, on the other hand, statistical hypothesis testing methods, such as the generalized likelihood ratio test (GLRT), have shown good fault detection abilities. In this thesis, extensions using nonlinear models and input latent variable regression techniques (such as PCA) are made to achieve further improvements and widen the applicability of the developed methods in practice. Also, kernel PCA is used to deal process nonlinearities. Unfortunately, PCA and kernel PCA models are batch and then they demand the availability of the process data before building the model. In most situations, however, fault detection is required online, i.e., as the data are collected from the process. Therefore, recursive PCA and kPCA-based statistical hypothesis testing, so recursive PCA and kernel PCA techniques will be developed in order to extend the advantages of the developed techniques to online processes. The third objective of this work is to utilize the developed fault detection methods to enhance monitoring various chemical and environmental processes. The developed fault detection techniques are used to enhance monitoring the concentration levels of various air pollutants, such as ozone, nitrogen oxides, sulfur oxides, dust, and others. Real air pollution data from France are used in this important application. The developed fault detection methods are also utilized to enhance monitoring of various chemical processes such as continuous stirred-tank reactor (CSTR) and Tennessee Eastman process (TEP).
... The main consideration is that the dimension of the original data can be effectively reduced and the original data information can be retained to the maximum extent. Qin thinks that the best number of principal components is needed to minimize the reconstruction error [5]. This idea only needs to analyse the data, and does not need any prior knowledge. ...
Article
Full-text available
With the development of the times, the demand for high efficiency and reliability of machine performance in China’s industry has become higher than ever before, and the traditional equipment condition evaluation method has also encountered a huge challenge. As an important branch of machine learning, cluster analysis is widely used in fault diagnosis and other fields because of its advantages of no prior knowledge and massive data processing. This paper introduces the concept of principal component analysis and K-means method based on time series in the field of oil data analysis. Through the test data on the oil monitoring data of the steam turbine unit in the power plant, the condition evaluation and classification of the equipment are carried out, among which the classification effect of 6μm particles, dielectric constant and 4μm particles is very obvious, and the water content is relatively obvious, which can basically distinguish each state. At the same time, on the basis of feature extraction, use the special cyclic neural network (RNN) in deep learning-long and short-term memory network (LSTM) to build a model of data for data prediction, under the powerful ability of LSTM to process temporal data, they have unearthed more irregular and non-linear trends in a large amount of historical data, and achieved better prediction results than traditional methods (ARIMA). Through clustering and prediction of lubrication parameter data, it can realize early warning and abnormal diagnosis of lubrication status and reduce damage to machinery and equipment.
... Several methods to choose the number of principle components l are discussed in [17], [18]. Using the loading matrices P andP , the measurement space is partitioned into the principle component subspace (PCS) and the residual subspace (RS). ...
... In this case, to separate the anomaly and background parts from X spectrally and spatially, it is necessary to determine the number of major PCs along each mode, denoted as N H , N W , and N S . The energy cumulative method [42] or the elbow method [43] is usually used to determine N H , N W , and N S . However, since the former requires an artificially specified ratio threshold and has no clear physical meaning, the latter is adopted in our algorithm. ...
Article
Full-text available
Sparse representation-based methods, as an important branch of anomaly detection (AD) technologies for hyperspectral imagery (HSI), have attracted extensive attention. How to construct an overcomplete background dictionary containing all background categories and excluding anomaly signatures is the focus. Traditional background dictionary construction methods first convert HSI into a two-dimensional matrix composed of independent spectral vectors, and then execute the subsequent construction operations. In this way, only spectral anomalies can be excluded from the background dictionary, whereas spatial anomalies still exist. To alleviate this problem, this paper proposes a novel AD algorithm through sparse representation with tensor decomposition-based dictionary construction and adaptive weighting. It has three main advantages. First, tensor representation allows the spectral and spatial characteristics of HSI to be preserved simultaneously, and Tucker decomposition achieves excellent separation between the background part and anomaly part by distinguishing them along three modes. Second, the K-means++ clustering operation is implemented on the background part so that the background dictionary used for sparse representation contains all background categories. Finally, an adaptive weighting matrix derived from the anomaly part further improves the distinction between background pixels and anomalies. Experiments on synthetic and real HSI datasets demonstrate the superiority of our proposed algorithm.
... The top left section in Table 1 summarizes a subset of methods that gained attention in the literature, most of which relate to the application of principal component analysis (PCA) to estimate the column space of A and rely on various assumptions. Depending on the assumptions imposed on r, the variance of the reconstruction error 43 (VRE) and the equality of eigenvalues test for maximum likelihood PCA 20 provide consistent estimations of m. The eigen decomposition of the scaled covariance matrix E{xx T } gives a consistent estimation of the column space of A, even if s does not follow a normal distribution. ...
Article
Full-text available
This article develops readily applicable methods for estimating the intrinsic dimension of multivariate data sets. The proposed methods, which make use of theoretical properties of the empirical distribution functions of (pairwise or pointwise) distances, build on the existing concepts of (i) correlation dimensions and (ii) charting manifolds that are contrasted with (iii) a maximum likelihood technique and (iv) other recently proposed geometric methods including MiND and IDEA. This comparison relies on application studies involving simulated examples, a recorded data set from a glucose processing facility, as well as several benchmark data sets available from the literature. The performance of the proposed techniques is generally in line with other dimension estimators, specifically noting that the correlation dimension variants perform favorably to the maximum likelihood method in terms of accuracy and computational efficiency.
Article
Full-text available
This paper presents a comprehensive review of the historical development, the current state of the art, and prospects of data-driven approaches for industrial process monitoring. The subject covers a vast and diverse range of works, which are compiled and critically evaluated based on the different perspectives they provide. Data-driven modeling techniques are surveyed and categorized into two main groups: multivariate statistics and machine learning. Representative models, namely principal component analysis, partial least squares and artificial neural networks, are detailed in a didactic manner. Topics not typically covered by other reviews, such as process data exploration and treatment, software and benchmarks availability, and real-world industrial implementations, are thoroughly analyzed. Finally, future research perspectives are discussed, covering aspects related to system performance, the significance and usefulness of the approaches, and the development environment. This work aims to be a reference for practitioners and researchers navigating the extensive literature on data-driven industrial process monitoring.
Article
Most traditional multivariate statistical monitoring methods require an assumption that the observation values at a certain moment and a past moment are statistically independent. However, in actual chemical and biological processes, the sample at a certain moment is often affected by the previous moment. Therefore, given the problem of more false alarms and poor detection ability based on the traditional principal component analysis, this article proposes a dynamic global–local preserving projections (DGLPP) algorithm. Unlike dynamic local preserving projections (DLPP) and dynamic principal component analysis (DPCA), DGLPP controls the global and local information retained in the dimensionality reduction data by introducing weight coefficients, which makes the algorithm applicable to more types of industrial processes. Moreover, new parameter determination methods are also proposed for improved detection and diagnosis. Through the improved contribution graph method, we can see the influence degree of each variable on the fault, to monitor and isolate the fault. Finally, by verifying the operation of the multivariable process and two practical cases, the results show that compared with DPCA, DLPP, and global local retained projection (GLPP) methods, the performance under this method has been significantly improved.
Article
A large amount of data generated in industrial processes exhibit multi‐modal, nonlinear, time‐domain correlation, and other characteristics. This poses great difficulty for the traditional principal component analysis (PCA) method since it requires that the input data need to conform to the Gaussian distribution. However, the data may have autocorrelation, that is, the data at the current moment will be affected by the past data. To this end, this paper proposes an enhanced dynamic principal component analysis (DPCA) method based on hierarchical clustering analysis. On the basis of the DPCA algorithm, the idea of data classification and enhanced training is used to strengthen the training of the dimensionality reduction matrix. Then, calibration, on‐line monitoring, and fault diagnosis of process data can be conducted. Finally, this paper demonstrates that the performance of the proposed method is greatly improved compared with PCA and DPCA through the Tennessee Eastman process system.
Article
Background: Electrical tomography is widely recognized for its high time resolution and low cost. However, the implementation of electrical tomographic solutions has been hindered by the high computational overhead associated, which causes delays in the analysis, and numerical instability, that results in unclear reconstructed images. Therefore, it has been mostly applied offline, for qualitative tasks and with some delay. Applications requiring fast response times and quantification have been hindered or ruled out. Results: In this article, we propose a new process analytical technology soft sensor that maps directly electrical tomography signals to the relevant parameter to be monitored. The data acquisition and estimation steps occur almost instantaneously, and the final accuracy is very good (R2 = 0,994). Significance and novelty: The proposed methodology opens up good prospects for real-time quantitative applications. It was successfully tested on a pilot piping installation where the target property is the interface height between two immiscible fluids.
Article
This paper proposes a new time‐varying process monitoring approach based on iterative‐updated semi‐supervised nonnegative matrix factorizations (ISNMFs). ISNMFs are a type of semi‐supervised model that constructs a semi‐nonnegative matrix factorization (SNMF) model of a process using both labelled and unlabelled samples. Compared with the existing nonnegative matrix factorizations (NMFs) where NMFs are referred to as matrix factorization algorithms that factorize a nonnegative matrix into two low‐rank nonnegative matrices whose product can well approximate the original nonnegative matrix, ISNMFs have advantages in terms of the model update and the use of labelled samples. The ISNMFs‐based process monitoring approach concerns fault detection and isolation and updates an SNMF model iteratively using the latest samples to capture the change of statistical property of time‐varying processes. Moreover, the proposed fault detection and isolation approach is supported by the k‐means algorithm in theory. At last, we demonstrate the superiority of ISNMFs over the existing NMFs in terms of fault detection and isolation through a case study on the penicillin fermentation process. This article is protected by copyright. All rights reserved.
Article
Full-text available
Principal Components Analysis (PCA) has been intensively studied and is widely applied in industrial process monitoring. The main purpose of using PCA is the dimensionality reduction by extraction of the feature space that still contain the most information in the original data set. Despite its success in this field, the most important obstacle faced is the sensitivity to noise, also the fact that the majority of collected data from industrial processes are normally contaminated by noise makes it unreliable in some cases. To overcome these limitations, several strategies have been used. One of these has been interested to combine the robustness theory with PCA method, such theory sonsists in robustifying the existing algorithms against noise or outliers. Fuzzy Robust Principal Components Analysis (FRPCA) is one of the results for such combination that acheive better result compared with the classical method. In this work the RFPCA method is used and compared with the classical one to monitoring a biological nitrogen removal process. The obtained results demonstrate the performances superiority of this method compared with the conventional one.
Article
Fault detection and recognition are to recognize which type of operating mode the current operating mode belongs to, among the normal and faulty operating modes of an industrial process. This paper develops a method for fault detection and recognition using hybrid nonnegative matrix factorizations (HNMF) where the term ‘hybrid’ refers to the fact that these models utilize nonnegative matrix factorization objective functions built upon ideas from graph theory and information theory. Although HNMF absorb a variety of advanced theories and are significantly different from the existing nonnegative matrix factorizations (NMF), they are still convergent in theory. To achieve fault detection and recognition by HNMF, this paper designs a feasible technical roadmap for performing fault detection and recognition using HNMF. Due to the incorporation of NMF, graph theory, and information theory, HNMF show advantages over the existing NMF in terms of fault detection and recognition. More importantly, the proposed fault detection and recognition approach has advantages over the NMFs-based approaches, which is demonstrated through a case study on a penicillin fermentation process.
Article
Fault clustering attempts to partition a set of faulty samples into several clusters, allowing the exploration of the underlying pattern of faults. Nonnegative matrix factorizations (NMFs) are good candidates for fault clustering since they are inherently capable of data clustering and variants of the k‐means algorithm. However, NMFs always show poor performance in real‐world clustering applications for their naive data clustering mechanism. To improve the clustering performance of the existing NMFs and solve the fault clustering problem, this paper proposes a new type of NMFs, called small‐entropy nonnegative matrix factorizations (SENMFs). SENMFs impose a small amount of entropy on the cluster probability distribution of each sample to avoid ambiguous clustering results. Moreover, the algorithm for SENMFs is convergent in theory. We selected three types of faulty samples of the penicillin fermentation process for fault clustering. The case study results showed that SENMFs exceed the state‐of‐the‐art NMFs and k‐means in terms of fault clustering performance. This article is protected by copyright. All rights reserved.
Chapter
Traffic congestion has a negative impact on traffic performance because it increases travel time and air pollution. Therefore detecting traffic congestion is a key element in facilitating the development of efficient intelligent transportation systems. Motivated by the high capacity of principal component analysis (PCA) methods in describing the correlation structure underlying multivariate data, the aim of this chapter is demonstrating the performance of PCA-based methods in monitoring traffic flow. First, in this chapter, we present two commonly applied denoising methods that can be used for data pre-filtering, as traffic flow data are usually contaminated with noise. Then we present the basic steps needed to model multivariate traffic data using PCA. We also show five well-known procedures that are frequently used to determine the PCA model order, namely, cumulative percent variance, cross-validation, scree test, parallel analysis, and eigenvalue 1 rule. Essentially, to detect traffic congestion, PCA is constructed using congestion-free data. Then the new data are referenced to this model, where any anomalous traffic condition can be detected. Here we present four monitoring indices commonly used with PCA: SPE, T2, combined SPE and T2, and amalgamated exponential smoothing schemes. Then we assess the effectiveness of the presented PCA-based monitoring techniques using traffic measurements from the Old Bayshore Hwy on the south of Interstate 880 (I880) in California and Ashby Ave from the west of Interstate 80 (I80) highway in the San Francisco Bay Area. The results highlight that the PCA with nonparametric thresholds provided improved detection performance compared with the conventional PCA-based schemes. Finally, we discuss the limitations of the presented monitoring approaches and offer some plausible directions to rectify these limitations.
Article
Principal component analysis (PCA) was used to detect anomalies in wind tunnel measurements. Data were compiled from previous measurements for three two-dimensional airfoils in the Virginia Tech Stability Wind Tunnel with similar experimental arrangements. Measurements included the surface pressure distribution about the airfoil, total pressure distribution in the airfoil wake, static pressure in the wake, angle of attack, freestream velocity, flow temperature, and ambient pressure. These data were used to train the PCA scheme through an eigendecomposition of the measurement covariance matrix. A low-order reconstruction of this covariance matrix was then used to assess whether new measurements not included in the training set were anomalous by comparing predictions of the expected result with measured values. Results show that the method is very good at detecting anomalies in data from airfoils included in the training set as well as an additional airfoil outside of this set. Predictions of a particular measurement can be further improved by biasing the data used to construct the covariance matrix with observations that are more similar to the measurement of interest. However, this biasing reduces the ability of the method to predict results and therefore detect anomalies for experimental conditions outside of the range of the input data.
Article
Full-text available
Principal component analysis (PCA) and its variants have been widely used for process monitoring and quality control. A key issue of PCA-related methods is to select an appropriate number of principal components. However, few approaches for component selection consider the monitoring performance, and they usually rely on prior fault information. This paper develops an effective algorithm for dynamic component selection, which selects components for each sample based on a detection performance index, without need for prior fault information. Component selection is transformed into a stochastic optimization problem, whose optimal solution is derived analytically. Then, a subset of components that are sensitive to faults is obtained. The proposed method reduces the requirement for the detectable fault amplitude, which leads to better performance. Furthermore, the differences between PCA and the proposed method are discussed, and online computational complexity is analyzed. Case studies on a continuous, stirred tank heater, the Tennessee Eastman process, and a practical ultra-supercritical power plant demonstrate that the proposed method has better monitoring performance than PCA and some of its variants.
Article
Alarm systems are of vital importance in the safe and effective functioning of industrial plants, yet they frequently suffer from too many nuisance alarms (alarm overloading). It is necessary to intelligently enhance existing alarm systems and supply accurate information for the operators. Nowadays, process variables are more correlated and complicated. This correlation structure can be used as a basis to manage alarms efficiently. Hence, multivariate approaches are more appropriate. Designing a system aimed at reducing nuisance alarms is an essential phase to guarantee the reliable operation of a plant. Due to the definition of alarm limits, the problem of false alarms is inevitable in multivariate methods. In this paper, the conventional Principal Component Analysis (PCA) is applied to extract the sum of squared prediction error (SPE) known as the Q statistic and the Hotelling T2 statistic. These statistics are used separately as alarm indicators where their control limits are duly modified. Consequently, for each statistic, a nonlinear combination of alarm duration and alarm deviation, is additionally exploited as a new requirement to activate an alarm or not. The resulting new index is fed to a delay timer with a defined parameter n. The implementation of this technique resulted in a significant reduction in the severity of alarm overloading. Historical data collected from the cement rotary kiln operating under healthy conditions are employed to adequately build the PCA model and extract the proposed alarming indexes. Then, various testing data sets, covering different types of faults occurring in the cement process, are used to assess the performance of the developed method. In comparison with the conventional PCA technique, alarms are better managed nd almost nuisance alarms are suppressed. The proposed method is more robust to false alarms and more sensitive to fault detection.
Article
Full-text available
In industrial process fault monitoring, it is very important to collect accurate data, but in the actual process, there are often various noises that are difficult to eliminate in the collected data due to sensor accuracy, measurement errors, or human factors. Existing statistical process monitoring methods often ignore the problem of data noise. To solve this problem, a sliding window wavelet denoising-global local preserving projections (SWWD-GLPP) process monitoring method is proposed. In the offline stage, the wavelet denoising method is used to denoise the offline data, and then, the GLPP method is used for offline modeling, and then, the control limit is obtained by the kernel density estimation method. In the online phase, the sliding window wavelet denoising method is used to denoise the online data in real time. Then, use the model of the GLPP method to find the statistics, compare them with the control limit, judge the fault situation, and finally, use the contribution graph method to determine the variable that caused the fault, so as to diagnose the fault. This article uses a numerical case to illustrate the effectiveness of the algorithm, using the Tennessee Eastman (TE) process to compare the traditional principal component analysis (PCA) and GLPP methods to further prove the effectiveness and superiority of the method.
Article
This work proposes state-reconstruction strategies to effectively regain and/or maintain controllability of the system following the detection of cyber-attacks on sensor measurements. Working with a general class of nonlinear systems, of which the sensor measurements may be subject to cyber-attacks, robust control frameworks have been previously proposed to maintain the stability of the process in the presence of cyber-attacks. Moreover, machine-learning-based detection mechanisms could be employed to effectively detect the presence of and distinguish the particular types of cyber-attacks. This work further explores recuperation measures to be taken after the detection of cyber-attacks to mitigate its impact, and proposes a machine-learning-based state reconstruction approach to provide estimated state measurements based on the falsified state measurements. This approach ensures stable operation of the process before reliable sensor measurements are installed back online.
Article
This paper develops a brand new technique to perform safety monitoring using images for a type of complex industrial processes. The new safety monitoring technique is developed with the aid of a modified nonnega- tive matrix factorization algorithm which is called multiple- centroid strictly convex nonnegative matrix factorization (MCSCNMF). Specifically, MCSCNMF is designed deliber- ately in an all-new manner such that it can learn more accu- rate centroids for each group of samples than the existing nonnegative matrix factorization-like algorithms. Moreover, MCSCNMF takes advantage of multiple centroids instead of single centroid to describe the complicated distribution of each type of samples. Both properties give MCSCNMF a powerful clustering performance and make it beneficial for developing high-performance safety monitoring tech- niques than the existing nonnegative matrix factorization- like algorithms. We compare different types of monitoring methods in terms of false negative, false positive, and average approximation error in an experiment on a steel coil painting process. Comparison results demonstrate that MCSCNMF-based monitoring technique can obtain the best experimental results compared with other monitoring meth- ods all the time.
Article
As an essential component of data acquisition systems, sensors have been widely used, especially in industrial and agricultural sectors. However, sensors are also prone to faults due to their harsh working environment. Therefore, the early identification of sensor faults is critical for making corrective actions to mitigate the impact. This paper provides a comprehensive review on the contemporary fault diagnosis techniques and helps researchers and practitioners to understand the current state of the art development in this emerging field. The paper introduces the common fault types and causes in sensors, and different types’ methods for fault diagnosis used in industry and agriculture sectors. It discusses the advantages and disadvantages of these methods, highlights the current challenges, and offers recommendations for future research directions.
Article
The advent of high-throughput genomic technologies has resulted in the accumulation of massive amounts of genomic information. However, biologists are challenged with how to effectively analyze these data. Machine learning can provide tools for better and more efficient data analysis. Unfortunately, because many plant biologists are unfamiliar with machine learning, its application in plant molecular studies has been restricted to a few species and a limited set of algorithms. Thus, in this study, we provide the basic steps for developing machine learning frameworks and present a comprehensive overview of machine learning algorithms and various evaluation metrics. Furthermore, we introduce sources of important curated plant genomic data and R packages to enable plant biologists to easily and quickly apply appropriate machine learning algorithms in their research. Finally, we discuss current applications of machine learning algorithms for identifying various genes related to resistance to biotic and abiotic stress. Broad application of machine learning and the accumulation of plant sequencing data will advance plant molecular studies.
Article
Principal component analysis (PCA) is a common tool in the literature and widely used for process monitoring and fault detection. Traditional PCA is associated with the two well-known control charts, the Hotelling’s T ² and the squared prediction error (SPE), as monitoring statistics. This paper develops the use of new measures based on a distribution dissimilarity technique named Kullback-Leibler divergence (KLD) through PCA by measuring the difference between online estimated and offline reference density functions. For processes with PCA scores following a multivariate Gaussian distribution, KLD is computed on both principal and residual subspaces defined by PCA in a moving window to extract the local disparity information. The potentials of the proposed algorithm are afterwards demonstrated through an application on two well-known processes in chemical industries; the Tennessee Eastman process as a reference benchmark and three tank system as an experimental validation. The monitoring performance was compared to recent results from other multivariate statistical process monitoring (MSPM) techniques. The proposed method showed superior robustness and effectiveness recording the lowest average missed detection rate and false alarm rates in process fault detection.
Article
This paper presents a new optimized interval principal component analysis applied to detect and isolate actuators faults of an autonomous spacecraft involved in the rendezvous phase of the Mars sample return mission. Based on the exploitation of various arithmetic and interval analysis properties, the new interval model is built by solving the interval eigenpairs problem via a resolution of a parametric linear programming problem. The detection and isolation phases are performed by extending the classic methods to interval‐valued data. The proposed method is applied to detect and isolate actuators faults that can occur on the spacecraft's thrusters. Based on data provided by a “high fidelity” industrial simulator developed by Thales Alenia Space, the obtained results proved the effectiveness of the proposed interval fault diagnosis method on detecting and isolating thrusters' faults.
Article
This paper presents a monitoring approach for nonlinear processes based on a new semi‐supervised kernel nonnegative matrix factorization (SKNMF). Different from the existing nonnegative matrix factorization (NMF) and kernel nonnegative matrix factorization (KNMF), SKNMF is a semi‐supervised matrix factorization algorithm, which takes advantages of both labelled and unlabelled samples to improve algorithm performance. Labelled samples refer to the samples whose memberships are already known, while unlabelled samples are a set of samples whose memberships are unknown. In fact, both NMF and KNMF are unsupervised algorithms, and they cannot make full use of labelled samples to improve algorithm performance. More importantly, we explain the reasons why labelled samples can improve algorithm performance even if the amount of labelled samples is small. Last but not least, SKNMF induces a simultaneous fault detection and isolation scheme for online processes monitoring. Case studies of a numerical example and a penicillin fermentation process (PFP) demonstrate that the proposed process monitoring approaches outperform the existing process monitoring approaches. This article is protected by copyright. All rights reserved.
Chapter
Multivariate statistical principles are introduced in this chapter as a basis for data-driven method for fault detection and isolation. Thus, Principal Component Analysis (PCA) properties for data projection and dimensionality reduction are exploited to model process behaviour based on historical data representing normal operating conditions. After formulating the PCA model in terms of projection and residual spaces, the method introduces the distance concept in both subspaces aiming to define fault detection criteria. Two statistics, Hotelling T\(^{2}\) and the square prediction error (SPE), are used with this purpose. Diagnosis functionalities are provided by the capability to describe the magnitude of both statistics in terms of contributions of the original variables. The chapter ends with an illustrative example of the method.
Article
This paper proposes a simultaneous fault detection and isolation approach based on a novel transfer semi-nonnegative matrix factorization (TSNMF) algorithm. Different from the existing nonnegative matrix factorization (NMF) algorithm, TSNMF takes advantages of a few labeled samples and geometry structures of sample spaces to improve performance. Labeled samples refer to as a type of samples whose memberships are already known. On the contrary, unlabeled samples are a type of samples whose memberships are unknown. Theoretically, we will demonstrate how labeled samples and geometry structures of sample spaces can improve fault detection and isolation performance. More importantly, the proposed simultaneous fault detection and isolation approach can achieve fault detection and isolation purpose without use of monitoring statistics, which means it is easy to be implemented than the existing approaches. In comparison with the existing fault detection and isolation methods, the proposed detection and isolation scheme can perform excellent performance. In addition, the proposed approach can be readily used for newly coming samples and demonstrates good generalization, which promotes an online fault detection and isolation scheme. At last, a numerical example and a case study on the penicillin fermentation process will demonstrate effectiveness of the proposed approaches.
Article
Full-text available
There are two problems in principal component analysis (PCA), which is widely employed in multivariate statistical process monitoring (MSPM). On one hand, principal components selection according to the variance of the normal training dataset cannot represent the amount of fault information included in the online data. Thus, the useful fault information loss would exist, and leads to poor monitoring performance. On the other hand, although the fault information contained in every principal component is different, principal components are treated equally in traditional PCA based methods. Then, some useful fault information would be suppressed. In order to reduce the dimension and preserve every original variable information as complete as possible at the same time, this paper selects key principal components using our previously proposed full variable expression method. Moreover, according to the accumulated reachability distance of the online data relative to that of the offline training data, those key principal components with large accumulated reachability distance are emphasized and weighted. Finally, the statistic is constructed to monitor the operation status, and process monitoring performance of the proposed method is evaluated under an industrial process.
Article
In chemical process monitoring based on principal component analysis (PCA), sampling data with outliers and optimally select principal components are two challenging problems that have a main effect on monitoring performance. Given this situation, firstly, a novel outlier detection method, i.e., a robust CDC‐MVT‐PCA method (CMP), which integrates CDC‐MVT (the closest distance to centre and multivariate trimming) with PCA to identify and eliminate the outliers, is proposed to clean sample data. Secondly, based on the cleaning sample data, PCA is employed to obtain PCs. The cumulative frequency representing the variability of each PC is defined to find the optimal PCs, which are able to represent the current variability information. Finally, selecting optimal PCs online based on the cumulative frequency of each PC (CF‐PCA) is proposed to keep the most responsive components and, thus, to improve the monitoring performance. The effectiveness of the proposed robust fault monitoring algorithm is verified through a simple numerical simulation and the Tennessee Eastman process. This article is protected by copyright. All rights reserved
Article
Identification of linear dynamic systems from input-output data has been a subject of study for several decades. A broad class of problems in this field pertain to what is widely known as the errors-in-variables (EIV) class, where both input and output are known with errors, in contrast to the traditional scenario where only outputs are assumed to be corrupted with errors. In this work, we present a novel and systematic approach to the identification of dynamic models for the EIV case in the principal component analysis (PCA) framework. The key contribution of this work is a dynamic iterative PCA (DIPCA) algorithm that has the ability to determine the correct order of the process, handle unequal measurement noise variances (the \emph{heteroskedastic} case) and provide accurate estimates of the model together with the noise covariance matrix under large sample conditions. The work is presented within the scope of single-input, single-output (SISO) deterministic systems corrupted by white-noise errors. Simulation results are presented to demonstrate the effectiveness of the proposed method.
Conference Paper
Full-text available
The paper discusses an application of empirical models based on the process characteristics and the abundance of data to detect and identify important process shifts on a Du Pont plant and describes the open loop control problem associated with this design. The objective is to provide meaningful information to the operator which he can use to take appropriate action. The proposed system will recommend potential changes to controller setpoints to correct deficiencies detected. Ultimately, the objective is to eventually automate the changes.
Article
Full-text available
Principal component analysis of a data matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots. It is the responsibility of the data analyst to formulate the scientific issue at hand in terms of PC projections, PLS regressions, etc. Ask yourself, or the investigator, why the data matrix was collected, and for what purpose the experiments and measurements were made. Specify before the analysis what kinds of patterns you would expect and what you would find exciting.The results of the analysis depend on the scaling of the matrix, which therefore must be specified. Variance scaling, where each variable is scaled to unit variance, can be recommended for general use, provided that almost constant variables are left unscaled. Combining different types of variables warrants blockscaling.In the initial analysis, look for outliers and strong groupings in the plots, indicating that the data matrix perhaps should be “polished” or whether disjoint modeling is the proper course.For plotting purposes, two or three principal components are usually sufficient, but for modeling purposes the number of significant components should be properly determined, e.g. by cross-validation.Use the resulting principal components to guide your continued investigation or chemical experimentation, not as an end in itself.
Article
Full-text available
Several extensions are made to the theory of multivariate process monitoring via Principal Components Analysis (PCA). An important robustness issue is addressed: the continued use of the PCA model after detection of a sensor failure. Without some adjustment, a single failed sensor can obscure other failures, thus rendering the monitoring method useless. It is shown here that one can calculate an estimate of the output of the failed sensor that is most consistent with the PCA model of the process. This estimate allows continued use of the model. Under some circumstances, replacing the failed output with this estimate is equivalent to rebuilding the entire PCA model. Partial Least Squares (PLS) regression can be used in a manner similar to PCA for process monitoring. It is shown that PLS is fundamentally more sensitive to sensor failures than PCA. Unlike PCA, however, the PLS monitoring scheme maps state information into the model residuals. For this reason, changes in the process state covariance and autocovariance can invalidate calculated PLS model residual limits. The failed sensor problem is also solved for the PLS monitoring method. Keywords. Principal components; partial least squares; failure detection; multivariable systems.
Article
By means of factor analysis (FA) or principal components analysis (PCA) a matrix Y with the elements yik is approximated by the modelHere the parameters α, β and θ express the systematic part of the data yik, “signal,” and the residuals ∊ik express the “random” part, “noise.”When applying FA or PCA to a matrix of real data obtained, for example, by characterizing N chemical mixtures by M measured variables, one major problem is the estimation of the rank A of the matrix Y, i.e. the estimation of how much of the data yik is “signal” and how much is “noise.”Cross validation can be used to approach this problem. The matrix Y is partitioned and the rank A is determined so as to maximize the predictive properties of model (I) when the parameters are estimated on one part of the matrix Y and the prediction tested on another part of the matrix Y.
Article
The problem of using time-varying trajectory data measured on many process variables over the finite duration of a batch process is considered. Multiway principal-component analysis is used to compress the information contained in the data trajectories into low-dimensional spaces that describe the operation of past batches. This approach facilitates the analysis of operational and quality-control problems in past batches and allows for the development of multivariate statistical process control charts for on-line monitoring of the progress of new batches. Control limits for the proposed charts are developed using information from the historical reference distribution of past successful batches. The method is applied to data collected from an industrial batch polymerization reactor.
Article
Even though there has been a recent interest in the use of principal component analysis (PCA) for sensor fault detection and identification, few identification schemes for faulty sensors have considered the possibility of an abnormal operating condition of the plant. This article presents the use of PCA for sensor fault identification via reconstruction. The principal component model captures measurement correlations and reconstructs each variable by using iterative substitution and optimization. The transient behavior of a number of sensor faults in various types of residuals is analyzed. A sensor validity index (SVI) is proposed to determine the status of each sensor. On-line implementation of the SVI is examined for different types of sensor faults. The way the index is filtered represents an important tuning parameter for sensor fault identification. An example using boiler process data demonstrates attractive features of the SVI.
Article
In this paper we extend previous work by ourselves and other researchers in the use of principal component analysis (PCA) for statistical process control in chemical processes. PCA has been used by several authors to develop techniques to monitor chemical processes and detect the presence of disturbances [1–5]. In past work, we have developed methods which not only detect disturbances, but isolate the sources of the disturbances [4]. The approach was based on static PCA models, T2 and Q charts [6], and a model bank of possible disturbances. In this paper we use a well-known ‘time lag shift’ method to include dynamic behavior in the PCA model. The proposed dynamic PCA model development procedure is desirable due to its simplicity of construction, and is not meant to replace the many well-known and more elegant procedures used in model identification. While dynamic linear model identification, and time lag shift are well known methods in model building, this is the first application we are aware of in the area of statistical process monitoring. Extensive testing on the Tennessee Eastman process simulation [7] demonstrates the effectiveness of the proposed methodology.
Article
With process computers routinely collecting measurements on large numbers of process variables, multivariate statistical methods for the analysis, monitoring and diagnosis of process operating performance have received increasing attention. Extensions of traditional univariate Shewhart, CUSUM and EWMA control charts to multivariate quality control situations are based on Hotelling's T2 statistic. Recent approaches to multivariate statistical process control which utilize not only product quality data (Y), but also all of the available process variable data (X) are based on multivariate statistical projection methods (Principal Component Analysis (PCA) and Partial Least Squares (PLS)). This paper gives an overview of these methods, and their use for the statistical process control of both continuous and batch multivariate processes. Examples are provided of their use for analysing the operations of a mineral processing plant, for on-line monitoring and fault diagnosis of a continuous polymerization process and for the on-line monitoring of an industrial batch polymerization reactor.
Article
Fault detection and process monitoring using principal component analysis (PCA) have been studied intensively and applied to industrial processes. This paper addresses some fundamental issues in detecting and identifying faults. We give conditions for detectability, reconstructability, and identifiability of faults described by fault direction vectors. Such vectors can represent process as well as sensor faults using a unified geometric approach. Measurement reconstruction is used for fault identification, and consists of sliding the sample vector towards the PCA model along the fault direction. An unreconstructed variance is defined and used to determine the number of principal components for best fault identification and reconstruction. The proposed approach is demonstrated with data from a simulated process plant. Future directions on how to incorporate dynamics and multidimensional faults are discussed.
Article
Factor Analysis, a multivariate technique for determining the major trends or factors in a data matrix, is shown in this paper to be appropriate for resolving biochemical reaction networks. As opposed to an algorithmic approach, the methods presented in this article are intended to be a highly interactive set of tools. The researcher can use these tools to investigate a data matrix of concentration-change measurements by proposing different reaction networks. Several tools are adapted from other fields, and a few new techniques are proposed. The new techniques involve the estimation (or extraction) of reaction stoichiometries and reaction extents when all the reactions are not present at all times. This article presents theoretical elements, simulation results as well as an application of the method to experimental data from the fed-batch production of Baker's yeast grown on glucose. Reaction stoichiometries and reaction extents are estimated for the reactions of glucose fermentation, glucose oxidation and ethanol oxidation.
A uni®ed geometric approach to process and sensor fault identi®cation
  • R Dunia
  • S J Qin
R. Dunia, S.J. Qin, A uni®ed geometric approach to process and sensor fault identi®cation, Comput. Chem. In press.
Factor Analysis in Chemistry, Wiley-Inter-science
  • E R Malinowski
E.R. Malinowski, Factor Analysis in Chemistry, Wiley-Inter-science, New York, 1991.
Determining the number of principal components
  • S Valle-Cervantes
  • W Li
  • S J Qin
Valle-Cervantes, S., Li, W., Qin, S.J. Determining the number of principal components, in: AIChE Annual Meeting, paper 224h, 15±20 November 1998, Miami, FL. Fig. 3. VRE vs. the number of principal components for m=11.