ArticlePDF Available

Chemometrics and Modeling

Authors:

Abstract

Chemometrics is a chemical discipline in which mathematical and statistical techniques are applied to design experiments or to analyze chemical data. An important part of chemometrics is modeling, in which one tries to relate two or more characteristics in such a way that the obtained model represents reality as closely as possible. In this article some less known but useful regression methods such as orthogonal least squares, inverse and robust regression are introduced and compared with the well-known classical least squares regression method. Genetic algorithms are described as a means of carrying out feature selection for multivariate regression. Regression methods such as principal component regression and partial least squares are introduced as well as the use of N-way principal components.
COMPUTATIONAL CHEMISTRY COLUMN
70
CHIMIA 2001,55, No,1/2
Computational Chemistry Column
Column Editors:
Prof. Dr. H. Huber, University of Basel
Prof. Dr. K. MOiler, F. Hoffmann-La Roche AG, Basel
Prof. Dr. H.P. LOthi, Univ. of Geneva, ETH-ZOrich
Chimia 55 (2001) 70-aO
© Neue Schweizerische Chemische Gesellschaft
ISSN 0009-4293
Chemometrics and Modeling
Frederic Estienne, Yvan Vander Heyden, and D. Luc Massart*
Abstract: Chemometrics is a chemical discipline in which mathematical and statistical techniques are applied
to design experiments or to analyze chemical data. An important part of chemometrics is modeling, in which
one tries to relate two or more characteristics in such a way that the obtained model represents reality as closely
as possible. In this article some less known but useful regression methods such as orthogonal least squares,
inverse and robust regression are introduced and compared with the well-known classical least squares
regression method. Genetic algorithms are described as a means of carrying out feature selection for
multivariate regression. Regression methods such as principal component regression and partial least squares
are introduced as well as the use of N-way principal components.
Keywords: Analytical chemistry' Chemical data analysis' Chemometrics . Modeling·
OSAR . Regression methods
Introduction
Chemometrics has been defined
[1]
as a
chemical discipline that uses mathemat-
ics, statistics and formal logic (a) to de-
sign or select optimal experimental pro-
cedures, (b) to provide the maximum rel-
evant chemical information by analyzing
chemical data, and (c) to obtain knowl-
edge about chemical systems.
In this article we will focus on how
chemometrics is used for modeling pur-
poses. However, first we should note
that, while modeling is probably the most
important area of chemometrics, there
are many other applications such as
method validation, optimization, statisti-
cal process control, signal processing,
etc.
'Correspondence: Prof, Dr. D.L. Massart
Farmaceutisch Instituut
Vrije Universiteit Brussel
Laarbeeklaan 103
B-1 090 Brussels
Tel.: +32 2 477 47 34
Fax: +32 2 477 47 35
E-Mail: fabi@fabi.vub.ac.be
Modeling is applied when two or
more characteristics of the same objects
are measured or calculated and then relat-
ed to each other, for example the concen-
tration of a chemical compound to an in-
strumental signal, the chemical structure
of a drug to its activity or instrumental
responses to sensory characteristics. The
purpose of the modeling usually is to
make predictions (e.g. predict the con-
centration of a certain analyte in a sample
from a measured signal), but sometimes
simply to verify the nature of the relation-
ship.
The expertise of the authors is in the
use of chemometrics for analytical chem-
ical purposes and most examples will
therefore come from that area.
Classical Univariate Least Squares:
Straight Line Models
Before introducing some of the more
sophisticated methods such as genetic al-
gorithms, latent variable procedures or
neural nets, we should look shortly at the
classical univariate least squares method-
ology (often called ordinary least squares
- OLS), which is what analytical chem-
ists generally use to construct a (linear)
calibration line. In most analytical tech-
niques the concentration of a sample can-
not be measured directly but is derived
from a measured signal that is in direct
relation to the concentration. Suppose x
represents a concentration and y the cor-
responding measured instrumental sig-
nal. To be able to define a model y
=
f(x)
a relationship between x and y has to ex-
ist. The simplest and most convenient sit-
uation is when the relation is linear which
leads to a model of the type y
=
b
o
+b
I X
and which represents a straight line. The
coefficients boand b
l
represent the inter-
cept and the slope of the line. Relation-
ships between y and x that follow a
curved line can for instance be represent-
ed by a regression model of the type y
=
bo+b1x +b11X2.
The least squares regression analysis
is a methodology that allows the coeffi-
cients of a given model to be estimated.
For calibration purposes one usually fo-
cuses on straight line models which we
also will do in the rest of this section.
COMPUTATIONAL CHEMISTRY COLUMN 71
CHI MIA 2001,55. No. 1/2
Fig. 1. Straight line fitting through a series of measured points,
y
a urcm nl
=
f (
oncentrulion)
This relationship can be inversed to
become
n cntralion
=
r
(mea.,ur menl
(:1.
Another possibility is to apply inverse
regression. The term inverse is applied in
opposition to the usual calibration proce-
dure. Calibration consists of measuring
samples with a known characteristic and
deriving a calibration line (or more gen-
erally a model). A measurement is then
carried out for an unknown sample and
its concentration is derived from the
measurement result and the calibration
line. In view of the assumptions of OLS,
the measurement is the y-value and the
concentration the x-value, i.e.
OLS-methods leads to wrong estimates
of
Po
and
P
I'
Significant errors in the least
squares estimate of b
l
can be expected if
the ratio between the measurement error
on the x values and the range of the x val-
ues is large. In that case OLS should not
be used. To obtain correct values for b()
and b
l
the sum of least squares must now
be obtained in the direction given in
Fig. 2. Such methods are sometimes
called errors in variables models or or-
thogonalleast squares. Detailed studies
of the application of models of this type
can be found in [9][10].
OLS is then applied in the usual way.
meaning that the sum of the squared re-
siduals is minimized in the direction of y,
which is now the concentration. This may
appear strange, since, when the calibra-
tion line is computed, there are no errors
in the concentrations. However, if it is
taken into account that there will be an
error in the predicted concentration of the
unknown sample, then minimizing in this
way means that one minimizes the pre-
diction errors, which is what is important
to the analytical chemist. It has been
shown indeed that better results are ob-
tained in this way [I 1-13]. The analytical
chemist should therefore really apply
Eqn. (2), instead of the usual Eqn. (1). In
most cases the difference in prediction
quality between both approaches is very
small in practice, so that there is general-
ly no harm in applying Eqn. (1). We will
see however that when multivariate cali-
bration is applied, inverse regression is
the rule. It should be noted that, when the
aim is not to predict y-values, but to ob-
tain the best possible estimates of
Po
and
PI'
inverse regression performs less well
than the usual procedure.
x
quantities are related to each other and
the assumption then does not hold, be-
cause there are also measurement errors
in x. This is for instance the case when
two methods are compared to each other.
Often one of these methods is a reference
method and the other a new method,
which is faster or cheaper and a demon-
stration is required that the results of both
methods are sufficiently similar. A cer-
tain number of samples are analyzed with
both methods and a straight line model
relating both series of measurements is
obtained. If
Po
as estimated from b
o
is not
more different from 0 than an a priori ac-
cepted bias and
PI
as estimated by b
l
is
not more different from I than a given
amount, then one can accept that for
practical purposes Y
=
x. In its simplest
statistical expression, this means that it is
tested that
Po
=
0 and
PI
=
I or to put it in
another way that bois statistically differ-
ent from 0 and/or blis statistically differ-
ent from 1. If this is the case then it is
concluded that the two methods do not
yield the same result but that there is a
constant (intercept) or proportional
(slope) systematic error or bias.
This means that one should calculate
boand b
l
and at first sight this could be
done by OLS. However both regression
variables (not only Yibut now also Xi) are
subject to error, as already mentioned.
This violates one of the key assumptions
of the OLS calculations.
It has been shown
[5-8]
that the com-
putation of boand blaccording to the
Some Variants of the Univariate
Least Square Straight Line Models
Conventionally the x values represent the
so-called controlled or independent vari-
able, i.e. the variable that is considered
not to have a measurement error (or a
negligible one), which is the concentra-
tion in our case. The y values represent
the dependent variable, i.e. the measured
response, which is considered to have a
measurement error. The least squares ap-
proach allows b
o
and blvalues to be ob-
tained such that the model fits the meas-
ured points (xj, Yi)best (Fig. 1).
The true relationship between x and y
is considered to be y
=
Po
+
P
I
X whi Ie the
relationship between each Xi and its
measured Yican be represented as Yi
=
b
o
+blXi +ei' The signal Yiis composed of a
component predicted by the model, b
o
+
b1x, and a random component, ej, the re-
sidual (Fig. 1). The least squares regres-
sion finds the estimates b
o
and blfor
Po
and
PI
by calculating the values boand bl
for which I.e?
=
I. (Yi - bo- blxY, the
sum of the squared residuals, is minimal.
This explains the name 'least squares'.
Standard books about regression, includ-
ing least squares approaches, are given in
[2][3]. Analytical chemists can find in-
formation in [4][5].
A fundamental assumption of OLS is
that there are only errors in the direction
of y. In some instances, two measured
COMPUTATIONAL CHEMISTRY COLUMN 72
CHIMIA 2001, 55, No.1/2
Fig. 2. The errors-in-variables model.
y
y
10
8
6
4
2
o
2
3
4
5
x
residuals. Cook's squared distance or the
Mahalanobis distance can for instance be
used.
A more elegant way is to apply so-
called robust regression methods. The
easiest to explain is the single median
method [14]. The slope between each pair
of points is computed. For instance the
slope between points 1 and 2 is 1.10, be-
tween 1 and 3 1.00, between 5 and 6 6.20.
The complete list is 1.10, 1.00, 1.03,
0.95, 2.00, 0.90, 1.00, 0.90, 2.23, 1.10,
0.90, 2.67, 0.70, 3.45, 6.20. These are
now ranked and the median slope (here
the 8
th
value 1.03) is chosen. All pairs of
points of which the outlier is one point
have high values and end up at the end of
the ranking, so that they do not have an
influence on the chosen median slope:
even if the outlier was still more distant,
the selected median would still be the
same. A similar procedure for the inter-
cept, which we will not explain in detail,
leads to the straight line equation y
=
0.00
+
1.03 x, which is close to the line ob-
tained with OLS after eliminating the
outlier. The single median method is not
the best robust regression method. Better
results are obtained with the least median
of squares method (LMS) [15], the itera-
tively reweighted [16] or biweight re-
gression [17]. Comparing results of cali-
bration lines obtained with OLS and with
a robust method is one way of finding
outliers towards a regression model [18].
Multivariate (Multiple) Regression
Multivariate regression, also often
called multiple regression or multiple
linear regression (MLR) in the linear
case, is used to obtain values for the b co-
efficients in an equation of the type
where x"
X2, ... , X
m
are different varia-
bles. In analytical spectroscopic applica-
tions, these variables could be the absorb-
encies obtained at different wavelengths,
y being a concentration or other charac-
teristic of the samples to be predicted, in
QSAR (the study of quantitative struc-
ture-activity relationships) they could be
variables such as hydrophobicity (Jog P),
the Hammett electronic parameter
0',
with y being some measure of biological
activity. In experimental design, equa-
tions of the type
Fig. 3. The leverage effect.
Robust Regression
One of the most frequently occurring
difficulties for an experimentalist is that
of the presence of outliers. The outliers
may be due to experimental error or to
the fact that the proposed model does
not represent the data well enough. For
example, if the postulated model is a
straight line, and measurements are made
in a concentration range where this is no
longer true, the measurements obtained
in that region will be model outliers. In
Fig. 3 it is clear that the last point is not
representative for the straight line fitted
by the rest of the data. The outlier attracts
the regression line computed by OLS. It
is said to exert leverage on the regression
line. One might think that outliers can be
discovered by examining the residuals to-
wards the line. As can be observed this is
not necessarily true: the outlier's residual
is not much larger than that of some other
data points.
To avoid the leverage effect, the
outlier(s) should be eliminated. One way
to achieve this is to use more efficient
outlier diagnostics than simply looking at
=bo+b
l
',+b
22
+b
12 ,
2+b
ll
)2+b22 2
(3)
(4)
COMPUTATIONAL CHEMISTRY COLUMN
73
CHIMIA 2001. 55. No. 1/2
are used to describe a response y as a
function of the experimental variables x
I
and X2' Both Eqn. (3) and (4) are called
linear, which may surprise the non-initi-
ated, since the shape of the relationship
between y and (x "X2) is certainly not lin-
ear. The term linear should be understood
as linear in the regression parameters. An
equation such as y = b
o
+log (x - b
l)
is
non-linear [2].
It can be observed from the applica-
tions cited above that multiple regression
models occur quite often. We will first
consider the classical solution to estimate
the coefficients. Later we will describe
some more sophisticated methodologies
introduced by chemometricians, such as
those based on latent vectors.
As for the univariate case, the b-val-
ues are estimates of the true b-parameters
and the estimation is done by minimizing
a (sum of) squares. It can be shown that
b
= (
T ).1 T
where b is the vector containing the b-
values from Eqn. (3), Xis an nxm matrix
containing the x-values for n samples (or
objects as they are often called) and m
variables and y is the vector containing
the measurements for the n samples.
One difficulty is that the inversion of
the
XTX
matrix Leads to unstable results
when the x-variables are very correlated.
As we will explain later, this happens for
instance with spectroscopic data. There
are two ways to avoid this probLem. One
is to select variables (variable selection
or feature selection) such that correlation
is reduced, the other is to combine the
variables in such a way that the resulting
summarizing variables are not correlated
(feature reduction). Both feature selec-
tion and feature reduction lead to a small-
er number of variables than the initial
number of variables, which by itself has
important advantages.
The classical approach, which is
found in many statistical packages, is the
so-called stepwise regression, a feature
selection method. The so-called forward
selection procedure consists of first se-
lecting the variable that is best correlated
with y. Suppose this is found to be Xi.The
model at this stage is restricted to y=f(xJ.
Then, one tests all other variables by add-
ing them to the model, which then be-
comes a model in two variables y=f(xi>xj)'
The variable Xjwhich is retained together
with Xi is the one which when added to
the model leads to the largest improve-
ment compared to the original model
y = f(Xi)' Then it is tested whether the
observed improvement is significant. If
not, the procedure stops and the model is
restricted to y = f(Xi)' If the improvement
is significant, Xj is incorporated defini-
tively in the model. It is then investigated
which variable should be added as the
third one and whether this yields a signif-
icant improvement. The procedure is re-
peated until finally no further improve-
ment is obtained. The procedure is based
on analysis of variance and several vari-
ants such as backwards elimination
(starting with all variables and eliminat-
ing successively the least important ones)
or a combination of forward and back-
ward methods has also been proposed. It
should be noted that the criteria applied
in the analysis of variance are such that
variables are automatically selected that
are less correlated. In certain contexts
such as in experimental design or QSAR,
the reason for applying feature selection
is not only to avoid the numerical diffi-
culties described higher, but also to ex-
plain relationships. The variables that are
included in the regression equation have
a chemical and physical meaning and
when a certain variable is retained it is
considered that the variable influences
the y-value, e.g. the biological activity,
which then leads to proposals for causal
relationships. Correct feature selection
then becomes very important in those sit-
uations to avoid making wrong conclu-
sions. A discussion comparing different
strategies for feature selection in QSAR
is given in [19]. One of the problems is
that the procedures involve regressing
many variables on y and chance correla-
tions may then occur [20].
There are other difficulties, for in-
stance, the choice of experimental condi-
tions, the samples or the objects. These
should cover the experimental domain as
well as possible and, where possible, fol-
Iowan experimental design. This is dem-
onstrated, for instance, in [21]. Outliers
can also cause problems. Detection of
multivariate outliers is not evident. As for
the univariate regression, robust regres-
sion is possible [15][22]. An interesting
example in which multivariate robust re-
gression is applied concerns an experi-
mental design [23] carried out to opti-
mize the yield of an organic synthesis.
Wide Data Matrices
Chemists often produce wide data
matrices, characterized by a relatively
small number of objects (a few tens to a
few hundred) and a very large number of
variables (many hundreds, at least). For
instance, analytical chemists now often
apply very fast spectroscopic methods,
such as near infrared spectroscopy (NIR).
Because of the rapid character of the
analysis, there is no time to dissolve the
sample or separate certain constituents.
The chemist tries to extract the informa-
tion required from the spectrum as such
and to do so he has to relate a y-value
such as an octane number of gasoline
samples or a protein content of wheat
samples to the absorbance at 500 to, in
some cases, 10000 wavelengths. The
e.g.
1000 variables for 100 objects con-
stitute the Xmatrix. Such matrices con-
tain many more columns than rows and
are therefore often called wide.
Very wide matrices are also encoun-
tered in QSAR. For instance, in compara-
tive molecular field analysis (CoMFA),
developed by Cramer [24], three-dimen-
sional grids are laid over a set of mole-
cules whose properties in reacting with
other molecules or receptors one wants to
predict. At each of the resulting lattice
points electrostatic, hydrophobic and
steric fields are computed. A typical grid
of 5 x 2x2nm3with a spacing of 0.02 nm
yields 2 500 000 such lattice points for
each molecule. This huge set of data con-
stitutes the Xmatrix and must be related
to
e.g.
biological activity data.
Feature selection/reduction then takes
on a completely different complexity
compared to the situations described in
the preceding sections. It should be noted
that variables in such matrices are often
very correLated. This can for instance be
expected for two neighboring wave-
lengths in a spectrum or the fields meas-
ured at adjacent locations in the CoMFA
lattice. In what follows, we will explain
which methods chemometricians use to
modeL very large, wide and highly corre-
lated data matrices.
Genetic Algorithms for Feature
Selection
Genetic algorithms are general opti-
mization tools aiming at selecting the fit-
test solution to a problem. Suppose that,
to keep it simple, nine variables are
measured. Possible solutions are repre-
sented in Fig. 4. Selected variables are
indicated by a 1, non-selected variables
by a
O.
Such solutions are sometimes, in
analogy with genetics, called chromo-
somes in the jargon of the specialists.
By random selection a set of such so-
lutions is obtained (in real applications
often several hundreds). For each soLu-
tion an MLR model is built using an
equation such as (3) and the sum of
squares of the residuals of the objects to-
COMPUTATIONAL CHEMISTRY COLUMN
74
CHIMIA 2001, 55. No.1/2
wards that model is determined. In the
jargon of the field one says that the fit-
ness of each solution is determined: the
smaller the sum of squares the better the
model describes the data and the fitter the
corresponding solutions are. Then fol-
lows what is described as the selection of
the fittest (leading to names such as ge-
netic algorithms or evolutionary compu-
tation). For instance out of the, say 100
original solutions, the 50 fittest are re-
tained. They are called the parent genera-
tion. From these a child generation is ob-
tained by reproduction and mutation.
Reproduction is explained in Fig. 5.
Two randomly chosen parent solutions
produce two child solutions by cross
over. The cross over point is also chosen
randomly. The first part of solution 1 and
the second part of solution 2 together
yield child solution I'. Solution 2' results
from the first part of solution 2 and the
second of solution 1.
The child solutions are added to the
selected parent solutions to form a new
generation. This is repeated for many
generations and the best solution from
the final generation is retained. Each gen-
eration is additionally submitted to muta-
tion steps. Here and there randomly cho-
sen bits of the solution string are changed
(0 to I or I to 0). This is applied in Fig. 6.
The need for the mutation step can be
understood from Fig. 5. Suppose that the
best solution is close to one of the child
solutions in that Fig., but should not in-
c1ude variable 9. However, because the
value for variable 9 is I in both parents, it
is also unavoidably 1 in the children.
Mutation can change this and move the
solutions in a better direction.
Genetic algorithms were first pro-
posed by Holland [25]. They were intro-
duced in chemometrics by Lucasius et al.
[26] and Leardi [27]. They were applied
for instance in QSAR and molecular
modeling [28], conformational analysis
[29], multivariate calibration for the de-
termination of certain characteristics of
polymers [30] or octane numbers [31].
Reviews about applications in chemistry
can be found in [32][33]. There are sever-
al competing algorithms such as simulat-
ed annealing [34] or the immune algo-
rithm [35].
REPRODUCTION (MATING)
[.11oI2III2IITillTIJ
Latent Variables for Feature
Reduction: Principal Components
The alternative to feature selection is to
combine the variables in what we called
earlier summarizing variables. Chemo-
metricians call this latent variables and
the obtention of such variables is called
feature reduction. It should be under-
stood that in this case no variables are
discarded. The type of latent variable
most commonly used is the principal
component (PC). To explain it we will
first consider the simplest possible situa-
tion. Two variables
(XI
and x::) were
measured for a certain number of objects
and the number of variables should be
reduced to one. In principal component
analysis (PCA) this is achieved by defin-
ing a new axis or vari-able on which the
objects are projected. The projections are
called the scores, sl, along principal
component I, PCI (Fig.7).
MUTATION
Fig. 4. A set of solutions
for feature selection
from nine variables for
MLR.
CHROMOSOMES
(Solutions)
+
VARIABLE
123456789
L~LiIlTIIQ]~
*
*
1
*
+
(cross over)
Fig. 5. Genetic algorithms: the reproduction step. Fig. 6. Genetic algorithms: the mutation step.
COMPUTATIONAL CHEMISTRY COLUMN
75
CHIMIA 2001,55, No. 1/2
The projections along PC 1 preserve
the information present in the
X'-X2
plot,
namely that there are two groups of data.
By definition, PCI is drawn in the direc-
tion of the largest variation through the
data. A second PC, PC2, can also be ob-
tained. By definition it is orthogonal to
the first' one (Fig. 8a). The scores along
PCI and along PC2 can be plotted against
each other yielding what is called a score
plot (Fig. 8b).
The reader observes that PCA decor-
relates: while the data points in the
Xl-X2
plot are correlated they are no longer so
in the s
I-S2
plot. This also means that
there was correlated and therefore redun-
dant information present in
Xl
and
X2'
PCA picks up all the important informa-
( )
tion in PCI and the rest, along PC2, is
noise and can be eliminated. By keeping
only PCI, feature reduction is applied:
the number of variables, originally two,
has been reduced to one. This is achieved
by computing the score along PCI as:
In other words the score is a weight-
ed sum of the original variables. The
weights are known as loadings and plots
of the loadings are called loading plots.
This can now be generalized to m di-
mensions. In the m-dimensional space,
PC I is obtained as the axis of largest vari-
ation in the data, PC2 is orthogonal to
PC I and is drawn into the direction of
largest remaining variation around PC1.
It therefore contains less variation (and
information) than PC I. PC3 is orthogo-
nal to the plane of PCl and PC2. It is
drawn in the direction of largest variation
around that plane, but contains less varia-
tion than PC3. In the same way PC4
is orthogonal to the hyperplane PC I,
PC2, PC3 and contains stilI less varia-
tion, etc. For a matrix with dimensions
n
X
m, N
=
min (n, m) PCs can be extract-
ed. However, since each ofthem contains
less and less information, at a certain
time they contain only noise and the
process can be stopped before reaching
N. If only N PCs are obtained, then
feature reduction is achieved.
A very important application of prin-
cipal components is to visually display
the information present in the data set and
most multivariate data applications start
therefore with score and/or loading plots.
The score plots give information about
the objects and the loading plots about
the variables. Both can be combined into
a biplot, which are all the more effective
after certain types of data transformation,
e.g. spectral mapping [36]. In Fig. 9 a
score plot is shown for an investigation
into the Maillard reaction, a reaction be-
tween sugars and amino acids [37]. The
samples consist of reaction mixtures of
different combinations of sugars and
amino acids. The variables are the areas
under the peaks of the reaction mixtures.
The reactions are very complex: 159 dif-
ferent peaks were observed. Each of the
samples is therefore characterized by its
value for 159 variables. The PCI-PC2
score plot of Fig. 9 can be seen as a pro-
jection of the samples from 159-dimen-
sional space to the two-dimensional
space that best preserves the variance in
the data. In the score plot different sym-
bols are given to the samples according to
the sugar that was present and it is ob-
Fig. 8. a) Second PC and
b) score plot of the data in
Fig. 1.
Fig. 7. Feature reduction of
two variables,
x,
and X2, by
a principal component.
b
PCI
a
PCI
PCI
PCI
xxx x xxx
PC2
x x x x x
x
X
1
COMPUTATIONAL CHEMISTRY COLUMN
10
.
. .
o ~
8~
o
<ff "
0
~
00
6
~
0
0
o
0
0
000
4
0
0
'#.
0
00
00
~
0
lD
2
0
~X ••
~
I
~x,. ••
+
N
.Il
.~f
0
O~
e
~>9'~~.+
+
X
X
-2
'.
X
X ,
+ +
,
0
cae.sGf-
, ',. "+
+
-4
(fttfE .
+
G :
0
y
e
-6
2 4
6
8
10 12
14
16 18
PC1 ·58.01%
Fig. 9.
peA
score plot of samples from the Maillard reaction. The samples with rhamnose have
symbol
Q.
PC2
76
CHIMIA 2001. 55, NO.1/2
loadings on PC I are all positive and not
very different. Referring to Eqn. (5), and
remembering that the loadings are the
weights (the w-values) this means that
the score on PCI is simply a weighted
sum of the variables and therefore a glo-
bal indicator of pollution. The samples
with highest score on PCI are those with
the highest degree of polIution. Along
PC2 some variables have positive load-
ings and others negative loadings. Those
of the aliphatic variables are positive and
those of the aromatic variables are nega-
tive. It follows that samples with positive
scores contain more aliphatic than aro-
matic variables.
Combining PC 1 and PC2, one can
then conclude that samples with symbol
x have an aliphatic character and that the
total content increases with higher values
on PCl. The same reasoning can be held
for the samples with symbol·: they have
an aromatic character. In fact, one could
define new aliphaticity and aromaticity
factors as in Fig. 12. This can be done in a
more formal way using what is called
factor analysis.
Other Latent Variables
PCI
x Xx
x
xXX
x
r
*
.~
..
X
••
o -
~:~,~ x
••
.
:
••
Fig. 10.
peA
score plot of air samples.
served, for instance, that samples with
rhamnose occupy a specific location in
the score plot. This is only possible if
they also occupy a different place in the
original I 59-dimensional space, i.e. their
GC chromatogram is different. By study-
ing different parts of the data and by in-
cluding the information from the loading
plots, it is then possible to understand the
effect of the starting materials on the re-
action mixture obtained.
Principal components have been used
in many different fields of application.
Whenever a table of samples x variables
is obtained and some correlation between
the variables is expected, a principal com-
ponents approach is useful. Let us con-
sider an environmental example [38]. In
Fig. 10 the score plot is shown. The data
consist of air samples taken at different
times in the same sampling location. For
each of the samples a capilIary GC chro-
matogram was obtained. The different
symbols given to the samples indicate
different wind directions prevailing at the
time of sampling. Clearly the wind direc-
tion has an effect on the sample composi-
tions. To understand this better Fig. 11
gives a plot of the loadings of a few of the
variables involved. It is observed that the
There are other types of latent varia-
bles. In projection pursuit [37][39] a la-
tent variable is chosen such that, instead
of largest variation in the data set, it de-
scribes the largest inhomogeneity. In this
way clusters or outliers can be observed
more easily. Fig. 13 shows the result ap-
plied to the MailIard data of Fig. 9 and it
can be observed that the cluster of rham-
nose samples can now be observed more
clearly .
If the y-values are not characteristics
observed for a set of samples, but the
class affiliation of the samples (e.g. sam-
ples
1-10
belong to class A, samples
11-
25 to class B), then a latent variable can
be defined that describes the largest dis-
crimination between the classes. Such la-
tent variables are called canonical vari-
ates or sometimes linear discriminant
functions and are the basis for supervised
pattem recognition methods such as line-
ar discriminant analysis. In the partial
least squares (PLS) section, a further type
of latent factor wilI be introduced.
N-way Methods
Some data have a more complex
structure than the classical 2-way matrix
or table. Typical examples are met, for
instance, in environmental chemistry
[40]. A set of n variables can be measured
COMPUTATIONAL CHEMISTRY COLUMN 77
CHIMIA 2001.55. No. 1/2
PC2
Principal Component Regression
(PCR)
in m different locations at p different
times. This leads to a 3-way data set with
dimensions n x m x p. The three ways (or
modes) are the variable mode, the loca-
tion mode and the time mode. This can of
course be generalized to a higher number
of modes, but for the sake of simplicity
we will restrict here to 3-way. The classi-
cal approach to study such data is to per-
form what is called unfolding. Unfolding
consists of rearranging a 3-way matrix
into a 2-way matrix. The 3-way array can
be considered as several 2-way tables
(slices of the original matrix), and these
tables can be put next to each other, lead-
ing to a new 2-way array (Fig. 14). This
rearranged matrix can be treated with
PCA. Considering the example of Fig. 14,
the scores will carry information about
the locations, and the loadings mixed in-
formation about the two other modes.
Unfolding can be performed in differ-
ent directions so that each of the three
modes is successively preserved in the
unfolded matrix. In this way, three differ-
ent PCA models can be built, the scores
of each of these models giving informa-
tion about one of the modes. This ap-
proach is called the Tucker! model. It
is the first of a series of Tucker models
[41]. The most important of these is the
Tucker3 model. Tucker3 is a true noway
method as it takes into account the multi-
way structure of the data. It consists in
building, through an iterative process, a
score matrix for each of the modes, and a
core matrix defining the interactions be-
tween the modes. As in PCA, the compo-
nents in each mode are constrained to be
orthogonaL The number of components
can be different in each mode. A graphi-
cal representation of the Tucker3 model
for 3-way data is given in Fig. 15. It ap-
pears as a sum, weighted by the core ma-
trix G, of outer products between the fac-
tors stored as columns in the A, Band C
score matrices.
Another common n-way model is
the Parafac-Candecomp model that was
proposed simultaneously by Chan and
Harchman [42][43]. Information about n-
way methods (and software) can be
found in [44--46]. Applications in process
control [47][48], environmental chemis-
try [40][49], food chemistry [50], curve
resolution [51] and several other fields
have been published.
Fig. 11.
peA
loading plot of a
few variables measured on
the air samples in Fig. 12.
New fundamental factors dis-
covered on a score plot.
PCI
aliphatic factor
n-dadecane
a-xylene
- 0.300
0.300
o
arOnl atic factor
PCI
Fig. 12.
6
+-
5
X
4
0
3
X
2
8
0
0
C\l
n.
1
0
0
00
n.
0
0
0
000
Cb
CO
X
0
0
.,
,
($J
0
§~
000
;~)(~)lft
.0
§.
'dOC
-2
0
X
+.
0
·3
-4 -2
02
4
6
a
10
PP1
PC2
Fig. 13. Projection pursuit plot of samples from the Maillard reaction. The samples with rhamnose
have symbol O. Until now we have applied latent var-
iables only for display purposes. Princi-
COMPUTATIONAL CHEMISTRY COLUMN 78
CHIMIA 2001, 55, No.1/2
Location
Fig. 14. Unfolding of a 3-way matrix, performed preserving the 'Location' dimension.
Applications of peR and PLS
peR and PLS have been applied in
many different fields. The following ref-
erences constitute a somewhat haphazard
selection from a very large literature.
There are many analytical applications in
the pharmaceutical industry [54], the pe-
troleum industry [55], food science [56],
environmental chemistry [57]. The meth-
ods are used with near or mid infrared
[58], chromatographic [59], Raman [60],
UV [61], potentiometric [62] data. A
good overview of applications in QSAR
is found in [63].
with the data contained in an (often) wide
matrix of correlated variables. However
the approach is different. In PCR one
works in two steps. In the first the scores
are obtained and only the X matrix is in-
volved, in the second y is related to the
scores. In PLS this is done in one step.
The latent variables are obtained, not
with the variation in X as criterion as is
the case for principal components, but
such that the new latent variable shows
maximal covariance between X and y.
This means that the latent variable is now
built immediately in function of the rela-
tionship between y and X. In principle
one therefore expects that PLS would
perform better than PCR, but in practice
they often perform equally well. A tutori-
al can be found in [52]. Several algo-
rithms are available. A very performant
one requiring the least computer time ac-
cording to our experience is SIMPLS
[53].
r
Variables x Time
Location
Fig. 15, Graphical representation of the Tucker3 model. n, m, and p are the dimensions of the
original matrix X, WI, W2, and W3 are the number of components extracted on mode 1, 2 and 3,
respectively, corresponding to the number of columns of the loading matrices A, Band C,
respectively.
pal components can however also be
used as the basis of a regression method.
It is applied among others when the x-
values constitute a wide X-matrix,
e,g,
for NIR calibration (see earlier). Instead
of the original x-values one applies the
reduced ones, the scores, Suppose m vari-
ables (e.g. 1000) were measured for n
samples (e.g. 100). As explained earlier
this requires either feature selection or
feature reduction, The latter can be
achieved by replacing the m x-values by
the scores on the k significant PC scores
(e,g,
5). The X matrix now no longer con-
sists of 100 x 1000 absorbance values but
of 100 x 5 scores since each of the 100
samples is now characterized by five
scores instead of 1000 variables. The re-
gression model is:
(6)
Since:
Eqn (6) becomes:
By using the principal components as
intermediates it is therefore possible to
solve the wide X matrix regression prob-
lem, It should be noted also that the prin-
cipal components are by definition not
correlated, so that the correlation prob-
lem mentioned earlier is therefore also
solved,
Partial Least Squares (PLS)
The aim of PLS is the same as that of
PCR, namely to model a set of y-values
PLS2 and Other Methods that
Describe the Relationship between
Two Tables
Instead of relating one y-value to
many x-values, it is possible to model a
set of y-values with a set of x-values.
This means that one relates two matrices
Y and X, or in other words two tables.
For instance, one could measure for a
certain set of samples a number of senso-
ry characteristics on the one hand and ob-
tain analytical measures on the other.
This would yield two tables as depicted
in Fig. 16. One could then wonder if it is
possible to predict the sensory character-
istics from the (easier to measure) chemi-
cal measurements or at least to under-
stand which (combinations) of analytical
measurements are related to which senso-
ry characteristics. At the same time one
wants to obtain information about the
COMPUTATIONAL CHEMISTRY COLUMN
Analytical measurements
Samples
Sensory data
79
CHI MIA 2001. 55. No.
1/2
Fig. 16. Relating two 2-way tables.
on-line measurements
batches
t
quality measurements
Fig 17. In process analysis, one is con-
cerned with the quality of finished batch-
es and this can be described by a number
of quality parameters. At the same time
for each batch a number of variables can
be measured on the process in function of
time [68]. This yields a two-way table on
the one hand and a three-way one on the
other. Relating these tables allows to pre-
dict the quality of a batch from the meas-
urements made during the process.
Acknowledgements
Y. Vander Heyden is a postdoctoral fellow
of the Fund for Scientific Research (FWO) .:..
Vlaanderen.
Received: December 21, 2000
batches
structure of each of the two tables (e.g.
which analytical variables give similar
information). PLS2 can be used for this
purpose. Other methods that can be ap-
plied are, for instance, canonical correla-
tion and reduced rank regression. An ex-
ample relating 20 measurements of me-
chanical strength of meat patties to the
sensory evaluation of textural attributes
can be found in [64] and a comparison of
methods in [65].
Generalization
It is also possible to relate multi-way
models to a vector of y-values or to 2-
Fig. 17. Relating a
two-way and a three-
way table.
way tables. The same way as with 2-way
data, the latent variables obtained in mul-
ti-way models are then used to build the
regression models. The multi-way analog
to peR would consist in modeling the
original data with Tucker3 or Parafac,
and then regress the dependent y variable
on the obtained scores. A more sophisti-
cated n-way version ofPLS (N-PLS) was
also developed. The principle of N-PLS
is to fit a model similar to Parafac, but
aiming at maximizing the covariance be-
tween the dependent and independent
variables instead of fitting a model in a
least squares sense. The usefulness of
such approaches will be apparent from
[I] D.L. Massart, B.G.M. Vandeginste,
L.M.C. Buydens, S. de Jong, P.J. Lewi, J.
Smeyers- Verbeke, 'Handbook of Chemo-
metrics', Elsevier, Amsterdam, 1997.
[2] N.R. Draper, H. Smith, 'Applied Regres-
sion Analysis', Wiley, New York, 1981.
[3]
l.
Mandel, 'The Statistical Analysis of Ex-
perimental Data', Dover reprint, 1984,
Wiley&Sons, New York, 1964.
[4] D.L. MacTaggart, S.D. Farwell, J. Assoc.
Off. Anal. Chern. 1992, 75, 594.
[5] J.e. Miller, J.N.Miller, 'Statistics for Ana-
lytical Chemistry', Ellis Horwood, Chich-
ester, 3rd ed., 1993.
[6] W.E. Deming, 'Statistical Adjustment of
Data', Wiley, New York, 1943.
[7] P.T. Boggs, e.H. Spiegelman, J.R. Don-
aldson, R.B. Schnabel, J. Econometrics
1988,38,
169.
[8] P.J.Cornbleet, N.Gochman, Clill. Chem.
1979, 25, 432.
[9] e. Hartmann,
l.
Smeyers-Verbeke, D.L.
Massart, Ana/usis 1993, 2J, 125.
[10] 1. Riu, F.x. Rius, J. Chemometr. 1995, 9,
343.
[I I] R.G. Krutchkoff, Technometrics 1967, 9,
425.
[12] V. Centner, D.L. Massart, S. de long,
Fre-
sen ius J. Anal.Chem. 1998,36/,2.
[13] B. Grientschnig, Fresenius J. Alla/.Chem.
2000, 367, 497.
COMPUTATIONAL CHEMISTRY COLUMN
80
CHI MIA 2001,55, No.112
[14] H. Theil, Nederlandse Akademie van Weten-
schappen Proc., Scr. A 1950,53, 386.
[15] P.J. Rousseeuw, A.M. Leroy, 'RobustRe-
gression and Outlier Detection', Wiley,
New York, 1987.
[16] G.R. Phillips, E.R. Eyring, Ana/. Chem.
1983, 55, 1134.
[17] F. Mosteller, J.W. Tukey, 'Data Analysis
and Regression', Addison-Wesley, Read-
ing,1977.
[18] P. Van Keerberghen, J. Smeyers- Verbeke,
R. Leardi, C.L. Karr, D.L. Massart, Che-
mom. IlIfel/. Lab. Syst. 1995, 28, 73.
[19] H. Kubinyi, Quant. Struct.-Act. Relat.
1994,13,285.
120] J.G. Topliss, R.J. Costello,
J.
Med. Chem.
1972,15, 1066.
[21] M. Sergent, D. Mathieu, R. Phan- Tan-
LUll, G. Drava, Chemom. Intel/. Lab. Syst.
1995,27, 153.
[22] A.C. Atkinson, J. Am. Stat. Soc. 1994,89,
1329.
[23] S. Morgenthaler, M.M. Schumacher, Che-
mom. IlIfel/. Lab. Syst. 1999, 47, 127.
[24] R.D. Cramer III, D.E. Patterson, J.D.
Bunce, J. Am. Chem. Soc. 1988, 110,
5959.
[25] J.H. Holland, 'Adaption in Natural and
Artificial Systems', University of Michi-
gan Press, Ann Arbor, MI, 1975, revised
reprint, MIT Press, Cambridge, 1992.
[26] C.B. Lucasius, M.L.M. Beckers, G. Kate-
man, Anal. Chim. Acta 1994,286, 135.
[27] R. Leardi, R. Boggia, M. Terri Ie, J. Chem-
om.
1992,6,267.
[28] J. Devillers ed., 'Genetic Algorithms in
Molecular Modeling', Academic Press,
London, 1996.
[29] M.L.M. Beckers, E.P.P.A. Derks, W.J.
Melssen, L.M.C. Buydens, Comput.
Chem. 1996, 20, 449.
[30] D. Jouan-Rimbaud, D.L.Massart, R. Lear-
di, O.E. de Noord, Anal. Chem. 1995, 67,
4295.
[31] R. Meusinger, R. Moros, Chemom. Intel/.
Lab. Syst. 1999,46,67.
[32] P. Willet, Trends. Biochem, 1995, 13, 516.
[33] D.H. Hibbert, Chemom. Intel/. Lab. Syst.
1993,19,277.
[34] J.H. Kalivas, J. Chemom. 1991,5,37.
[35] X.G. Shao, Z.H. Chen, X.Q. Lin, Frese-
nius
J.
Anal. Chem. 2000, 366, 10.
[36] P.J. Lewi, Arzneim. Forschung 1976, 26,
1295.
[37] Q. Guo, W. Wu, F. Questier, D.L.Massart,
C. Boucon, S. de Jong, Anal. Chem.
Year? 72,2846.
[38] J. Smeyers-Verbeke, J.c. Den Hartog,
W.H. Dekker, D. Coomans, L. Buydens,
D.L. Massart, Atmos. Environ., 1984, 18,
2471.
[39] J.H. Friedman, J. Am. Stat. Soc. 1987,82,
249.
[40] P. Barbieri, C.A. Andersson, D.L. Mas-
sart, S. Predonzani, G. Adami, G.E. Rei-
sen hofer, Anal. Chim. Acta 1999, 398,
227.
141] L.R. Tucker, Psychometrika 1966, 31,
279.
[42] R. Harshman, UCLA working papers in
phonetics 1970, 16, I.
[43] J.D. Carrol, J. Chang, Psychometrika,
1970, 45, 283.
[44] C.A. Andersson, R. Bro, Chemom. IlIfel/.
Lab. Syst. 2000,52, 1.
[45] M. Kroonenberg, 'Three-mode Principal
Component Analysis. Theory and Appli-
cations', DSWO Press, Leiden, 1983, re-
print 1989.
[46] R. Henrion, Chemom. Intel/. Lab. Syst.
1994,25, I.
[47] P. Nomikos, J.F. MacGregor, MChE
Journal, 1994, 40, 1361.
[48] D.J. Louwerse, A.K. Smilde, Chem. Eng.
Sci. 2000,55, 1225.
[49] R. Henrion, Chemom. Intel/. Lab. Syst.
1992, 16, 87.
[50] R. Bro, Chemom. IlIfel/. Lab. Syst. 1998,
46,133.
[51] A. de Juan, S.c. Rutan, R. Tauler, D.L.
Massart, Chemom. llIfel/. Lab. Syst. 1998,
40,19.
[52] P. Geladi, B.R. Kowalski, Anal. Chim.
Acta 1986, 185, 1.
[53] S. de Jong, Chemom. Intel/. Lab. Syst.
1993, 18, 251.
[54] K.D. Zissis, R.G. Brereton, S. Dunkerley,
R.E.A. Escott, Anal. Chim. Acta 1999,
384,71.
[55] C.J. de Bakker, P.M. Fredericks, App/.
Spect. 1995,49, 1766.
[56] S. Vaira, V.E. Mantovani, J.e. Robles,
J.e. Sanchis, H.C. Goicoechea, Ana/. Lett.
1999,32,3131.
[57] V. Simeonov, S. Tsakovski, D.L. Massart,
Toxicological
&
Environmental Chemis-
try, 1999, 72, 81.
[58] 1.B. Cooper, K.L. Wise, W.T. Welch,
M.B. Summer, B.K. Wilt, R.R. Bledsoe,
Appl. Speer.
1997,51,1613.
[59] M.P. Montana, N.B. Pappano, N.B. De-
battista, J. Raba, J.M. Luco, Chromato-
graphia 2000, 51, 727.
[60] O. Svensson, M. Josefson, F.W. Langkil-
de, Chemom. llIfel/. Lab. Syst. 2000, 49,
49.
[61] F. Vogt, M. Tacke, M. Jakusch, B. Mizai-
koff,Anal. Chim. Acta 2000, 422,187.
[62] M. Baret, D.L. Massart, P. Fabry, e. Me-
nardo, F. Conesa, Talanta 1999, 50, 541.
[63] S. Wold in 'Chemometric Methods in Mo-
lecular Design', Ed.: H. van de Water-
beemd, VCH, Weinheim, 1995.
[64] S. Beilken, L.M. Eadie, I. Griffiths, P.N.
Jones, P.V. Harris, J. Food Sci. 1991,56,
1465.
[65] B.G.M. Vandeginste, D.L. MasslU1,
L.M.C. Buydens, S. de Jong, P.J. Lewi, J.
Smeyers- Verbeke, 'Handbook of Chemo-
metrics and Qualimetrics' Part B, Chapter
35, Elsevier, Amsterdam, 1998.
[66] R. Bro, H. Heimdal, Chemom. Intel/. Lab.
Syst. 1996, 34, 85.
[67] R. Bro, J. Chemom. 1996, 10,47.
[68] e. Duchesne, J.F. McGregor, Chemom.
Intel/. Lab. Syst. 2000,51,125.
... Factorial designs based on Response Surface Methodology, allow the study of the effect of a large number of controlled independent variables (factors) on a measured change for the different levels of the factors (response) with a relatively small number of experiments and simple calculations in the analysis of the results (Leardi, 2009;Schenone et al., 2015;Torrades and García-Montaño, 2014;Estienne et al., 2001). ...
... The regression coefficients R 2 = 0.892 and R 2 adj = 0.977 were quite satisfactory confirming the adequate correlation between the experimental values and the extracted models. The adequate precision (represented by the signal to noise ratio) should be higher than 4 (Estienne et al., 2001). In our study, the signal to noise ratio was 6.456 and 14.244 for ALP and 5.411 18.119 for DZP respectively indicating that the models can be used to navigate the design space. ...
Article
The objectives of the present study were: a) to evaluate the photocatalytic degradation of two benzodiazepine pharmaceuticals, alprazolam and diazepam, using Photo-Fenton, b) to optimize the experimental parameters through a central composite experimental design, c) to assess their mineralization and toxicity variations and d) to identify the transformation products during the process and to propose transformation pathways. Response Surface Methodology proved to be a useful tool for the optimization of the degradation process as the statistical coefficients (R 2 = 0.967 for alprazolam and R 2 = 0.929 for diazepam) showed satisfactory values confirming the adequate correlation between the predicted models and experimental values. Two sets of experimental conditions were proposed taking into consideration criteria related to the reaction rate and the minimum use of iron. Toxicity of the system varied with time after the treatment, indicating the gradual production of transformation products which differ in their toxic potential. Fifteen and twenty-three photocatalytic degradation products were identified for ALP and DZP respectively using LC-(ESI)MS/MS. In the case of ALP, the main degradation reactions included, phenyl-group removal and the opening of the 7-membered ring, while for DZP, degradation occurred through hydroxylation, formation of benzophenone and the opening of the 7-membered cyclic group.
... QSAR is a mathematical representation that attempts to correlate a set of compounds with dependent variables (activity values, eg, K i , EC 50 , ED 50 , IC 50 ) and a set of independent variables called descriptors. 173 There are various statistical models that are used to derive a QSAR equation, [174][175][176] and a QSAR model can be 2-D, 3-D, or 4-D. [177][178][179] Using the QSAR method, predicted chemical structures that possess good activity values need only be synthesized. ...
... This approach has been applied to target proteins, such as human acethylcolinesterase, 168 heme peroxidases, metallo-β-lactamases, α-synuclein, ligase ribozymes, 169 and trypsin. 170 [171][172][173][174][175][176][177][178][179][180][181][199][200][201][202][203][204] QM-MM docking The accuracy of electric charges plays an important role in protein-ligand docking, which is why QM-MM calculations are incorporated into docking procedures. Fixed charges of ligands obtained from force-field parameterization are replaced by QM-MM calculations in the protein-ligand complex, treating only the ligand as the quantum region. ...
Article
Full-text available
The pharmaceutical industry is progressively operating in an era where development costs are constantly under pressure, higher percentages of drugs are demanded, and the drug-discovery process is a trial-and-error run. The profit that flows in with the discovery of new drugs has always been the motivation for the industry to keep up the pace and keep abreast with the endless demand for medicines. The process of finding a molecule that binds to the target protein using in silico tools has made computational chemistry a valuable tool in drug discovery in both academic research and pharmaceutical industry. However, the complexity of many protein–ligand interactions challenges the accuracy and efficiency of the commonly used empirical methods. The usefulness of quantum mechanics (QM) in drug–protein interaction cannot be overemphasized; however, this approach has little significance in some empirical methods. In this review, we discuss recent developments in, and application of, QM to medically relevant biomolecules. We critically discuss the different types of QM-based methods and their proposed application to incorporating them into drug-design and -discovery workflows while trying to answer a critical question: are QM-based methods of real help in drug-design and -discovery research and industry?
Article
This review describes advances in multiway analysis during the period 2000-2005. Multiway analysis started to take off in chemistry in the 1980s, but only in recent years has it been broadly applied to many diverse kinds of data. This review reflects how the field has matured and how the methods have been applied to more and more difficult types of data in new research areas. Multiway analysis is described in terms of different types of data, different areas of applications as well as more fundamental and theoretical results throughout the period. It is evident from this review that multiway analysis is presently a generally accepted and used tool whose full potential is far from reached.
Article
A new projection pursuit algorithm for exploring multivariate data is presented that has both statistical and computational advantages over previous methods. A number of practical issues concerning its application are addressed. A connection to multivariate density estimation is established, and its properties are investigated through simulation studies and application to real data. The goal of exploratory projection pursuit is to use the data to find low- (one-, two-, or three-) dimensional projections that provide the most revealing views of the full-dimensional data. With these views the human gift for pattern recognition can be applied to help discover effects that may not have been anticipated in advance. Since linear effects are directly captured by the covariance structure of the variable pairs (which are straightforward to estimate) the emphasis here is on the discovery of nonlinear effects such as clustering or other general nonlinear associations among the variables. Although arbitrary nonlinear effects are impossible to parameterize in full generality, they are easily recognized when presented in a low-dimensional visual representation of the data density. Projection pursuit assigns a numerical index to every projection that is a functional of the projected data density. The intent of this index is to capture the degree of nonlinear structuring present in the projected distribution. The pursuit consists of maximizing this index with respect to the parameters defining the projection. Since it is unlikely that there is only one interesting view of a multivariate data set, this procedure is iterated to find further revealing projections. After each maximizing projection has been found, a transformation is applied to the data that removes the structure present in the solution projection while preserving the multivariate structure that is not captured by it. The projection pursuit algorithm is then applied to these transformed data to find additional views that may yield further insight. This projection pursuit algorithm has potential advantages over other dimensionality reduction methods that are commonly used for data exploration. It focuses directly on the “interestingness” of a projection rather than indirectly through the interpoint distances. This allows it to be unaffected by the scale and (linear) correlational structure of the data, helping it to overcome the “curse of dimensionality” that tends to plague methods based on multidimensional scaling, parametric mapping, cluster analysis, and principal components.
Article
The theory of batch MSPC control charts is extended and improved control charts are developed. Unfold-PCA, PARAFAC and Tucker3 models are discussed and used as a basis for these charts. The results of the different models are compared and the performance of the control charts based on these models is investigated. It is found that this performance depends on the type of fault occuring in the batch process. A strategy is provided to partition reference data describing the normal operating conditions, in order to be able to monitor a new incomplete batch on-line.