ArticlePDF Available

Integrating functional genomics data using maximum likelihood based simultaneous component analysis

Authors:

Abstract and Figures

In contemporary biology, complex biological processes are increasingly studied by collecting and analyzing measurements of the same entities that are collected with different analytical platforms. Such data comprise a number of data blocks that are coupled via a common mode. The goal of collecting this type of data is to discover biological mechanisms that underlie the behavior of the variables in the different data blocks. The simultaneous component analysis (SCA) family of data analysis methods is suited for this task. However, a SCA may be hampered by the data blocks being subjected to different amounts of measurement error, or noise. To unveil the true mechanisms underlying the data, it could be fruitful to take noise heterogeneity into consideration in the data analysis. Maximum likelihood based SCA (MxLSCA-P) was developed for this purpose. In a previous simulation study it outperformed normal SCA-P. This previous study, however, did not mimic in many respects typical functional genomics data sets, such as, data blocks coupled via the experimental mode, more variables than experimental units, and medium to high correlations between variables. Here, we present a new simulation study in which the usefulness of MxLSCA-P compared to ordinary SCA-P is evaluated within a typical functional genomics setting. Subsequently, the performance of the two methods is evaluated by analysis of a real life Escherichia coli metabolomics data set. In the simulation study, MxLSCA-P outperforms SCA-P in terms of recovery of the true underlying scores of the common mode and of the true values underlying the data entries. MxLSCA-P further performed especially better when the simulated data blocks were subject to different noise levels. In the analysis of an E. coli metabolomics data set, MxLSCA-P provided a slightly better and more consistent interpretation. MxLSCA-P is a promising addition to the SCA family. The analysis of coupled functional genomics data blocks could benefit from its ability to take different noise levels per data block into consideration and improve the recovery of the true patterns underlying the data. Moreover, the maximum likelihood based approach underlying MxLSCA-P could be extended to custom-made solutions to specific problems encountered.
Content may be subject to copyright.
BioMed Central
Page 1 of 12
(page number not for citation purposes)
BMC Bioinformatics
Open Access
Research article
Integrating functional genomics data using maximum likelihood
based simultaneous component analysis
Robert A van den Berg*1, Iven Van Mechelen1, Tom F Wilderjans1,
Katrijn Van Deun1, Henk AL Kiers2 and Age K Smilde3
Address: 1SymBioSys, Katholieke Universiteit Leuven, Leuven, Belgium, 2Heymans Institute, University of Groningen, Groningen, The Netherlands
and 3Biosystems data analysis, Swammerdam Institute for Life Sciences, Universiteit van Amsterdam, Amsterdam, The Netherlands
Email: Robert A van den Berg* - robert.vandenberg@psy.kuleuven.be; Iven Van Mechelen - iven.vanmechelen@psy.kuleuven.be;
Tom F Wilderjans - tom.wilderjans@psy.kuleuven.be; Katrijn Van Deun - katrijn.vandeun@psy.kuleuven.be; Henk AL Kiers - h.a.l.kiers@rug.nl;
Age K Smilde - a.k.smilde@uva.nl
* Corresponding author
Abstract
Background: In contemporary biology, complex biological processes are increasingly studied by
collecting and analyzing measurements of the same entities that are collected with different
analytical platforms. Such data comprise a number of data blocks that are coupled via a common
mode. The goal of collecting this type of data is to discover biological mechanisms that underlie the
behavior of the variables in the different data blocks. The simultaneous component analysis (SCA)
family of data analysis methods is suited for this task. However, a SCA may be hampered by the
data blocks being subjected to different amounts of measurement error, or noise. To unveil the
true mechanisms underlying the data, it could be fruitful to take noise heterogeneity into
consideration in the data analysis. Maximum likelihood based SCA (MxLSCA-P) was developed for
this purpose. In a previous simulation study it outperformed normal SCA-P. This previous study,
however, did not mimic in many respects typical functional genomics data sets, such as, data blocks
coupled via the experimental mode, more variables than experimental units, and medium to high
correlations between variables. Here, we present a new simulation study in which the usefulness
of MxLSC A-P c o mpar e d to o rdina ry SCA -P is e valu ated within a typical functional genomics setting.
Subsequently, the performance of the two methods is evaluated by analysis of a real life Escherichia
coli metabolomics data set.
Results: In the simulation study, MxLSCA-P outperforms SCA-P in terms of recovery of the true
underlying scores of the common mode and of the true values underlying the data entries.
MxLSCA-P further performed especially better when the simulated data blocks were subject to
different noise levels. In the analysis of an E. coli metabolomics data set, MxLSCA-P provided a
slightly better and more consistent interpretation.
Conclusion: MxLSCA-P is a promising addition to the SCA family. The analysis of coupled
functional genomics data blocks could benefit from its ability to take different noise levels per data
block into consideration and improve the recovery of the true patterns underlying the data.
Moreover, the maximum likelihood based approach underlying MxLSCA-P could be extended to
custom-made solutions to specific problems encountered.
Published: 16 October 2009
BMC Bioinformatics 2009, 10:340 doi:10.1186/1471-2105-10-340
Received: 23 July 2009
Accepted: 16 October 2009
This article is available from: http://www.biomedcentral.com/1471-2105/10/340
© 2009 Berg et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
BMC Bioinformatics 2009, 10:340 http://www.biomedcentral.com/1471-2105/10/340
Page 2 of 12
(page number not for citation purposes)
Background
In contemporary biology, it becomes more widespread to
study complex biological processes by collecting and ana-
lyzing measurements on the same entities from different
sources, such as transcriptomics, metabolomics, ChIP-
chip, or proteomics. The data originating from such meas-
urements can often be organized in matrices pertaining to
experimental units (e.g., tissues or culture samples) and
variables (e.g., genes or metabolites) that were measured
on these experimental units. The experimental units, also
referred to as objects, constitute the experimental mode of
the data, and the measured biochemical compounds the
variable mode. We will denote such matrices consisting of
measurements originating from different sources by data
blocks.
Data blocks with information on the same entities stem-
ming from different sources share one of the data modes;
as such we will further denote them by the term 'coupled
data'. For instance, Ishii and coworkers [1] simultaneously
collected metabolomics, transcriptomics, and proteomics
measurements from Escherichia coli chemostat cultures
with different mutants and environmental conditions.
This yields measurements coupled via the experimental
mode. Other examples of publications involving this type
of data are [2,3]. As an alternative, data blocks can be cou-
pled via the variable mode. This occurs, for instance, in
experiments in which transcriptomics measurements are
coupled with ChIP-chip measurements [4], or even with
ChIP-chip and motif data [5].
Often, the purpose of collecting coupled data will be to
discover biological mechanisms that underlie the behav-
ior of the variables in the different data blocks. For exam-
ple, when the measurements originate from experiments
in which metabolomics and transcriptomics analyses
were conducted, the researcher could be interested in
identifying regulatory mechanisms that coordinate a joint
response on metabolome and transcriptome level.
To arrive at a comprehensive synthesis of the information
on biological mechanisms underlying coupled data
blocks, the data blocks have to be analyzed simultane-
ously. For such a synthesis, the family of simultaneous
component analysis (SCA) methods is a natural choice.
SCA methods search for important patterns in the data
blocks and reveal the contributions of the variables and
the experimental units to these patterns, similar to princi-
pal component analysis (PCA). The identified patterns
can subsequently aid the discovery of the regulatory
mechanisms underlying the data.
However, a simultaneous analysis of multiple data blocks
may be hampered by the data blocks being heterogeneous
in a number of respects. For instance, measurements orig-
inating from different functional genomics platforms can
be subject to different amounts measurement error, or
noise related to the accuracy of the platforms in question.
The noise present in the different data blocks can obscure
the data patterns. Therefore, it can become more difficult
to extract information regarding these patterns. For this
reason, it could be fruitful to take data block noise into
consideration in the data analysis. In particular, when
data blocks are subject to different amounts of noise, it
seems desirable to treat the data block with more noise
with more caution.
Yet, the different noise levels should be known to be able
to take these into consideration. Often however, it is
unknown how much noise is present in each data block.
If this were the case, a method is needed that also esti-
mates the noise in each data block. Such a method was
proposed recently in the psychometrics field: MxLSCA-P,
a maximum likelihood based SCA method (Wilderjans,
T.F., Ceulemans, E., Van Mechelen, I., van den Berg, R.A.:
Simultaneous analysis of coupled data matrices subject to
different amounts of noise, submitted). MxLSCA-P explic-
itly estimates the noise levels per data block and integrates
these estimations in the overall analysis. In a simulation
study, MxLSCA-P outperformed standard SCA-P [6] when
recovering the underlying structure of simulated data
blocks that were subject to different noise levels.
One may wish to translate the results of the simulation
study mentioned above to the analysis of coupled func-
tional genomics data. There are, however, two obstacles
that prevent a direct translation. First, the data blocks sim-
ulated in the previous study were coupled via the variable
mode, while functional genomics measurements often
pertain to measurements coupled via the experimental
mode [1-3]. Different coupling leads to a rather different
kind of analysis, in particular with regard to the type of
preprocessing that is linked to different SCA methods [7-
10]. It is therefore not self-evident that the previous results
hold for data blocks coupled via the experimental mode.
Second, the simulation study did not consider data
aspects that are typical for functional genomics, such as,
having more variables than objects, and moderate to high
correlations between variables (e.g., between two co-regu-
lated genes) as the simulation was based on randomly
generated components.
In this paper we will present a new simulation study to
overcome these obstacles and to ascertain the relevance of
MxLSCA-P for the analysis of functional genomics data
coupled via the experimental mode. For this purpose we
will determine the performance of MxLSCA-P in a context
in which (i) the experimental mode is shared; and (ii) the
correlations between variables are realistic in that they
BMC Bioinformatics 2009, 10:340 http://www.biomedcentral.com/1471-2105/10/340
Page 3 of 12
(page number not for citation purposes)
mimic the correlations observed in a real life microbial
metabolomics data set consisting of two coupled GC/MS
(gas chromatography combined with mass spectrometry)
and LC/MS (liquid chromatography combined with mass
spectrometry) data blocks. In addition, we will also apply
standard SCA-P and MxLSCA-P to the real life metabo-
lomics data set itself. Before presenting the results of the
analysis of simulated and real-life data sets, we will now
first explain SCA-P and MxLSCA-P. Subsequently, we will
outline the problem and setup of our new simulation
study.
Simultaneous component analysis
Notation
In this paper matrices and vectors will be indicated by
bold uppercase and lowercase characters as in Kiers [11].
Elements will further be denoted by lowercase running
indices that range from 1 to the corresponding uppercase
characters. For instance, the number of objects in a data
block will be indexed by i, running from 1 to I.
General SCA decomposition
The family of SCA methods [10] comprises a wide range
of component methods that share two characteristics.
First, they reduce the dimensionality of the data blocks by
decomposing the data blocks in components, and second
they do so while minimizing the loss of information. The
SCA methods distinguish themselves from other compo-
nents methods [10] by (i) simultaneously decomposing
coupled data blocks with the different data blocks taking
exchangeable roles, and (ii) allowing for block-specific
weighting of data blocks to capture particular aspects of
the data blocks more adequately.
In general, given a set of K data blocks Xk that share an
object mode with I objects and Jk variables, and a set of
prespecified block-specific weights wk, a SCA decomposi-
tion reads as follows:
with T(I × R) denoting a score matrix for R components
shared by all K data blocks, Pk(Jk × R) the accompanying
block-specific loadings, and Ek(I × Jk) a residual matrix.
This decomposition of data blocks that share the object
mode will be the reference decomposition in this paper.
For other situations in which the data blocks share a vari-
able mode, the SCA decomposition is given by:
with Tk(Ik × R) denoting a block-specific score matrix for R
components, P(J × R) the loadings shared by all data
blocks, and Ek(Ik × J) a residual matrix.
Model estimation
For the estimation of T and Pk the following objective
function is minimized:
The optimal matrices T and Pk that minimize (3) can be
estimated on the basis of the following identity:
Where Xc = [w1X1...wkXk...wKXK] with size is
the concatenation of all wkXk, and
with size is the concatenation of all Pk; the
estimates then can be obtained by means of a singular
value decomposition (SVD) [10]. For identification pur-
poses, the components can be constrained to have a prin-
cipal axis orientation and T or Pc to be orthonormal.
The SVD of Xc reads as follows:
If T is chosen to be columnwise orthonormal, T can be
obtained by choosing the R left singular vectors associated
with the R largest singular values in S. The loadings Pc are
then obtained by multiplication of the R right singular
vectors with the R associated largest singular values:
where the subscript 'R' indicates the R largest singular val-
ues and accompanying singular vectors. In case Pc is cho-
sen to be orthonormal, Pc is put equal to VR and T to URSR.
SCA with equal block weights
SCA with equal block weights (w1 = ... = wK = w > 0) was
proposed in the psychometrics literature as SCA-P [6] and
in the chemometrics literature as SUM-PCA [12]. Both
methods fit the general SCA decomposition as methods in
which equal weights are applied to the different data
blocks. In the remainder of this paper we will refer to this
method as SCA-P.
Choosing equal block weights implies that all the data
entries in the different data blocks are considered equally
important and that no further block-specific adjustments
are made to increase or decrease their relative influence.
wkk k
T
k
XTPE=+
(1)
wkk k
T
k
XTPE=+
(2)
min || || .
,TP
XTP
k
wkk k
T
k
K
=
2
1
(3)
min || || min || || ,
,,TP TP
XTP XTP
kc
wkk k
T
k
K
cc
T
−= −
=
2
1
2
(4)
IJ
k
k
K
×=
1
PPPP
c
T
k
T
K
TT
=[]
1
XUSV
c
T
=
(5)
PVS
cRR
=
(6)
BMC Bioinformatics 2009, 10:340 http://www.biomedcentral.com/1471-2105/10/340
Page 4 of 12
(page number not for citation purposes)
This approach was coined a 'one entry, one vote' approach
[13]. The objective function of this method is:
MxLSCA-P
MxLSCA-P (Wilderjans, et al.: submitted) is a stochastic
extension of the generic SCA method (1). Unlike SCA-P, it
assumes that the residuals in Ek follow a normal distribu-
tion with a mean of zero and an unknown block-specific
variance:
The minus loglikelihood function for the MxLSCA-P
method is (Wilderjans, et al.: submitted):
in which c denotes a constant term that does not influence
the minimization of the minus loglikelihood function.
(This equation generalizes the equivalent equation in
(Wilderjans, et al.: submitted) that pertained to the two
block case. We minimize the minus loglikelihood in line
with the optimizations discussed previously.) The
improved performance of MxLSCA-P in the previous sim-
ulation study (Wilderjans, et al.: submitted) can be under-
stood from the different model assumptions made. SCA-
P implicitly assumes that noise across the different data
blocks is identically distributed, i.e., it maximizes the like-
lihood function based on the assumption that the noise is
distributed identically in the different data blocks. When
this assumption is violated and the noise is distributed
differently, the SCA-P model becomes misspecified,
unlike MxLSCA-P that specifically allows for those differ-
ences.
The objective function of MxLSCA-P (9) differs from the
general objective function for SCA methods (3) by the
introduction of block-specific noise parameters
k. These
noise parameters act as a weight to the data blocks and in
a new term 'IJk log
k'. Unlike in the general SCA decom-
position, in MxLSCA-P the block weights are to be esti-
mated as an integrated part of the analysis.
The parameters of MxLSCA-P (
k, T, and Pk) cannot be
estimated directly via an SVD. Therefore an alternating
least squares (ALS) algorithm [14,15] was developed
(Wilderjans, et al.: submitted).
In an ALS algorithm, the parameters to be estimated are
split into subsets that are alternatingly re-estimated condi-
tionally on each other. In particular, the following proce-
dure is followed:
1. The algorithm is initiated by choosing values for
k.
These starting values for
k can be determined ran-
domly or rationally (e.g., based on a SCA-P). It is
advised to use multiple different starting values to
avoid getting stuck in local minima.
2. The scores T and loadings Pk are estimated condi-
tional on the values of
k via an SVD. This SVD opti-
mizes the following part of the objective function:
.
3. New estimations of
k are calculated conditional
on the previous estimations of and :
4. The current value of the objective function (9) is cal-
culated.
The second, third, and fourth step are repeated until a
convergence criterion is met (e.g., changes in the objective
function below a prespecified threshold).
Problem and setup of the simulation study
A simulation study was set up to assess the performance of
the SCA-P and MxLSCA-P methods for the analysis of
functional genomics data blocks coupled via the experi-
mental mode. The performance of the methods was eval-
uated in terms of their ability to recover the true structures
(Tm, , , , and ) underlying two simulated
data blocks subject to different simulation settings. To
improve the realism of the simulations, the data blocks
were simulated using the correlation structure of the vari-
ables as observed in a real life GC/MS and LC/MS micro-
bial metabolomics data set (see Methods section).
Furthermore, different data characteristics that could
influence the analysis of coupled functional genomics
data blocks were varied. In particular, the following char-
min || || .
,TP
XTP
k
kk
T
k
K
=
2
1
(7)
eN
ij
iid
k
k
,
.. .
~(,).02
(8)
−= +−
+=
lIJ
k
IJ k
k
K
kk k k k k
T
(, , ) log || ||TP X TP

22
1
221
22 2
1
22
1
2
1
log
log || ||
k
K
kk kk
T
k
K
IJ
k
c
=
=
=+
+XTP
(9)
1
22
2
1
k
kk
T
k
K|| ||XTP
=
ˆ
k
ˆ
T
ˆ
Pk
ˆ|| ˆˆ || .
k
kk
T
IJ k
=XTP2
(10)
P1
m
P2
m
X1
m
X2
m
BMC Bioinformatics 2009, 10:340 http://www.biomedcentral.com/1471-2105/10/340
Page 5 of 12
(page number not for citation purposes)
acteristics were included as design factors (see Methods
section for detailed information):
• Noise level of the data blocks. Noise can hamper the
recovery of the true data structures, especially if the
noise levels of different coupled functional genomics
data blocks would differ. In the simulation study noise
was manipulated via two factors: (i) the noise ratio
between the two data blocks (factor Noise Ratio), and
(ii) the total amount of noise on the data blocks (fac-
tor Noise Total).
• Different numbers of variables per data block. In
functional genomics research, different data blocks
can considerably differ in the number of variables
(e.g., metabolomics and transcriptomics data sets can
consist of hundreds and thousands of variables,
respectively). Moreover, the number of variables is
generally larger than the number of objects which
induces collinearity in the data [16,17]. A SCA can be
influenced by these factors in two ways. First, when
the difference between the number of variables in dif-
ferent data blocks is large, the larger data block could
dominate the analysis. Second, induced collinearity
may hamper a correct estimation of the loadings.
In this simulation study a small and a large data block
were simulated with different numbers of variables
per data block (factor Number of variables). The total
number of variables was always larger than the
number of objects such that collinearity was always
present. The large data block used the correlation
structure observed in the GC/MS data set and the
small data block the correlation structure of the LC/
MS data set.
• Relative importance of the data blocks. The variation
present in one data block, and thus its importance, can
differ from other data blocks. This could influence the
recovery of the data structures, as data blocks with
high variation can dominate other data blocks. The
variation present in the data blocks is in an SVD
expressed by the singular values. Here, the relative
importance of a data block was manipulated by these
singular values (factor Singular value).
In addition to these factors, a factor Methods was included
in the experimental design, with SCA and MxLSCA-P as its
two different levels. Recovery performance and the impact
on it of the factors manipulated in the simulation study
were analyzed by means of an analysis of variance
(ANOVA).
Results
Performance of the SCA methods on simulated data
The recovery by the two SCA methods of the true data
structures as measured by a Fisher-Z transformed modi-
fied RV coefficient [18] (RV-Z) was generally good. Recov-
ery performance appeared to depend both on the specific
structural aspect looked at, and on data characteristics as
manipulated in the simulation study (Table 1). Below we
will discuss the different data characteristics and their
influence on the recovery of the true structural aspects.
Most importantly for the purpose of this research, the
main effect of factor 'Method' and its interaction with
'Noise Ratio' appeared to be sizeable on the level of the
recovery of the true scores (Tm) as well as of the true data
block entries ( , and ). In particular, MxLSCA-P
performed on average significantly better than SCA-P
(Table 2). Moreover, as appears from Figure 1, in the case
of the recovery of Tm, and , MxLSCA-P especially
outperforms SCA-P when the noise levels for the data
blocks differ. For the recovery of Tm (Figure 1, left panel),
X1
m
X2
m
X1
m
X2
m
Table 1: Excerpt from the ANOVA tables of the analysis of the
recovery of the true structures underlying the simulated data.
True structure Factor df F
2
TmNoise Total 2 139 855 .44
Method 1 106 207 .17
Noise Ratio * Noise Total 4 26 255 .17
Method * Noise Ratio 2 23 912 .075
Method * Noise Total 2 22 988 .072
Noise Total 2 370 120 .36
Noise Ratio 2 361 152 .36
Noise Ratio * Noise Total 4 141 444 .28
Noise Total 2 112 233 .37
Noise Ratio 2 107 744 .35
Noise Ratio * Noise Total 4 42 273 .28
Noise Total 2 39 177 .41
Noise Ratio * Noise Total 4 10 644 .22
Noise Ratio 2 9 239 .096
Method 1 16 792 .088
Method * Noise Ratio 2 6 599 .069
Noise Total 2 21 643 .39
Noise Ratio * Noise Total 4 7 018 .26
Method * Noise Ratio 2 4 622 .084
Method 1 8 369 .076
Noise Ratio 2 4 169 .076
F denotes the value of the F statistic, df the degrees of freedom, and
2 the effect size. Only the most important factors in terms of
2 (
2
.050) are reported (all were significant p < .0001).
P1
m
P2
m
X1
m
X2
m
BMC Bioinformatics 2009, 10:340 http://www.biomedcentral.com/1471-2105/10/340
Page 6 of 12
(page number not for citation purposes)
recovery was best when the largest data block, , was
the least noisy. The recovery of a particular data block was
further best (in absolute as well as relative sense) when
that data block was subject to the least amount of noise
(Figure 1, center panel: , right panel: ). Further-
more, the interaction between 'Method' and 'Noise Total'
was also sizable or the recovery Tm. This interaction
showed that the benefit of MxLSCA-P is largest when the
total noise level is low and the benefit becomes smaller
for higher total noise levels. The advantage of MxLSCA-P
over SCA-P for recovering the true underlying structures in
the presence of different noise levels did not carry over to
the recovery of the block-specific loadings (Table 2).
One might conjecture that this result is due to differences
in the number of implicit constraints on the different con-
stituents of the MxLSCA-P decomposition. The scores of
the SCA decomposition are constrained to be identical for
all data blocks; as a result, these scores may be prevented
to be misguided by the data. The loadings, however, are
not subject to such restriction, and, as a result have more
freedom to deviate from the true model structure.
A sizeable main effect of 'Noise Total' was found for the
recovery of all true structural aspects. This effect is obvious
with more noise leading to a poorer recovery. Further-
more, for the recovery of all block-specific structural
aspects (i.e., the true loadings , , and the true data
block entries of and ), the main effect of 'Noise
Ratio' was important as well, with the true structures being
recovered better when the corresponding data block was
less noisy. Furthermore, the interaction between 'Noise
Total' and 'Noise Ratio' was substantial for the recovery of
all data structures. This interaction was plotted in Figure 2
for the cases of the recovery and (for the block-
specific loadings the pattern was similar). From Figure 2 it
becomes clear that the effect of 'Noise Ratio' (i.e., better
recovery when a particular block is relatively less noisy
X1
m
X1
m
X2
m
Pk
m
P1
m
P2
m
X1
m
X2
m
X1
m
X2
m
Mean recovery of Tm, , and for all combinations of the levels of 'Method' and 'Noise Ratio'
Figure 1
Mean recovery of Tm, , and for all combinations of the levels of 'Method' and 'Noise Ratio'. The recover-
ies of the different true structures Tm, and are given from left to right, respectively. The RV-Z is indicated on the y-
axis. The two levels of 'Method' are indicated on the x-axis. The different lines indicate the different levels of the factor Noise
Ratio (red, dashed, square = NoiseX1 < NoiseX2; solid blue, diamond = Equal; black, dashed, triangle = NoiseX1 > NoiseX2).
Method
4567
RV−Z
SCA-P MxLSCA-P
T
m
MxLSCA-PSCA-PSCA-P MxLSCA-P
45 67
4567
Noise Ratio
NoiseX
1
< NoiseX
2
Equal
NoiseX
1
> NoiseX
2
RV−ZX
m
2
RV−ZX
m
1
Method Method
X1
m
X2
m
X1
m
X2
m
Table 2: Mean recoveries (RV-Z) for the levels of the design
factor Method for the recovery of the true structures Tm, ,
, , and .
True structure Method Recovery (RV-Z) SE
TmSCA-P 3.9 .0036
MxLSCA-P 5.5 .0036
SCA-P 4.2 .0075
MxLSCA-P 5.3 .0075
SCA-P 4.0 .0075
MxLSCA-P 5.0 .0075
SCA-P 4.3 .0022
MxLSCA-P 4.3 .0022
SCA-P 4.3 .0039
MxLSCA-P 4.4 .0039
SE denotes the standard error. RV-Z values of 3.8 and 5.0 correspond
to modified RV coefficients of 0.9990 and 0.9999, respectively. The
differences between SCA-P and MxLSCA-P are significant for Tm,
, and (p < .05).
X1
m
X2
m
P1
m
P2
m
X1
m
X2
m
P1
m
P2
m
X1
m
X2
m
BMC Bioinformatics 2009, 10:340 http://www.biomedcentral.com/1471-2105/10/340
Page 7 of 12
(page number not for citation purposes)
than the other as compared to a situation with a reverse
noise ratio) shows up only in case of low to medium noise
levels. In addition, a very good recovery is observed in case
of the combination of a low total noise level and a Noise
Ratio of 1; the latter is due to the fact that this particular
combination implies a very low total noise level for the
whole of the two data blocks (10-3%). For the recovery of
the common scores, the interaction between 'Noise Total'
and 'Noise Ratio' took a slightly different shape: Now the
two conditions of 'Noise Ratio' that implied different
noise levels for the two data blocks resulted in a better
recovery in case of low and medium 'Total noise' levels.
Analysis of real life microbial metabolomics data
To obtain an as complete as possible overview of the
changes of the concentrations of metabolites in microbial
metabolomics, multiple analytical platforms are required
[19]. In this paper, E. coli metabolomics data consisting of
metabolite concentrations that were obtained using GC/
MS and LC/MS [20] were used. The data set consisted of
28 samples of batch fermentations with varying experi-
mental conditions (e.g., low oxygen, succinate or D-glu-
cose as sole carbon source, wild type or phenylalanine
overproducing strain) taken at different time points. In
general, different analytical platforms can perform differ-
ently with regard to reproducibility. Therefore, the analy-
sis could potentially benefit from an MxLSCA-P approach
that takes noise heterogeneity into account.
We subjected the data under study to MxLSCA-P and SCA-
P analyses with three components. The three components
were selected based on the scree plots of component anal-
yses of the individual data blocks. Subsequently, the
MxLSCA-P and SCA-P score plots were compared. The first
two components appeared to be very similar: On the first
component the samples obtained from succinate grown
cells differed strongly from the other samples; the second
component showed a separation between samples
obtained under low oxygen conditions and samples
obtained at late time points of both succinate grown cells
and wild type cells. However, differences between the two
methods became apparent for the third component. In
particular, the scores on the third MxLSCA-P component
for those conditions for which multiple time points were
sampled as a function of time were plotted. For all these
plots, profiles resembling typical batch fermentation
growth curves were found. In such a growth curve, the
cells first grow fast as a sufficient amount of nutrients is
available; next, when nutrients become depleted, growth
is halted and the curve starts to decline. A typical example
of such a profile in the MxLSCA-P scores is plotted in the
upper left corner of Figure 3. For SCA-P, such typical pro-
files were also found for five experimental conditions (see
e.g., Figure 3, upper right plot), but for two experimental
Mean recovery of (left panel) and (right panel) for all combinations of the levels of 'Noise Total' and 'Noise Ratio'
Figure 2
Mean recovery of (left panel) and (right panel) for all combinations of the levels of 'Noise Total' and
'Noise Ratio'. The RV-Z is indicated on the y-axis. The three levels of 'Noise Total' are indicated on the x-axis. The different
lines indicate the levels of the factor Noise Ratio (red, dashed, square = NoiseX1 < NoiseX2; solid blue, diamond = Equal;
black, dashed, triangle = NoiseX1 > NoiseX2). The RV-Z values were averaged over the other factors, e.g., the factor Method.
Noise Total
345678910
Noise Total
345678910
RV−Z
RV−Z
X
m
1
X
m
2
Low Medium High Low Medium High
Noise Ratio
NoiseX1 < NoiseX2
Equal
NoiseX1 > NoiseX2
X1
m
X2
m
BMC Bioinformatics 2009, 10:340 http://www.biomedcentral.com/1471-2105/10/340
Page 8 of 12
(page number not for citation purposes)
conditions the patterns differed (see e.g., Figure 3, lower
right plot). (Note that the profile in the lower right plot
cannot simply be reflected to match the typical batch fer-
mentation profile, as reflections of SCA scores and load-
ings can only be performed on the entire score or loading
vector and not on a subset of it.)
The pattern of the block-specific loadings further nicely
complemented the pattern of the scores. In particular,
inspection of the loadings on the third component for the
LC/MS data block revealed high contributions for cell wall
precursors for peptidoglycan biosynthesis [21,22] (like
UDP-N-AAGDAA and UDP-N-AAGD) and nucleotides
(such as, UDP, UTP, CMP, CDP, and CTP) that are
involved in a wide range of cellular processes, among
Scores on the third component of MxLSCA-P and SCA-PFigure 3
Scores on the third component of MxLSCA-P and SCA-P. Scores on the third component of MxLSCA-P (left) and
SCA-P (right) for experimental conditions 4 and 10 (from top to bottom the first and second row of panels, respectively). On
the x-axis, the different time points of sampling are presented ranging from 'early' (1) to 'late' (3, 4, and 5). The y-axis indicates
the score value in arbitrary units.
Sample time point
MxLSCA−P score condition 10
1234
−0.10 −0.05 0.00 0.05
Sample time point
SCA−P score condition 10
1234
−0.10 0.00 0.05 0.10 0.15
Sample time point
MxLSCA−P score condition 4
123 5
−0.2 0.0 0.1 0.2 0.3 0.4
Sample time point
SCA−P score condition 4
123 5
−0.2 −0.1 0.0 0.1 0.2
BMC Bioinformatics 2009, 10:340 http://www.biomedcentral.com/1471-2105/10/340
Page 9 of 12
(page number not for citation purposes)
which cell wall biosynthesis [22]. Cell wall biosynthesis
can be linked to the growth phases in a batch fermenta-
tion, as metabolites involved in it are likely to fluctuate
depending on these growth phases. For instance, during
exponential growth, cell wall intermediates are required
for growth and cell division, whereas during the stationary
growth phase the demand for these intermediates is
expected to drop.
The MxLSCA-P block-specific loadings for the third com-
ponent pertaining to the GC/MS data block revealed con-
sistently large contributions for uncharacterized
disaccharides; for the corresponding SCA-P loadings this
was less clearly the case. Within the context of this study,
there are two likely roles for disaccharides in E. coli, which
both could relate to variation in metabolite concentra-
tions during the different phases of a batch fermentation:
(i) In cell wall biosynthesis, the different parts of the cell
wall have polysaccharides as a major constituent, for
instance, in peptidoglycan [21,22] and in lipopolysaccha-
rides [22,23]. (ii) Disaccharides could play a role in the
internal storage of excess carbon source, during condi-
tions under which another nutrient excluding carbon
source is limiting.
Summarizing, in this case study MxLSCA-P seemed better
able to extract biologically relevant information. MxLSCA-
P provided a more consistent link to the growth phases of
the batch fermentations, both through the common
scores and through the LC/MC data block loadings. Also,
the disaccharides involved in the MxLSCA-P loadings for
the GC/MS block are likely to link up with cellular proc-
esses related to the different batch fermentation growth
phases.
Discussion
MxLSCA-P was proposed to model coupled data blocks
with heterogeneous noise levels. In a previous simulation
study, MxLSCA-P was shown to outperform SCA-P in
recovering the true structure underlying the data that did
not consider typical problems encountered in functional
genomics studies. In the study presented in this manu-
script the previous study was extended to address these
problems typical for functional genomics: (i) the data
were coupled via the experimental mode, (ii) the simula-
tions were based on correlation structures observed in real
life data sets, (iii) collinearity was induced by ensuring the
data had more variables than objects. Our results showed
that MxLSCA-P also outperforms SCA-P in simulated data
that mimic functional genomics data more closely. In par-
ticular, MxLSCA-P was better able to recover the true
scores (Tm) and true data blocks ( and ) especially
when the relative noise levels differed across data blocks.
Furthermore, MxLSCA-P provided a more consistent and
biologically more meaningful interpretation of the analy-
sis of the E. coli metabolomics case study. Therefore
MxLSCA-P seems to be the preferred choice over SCA-P for
the kind of data we have studied, but probably for other
kinds of data as well.
In SCA-P, the data blocks are given equal a priori block
weights as there is no a priori reason to treat the data
blocks differently. MxLSCA-P is an extension of SCA-P in
which, as an integrated part of the analysis, the equal a pri-
ori block weights are combined with data-driven a posteri-
ori weights that reflect the noise levels of the different data
blocks such as to de-emphasize the most noisy data
blocks. Within the family of SCA methods, other methods
exist that a priori weigh the data blocks differently to
ensure that each block makes a "fair" contribution to the
analysis. Such a weighting can be based on different con-
ceptions of fairness [10], for instance, to ensure that each
data block has the same amount of variation [12], or that
data blocks with more redundant information are down-
weighted [24]. (The latter conception is the basis of mul-
tiple factor analysis, which was recently applied for the
analysis of coupled functional genomics data blocks by de
Tayrac and coworkers [25]). Those a priori weights to
ensure a fair block weighting, however, do not take into
account differences in measurement error, or noise levels.
Indeed, analogous to SCA-P, in other SCA methods it is
implicitly assumed that the data blocks have equally and
independently normal distributed noise levels. Therefore,
these other SCA methods, too, could potentially benefit
from block-specific noise estimations on the basis of max-
imum likelihood extensions as discussed in the present
paper. Following such an approach, the a priori fairness
correction could be blended with block-specific noise esti-
mations.
SCA-P assumes that the noise levels are equal for all data
blocks. Often, this assumption does not match with situ-
ations encountered in practice. MxLSCA-P addresses this
problem by allowing for different noise levels per data
block, and by only requiring that the noise levels within
each data block are equally and independently normal
distributed. Yet, it is possible that noise levels also vary
within a data block. For example, in addition to the fact
that different measurement platforms can have different
levels of reproducibility on average, within a measure-
ment platform some variables could be measured more or
less reliably than others (e.g., because of their chemical
properties). This example illustrates that MxLSCA-P could
benefit from allowing more complex 'within data block'
error variance structures. Such complex variance structures
could be incorporated following, for instance, a general-
ized least squares approach [26,27].
X1
m
X2
m
BMC Bioinformatics 2009, 10:340 http://www.biomedcentral.com/1471-2105/10/340
Page 10 of 12
(page number not for citation purposes)
Research within the functional genomics field is not only
limited to static experiments, experiments in which sam-
ples are obtained in time are also often conducted (e.g.,
[28,29]). To discover time-related effects in the data,
MxLSCA-P could be extended using functional data anal-
ysis approaches [30].
Sometimes, the data sets collected in functional genomics
studies are incomplete and contain missing data entries,
for instance, due to experimental complications. The
MxLSCA-P method could be extended to handle data sets
containing missing values. For this, strategies like criss-
cross regression [31,32] could be adapted.
Conclusion
MxLSCA-P is a promising addition to the SCA family. Its
ability to take different noise levels per data block into
consideration and improve the recovery of the true pat-
terns underlying the data could be beneficial for the anal-
ysis of coupled data blocks originating from different
functional genomics sources. Moreover, the maximum
likelihood based approach to SCA offers room for further
extensions to allow for custom-made solutions to specific
problems encountered in functional genomics research.
Methods
Metabolomics data
The metabolomics data set consisted of E. coli metabo-
lomes (E. coli NST 74, a phenylalanine overproducing
strain, and E. coli W3110, the wild-type strain). The E. coli
strains were grown under different experimental condi-
tions as described elsewhere [20]. The samples were ana-
lyzed by GC/MS and [33] and LC/MS [34]. The GC/MS
and LC/MS samples were measured in duplicate. The final
data blocks were manually cleaned up, removing spurious
and double entries. After averaging of the duplicate meas-
urements the data consisted of 28 experiments, 131
metabolites measured by GC/MS, and 44 metabolites
measured by LC/MS. The metabolite data were autoscaled
before analysis with SCA-P and MxLSCA-P. After autoscal-
ing, each variable had mean zero and standard deviation
one.
Simulation study
Experimental design
A full factorial design was developed for the simulation
study. Each cell of the experimental design was independ-
ently repeated 20 times. The design consisted of the fol-
lowing factors:
• The first factor is 'Method' with the two levels refer-
ring to the two different methods, SCA-P and
MxLSCA-P.
• The second factor is 'NoiseX1'. This factor determines
the amount of noise on , the first simulated data
block (see (11)). The levels of this factor are 10-3, 6.67,
and 13.33% of noise variation of the total variation of
block. The specific percentages were chosen to
simplify the conversion of data block noise levels into
the factor 'Noise Total' (see below).
• The third factor is 'NoiseX2'. This factor determines
the amount of noise on and has the same levels as
the factor 'NoiseX1' now pertaining to X2.
• The fourth factor is 'Number of variables' per X
block. The first and second number indicates the
number of variables of and , respectively. The
levels are '100 - 10', '70 - 20', and '40 - 30'.
• The fifth factor is the factor 'Singular value' and its
three levels are '4, 2 & 2, 1'; '2, 1 & 2, 1'; and '2, 1 & 4,
2'. The first two values become the singular values of
, the true X1 data block, and the second two
become those of . Thus, for the first level of this
factor, receives singular values 4 and 2, and 2
and 1. Note that these singular values are scaled to cor-
rect for the number of variables in each block before
they become the final singular values of the X block
(see section Data generation).
To improve the interpretation of the effect of different
noise levels on the recovery of the true underlying data
structures, the noise factors of the experimental design
were converted into a 'noise ratio between data blocks
(Noise Ratio)' and a 'sum of the noise levels (Noise Total)'
factor. These factors were not part of the simulation, but
were used instead of the factors 'NoiseX1' and 'NoiseX2' as
independent variables in the ANOVA:
• The Noise Ratio between data blocks factor consisted
of three levels:
NoiseX1 < NoiseX2, Equal, and NoiseX1 > NoiseX2.
• The Noise Total factor consisted of 'Low', 'Medium',
and 'High' noise levels over all the blocks. In this
study, 'Medium' was equal to NoiseX1 + NoiseX2 =
13.33. The sum of the noise levels smaller than 13.33
was 'Low', and larger was 'High'.
X1
s
X1
s
X2
s
X1
s
X2
s
X1
m
X2
m
X1
m
X2
m
BMC Bioinformatics 2009, 10:340 http://www.biomedcentral.com/1471-2105/10/340
Page 11 of 12
(page number not for citation purposes)
These converted factors remained orthogonal to the other
design factors and to each other.
Data generation
Generation of the data blocks under the experimental
design relied on Equation (11)
where and are generated under the design factors,
and the matrix refers to the kth simulated data block.
The true model parameters are indicated by 'm'. For com-
pleteness, the true data block is given by
. The simulation study was performed in
Matlab R2008a (the Mathworks).
The true loading matrices and were generated
based on the correlation matrices of the real life metabo-
lomics data blocks. The data block obtained by GC/MS
consisted of more variables than the LC/MS data block.
Therefore, the GC/MS data block was used in the genera-
tion of the loadings for the largest data block in this sim-
ulation, , and the LC/MS data block was used for the
generation of . The following procedure was followed
in each simulation for the generation of the loadings:
• Randomly select Jk variables from .
The label 'real' indicates that these variables pertain to
the real life measurements. The number of variables Jk
was given by the relevant design factor. Note that care
was taken that Jk is sufficiently smaller than to
ensure the subset of selected variables was sufficiently
different in each simulation.
• Calculate the correlation matrix (Jk × Jk)
• Extract two normalized singular vectors belonging to
the two largest singular values of . These two vec-
tors form (Jk × 2).
• Obtain the diagonal matrix (2 × 2) based on the
factor 'Singular value'. Scale by multiplying by
to correct for differences in block size.
• Obtain the true loading matrix (Jk × 2) by multi-
plication of with : .
For each simulation, the true score matrix Tm (20 × 2) were
obtained from the left singular vectors of a centered
matrix of which the elements were independently drawn
from a standard normal distribution. The elements of the
noise matrix (20 × Jk) were each simulation obtained
by independently drawing values from N(0, ). The var-
iance parameter was set such that the expected varia-
tion of was a certain percentage of the total variation.
This percentage was given by the design factors NoiseX1
and NoiseX2 for respectively the largest and smallest data
block.
Recovery of the true data structures
As performance measure for the different methods, the
recovery of the true component matrices Tm and and
the true data blocks from the simulated data block
by the different SCA methods was determined. The
closer the estimation of the components resembled the
true component matrices, the better a method performs.
The recovery of the data structures was measured by the
modified RV coefficient [18], a matrix correlation meas-
ure, as a goodness of recovery measure. The range of mod-
ified RV coefficient is between -1 and 1 and '1' means
perfect recovery. The modified RV coefficient is insensitive
to orthogonal rotations, therefore we expect values close
to 1. The modified RV coefficients were transformed using
the Fisher-Z transformation to allow for values on the
entire real line instead of between -1 and 1, thus a larger
number indicates a better recovery. The transformed val-
ues are referred to as RV-Z. The recovery of the true data
blocks was also analyzed by the sum of squared dif-
ferences per data block. This different recovery measure
did not change the conclusions of this paper. Therefore,
the RV-Z measure was used as a recovery measure for all
data structures. The recovery measures obtained from the
simulation study were analyzed by ANOVA using the
GLM procedure of the software package SAS 9.2 (SAS). All
factors were considered fixed for the ANOVA.
Authors' contributions
RVDB performed the simulations, the analysis of the
metabolomics data, and the writing of the manuscript.
XTP E
k
sm
k
mT
k
m
=+()
(11)
Pk
m
Ek
m
Xk
s
Xk
m
XTP
k
mm
k
mT
=()
P1
m
P2
m
P1
m
P2
m
Xkk
IJ
real real real
()×
Jk
real
Ck
real
Ck
real
Vk
m
Sk
m
Sk
m
Sk
m
Jk
Pk
m
Vk
m
Sk
m
Ek
m
k
2
k
2
Ek
m
Pk
m
Xk
m
Xk
s
Xk
m
BMC Bioinformatics 2009, 10:340 http://www.biomedcentral.com/1471-2105/10/340
Page 12 of 12
(page number not for citation purposes)
IVM recognized the usability of MxLSCA-P for the analysis
of functional genomics data and aided the interpretation
of the results and the writing of manuscript. TFW pro-
vided useful suggestions for the setup of the simulation
study and the interpretation of the results. KVD provided
useful suggestions for the setup of the simulation study,
the ANOVA, the interpretation of the results, and the writ-
ing of the manuscript. HALK and AKS provided useful
suggestions for the interpretation of the results. All
authors read and approved the final manuscript.
Acknowledgements
The authors would like to thank Dr. Mariët van der Werf (TNO Quality of
Life, the Netherlands) for providing the E. coli metabolomics data set. We
would also like to thank Dr. David Magis for interesting discussions. This
work was supported by the Research Fund of the Katholieke Universiteit
Leuven (EF/05/007 SymBioSys) and by IWT-Flanders (IWT/060045/SBO
Bioframe).
References
1. Ishii N, Nakahigashi K, Baba T, Robert M, Soga T, Kanai A, Hirasawa
T, Naba M, Hirai K, Hoque A, Ho PY, Kakazu Y, Sugawara K, Igarashi
S, Harada S, Masuda T, Sugiyama N, Togashi T, Hasegawa M, Takai Y,
Yugi K, Arakawa K, Iwata N, Toya Y, Nakayama Y, Nishioka T,
Shimizu K, Mori H, Tomita M: Multiple High-Throughput Analy-
ses Monitor the Response of E. coli to Perturbations. Science
2007, 316(5824):593-597.
2. Hirai MY, Yano M, Goodenowe DB, Kanaya S, Kimura T, Awazuhara
M, Arita M, Fujiwara T, Saito K: Integration of transcriptomics
and metabolomics for understanding of global responses to
nutritional stresses in Arabidopsis thaliana. Proc Natl Acad Sci
USA 2004, 101(27):10205-10210.
3. Bradley PH, Brauer MJ, Rabinowitz JD, Troyanskaya OG: Coordi-
nated Concentration Changes of Transcripts and Metabo-
lites in Saccharomyces cerevisiae. PLoS Comput Biol 2009,
5:e1000270.
4. Yu H, Luscombe NM, Qian J, Gerstein M: Genomic analysis of
gene expression relationships in transcriptional regulatory
networks. Trends Genet 2003, 19(8):422-427.
5. Lemmens K, Dhollander T, De Bie T, Monsieurs P, Engelen K, Smets
B, Winderickx J, De Moor B, Marchal K: Inferring transcriptional
modules from ChIP-chip, motif and microarray data. Genome
Biol 2006, 7(5):R37.
6. Kiers HAL, ten Berge JMF: Hierarchical relations between
methods for simultaneous component analysis and a tech-
nique for rotation to a simple simultaneous structure. Br J
Math Stat Psychol 1994, 47:109-126.
7. Timmerman ME, Kiers HAL: Four simultaneous component
models for the analysis of multivariate time series from
more than one subject to model intraindividual and interin-
dividual differences. Psychometrika 2003, 68:105-121.
8. Bro R, Smilde AK: Centering and scaling in component analy-
sis. J Chemom 2003, 17:16-33.
9. van den Berg RA, Hoefsloot HCJ, Westerhuis JA, Smilde AK, van der
Werf MJ: Centering, scaling, and transformations: improving
the biological information content of metabolomics data.
BMC Genomics 2006, 7:142.
10. Van Deun K, Smilde AK, Werf MJ van der, Kiers HAL, Van Mechelen
I: A structured overview of simultaneous component based
data integration. BMC Bioinformatics 2009, 10:246.
11. Kiers HAL: Towards a standardized notation and terminology
in multiway analysis. J Chemom 2000, 14(3):105-122.
12. Smilde AK, Westerhuis JA, de Jong S: A framework for sequential
multiblock component methods. J Chemom 2003, 17:323-337.
13. Wilderjans TF, Ceulemans E, Van Mechelen I: Simultaneous analy-
sis of coupled data blocks differing in size: A comparison of
two weighting schemes. Comput Stat Data An 2009,
53(4):1086-1098.
14. Carroll J, Chang JJ: Analysis of individual differences in multidi-
mensional scaling via an n-way generalization of 'Eckart-
Young' decomposition. Psychometrika 1970, 35(3):283-319.
15. Kroonenberg P, de Leeuw J: Principal component analysis of
three-mode data by means of alternating least squares algo-
rithms. Psychometrika 1980, 45:69-97.
16. Kiers HAL, Smilde AK: A comparison of various methods for
multivariate regression with highly collinear variables. Stat
Methods Appl 2007, 16(2):193-228.
17. Eilers PHC, Boer JM, van Ommen GJ, van Houwelingen HC: Classi-
fication of microarray data with penalized logistic regres-
sion. Proceedings of SPIE 2001, 4266:187-198.
18. Smilde AK, Kiers HAL, Bijlsma S, Rubingh CM, van Erk MJ: Matrix
correlations for high-dimensional data: the modified RV-
coefficient. Bioinformatics 2009, 25(3):401-405.
19. van der Werf MJ, Overkamp KM, Muilwijk B, Coulier L, Hankemeier
T: Microbial metabolomics: Toward a platform with full
metabolome coverage. Anal Biochem 2007, 370:17-25.
20. Smilde AK, van der Werf MJ, Bijlsma S, van der Werff-van der Vat
BJC, Jellema RH: Fusion of mass-spectrometry-based metabo-
lomics data. Anal Chem 2005, 77(20):6729-6736.
21. van Heijenoort J: Recent advances in the formation of the bac-
terial peptidoglycan monomer unit. Nat Prod Rep 2001,
18:503-519.
22. Keseler IM, Vides JC, Castro SG, Ingraham JL, Paley S, Paulsen IT, Gil
MP, Karp PD: a comprehensive database resource for
Escherichia coli. Nucleic Acids Res 2005, 33(suppl_1):D334-D337.
23. Bos MP, Robert V, Tommassen J: Biogenesis of the gram-nega-
tive bacterial outer membrane. Annu Rev Microbiol 2007,
61:191-214.
24. Pagès J: Collection and analysis of perceived product inter-dis-
tances using multiple factor analysis: Application to the
study of 10 white wines from the Loire Valley. Food Qual Pref
2005, 16(7):642-649.
25. de Tayrac M, Le S, Aubry M, Mosser J, Husson F: Simultaneous
analysis of distinct Omics data sets with integration of bio-
logical knowledge: Multiple Factor Analysis approach. BMC
Genomics 2009, 10:32.
26. Bro R, Sidiropoulos ND, Smilde AK: Maximum likelihood fitting
using ordinary least squares algorithms. J Chemom 2002,
16:387-400.
27. Johnston J, DiNardo J: Econometric Methods 4th edition. New York:
McGraw Hill Higher Education; 1997. [978-0071259644]
28. Rubingh CM, Bijlsma S, Jellema RH, Overkamp KM, van der Werf MJ,
Smilde AK: Analyzing Longitudinal Microbial Metabolomics
Data. J Proteome Res 2009, 8(9):4319-4327.
29. Blanchard JL, Wholey WY, Conlon EM, Pomposiello PJ: Rapid
Changes in Gene Expression Dynamics in Response to
Superoxide Reveal SoxRS-Dependent and Independent
Transcriptional Networks. PLoS ONE 2007, 2(11):e1186.
30. Ramsay J, Silverman BW: Functional Data Analysis 2nd edition. New
York: Springer; 2005. [ISBN-10: 038740080X]
31. Kiers HAL: Weighted least squares fitting using ordinary least
squares algorithms. Psychometrika 1997, 62(2):251-266.
32. Gabriel KR, Zamir S: Lower Rank Approximation of Matrices
by Least Squares with Any Choice of Weights. Technometrics
1979, 21(4):489-498.
33. Koek M, Muilwijk B, van der Werf MJ, Hankemeier T: Microbial
metabolomics with gas chromatography mass spectrome-
try. Anal Chem 2006, 78(4):1272-1281.
34. Coulier L, Bas R, Jespersen S, Verheij E, van der Werf MJ, Hankemeier
T: Simultaneous Quantitative Analysis of Metabolites Using
Ion-Pair Liquid Chromatography-Electrospray Ionization
Mass Spectrometry. Anal Chem 2006, 78(18):6573-6582.
... Unless external information on the amount of noise in each data block is available, such a weighing strategy requires tools to estimate this information from the data at hand. For this purpose, Wilderjans et al. [54] propose a minimal stochastic extension of the SCA model (5) that includes a homoscedastic normally distributed error term with a block-specific error variance [in [55], a similar extension was proposed for (10)]. Maximum-likelihood estimation of the resulting models then automatically implies a down-weighing of more noisy data blocks. ...
... Maximum-likelihood estimation of the resulting models then automatically implies a down-weighing of more noisy data blocks. In [44] and [55], results of (synthetic and realistic) simulation studies are presented that show that the proposed weighing scheme implies a better recovery of the underlying true SCA structure (which is an argument in favor of the weighing scheme in question). The gain in recovery is sizeable only, however, if the amount of noise is larger in the data blocks that are larger in size. ...
Article
Full-text available
We start from a few examples of coupled behavioral sciences data, along with associated research questions and data-analytic methods. Linking up with these, we introduce a few concepts and distinctions, by means of which we specify the focus of this paper: 1) data that take the form of a collection of coupled matrices that are linked in either the experimental unit or the variable data mode; 2) associated with questions about the mechanisms underlying these data matrices; 3) which are to be addressed by data-analytic methods that rely on a submodel per data matrix, with a common parameterization of the shared data mode. Next, we outline the principles of two closely related families within this focus: the families of multiblock component- and factor-based models for data fusion (while considering both deterministic and stochastic model variants). Then, we review developments within these families to capture both similarities and differences between the different data matrices under study. We follow with a discussion on recent attempts to address quite a few challenges in data fusion based on multiblock component and factor models, including whether and how to differentially weigh the different data matrices under study, and problems such as dealing with large numbers of variables, outliers, and missing values. While the focus of this paper is on data and modeling contributions from the behavioral sciences, we point in a concluding section at their relevance for other domains and at the importance of related methods developed in those domains.
... There is a need for statistical methods appropriate for doing an integrative analysis of coupled binary and quantitative data sets in biology research. The standard SCA models [65,66] that use column centering processing steps and least-squares loss criteria are not appropriate for binary data sets. Recently, iClusterPlus [67] was proposed as a factor analysis framework to model discrete and quantitative data sets simultaneously by exploiting the properties of exponential family distributions. ...
Preprint
In systems biology, it is common to measure biochemical entities at different levels of the same biological system. One of the central problems for the data fusion of such data sets is the heterogeneity of the data. This thesis discusses two types of heterogeneity. The first one is the type of data, such as metabolomics, proteomics and RNAseq data in genomics. These different omics data reflect the properties of the studied biological system from different perspectives. The second one is the type of scale, which indicates the measurements obtained at different scales, such as binary, ordinal, interval and ratio-scaled variables. In this thesis, we developed several statistical methods capable to fuse data sets of these two types of heterogeneity. The advantages of the proposed methods in comparison with other approaches are assessed using comprehensive simulations as well as the analysis of real biological data sets.
... There is a need for statistical methods appropriate for doing the integrative analysis of coupled binary and quantitative data sets in biology research. The standard SCA models that use column centering processing steps and least squares loss criteria are not appropriate for binary data sets [26,25]. Recently, [19] derived a factor analysis framework to model discrete and quantitative data sets simultaneously by exploiting the properties of exponential family distributions. ...
Preprint
In the current era of systems biological research there is a need for the integrative analysis of binary and quantitative genomics data sets measured on the same objects. We generalize the simultaneous component analysis (SCA) model, a canonical tool for the integrative analysis of multiple quantitative data sets, from the probabilistic perspective to explore the underlying dependence structure present in both these distinct measurements. Similar as in the SCA model, a common low dimensional subspace is assumed to represent the shared information between these two distinct measurements. However, the generalized SCA model can easily be overfit by using exact low rank constraint, leading to very large estimated parameters. We propose to use concave penalties in the low rank matrix approximation framework to mitigate this problem of overfitting and to achieve a low rank constraint simultaneously. An efficient majorization algorithm is developed to fit this model with different concave penalties. Realistic simulations (low signal to noise ratio and highly imbalanced binary data) are used to evaluate the performance of the proposed model in exactly recovering the underlying structure. Also, a missing value based cross validation procedure is implemented for model selection. In addition, exploratory data analysis of the quantitative gene expression and binary copy number aberrations (CNA) measurement obtained from the same 160 cell lines of the GDSC1000 data sets successfully show the utility of the proposed method.
... However, these standard methods cannot handle complementary data obtained from multiple platforms. A commonly used method to fuse data from multiple platforms, to reveal their underlying relationships, is simultaneous component analysis (SCA) [6][7][8]. While useful for certain questions, it does not specifically classify or make predictions about class membership. ...
Article
Full-text available
Combining different metabolomics platforms can contribute significantly to the discovery of complementary processes expressed under different conditions. However, analysing the fused data might be hampered by the difference in their quality. In metabolomics data, one often observes that measurement errors increase with increasing measurement level and that different platforms have different measurement error variance. In this paper we compare three different approaches to correct for the measurement error heterogeneity, by transformation of the raw data, by weighted filtering before modelling and by a modelling approach using a weighted sum of residuals. For an illustration of these different approaches we analyse data from healthy obese and diabetic obese individuals, obtained from two metabolomics platforms. Concluding, the filtering and modelling approaches that both estimate a model of the measurement error did not outperform the data transformation approaches for this application. This is probably due to the limited difference in measurement error and the fact that estimation of measurement error models is unstable due to the small number of repeats available. A transformation of the data improves the classification of the two groups.
Article
In the current era of systems biology research, there is a need for the integrative analysis of binary and quantitative genomics data sets measured on the same objects. One standard tool of exploring the underlying dependence structure present in multiple quantitative data sets is the simultaneous component analysis (SCA) model. However, it does not have any provisions when a part of the data are binary. To this end, we propose the generalized SCA (GSCA) model, which takes into account the distinct mathematical properties of binary and quantitative measurements in the maximum likelihood framework. Like in the SCA model, a common low‐dimensional subspace is assumed to represent the shared information between these two distinct types of measurements. To achieve a low rank solution, we propose to use a concave variant of the nuclear norm penalty. An efficient majorization algorithm is developed to fit this model with different concave penalties. Realistic simulations (low signal‐to‐noise ratio and highly imbalanced binary data) are used to evaluate the performance of the proposed model in recovering the underlying structure. Also, a missing value based cross‐validation procedure is implemented for model selection. We illustrate the usefulness of the GSCA model for exploratory data analysis of quantitative gene expression and binary copy number aberration measurements obtained from the GDSC1000 data sets.
Article
Full-text available
Interdisciplinary research often involves analyzing data obtained from different data sources with respect to the same subjects, objects, or experimental units. For example, global positioning systems (GPS) data have been coupled with travel diary data, resulting in a better understanding of traveling behavior. The GPS data and the travel diary data are very different in nature, and, to analyze the two types of data jointly, one often uses data integration techniques, such as the regularized simultaneous component analysis (regularized SCA) method. Regularized SCA is an extension of the (sparse) principle component analysis model to the cases where at least two data blocks are jointly analyzed, which - in order to reveal the joint and unique sources of variation - heavily relies on proper selection of the set of variables (i.e., component loadings) in the components. Regularized SCA requires a proper variable selection method to either identify the optimal values for tuning parameters or stably select variables. By means of two simulation studies with various noise and sparseness levels in simulated data, we compare six variable selection methods, which are cross-validation (CV) with the “one-standard-error” rule, repeated double CV (rdCV), BIC, Bolasso with CV, stability selection, and index of sparseness (IS) - a lesser known (compared to the first five methods) but computationally efficient method. Results show that IS is the best-performing variable selection method.
Article
In the behavioral sciences, many research questions pertain to a regression problem in that one wants to predict a criterion on the basis of a number of predictors. Although in many cases, ordinary least squares regression will suffice, sometimes the prediction problem is more challenging, for three reasons: first, multiple highly collinear predictors can be available, making it difficult to grasp their mutual relations as well as their relations to the criterion. In that case, it may be very useful to reduce the predictors to a few summary variables, on which one regresses the criterion and which at the same time yields insight into the predictor structure. Second, the population under study may consist of a few unknown subgroups that are characterized by different regression models. Third, the obtained data are often hierarchically structured, with for instance, observations being nested into persons or participants within groups or countries. Although some methods have been developed that partially meet these challenges (i.e., principal covariates regression (PCovR), clusterwise regression (CR), and structural equation models), none of these methods adequately deals with all of them simultaneously. To fill this gap, we propose the principal covariates clusterwise regression (PCCR) method, which combines the key idea’s behind PCovR (de Jong & Kiers in Chemom Intell Lab Syst 14(1–3):155–164, 1992) and CR (Späth in Computing 22(4):367–373, 1979). The PCCR method is validated by means of a simulation study and by applying it to cross-cultural data regarding satisfaction with life.
Article
The integration of multiblock high throughput data from multiple sources is one of the major challenges in several disciplines including metabolomics, computational biology, genomics, and clinical psychology. A main challenge in this line of research is to obtain interpretable results 1) that give an insight into the common and distinctive sources of variations associated to the multiple and heterogeneous data blocks and 2) that facilitate the identification of relevant variables. We present a novel variable selection method for performing data integration, providing easily interpretable results, and recovering underlying data structure such as common and distinctive components.The flexibility and applicability of this method are showcased via numerical simulations and an application to metabolomics data.
Book
This chapter reviews different chemometrics methods for the analysis of genomics, transcriptomics, proteomics, metabolomics, and metagenomics datasets. It discusses a range of statistical data integration techniques.
Article
Full-text available
Behavioral researchers often obtain information about the same set of entities from different sources. A main challenge in the analysis of such data is to reveal, on the one hand, the mechanisms underlying all of the data blocks under study and, on the other hand, the mechanisms underlying a single data block or a few such blocks only (i.e., common and distinctive mechanisms, respectively). A method called DISCO-SCA has been proposed by which such mechanisms can be found. The goal of this article is to make the DISCO-SCA method more accessible, in particular for applied researchers. To this end, first we will illustrate the different steps in a DISCO-SCA analysis, with data stemming from the domain of psychiatric diagnosis. Second, we will present in this article the DISCO-SCA graphical user interface (GUI). The main benefits of the DISCO-SCA GUI are that it is easy to use, strongly facilitates the choice of model selection parameters (such as the number of mechanisms and their status as being common or distinctive), and is freely available.
Article
Full-text available
Classification of microarray data needs a firm statistical basis. In principle, logistic regression can provide it, modeling the probability of membership of a class with (transforms of) linear combinations of explanatory variables. However, classical logistic regression does not work for microarrays, because generally there will be far more variables than observations. One problem is multicollinearity: estimating equations become singular and have no unique and stable solution. A second problem is over-fitting: a model may fit well into a data set, but perform badly when used to classify new data. We propose penalized likelihood as a solution to both problems. The values of the regression coefficients are constrained in a similar way as in ridge regression. All variables play an equal role, there is no ad-hoc selection of most relevant or most expressed genes. The dimension of the resulting systems of equations is equal to the number of variables, and generally will be too large for most computers, but it can dramatically be reduced with the singular value decomposition of some matrices. The penalty is optimized with AIC (Akaike's Information Criterion), which essentially is a measure of prediction performance. We find that penalized logistic regression performs well on a public data set (the MIT ALL/AML data).
Article
This paper presents a standardized notation and terminology to be used for three- and multiway analyses, especially when these involve (variants of) the CANDECOMP/PARAFAC model and the Tucker model. The notation also deals with basic aspects such as symbols for different kinds of products, and terminology for three- and higher-way data. The choices for terminology and symbols to be used have to some extent been based on earlier (informal) conventions. Simplicity and reduction of the possibility of confusion have also played a role in the choices made. Copyright (C) 2000 John Wiley & Sons, Ltd.
Article
Functional data analysis (FDA) models data using functions or functional parameters. The complexity of the functions is not assumed to be known in advance, so that methods are used for approximating these with as much flexibility as the data require. Keywords: functional parameters; splines; basis functions; roughness penalty; dynamic model; registration
Article
Reduced rank approximation of matrices has hitherto been possible only by unweighted least squares. This paper presents iterative techniques for obtaining such approximations when weights are introduced. The techniques involve criss-cross regressions with careful initialization. Possible applications of the approximation are in modelling, biplotting, contingency table analysis, fitting of missing values, checking outliers, etc.
Article
The concept of sensory distance between products is familiar. It is used mainly in graphical displays, for example provided by principal components analysis carried out from conventional profiles. In such a methodology, the importance (or weight) granted to the descriptors by the subjects is not taken into account in the averaging process.To take into account these weights, two practices can classically be used: similarity scaling and multidimensional sorting. Here a third way is proposed: the direct collection of perceived inter-products distance by each subject. In this method, each subject is asked to position the products onto a large sheet of paper according to the sensory distances he perceives between the products. Thus, for each subject, the data consist in a bi-dimensional configuration of the products. These configurations can be analysed by multiple factor analysis.This procedure is described here and applied to 10 white wines from the Loire Valley; the results are interpreted with the help of a classical descriptive sensory analysis.
Article
Classification of microarray data needs a firm statistical basis. In principle, logistic regression can provide it, modeling the probability of membership of a class with (transforms of) linear combinations of explanatory variables. However, classical logistic regression does not work for microarrays, because generally there will be far more variables than observations. One problem is multicollinearity: estimating equations become singular and have no unique and stable solution. A second problem is over-fitting: a model may fit well into a data set, but perform badly when used to classify new data. We propose penalized likelihood as a solution to both problems. The values of the regression coefficients are constrained in a similar way as in ridge regression. All variables play an equal role, there is no ad-hoc selection of most relevant or most expressed genes. The dimension of the resulting systems of equations is equal to the number of variables, and generally will be too large for most computers, but it can dramatically be reduced with the singular value decomposition of some matrices. The penalty is optimized with AIC (Akaike's Information Criterion), which essentially is a measure of prediction performance. We find that penalized logistic regression performs well on a public data set (the MIT ALL/AML data).© (2001) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.
Article
Discusses methods for simultaneous component analysis (SCA) of scores of 2 or more groups of individuals on the same variables. A method is developed for SCA such that for each set essentially the same component structure (ST) is found (SCA-ST). The method is compared with those that use the same component weights matrix (SCA-W) or the same pattern matrix (SCA-P) across data sets. SCA-W always explains the highest amount of variance and SCA-ST the lowest. These explained variances can be compared to the amount of variance explained by separate principal components analyses. It is shown how, for cases where SCA-ST does not fit well, one can use SCA-W (and SCA-P) to find out if and how correlational structures differ. Facilitating the interpretation of an SCA-ST solution is discussed. Rotational freedom is exploited in a specially designed simple structure rotation technique for SCA-ST, which is illustrated on an empirical data set. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Multiblock or multiset methods are starting to be used in chemistry and biology to study complex data sets. In chemometrics, sequential multiblock methods are popular; that is, methods that calculate one component at a time and use deflation for finding the next component. In this paper a framework is provided for sequential multiblock methods, including hierarchical PCA (HPCA; two versions), consensus PCA (CPCA; two versions) and generalized PCA (GPCA). Properties of the methods are derived and characteristics of the methods are discussed. All this is illustrated with a real five-block example from chromatography. The only methods with clear optimization criteria are GPCA and one version of CPCA. Of these, GPCA is shown to give inferior results compared with CPCA. Copyright © 2003 John Wiley & Sons, Ltd.