Article

Clusterwise Parafac to identify heterogeneity in three-way data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In many research areas, the Parafac model is adopted to disclose the underlying structure of three-way three-mode data. In this model, a set of latent variables, called components, is sought that captures the complex interaction between the elements of the three modes. An important assumption of this model is that these components are the same for all elements of the three modes. In many cases, however, it makes sense to assume that the components may differ (i.e., qualitative differences in underlying component structure) across groups of elements of one of the modes. Therefore, in this paper, we present Clusterwise Parafac. In this new model, the elements of one of the three modes are assigned to a limited number of mutually exclusive clusters and, simultaneously, the data within each cluster are modeled with Parafac. As such, elements that belong to the same cluster are assumed to be governed by the same components, whereas elements that are assigned to different clusters have a different underlying component structure. To evaluate the performance of the new Clusterwise Parafac strategy, an extensive simulation study is conducted. Moreover, the strategy is applied to sensory profiling data regarding different cheeses.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In the present paper, we restrict ourselves to the situation where a number of loadings in only one of the three modes will be constrained to zeros. For this situation, the approach where in each row all but one of the loadings are constrained to zero has been studied repeatedly (e.g., [3][4][5][6]. These approaches are usually denoted as clustering approaches, or more appropriately, nonoverlapping clustering or partitioning approaches. ...
... As has been mentioned, the methods in this way actually can be seen as methods for overlapping clustering, or in special cases nonoverlapping clustering. Thus, there is a relation with the clusterwise CP approach proposed by [5]. Their approach offers nonoverlapping clusters, with for each cluster one or more CP components. ...
Article
Full-text available
When one interprets Candecomp/Parafac (CP) solutions for analyzing three-way data, small loadings are often ignored, that is, considered to be zero. Rather than just considering them zero, it seems better to actually model such values as zero. This can be done by successive modeling approaches as well as by a simultaneous modeling approach. This paper offers algorithms for three such approaches, and compares them on the basis of empirical data and a simulation study. The conclusion of the latter was that, under realistic circumstances, all approaches recovered the underlying structure well, when the number of values to constrain to zero was given. Whereas the simultaneous modeling approach seemed to perform slightly better, differences were very small and not substantial. Given that the simultaneous approach is far more time consuming than the successive approaches, the present study suggests that for practical purposes successive approaches for modeling zeros in the CP model seem to be indicated.
... Note that α j(q) is undefined when variable j does not belong to cluster G q ; in that case, α j(q) is taken equal to 0. The CLV3W criterion is equivalent to the Clusterwise Parafac criterion when clustering the elements of the second mode with Q clusters and one component in each cluster. 15 Moreover, this latter criterion clearly shows the equivalence of the CLV3W model with the ParaFac with Optimally Clustered Variables (PFOCV) model designed for clustering the elements of the first mode. 16 To partition the variables, starting from an initial variable clustering (into Q clusters), CLV3W runs an alternating least squares (ALS) algorithm, which alternates between two updating steps as follows: (a) To update the cluster membership of a variable j, the criterion ...
... is computed for each cluster G q and variable j is assigned to the cluster G q for which f jq is minimal 11,15 ; (b) After updating the cluster membership of all the variables, a Parafac model [17][18][19] with one component is carried out on each cluster G q , that is to say on the three-way array X q ð Þ associated with cluster G q (q = 1, …, Q). The three-way array X q ð Þ is the array obtained by only taking the data slices X j of X associated to variables j belonging to G q . ...
Article
The set‐up of comprehensive studies in life sciences involving a longitudinal dimension—as appears in time‐scale metabolomics—calls for the use of dimension reduction techniques for three‐way data structures (e.g., samples by variables by time points). For this purpose, a clustering around latent variables for three‐way data approach, CLV3W , has been proposed. CLV3W aims at both partitioning the variables into nonoverlapping clusters and estimating within each cluster a rank‐one Parafac model consisting of a latent component (resp. a weighting system) associated with the first mode (resp. third mode) and a vector of loadings reflecting the degree of closeness of each variable of the second mode to its cluster. In this paper, two constrained CLV3W models are discussed. First, a nonnegativity constraint is defined implying that clusters are composed of positively correlated variables. Second, it is proposed to constrain the weighting system to be the same for all clusters. These two constraints aim at providing more parsimonious models with configurations that are easier to interpret. The appropriateness of both constraints is evaluated in a simulation study and illustrated on two case studies pertaining to sensory evaluation and metabolomics data. Regarding the first case study, CLV3W yields the identification of two consumer segments together with one common emotional pleasantness dimension associated with coffee aromas. CLV3W analysis of human preterm breast milk metabolomics data provided three clusters of lipid species that are responsible for specific functions (i.e., milk fat globules membrane‐constituents, fatty acid oxidation‐products, lipid mediators as eicosanoids and endocannabinoids).
... A second example is given by crosscultural studies in which inhabitants of different countries fill in the same questionnaire 4 . A third example, that we will focus on in this paper, are sensory profiling data, where a number of assessors rate different food samples with respect to a set of sensory attributes 5,6 . ...
... 6. Simulation study 6 ...
Article
Full-text available
Multivariate multigroup data are collected in many fields of science, where the so‐called groups pertain to, for instance, experimental groups or countries the participants are nested in. To summarize the main information in such data, principal component analysis (PCA) is highly popular. PCA reduces the variables to a few components that are linear combinations of the original variables. Researchers usually assume those components to be the same across the groups and aim to apply a simultaneous component analysis. To investigate whether this assumption is reasonable, one often analyzes the groups separately and computes a similarity index between the group‐specific component loadings of the variables. In many cases, however, most variables have highly similar loadings across the groups, but a few variables, which we will call “outlying variables,” behave differently, indicating that a simultaneous analysis is not warranted. In such cases, the outlying variables should be removed before proceeding with the simultaneous analysis. To do so, the variables are ranked according to their relative outlyingness. Although some procedures have been proposed that yield such an outlyingness ranking, they might not be optimal, because they all rely on the same choice of similarity coefficient without evaluating other alternatives. In this paper, we give an overview of other options and report extensive simulations in which we investigate how this choice affects the correctness of the outlyingness ranking. We also illustrate the added value of the outlying variable approach by means of sensometric data on different bread samples.
... The data are three-way three mode data, and have been analyzed with three-way models previously [5,28]. Here, we apply a multi-set model. ...
... 8 Note that since the block sizes are equal, the weights that result from the normalizing division are identical across blocks. the number of components R using CHull [22,23], considering the models with 1 up to 8 components (as previous models included 6 (constrained) components at maximum [28]), taking the explained variance as the goodness of fit and the number of components as the complexity measure. CHull indicated 2 as the optimal number of components, which is in line with the PARAFAC model of this data [5] as well as other multiset analyses [30,31]. ...
Article
Simultaneous component analysis (SCA) is a fruitful approach to disclose the structure underlying data stemming from multiple sources on the same objects. This kind of data can be organized in blocks. To identify which component relates to all, and which to some sources, the block structure in the data should be taken into account. In this paper, we propose a new rotation criterion, Blockwise Simplimax, that aims at block simplicity of the loadings, implying that for some components all variables in a block have a zero loading. We also present an associated model selection criterion, to aid in selecting the required degree of simplicity for the data at hand. An extensive simulation study is conducted to evaluate the performance of Blockwise Simplimax and the associated model selection criterion, and to compare it with a sparse competitor, namely Sparse group SCA. In the conditions considered Blockwise Simplimax performed reasonably well, and either performed equally well as, or clearly outperformed Sparse group SCA. The model selection criterion performed well in simple conditions. The usefulness of Blockwise Simplimax and Sparse group SCA is illustrated using sensory profiling data regarding different cheeses.
... Such a criterion can be formalized by computing the Scree-Ratio index (Wilderjans & Ceulemans, 2013): ...
Article
Full-text available
A novel clustering model, CPclus, for three-way data concerning a set of objects on which variables are measured by different subjects is proposed. The main aim of the proposal is to simultaneously summarize the objects through clusters and both variables and subjects through components. The object clusters are found by adopting a K -means-based strategy where the centroids are reduced according to the Candecomp/Parafac model in order to exploit the three-way structure of the data. The clustering process is carried out in order to reveal between-cluster differences in mean. Least-squares fitting is performed by using an iterative alternating least-squares algorithm. Model selection is addressed by considering an elbow-based method. An extensive simulation study and some real-life applications show the effectiveness of the proposal, also in comparison with its potential competitors.
... This is the reason that a large volume of the literature has been devoted to relative approaches [29,30]. As an alternative to scaling and normalization, the use of different weights per dataset (modality/sensor) in the course of the fusion process has been proposed [31,32,33,34,35]. ...
Chapter
Full-text available
Data fusion is the joint analysis of multiple inter-related datasets that provide complementary views of the same phenomenon. The process of correlating and fusing information from multiple sources generally allows more accurate inferences than those that the analysis of a single dataset can yield. Data fusion is a multifaceted concept with clear advantages but at the same time with numerous challenges that need to be carefully addressed. Coupled tensor decompositions have been proved successful in a plethora of data fusion applications, in view of their uniqueness properties and their unique ability to discover and fuse latent multidimensional information from inter-linked datasets. The aim of this chapter is to provide a brief overview of the data fusion concept and its advantages and challenges, with a discussion of coupled tensor decomposition models and methods, showing their power in solving data fusion tasks, as compared to matrix decomposition-based approaches. A few relevant applications are overviewed, particularly the fusion of electroencephalography and functional magnetic resonance imaging data.
... In a recent research [19], algorithms are proposed to build sparse loading matrices. Other authors [20,21] have used clusters in the Parafac model to obtain a better interpretation of the results. In all the above proposals, constraints are imposed on one mode. ...
Article
Full-text available
In this paper, we extend the use of disjoint orthogonal components to three-way table analysis with the parallel factor analysis model. Traditional methods, such as scaling, orthogonality constraints, non-negativity constraints, and sparse techniques, do not guarantee that interpretable loading matrices are obtained in this model. We propose a novel heuristic algorithm that allows simple structure loading matrices to be obtained by calculating disjoint orthogonal components. This algorithm is also an alternative approach for solving the well-known degeneracy problem. We carry out computational experiments by utilizing simulated and real-world data to illustrate the benefits of the proposed algorithm.
... To facilitate the application of CSSCA, we propose a model selection procedure to determine the number of clusters and the level of sparsity with the best balance between the model fit (i.e., the total loss) and the model complexity. Wilderjans and Ceulemans (2013) showed that a sequential model selection strategy may have several advantages. Adapted to solving the model selection problem of CSSCA, the sequential strategy includes two steps: (1) pick the optimal number of clusters K and (2) determine the optimal level of sparsity S, given the selected K. To illustrate our model selection procedure, assume that K and S are selected from ascending candidate sets (K 1 , K 2 , . . . ...
Article
Full-text available
Social and behavioral studies more and more often yield multi-block data, which consist of novel blocks of data (e.g., data from wearable devices) and traditional blocks of data (e.g., survey data) collected from the same sample. Multi-block data offer researchers valuable insights into complex social mechanisms, where several influences act together. Yet such mechanisms are likely to differ among subgroups. Hence, fully revealing the composite mechanisms underlying multi-block data is challenging, since proper clustering analysis of such data requires methods that simultaneously detect the covariation of variables underlying all data blocks and the group differences therein. Additionally, the methods should be able to handle high-dimensional datasets, which might include many irrelevant variables. Here, we present Clusterwise Sparse Simultaneous Component Analysis (CSSCA), a method that groups the subjects that are driven by the same mechanisms and, at the same time, extracts cluster-specific components that model these mechanisms. By imposing structure constraints, CSSCA further distinguishes common mechanisms that underlie all data blocks from distinctive mechanisms that only underlie one or a few data blocks. In extensive simulations, CSSCA delivered convincing results in recovering the clusters and their associated component structures across various conditions. More importantly, CSSCA showed a clear advantage over existing methods when substantial cluster differences in the component structure were present. We demonstrated the usefulness of CSSCA in an application to data stemming from a study on personality.
... Furthermore, as these classical three-way methods actually make the rather strict assumption that component scores assigned to entities of one mode are homogeneous across entities of other modes, methods that allow to capture such heterogeneity in the component scores have recently been proposed. In particular, one may use a clusterwise extension of the Parafac model (Wilderjans and Ceulemans 2013), or (clusterwise extensions of) simultaneous components analysis (De Roover, Ceulemans, Timmerman, Vansteelandt, Stouten, and Onghena 2012b;De Roover, Timmerman, Van Mechelen, and Ceulemans 2013b;De Roover, Ceulemans, Timmerman, Nezlek, and Onghena 2013a;Helwig 2013). Simultaneous components analysis and its (clusterwise) extensions can be estimated using the R package multiway or a software program based on MATLAB code (De Roover, Ceulemans, and Timmerman 2012a). ...
Article
Full-text available
The analysis of binary three-way data (i.e., persons who indicate which attributes apply to each of a set of objects) may be of interest in several substantive domains as sensory profiling, marketing research or personality assessment. Latent class probabilistic latent feature models (LCPLFMs) may be used to explain binary object-attribute associations on the basis of a small number of binary latent variables (called latent features). As LCPLFMs aim to model object-attribute associations using a small number of latent features they may be more suited to analyze data with many objects/attributes than standard multilevel latent class models which do not include such a dimension reduction. In this paper we describe new functions of the plfm package for analyzing binary three-way data with LCPLFMs. The new functions provide a flexible modeling approach as they allow to (1) specify different assumptions for modeling statistical dependencies between object-attribute pairs, (2) use different assumptions for modeling parameter heterogeneity across persons, (3) conduct a confirmatory analysis by constraining specific parameters to pre-specified values, (4) inspect results with print, summary and plot methods. As an illustration, the models are applied to analyze data on the perception of midsize cars, and to study the situational determinants of anger-related behavior.
... However, a small simulation study with unequally sized clusters shows that our initialization procedure works reasonably well in that case, except for some specific situations (i.e., very noisy data with a large number of underlying clusters). 1 These results are in line with simulation studies regarding other methods that also combine clustering and component analysis (e.g., Wilderjans & Ceulemans, 2013;Wilderjans, Ceulemans, & Kuppens, 2012;De Roover, Ceulemans, Timmerman, Vansteelandt, Stouten, & Onghena, 2012). 2. Initialize P X , T (W), and p k Y conditional on the initial partition matrix C initial A singular value decomposition is conducted on X: X = KML . ...
Article
In the behavioral sciences, many research questions pertain to a regression problem in that one wants to predict a criterion on the basis of a number of predictors. Although in many cases, ordinary least squares regression will suffice, sometimes the prediction problem is more challenging, for three reasons: first, multiple highly collinear predictors can be available, making it difficult to grasp their mutual relations as well as their relations to the criterion. In that case, it may be very useful to reduce the predictors to a few summary variables, on which one regresses the criterion and which at the same time yields insight into the predictor structure. Second, the population under study may consist of a few unknown subgroups that are characterized by different regression models. Third, the obtained data are often hierarchically structured, with for instance, observations being nested into persons or participants within groups or countries. Although some methods have been developed that partially meet these challenges (i.e., principal covariates regression (PCovR), clusterwise regression (CR), and structural equation models), none of these methods adequately deals with all of them simultaneously. To fill this gap, we propose the principal covariates clusterwise regression (PCCR) method, which combines the key idea’s behind PCovR (de Jong & Kiers in Chemom Intell Lab Syst 14(1–3):155–164, 1992) and CR (Späth in Computing 22(4):367–373, 1979). The PCCR method is validated by means of a simulation study and by applying it to cross-cultural data regarding satisfaction with life.
... In other words, the data blocks pertain to the sample by attribute data matrices of the different panelists. In accordance with previous analyses of these data (Bro et al., 2008; Ceulemans, Timmerman, & Kiers, 2011; De Roover, Timmerman, Van Mechelen, & Ceulemans, 2013; Wilderjans & Ceulemans, 2013), we opted to discard all variance differences between the attributes by scaling each of them to a variance of one across all data blocks (preprocessing option of the software). This implies that possible differences in variances between panelists are retained. ...
Article
Full-text available
MultiLevel Simultaneous Component Analysis (MLSCA) is a data-analytical technique for multivariate two-level data. MLSCA sheds light on the associations between the variables at both levels by specifying separate submodels for each level. Each submodel consists of a component model. Although MLSCA has already been successfully applied in diverse areas within and outside the behavioral sciences, its use is hampered by two issues. First, as MLSCA solutions are fitted by means of iterative algorithms, analyzing large data sets (i.e., data sets with many level one units) may take a lot of computation time. Second, easily accessible software for estimating MLSCA models is lacking so far. In this paper, we address both issues. Specifically, we discuss a computational shortcut for MLSCA fitting. Moreover, we present the MLSCA package, which was built in MATLAB, but is also available in a version that can be used on any Windows computer, without having MATLAB installed.
Article
Background FMRI resting state networks (RSNs) are used to characterize brain disorders. They also show extensive heterogeneity across patients. Identifying systematic differences between RSNs in patients, i.e. discovering neurofunctional subtypes, may further increase our understanding of disease heterogeneity. Currently, no methodology is available to estimate neurofunctional subtypes and their associated RSNs simultaneously. New method We present an unsupervised learning method for fMRI data, called Clusterwise Independent Component Analysis (C-ICA). This enables the clustering of patients into neurofunctional subtypes based on differences in shared ICA-derived RSNs. The parameters are estimated simultaneously, which leads to an improved estimation of subtypes and their associated RSNs. Results In five simulation studies, the C-ICA model is successfully validated using both artificially and realistically simulated data (N = 30–40). The successful performance of the C-ICA model is also illustrated on an empirical data set consisting of Alzheimer’s disease patients and elderly control subjects (N = 250). C-ICA is able to uncover a meaningful clustering that partially matches (balanced accuracy = .72) the diagnostic labels and identifies differences in RSNs between the Alzheimer and control cluster. Comparison with other methods Both in the simulation study and the empirical application, C-ICA yields better results compared to competing clustering methods (i.e., a two step clustering procedure based on single subject ICA’s and a Group ICA plus dual regression variant thereof) that do not simultaneously estimate a clustering and associated RSNs. Indeed, the overall mean adjusted Rand Index, a measure for cluster recovery, equals 0.65 for C-ICA and ranges from 0.27 to 0.46 for competing methods. Conclusions The successful performance of C-ICA indicates that it is a promising method to extract neurofunctional subtypes from multi-subject resting state-fMRI data. This method can be applied on fMRI scans of patient groups to study (neurofunctional) subtypes, which may eventually further increase understanding of disease heterogeneity.
Article
Many research questions pertain to a regression problem assuming that the population under study is not homogeneous with respect to the underlying model. In this setting, we propose an original method called Combined Information criterion CLUSterwise elastic-net regression (Ciclus). This method handles several methodological and application-related challenges. It is derived from both the information theory and the microeconomic utility theory and maximizes a well-defined criterion combining three weighted sub-criteria, each being related to a specific aim: getting a parsimonious partition, compact clusters for a better prediction of cluster-membership, and a good within-cluster regression fit. The solving algorithm is monotonously convergent, under mild assumptions. The Ciclus principle provides an innovative solution to two key issues: (i) the automatic optimization of the number of clusters, (ii) the proposal of a prediction model. We applied it to elastic-net regression in order to be able to manage high-dimensional data involving redundant explanatory variables. Ciclus is illustrated through both a simulation study and a real example in the field of omic data, showing how it improves the quality of the prediction and facilitates the interpretation. It should therefore prove useful whenever the data involve a population mixture as for example in biology, social sciences, economics or marketing.
Article
Full-text available
A least-squares bilinear clustering framework for modelling three-way data, where each observation consists of an ordinary two-way matrix, is introduced. The method combines bilinear decompositions of the two-way matrices with clustering over observations. Different clusterings are defined for each part of the bilinear decomposition, which decomposes the matrix-valued observations into overall means, row margins, column margins and row–column interactions. Therefore up to four different classifications are defined jointly, one for each type of effect. The computational burden is greatly reduced by the orthogonality of the bilinear model, such that the joint clustering problem reduces to separate problems which can be handled independently. Three of these sub-problems are specific cases of k -means clustering; a special algorithm is formulated for the row–column interactions, which are displayed in clusterwise biplots. The method is illustrated via an empirical example and interpreting the interaction biplots are discussed. Supplemental materials for this paper are available online, which includes the dedicated R package, .
Article
Full-text available
In neuroscience, clustering subjects based on brain dysfunctions is a promising avenue to subtype mental disorders as it may enhance the development of a brain-based categorization system for mental disorders that transcends and is biologically more valid than current symptom-based categorization systems. As changes in functional connectivity (FC) patterns have been demonstrated to be associated with various mental disorders, one appealing approach in this regard is to cluster patients based on similarities and differences in FC patterns. To this end, researchers collect three-way fMRI data measuring neural activation over time for different patients at several brain locations and apply Independent Component Analysis (ICA) to extract FC patterns from the data. However, due to the three-way nature and huge size of fMRI data, classical (two-way) clustering methods are inadequate to cluster patients based on these FC patterns. Therefore, a two-step procedure is proposed where, first, ICA is applied to each patient’s fMRI data and, next, a clustering algorithm is used to cluster the patients into homogeneous groups in terms of FC patterns. As some clustering methods used operate on similarity data, the modified RV-coefficient is adopted to compute the similarity between patient specific FC patterns. An extensive simulation study demonstrated that performing ICA before clustering enhances the cluster recovery and that hierarchical clustering using Ward’s method outperforms complete linkage hierarchical clustering, Affinity Propagation and Partitioning Around Medoids. Moreover, the proposed two-step procedure appears to recover the underlying clustering better than (1) a two-step procedure that combines PCA with clustering and (2) Clusterwise SCA-ECP, which performs PCA and clustering in a simultaneous fashion. Additionally, the good performance of the proposed two-step procedure using ICA and Ward’s hierarchical clustering is illustrated in an empirical fMRI data set regarding dementia patients.
Article
Full-text available
The STATIS method has been successfully applied to the analysis of sensory profiling data and other kinds data in sensometrics. We discuss its use and benefits and compare its outcomes to alternative methods for the analysis of multiblock data arising in situations such as projective mapping and free sorting experiments. More importantly, a method of clustering a collection of datasets measured on the same individuals, called CLUSTATIS, is introduced. It is based on the optimization of a criterion and consists in a hierarchical cluster analysis and a partitioning algorithm akin to the K-means algorithm. The procedure of analysis can be seen as an extension of the cluster analysis of variables around latent components (CLV, Vigneau & Qannari, 2003) to the case of blocks of variables. Alongside the determination of the clusters, a latent configuration is determined by the STATIS method. The interest of CLUSTATIS in sensometrics is discussed and illustrated on the basis of two case studies pertaining to the projective mapping also called Napping and the free sorting tasks, respectively.
Article
In consumer studies, segmentation has been widely applied to identify consumer subsets on the basis of their preference for a set of products. From the last decade onwards, a more comprehensive evaluation of product performance has led to take into account various information such as consumer emotion assessment or hedonic measures on several aspects, like taste, visual and flavor. This multi-attribute evaluation of products naturally yields a three-way (products by consumers by attributes) data structure. In order to identify segments of consumers on the basis of such three-way data, the Three-Way Cluster analysis around Latent Variables (CLV3W) approach (Wilderjans & Cariou, 2016) is considered. This method groups the consumers into clusters and estimates for each cluster an associated latent product variable and attribute weights, along with a set of consumer loadings, which may be used for the purpose of cluster-specific product characterization. As consumers who rate the products along the attributes in an opposite way (i.e., raters' disagreement) should not be in the same cluster, in this paper, we propose to add a non-negativity constraint on the consumer loadings and to integrate this constraint within the versatile CLV3W approach. This non-negatively constrained criterion implies that the latent variable for each cluster is determined such that consumers within each cluster are as much related - in terms of a positive covariance - as possible with this latent product component. This approach is applied to a consumer emotion ratings dataset related to coffee aromas.
Article
Parallel factor analysis (PARAFAC) is a useful multivariate method for decomposing three-way data that consist of three different types of entities simultaneously. This method estimates trilinear components, each of which is a low-dimensional representation of a set of entities, often called a mode, to explain the maximum variance of the data. Functional PARAFAC permits the entities in different modes to be smooth functions or curves, varying over a continuum, rather than a collection of unconnected responses. The existing functional PARAFAC methods handle functions of a one-dimensional argument (e.g., time) only. In this paper, we propose a new extension of functional PARAFAC for handling three-way data whose responses are sequenced along both a two-dimensional domain (e.g., a plane with x- and y-axis coordinates) and a one-dimensional argument. Technically, the proposed method combines PARAFAC with basis function expansion approximations, using a set of piecewise quadratic finite element basis functions for estimating two-dimensional smooth functions and a set of one-dimensional basis functions for estimating one-dimensional smooth functions. In a simulation study, the proposed method appeared to outperform the conventional PARAFAC. We apply the method to EEG data to demonstrate its empirical usefulness.
Article
Full-text available
In psychology, studying multivariate dynamical processes within a person is gaining ground. An increasingly often used method is vector autoregressive (VAR) modeling, in which each variable is regressed on all variables (including itself) at the previous time points. This approach reveals the temporal dynamics of a system of related variables across time. A follow-up question is how to analyze data of multiple persons in order to grasp similarities and individual differences in within-person dynamics. We focus on the case where these differences are qualitative in nature, implying that subgroups of persons can be identified. We present a method that clusters persons according to their VAR regression weights, and simultaneously fits a shared VAR model to all persons within a cluster. The performance of the algorithm is evaluated in a simulation study. Moreover, the method is illustrated by applying it to multivariate time series data on depression-related symptoms of young women.
Article
To detect panel disagreement, we propose the clustering around latent variables for three-way data (CLV3W) approach which extends the clustering of variables around latent components (CLV) approach to three-way data typically obtained from a conventional sensory profiling procedure (i.e., assessors rating products on various descriptors). The CLV3W method groups the descriptors into Q clusters and estimates for each cluster an associated latent sensory component such that the attributes within each cluster are as much related (i.e., highest squared covariance) as possible with the latent component. Simultaneously, for each latent sensory component separately, a system of weights is estimated that yields information regarding the extent to which an assessor (dis)agrees with the rest of the panel according to the latent sensory component under study. Our new approach is illustrated with a dataset pertaining to Quantitative Descriptive Analysis applied to cider varieties. It is shown that CLV3W, as opposed to related approaches, is able to detect differential panel disagreement on various latent sensory components.
Article
Quite a few studies in the behavioral sciences result in hierarchical time profile data, with a number of time profiles being measured for each person under study. Associated research questions often focus on individual differences in profile repertoire, that is, differences between persons in the number and the nature of profile shapes that show up for each person. In this paper, we introduce a new method, called KSC-N, that parsimoniously captures such differences while neatly disentangling variability in shape and amplitude. KSC-N induces a few person clusters from the data and derives for each person cluster the types of profile shape that occur most for the persons in that cluster. An algorithm for fitting KSC-N is proposed and evaluated in a simulation study. Finally, the new method is applied to emotional intensity profile data.
Article
Investigating interdyad (i.e. couples of a client and their usual caregiver) differences in naturally occurring patterns of staff reactions to challenging behaviour (e.g. self-injurious, stereotyped and aggressive/destructive behaviour) of clients with severe or profound intellectual disabilities is important to optimise client-staff interactions. Most studies, however, fail to combine a naturalistic setup with a person-level analysis, in that they do not involve a careful inspection of the interdyad differences and similarities. In this study, the recently proposed Clusterwise Hierarchical Classes Analysis (HICLAS) method is adopted and applied to data of in which video fragments (recorded in a naturalistic setting) of a client showing challenging behaviour and the staff reacting to it were analysed. In a Clusterwise HICLAS analysis, the staff-client dyads are grouped into a number of clusters and the prototypical behaviour-reaction patterns that are specific for each cluster (i.e. interdyad differences and similarities) are revealed. Clusterwise HICLAS discloses clear interdyad differences (and similarities) in the prototypical patterns of clients' challenging behaviour and the associated staff reactions, complementing and qualifying the results of earlier studies in which only general patterns were disclosed. The usefulness and clinical relevance of Clusterwise HICLAS is demonstrated. In particular, Clusterwise HICLAS may capture idiosyncratic aspects of staff-client interactions, which may stimulate direct support workers to adopt person-centred support practices that take the specific abilities of the client into account.
Article
In this article, the basic principles and rationale for three-mode analysis of three-way data are introduced. The wealth of the three-mode world is sketched, and the power of its methods is illustrated with a three-mode component analysis of the way children cope in several school-related situations and an individual difference scaling analysis of the perception of pain.
Article
Full-text available
In this paper two major models for three-way profile data, i.e., the Parafac model and the Tucker3 model are discussed from the point of view of application. Topics treated are handling the data before analysis, model choice, choice of dimensionality, model fit, algorithmic hazards during the analyses, and interpretation and validation of the results. These issues are discussed in some detail so that prospective users can take guidance for analysing their own data. The data provided by Japanese girls and their parents about the parenting style in their family are the major vehicle for demonstrating the issues touched upon. The general results from these data are that parental styles consisted of three groups of behaviours: Acceptance, Control and Rejection, and Discipline. Within families the parenting behaviours of fathers and mothers are seen as parallel rather than at cross purposes, both by the daughters and the parents themselves. Moreover, daughters and parents largely agree about the parenting style itself. Notwithstanding, there are also families in which daughters and parents disagree about the parenting style in particular about Acceptance and Control, but not about Discipline.
Article
Full-text available
In this work, a method for differentiation of gasoline according to its geographical origin is presented. Comprehensive two-dimensional gas chromatography-flame ionization detection (GC×GC-FID) combined with multivariate analysis was used to differentiate Brazilian and Venezuelan gasoline samples. Pattern recognition of the GC×GC-FID chromatograms was performed by parallel factor analysis (PARAFAC) and it was successfully applied for the differentiation of these gasoline samples. Eluates with medium to high volatility, both aliphatic and aromatic, were responsible for this differentiation, based on the inspection of the score and loading graphs generated by PARAFAC.
Article
Full-text available
Pdf version of the monograph published by DSWO Press (Leiden, 1993) This version (2005) is essentially the same as the original one, published in 1993 by DSWO Press (Leiden). In particular, all material has been kept on the same pages. Apart from a few typographic and language corrections, the following non-trivial changes have been made. • Question 34 (and the answer) has been deleted (page 84). • The answers to questions 36 a and b (page 84) have been rephrased. • Question 49c of page 87 has been rephrased.
Article
Full-text available
Purpose This study proposes the best clustering method(s) for different distance measures under two different conditions using the cophenetic correlation coefficient. Methods In the first one, the data has multivariate standard normal distribution without outliers for and the second one is with outliers (5%) for . The proposed method is applied to simulated multivariate normal data via MATLAB software. Results According the results of simulation the Average (especially for ) and Centroid (especially for and ) methods are recommended at both conditions. Conclusions This study hopes to contribute to literature for making better decisions on selection of appropriate cluster methods by using subgroup sizes, variable numbers, subgroup means and variances.
Article
Full-text available
Given multivariate multiblock data (e.g., subjects nested in groups are measured on multiple variables), one may be interested in the nature and number of dimensions that underlie the variables, and in differences in dimensional structure across data blocks. To this end, clusterwise simultaneous component analysis (SCA) was proposed which simultaneously clusters blocks with a similar structure and performs an SCA per cluster. However, the number of components was restricted to be the same across clusters, which is often unrealistic. In this paper, this restriction is removed. The resulting challenges with respect to model estimation and selection are resolved.
Article
Full-text available
Dissolved organic matter (DOM) quality reflects numerous environmental processes, including primary production and decomposition, redox gradients, hydrologic transport, and photochemistry. Fluorescence spectroscopy can detect groups of DOM compounds sensitive to these processes. However, different environmental gradients (e.g., redox, DOM provenance) can have confounding effects on DOM fluorescence spectra. This study shows how these confounding effects can be removed through discriminant analyses on parallel factor modeling results. Using statistically distinct end-members, we resolve spatiotemporal trends in redox potential and DOM provenance within and between adjacent vegetation communities in the patterned ridge and slough landscape of the Everglades, where biogeochemical differences between vegetation communities affect net peat accretion rates and persistence of landscape structure. Source discrimination of DOM in whole-water samples and peat leachates reveals strong temporal variability associated with seasonality and passage of a hurricane and indicates that hurricane effects on marsh biogeochemistry persist for longer periods of time (>1 year) than previously recognized. Using the DOM source signal as a hydrologic tracer, we show that the system is hydrologically well mixed when surface water is present, and that limited transport of flocculent detritus occurs in surface flows. Redox potential discrimination shows that vertical redox gradients are shallower on ridges than in sloughs, creating an environment more favorable to decomposition and diagenesis. The sensitivity, high resolution, rapidity, and precision of these statistical analyses of DOM fluorescence spectra establish the technique as a promising performance measure for restoration or indicator of carbon cycle processes in the Everglades and aquatic ecosystems worldwide.
Book
Full-text available
In multivariate analysis the data have usually two way and/or two modes. This book treats prinicipal component analysis of data which can be characterised by three-ways and/or modes, like subjects by variables by conditions or occasions. The book extends the work on three-mode factor analysis by Tucker and the work on individual differences scaling by Carroll and colleagues. The many examples give a true feeling of the working of the techniques.
Article
Full-text available
SUMMARY We generalize Kruskal's fundamental result on the uniqueness of trilinear decomposition of three-way arrays to the case of multilinear decomposition of four- and higher-way arrays. The result is surprisingly general and simple and has several interesting ramifications. Copyright © 2000 John Wiley & Sons, Ltd.
Article
Full-text available
In many research domains different pieces of information are collected regarding the same set of objects. Each piece of information constitutes a data block, and all these (coupled) blocks have the object mode in common. When analyzing such data, an important aim is to obtain an overall picture of the structure underlying the whole set of coupled data blocks. A further challenge consists of accounting for the differences in information value that exist between and within (i.e., between the objects of a single block) data blocks. To tackle these issues, analysis techniques may be useful in which all available pieces of information are integrated and in which at the same time noise heterogeneity is taken into account. For the case of binary coupled data, however, only methods exist that go for a simultaneous analysis of all data blocks but that do not account for noise heterogeneity. Therefore, in this paper, the SIMCLAS model, being a Hierarchical Classes model for the simultaneous analysis of coupled binary two-way matrices, is presented. In this model, noise heterogeneity between and within the data blocks is accounted for by downweighting entries from noisy blocks/objects within a block. In a simulation study it is shown that (1) the SIMCLAS technique recovers the underlying structure of coupled data to a very large extent, and (2) the SIMCLAS technique outperforms a Hierarchical Classes technique in which all entries contribute equally to the analysis (i.e., noise homogeneity within and between blocks). The latter is also demonstrated in an application of both techniques to empirical data on categorization of semantic concepts.
Article
Full-text available
Mixture analysis is commonly used for clustering objects on the basis of multivariate data. When the data contain a large number of variables, regular mixture analysis may become problematic because a large number of parameters need to be estimated for each cluster. To tackle this problem, the mixtures of factor analyzers (MFA) model was proposed, which combines clustering with exploratory factor analysis (EFA). MFA model selection is rather intricate as both the number of clusters and the number of underlying factors has to be determined. To this end, the Akaike (AIC) and the Bayesian Information Criterion (BIC) are often used. AIC and BIC try to identify a model that optimally balances model fit and model complexity. In this paper, the recently proposed CHull (Ceulemans & Kiers, 2006) method, which also balances model fit and complexity, is presented as an interesting alternative model selection strategy for MFA. In an extensive simulation study, the performance of AIC, BIC, and CHull are compared. AIC performs poorly and systematically selects overly complex models, whereas BIC performs slightly better than CHull when considering the best model only. However, when taking model selection uncertainty into account by looking at the first three models retained, CHull outperforms BIC. This especially holds in more complex, and thus, more realistic, situations (e.g., more clusters, factors, noise in the data, and overlap among clusters).
Article
Full-text available
To achieve an insightful clustering of multivariate data, we propose subspace K-means. Its central idea is to model the centroids and cluster residuals in reduced spaces, which allows for dealing with a wide range of cluster types and yields rich interpretations of the clusters. We review the existing related clustering methods, including deterministic, stochastic, and unsupervised learning approaches. To evaluate subspace K-means, we performed a comparative simulation study, in which we manipulated the overlap of subspaces, the between-cluster variance, and the error variance. The study shows that the subspace K-means algorithm is sensitive to local minima but that the problem can be reasonably dealt with by using partitions of various cluster procedures as a starting point for the algorithm. Subspace K-means performs very well in recovering the true clustering across all conditions considered and appears to be superior to its competitor methods: K-means, reduced K-means, factorial K-means, mixtures of factor analyzers (MFA), and MCLUST. The best competitor method, MFA, showed a performance similar to that of subspace K-means in easy conditions but deteriorated in more difficult ones. Using data from a study on parental behavior, we show that subspace K-means analysis provides a rich insight into the cluster characteristics, in terms of both the relative positions of the clusters (via the centroids) and the shape of the clusters (via the within-cluster residuals).
Article
Full-text available
Single linkage using S. C. Johnson's (see PA, Vol 41:16046) minimum method, complete linkage using Johnson's maximum method, average linkage using R. R. Sokal and P. H. Sneath's (1963) unweighted pair-wise group mean average linkage, and the minimum variance method of J. H. Ward (1963) method were compared in terms of their accuracy in solving 50 data sets. A mixture model was used for generating the data sets. The minimum variance method obtained the highest accuracy values. The accuracy scores of all methods were negatively correlated with the overlap among the clusters and the ellipticity of the clusters. Future users of cluster analysis are cautioned that there exist a wide variety of cluster analysis methods, that different methods can yield very different solutions, and that users should be careful to skeptically test the classifications generated by cluster analysis methods. (61 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
When analyzing data, researchers are often confronted with a model selection problem (e.g., determining the number of components/factors in principal components analysis [PCA]/factor analysis or identifying the most important predictors in a regression analysis). To tackle such a problem, researchers may apply some objective procedure, like parallel analysis in PCA/factor analysis or stepwise selection methods in regression analysis. A drawback of these procedures is that they can only be applied to the model selection problem at hand. An interesting alternative is the CHull model selection procedure, which was originally developed for multiway analysis (e.g., multimode partitioning). However, the key idea behind the CHull procedure-identifying a model that optimally balances model goodness of fit/misfit and model complexity-is quite generic. Therefore, the procedure may also be used when applying many other analysis techniques. The aim of this article is twofold. First, we demonstrate the wide applicability of the CHull method by showing how it can be used to solve various model selection problems in the context of PCA, reduced K-means, best-subset regression, and partial least squares regression. Moreover, a comparison of CHull with standard model selection methods for these problems is performed. Second, we present the CHULL software, which may be downloaded from http://ppw.kuleuven.be/okp/software/CHULL/ , to assist the user in applying the CHull procedure.
Article
Parallel factor (PARAFAC) analysis and fluorescence spectroscopy were applied in the evaluation of yogurt during storage. Fluorescence landscapes with excitation wavelengths from 270 to 550 nm and emission wavelengths in the range 310–590 nm were obtained from front-face fluorescence measurements directly on yogurt samples during two storage experiments over a period of 5 weeks at 4 °C. PARAFAC analysis of the fluorescence landscapes exhibited three fluorophores present in the yogurt, all strongly related to the storage conditions. The fluorescence signal was resolved into excitation and emission profiles of the pure fluorescent compounds, which are suggested to be tryptophan, riboflavin and lumichrom. Thus, it is concluded that fluorescence spectroscopy in combination with chemometrics has a potential as a fast method for monitoring the oxidative stability and quality of yogurt. Regression models between fluorescence landscapes and riboflavin content, determined by the traditional chemical analysis, were performed, yielding a root mean square error of cross-validation of 0.09 ppm riboflavin, corresponding to 7% of the mean riboflavin content in the yogurt samples. Regression models based on PARAFAC scores, Partial Least Squares (PLS) and N-PLS were compared and yielded only minor differences with respect to prediction error. Several missing values appear in the fluorescence data matrices, for all emission wavelengths below the excitation wavelength. Substituting some of the missing values with zeros was observed to have a large impact on the model solution and the computation time. It is concluded that at least 43% of the missing values in the present data set need to be substituted in order to obtain meaningful PARAFAC models.
Article
In this paper the general theory of multiway multiblock component and covariates regression models is explained. Unlike in existing methods such as multiblock PLS and multiblock PCA, in the new proposed method a different number of components can be selected for each block. Furthermore, the method can be generalized to incorporate multiway blocks to which any multiway model can be applied. The method is a direct extension of principal covariates regression and therefore works in a simultaneous fashion in which a clearly defined objective criterion is minimized. It can be tuned to fulfil the requirements of the user. Algorithms to calculate the components will be presented. The method will be illustrated with two three-block examples and compared to existing approaches. The first example is with two-way data and the second example is with a three-way array. It will be shown that predictions are as good as with the existing methods, but because for most blocks fewer components are required, diagnostic properties of the method are improved. Copyright (C) 2000 John Wiley & Sons, Ltd.
Article
Systematic errors observed when using inductively coupled plasma atomic emission spectrometry (ICP-AES) and mass spectrometry (ICP-MS) for the multi-element determination in acid digests of environmental samples (tea leaves) were evaluated. Two chemometric approaches, experimental design and principal component analysis, were used in order to establish the errors associated with each stage of the analytical method: sample digestion (effect of the "number of acid digestions") and measurement step (effect of the "number of replicates" and the "calibration"). The elements under study were Co, Cr, Cs, Cu, Ni, Pb, Rb and Ti by ICP-MS, and Ba, Ca, Fe, Mg, Mn, Sr and Zn by ICP-MS and ICP-AES. Flame atomic absorption spectrometry was used for comparative purposes. A Chinese tea certified reference material with certified concentration for most of the elements was employed as the sample matrix. Variance estimation was made from ANOVA outputs from a full factorial design (FFD) 4(1) x 4(1) x 2(1).
Article
Combinatorial data analysis (CDA) as a generic term can be discussed within the framework of two distinct tasks of data analysis: exploratory and confirmatory. A confirmatory CDA approach compares some given data set to a specific structure that is conjectured for it a priori; the empirically observed degree of correspondence is evaluated by reference to what could be observed using all possible structures of the same form that could have been conjectured (for example, one-way analysis-of-variance can be viewed as comparing a given structure defined by a partitioning of subjects into groups, to all possible partitions that could have been formed). Alternatively, an exploratory strategy would seek one possible structure from some given class that best fits the given data set. In both cases, the data and the various structures considered are typically coded as matrices; thus, a confirmatory task usually involves an assessment of similar patterning across matrices; an exploratory task seeks to optimize such a pattern of correspondence by locating the most appropriate matrix reorganization.
Article
The performance of six hierarchical clustering methods (given by one algorithm, Wishart [1969]) are compared on bivariate and multivariate normal Monte Carlo samples. The methods are stopped with the correct number of clusters and compared with respect to correct classification (placing pairs of points in the same or different clusters correctly or incorrectly) and with each other (both methods agree or disagree in placing a pair of points in the same or differing clusters).
Article
A Monte Carlo study was made of the recovery of cluster structure in binary data by five hierarchical techniques, with a view to finding which data structure factors influenced recovery and to determining differences between clustering methods with respect to these factors. Recovery was found to increase as the number of groups decreased, as the number of variables increased, as the mixing proportions tended towards equality, and as the number of observations was increased. Single link was found to be much worse than the other clustering techniques.
Book
This book presents methods for analyzing multiway data by applying multiway component techniques. Multiway analysis is a special branch of the larger field of multivariate statistics that extends the standard methods for two-way data, such as component analysis, factor analysis, cluster analysis, correspondence analysis, and multidimensional scaling to multiway data The multiway techniques are applicable across a range of fields from the social and behavioral sciences to agriculture, environmental sciences and chemistry,
Chapter
A procedure is developed for clustering objects in a low-dimensional subspace of the column space of an objects by variables data matrix. The method is based on the K-means criterion and seeks the subspace that is maximally informative about the clustering structure in the data. In this low-dimensional representation, the objects, the variables and the cluster centroids are displayed jointly. The advantages of the new method are discussed, an efficient alternating least-squares algorithm is described, and the procedure is illustrated on some artificial data.
Article
We propose a dimension-reducing k-means clustering procedure based on a projection pursuit (PP) technique. The clustering structure of high-dimensional data in terms of low-dimensional projected points is analyzed, and the strong consistency of the estimates of the cluster centers and the projection orientations is shown.
Article
This paper shows how methods of cluster analysis, principal component analysis, and multidimensional scaling may be combined in order to obtain an optimal fit between a classification underlying some set of objects 1,…,n and its visual representation in a low-dimensional euclidean space ℝs. We propose several clustering criteria and corresponding k-means-like algorithms which are based either on a probabilistic model or on geometrical considerations leading to matrix approximation problems. In particular, a MDS-clustering strategy is presented for-displaying not only the n objects using their pairwise dissimilarities, but also the detected clusters and their average distances.
Article
Investigating interdyad (i.e. couples of a client and their usual caregiver) differences in naturally occurring patterns of staff reactions to challenging behaviour (e.g. self-injurious, stereotyped and aggressive/destructive behaviour) of clients with severe or profound intellectual disabilities is important to optimise client-staff interactions. Most studies, however, fail to combine a naturalistic setup with a person-level analysis, in that they do not involve a careful inspection of the interdyad differences and similarities. In this study, the recently proposed Clusterwise Hierarchical Classes Analysis (HICLAS) method is adopted and applied to data of in which video fragments (recorded in a naturalistic setting) of a client showing challenging behaviour and the staff reacting to it were analysed. In a Clusterwise HICLAS analysis, the staff-client dyads are grouped into a number of clusters and the prototypical behaviour-reaction patterns that are specific for each cluster (i.e. interdyad differences and similarities) are revealed. Clusterwise HICLAS discloses clear interdyad differences (and similarities) in the prototypical patterns of clients' challenging behaviour and the associated staff reactions, complementing and qualifying the results of earlier studies in which only general patterns were disclosed. The usefulness and clinical relevance of Clusterwise HICLAS is demonstrated. In particular, Clusterwise HICLAS may capture idiosyncratic aspects of staff-client interactions, which may stimulate direct support workers to adopt person-centred support practices that take the specific abilities of the client into account.
Article
This paper describes a sensitive excitation–emission matrix fluorescence (EEM) method for simultaneously measuring contents of two estrogens, estriol (E3) and estrone (E1), in liquid cosmetic samples with the aid of a second-order calibration method based on a parallel factor analysis (PARAFAC) algorithm. Before processing the obtained three-way data, a better region of the excitation and emission spectra was purposely selected. Then PARAFAC was recommended to acquire the clean spectra and predict the individual concentrations of the analytes of interest even in the presence of uncalibrated interferences. The standard curves of the two analytes are linear within a linear concentration range of 0–0.736 μg mL−1 of E3 and 0–18.000 μg mL−1 of E1 with correlation coefficients typically greater than 0.99. In the analysis of watermelon frost anti-acne toner sold on the internet web site, the limit of detection (LOD) of E3 is 4.7 ng mL−1 with an accuracy of 102.3–113.7%, and for E1, the LOD is 96.1 ng mL−1 with an accuracy of 92.3–111.0%. In the analysis of pagoda flower relaxing lotion from the commercial market in Changsha, the LOD of E3 is 8.9 ng mL−1 with an accuracy of 95.0–107.1%, and for E1, the LOD is 76.9 ng mL−1 with an accuracy of 98.6–119.3%. Generally, a new avenue has been opened up to determine estrogens quantitatively in cosmetic samples. This methodology will achieve greater development and gradually become a more routine approach in cosmetic quality control due to its advantages of high sensitivity, simple pretreatment procedure and non-destructive nature.
Article
This paper describes the application of a trilinear parallel factor analysis (PARAFAC) to study systematic error during the multi-element determination of a range of analytes in acid digests of solid samples (tea leaves) by ICP-AES and ICP-MS. The three variables studied were the “number of digestions”, in order to assess the systematic error associated with the sample pre-treatment, and the “number of replicates” and “calibration”, to provide information on the systematic error associated with the analytical determination itself. The elements under study were Co, Cr, Cu, Ni, Pb, Rb and Ti by ICP-MS, and Ba, Ca, Fe, Mg, Mn, Sr and Zn by both ICP-MS and ICP-AES. For some elements flame atomic absorption spectrometry was used for comparative purposes. A Chinese tea certified reference material containing many of the metals above was used in the study. The results obtained were compared to results from ANOVA. It was found that the systematic error, expressed as the sum of squares after PARAFAC, was quite different from the results obtained using ANOVA due to the very different way in which the models are built. The PARAFAC approach is shown to be straightforward to implement and robust.
Article
In this paper the general theory of multiway multiblock component and covariates regression models is explained. Unlike in existing methods such as multiblock PLS and multiblock PCA, in the new proposed method a different number of components can be selected for each block. Furthermore, the method can be generalized to incorporate multiway blocks to which any multiway model can be applied. The method is a direct extension of principal covariates regression and therefore works in a simultaneous fashion in which a clearly defined objective criterion is minimized. It can be tuned to fulfil the requirements of the user. Algorithms to calculate the components will be presented. The method will be illustrated with two three-block examples and compared to existing approaches. The first example is with two-way data and the second example is with a three-way array. It will be shown that predictions are as good as with the existing methods, but because for most blocks fewer components are required, diagnostic properties of the method are improved. Copyright © 2000 John Wiley & Sons, Ltd.
Article
An abundance of methods exist to regress a y variable on a set of x variables collected in a matrix X. In the chemical sciences a growing number of problems translate into arrays of measurements X and Y, where X and Y are three-way arrays or multiway arrays. In this paper a general model is described for regressing such a multiway Y on a multiway X, while taking into account three-way structures in X and Y. A global least squares optimization problem is formulated to estimate the parameters of the model. The model is described and illustrated with a real industrial example from batch process operation. An algorithm is given in an appendix. Copyright © 1999 John Wiley & Sons, Ltd.
Article
Regression and principal components analysis (PCA) are two of the most widely used techniques in chemometrics. In this paper, these methods are compared by considering their application to linear, two-dimensional data sets with a zero intercept. The need for accommodating measurement errors with these methods is addressed and various techniques to accomplish this are considered. Seven methods are examined: ordinary least squares (OLS), weighted least squares (WLS), the effective variance method (EVM), multiply weighted regression (MWR), unweighted PCA (UPCA), and two forms of weighted PCA. Additionally, five error structures in x and y are considered: homoscedastic equal, homoscedastic unequal, proportional equal, proportional unequal, and random. It is shown that for certain error structures, several of the methods are mathematically equivalent. Furthermore, it is demonstrated that all of the methods can be unified under the principle of maximum likelihood estimation, embodied in the general case by MWR. Extensive simulations show that MWR produces the most reliable parameter estimates in terms of bias and mean-squared error. Finally, implications for modeling in higher dimensions are considered.
Article
This book is an introduction to the field of multi-way analysis for chemists and chemometricians. Its emphasis is on the ideas behind the method and its pratical applications. Sufficient mathematical background is given to provide a solid understanding of the ideas behind the method. There are currently no other books on the market which deal with this method from the viewpoint of its applications in chemistry. Applicable in many areas of chemistry. No comparable volume currently available. The field is becoming increasingly important.
Article
A common representation of data within the context of multidimensional scaling (MDS) is a collection of symmetric proximity (similarity or dissimilarity) matrices for each of M subjects. There are a number of possible alternatives for analyzing these data, which include: (a) conducting an MDS analysis on a single matrix obtained by pooling (averaging) the M subject matrices, (b) fitting a separate MDS structure for each of the M matrices, or (c) employing an individual differences MDS model. We discuss each of these approaches, and subsequently propose a straightforward new method (CONcordance PARtitioning—ConPar), which can be used to identify groups of individual-subject matrices with concordant proximity structures. This method collapses the three-way data into a subject×subject dissimilarity matrix, which is subsequently clustered using a branch-and-bound algorithm that minimizes partition diameter. Extensive Monte Carlo testing revealed that, when compared to K-means clustering of the proximity data, ConPar generally provided better recovery of the true subject cluster memberships. A demonstration using empirical three-way data is also provided to illustrate the efficacy of the proposed method.
Article
role and relationships among statistical methods / basic mathematical propositions and formulations / alternative models: components and factors properties and formulas for the full factor model unique resolution and the tests of its attainment factor invariance, identification, and interpretation / deciding the number of factors reticular and strata models for higher-order factors / some modifications, developments, and conditions of the main factor model / strategies in the practical use of factor analysis questions of statistical significance and use of computer procedures (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Discusses methods for simultaneous component analysis (SCA) of scores of 2 or more groups of individuals on the same variables. A method is developed for SCA such that for each set essentially the same component structure (ST) is found (SCA-ST). The method is compared with those that use the same component weights matrix (SCA-W) or the same pattern matrix (SCA-P) across data sets. SCA-W always explains the highest amount of variance and SCA-ST the lowest. These explained variances can be compared to the amount of variance explained by separate principal components analyses. It is shown how, for cases where SCA-ST does not fit well, one can use SCA-W (and SCA-P) to find out if and how correlational structures differ. Facilitating the interpretation of an SCA-ST solution is discussed. Rotational freedom is exploited in a specially designed simple structure rotation technique for SCA-ST, which is illustrated on an empirical data set. (PsycINFO Database Record (c) 2012 APA, all rights reserved)