Chapter

Principal Component Analysis and Factor Analysis

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Principal component analysis has often been dealt with in textbooks as a special case of factor analysis, and this tendency has been continued by many computer packages which treat PCA as one option in a program for factor analysis—see Appendix A2. This view is misguided since PCA and factor analysis, as usually defined, are really quite distinct techniques. The confusion may have arisen, in part, because of Hotelling’s (1933) original paper, in which principal components were introduced in the context of providing a small number of ‘more fundamental’ variables which determine the values of the p original variables. This is very much in the spirit of the factor model introduced in Section 7.1, although Girschick (1936) indicates that there were soon criticisms of Hotelling’s method of PCs, as being inappropriate for factor analysis. Further confusion results from the fact that practitioners of ‘factor analysis’ do not always have the same definition of the technique (see Jackson, 1981). The definition adopted in this chapter is, however, fairly standard.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... These methods find and extract linear relationships that explain the correlation between multiple datasets and between variables, as well as highlight batch effects or outliers. Owing to the high dimensionality in omics datasets, they have been used in exploratory analysis to better understand molecular pathways in cells and their role in diseases [103], as well as in extracting patterns from gene expression data where there is typically a large amount of noise [20,79,157]. ...
... SVD can detect weak expression patterns by finding and extracting small signals from gene expression data where there is typically a significant amount of noise. PCA searches for a linear projection of data that preserves the variance in data while minimizing noise and redundancy [79]. As such, it is able to lower dimensional representation of data with small reconstruction errors. ...
Thesis
Full-text available
Data visualization has been shown to be an important tool in knowledge discovery, being used alongside data analysis to identify and highlight patterns, trends and outliers, aiding users in decision-making. The need for analyzing unstructured and increasingly larger datasets has led to the continued emergence of visualization tools that seek to provide methods that facilitate the exploration and analysis of such datasets. Many fields of study still face the challenges inherent to the analysis of complex multidimensional datasets, such as the field of computational biology, whose research of infectious diseases must contend with large protein-protein interaction networks with thousands of genes that vary in expression values over time. Throughout this thesis, we explore the visualization of multivariate data through CroP, a data visualization tool with a coordinated multiple views framework that allows users to adapt the workspace to different problems through flexible panels. While CroP is able to process generic relational, temporal and multivariate quantitative data, it also presents methods directed at the analysis of biological data. This data can be represented through various layouts and functionalities that not only highlight relationships between different variables, but also dig-down into discovered patterns in order to better understand their sources and their effects. In particular, we can highlight the exploration of time-series through our dynamic and parameter-based implementation of layouts that bend timelines to visually represent how datasets behave over time. The implemented models and methods are demonstrated through experiments with diverse multivariate datasets, with a focus on gene expression time-series datasets, and complemented with a discussion on how these contributed to the creation of comprehensible visualizations, facilitated data analysis, and promoted pattern discovery. We also validate CroP through model and interface tests performed with participants from both the fields of information visualization and computational biology. As we present our research and a discussion of its results, we can highlight the following contributions: an analysis of the available range of visualization models and tools for multivariate datasets, as well as modern data analysis methods that can be used cooperatively to explore such datasets; a coordinated multiple views framework with a modular workspace that can be adapted to the analysis of varied problems; dynamic visualization models that explore the representation of complex multivariate datasets, combined with modern data analysis methods to highlight and analyze significant events and patterns; a visualization tool that incorporates the developed framework, visualization models and data analysis methods into a platform that can be used by different types of users.
... Two components emerged with eigenvalues greater than 1, accounting for 81.97% of the overall variance (Table 6). Although a small fraction of information contained five dimensions was lost, most of the information was retained, which indicated effective data compression and feature extraction (Jolliffe, 1986). The weighted average value of component 1 and component 2 was formed as organisational commitment, which could be used in regression test. ...
... revealed 0.282 units of turnover intention's reduction when organisational commitment increased by per unit (Table 8). The Beta (−0.282) should not be compared to previous regression results, since information loss existed and could not be avoided in data compression and feature extraction (Jolliffe, 1986). Taking the significance level (p < 0.01) and Beta (−0.282) into account, organisational commitment still had a significant impact on turnover intention, which could sufficiently support the hypothesis H1. ...
... Observation were recorded on phenological traits viz., days to flower initiation, days to 50% flowering, days pod initiation and days to maturity, eleven quantitative traits viz., plant height (cm), stem height at first fruiting node (cm), number of primary branches per plant, number of secondary branches per plant, total number of pods per plant, number of effective pods per plant, number of seeds per pod, 100 seed weight (g), biological yield per plant (g), harvest index (%) and seed yield per plant (g). A popular method of dimension reduction is PCA (Massy 1965, Jolliffe 1986), which looks for linear combinations of the columns of X with maximal variance, or alternatively, high information. Therefore, the objective of this research was to assess the chickpea germplasm with the aim to identify and rank significant genotypes and features with the help of principal component analysis for initiating a hybridization program in order to create improved chickpea cultivars. ...
Article
An experiment was conducted to evaluate 30 promising chickpea genotypes grown in Randomized Completely Block Design with three replications during rabi 2018-19. Observation were recorded on 15 traits viz., days to flower initiation, days to 50% flowering, days pod initiation, days to maturity, plant height (cm), stem height at first fruiting node (cm), number of primary branches per plant, number of secondary branches per plant, total number of pods per plant, number of effective pods per plant, number of seeds per pod, 100 seed weight (g), biological yield per plant (g), harvest index (%) and seed yield per plant (g) and showed that out of 15, only 6 principal components (PCs) displayed more than 1 Eigen values and demonstrated approximately 86.7% variation in the attributes under investigation. The genotypes from PC1 will be given due importance because it accounted for 28.6 % of total variability. The genotypes viz; JG 63 × JG 4958, ( JG 74 × JG 315 ) – 14, ICC 96029 × ICC 11551, JG 24, JG 11 × JG 14 , ICCV 15119 , ICC 552241 × JG 11, JG 11 × RVSSG-1 highest positive PC values for yield related traits. Thus, these genotypes can be utilized in chickpea improvement program.
... The second PC, on the other hand, is adjusted orthogonally to the first PC and is intended to characterize the majority of the remaining variances and so on. 29 According to the Table -4, the 1st and 2nd principal components in this analysis demonstrate an overall variance of at least 99.980 % for characteristics of interest. ...
Article
Full-text available
In order to extract meaningful interpretation from the large data and provide their value to the application areas, chemical data analysis has become a serious challenge in the development and applications of new protocols, technique and methodologies for the mathematical modelling communities and other data science societies. Therefore, in the present work a rapid and robust box-and-whisker plot and multivariate principal component statistical techniques (PCA) are being proposed for the evaluations of the thermodynamic molecular properties data of the benign fuel structures. We observed that, the box-and-whisker plot technique successfully explored all of the thermochemical molecular properties precisely, and described symmetrical distribution of the data along the median values with respect to the rise in temperature. Moreover, applying the PCA technique, the score-plots of PCs diagnosed the peculiar molecular properties variations after a certain peak of temperature with descendant variation in the statistical parameters. Furthermore, PCA parameters not only segregated the thermodynamic properties of propanol and butanol but also, their variations with the temperature. Thus, we concluded that, Box-whisker and PCA statistical techniques are robust and rapid method for the assessment and evaluation of the large molecular thermodynamic quantities data.
... PC1 represents the eigenvector and is the dominant principal component which represents most of the information variance, and contributes more out of the total variability. PC2, PC3, PC4 and PC5 represent lesser variance in descending order [28,29]. ...
Article
Full-text available
The craving for organic cocoa beans has resulted in fraudulent practices such as mislabeling, adulteration, all known as food fraud, prompting the international cocoa market to call for the authenticity of organic cocoa beans before export. In this study, we proposed robust models using laser-induced fluorescence (LIF) and chemometric techniques for rapid classification of cocoa beans as either organic or conventional. The LIF measurements were conducted on cocoa beans harvested from organic and conventional farms. From the results, conventional cocoa beans exhibited a higher fluorescence intensity compared to organic ones. In addition, a general peak wavelength shift was observed when the cocoa beans were excited using a 445 nm laser source. These results highlight distinct characteristics that can be used to differentiate between organic and conventional cocoa beans. Identical compounds were found in the fluorescence spectra of both the organic and conventional ones. With preprocessed fluorescence spectra data and utilizing principal component analysis, classification models such as Linear Discriminant Analysis (LDA), Support Vector Machine (SVM), Neural Network (NN) and Random Forest (RF) models were employed. LDA and NN models yielded 100.0% classification accuracy for both training and validation sets, while 99.0% classification accuracy was achieved in the training and validation sets using SVM and RF models. The results demonstrate that employing a combination of LIF and either LDA or NN can be a reliable and efficient technique to classify authentic cocoa beans as either organic or conventional. This technique can play a vital role in maintaining integrity and preventing fraudulent practices in the cocoa bean supply chain.
... If the correlation coefficients assume low values (>0.3), further PCA studies should be discontinued. However, if the value of variables in the correlation matrix reaches high values (<0.3), the main components should be determined [55]. In case of obtained data the correlation is greater than 0.3, and therefore it is justified to apply the procedure consisting in determining the way of separating the main components and their further analysis. ...
... Among the feature extraction techniques that were used during the event, we can further define different types: 1) techniques that calculate summary measures (indices) based on the creation of weighted combinations of exposures that predict the health outcome; 2) feature projection techniques such as Principal Component Analysis (PCA) (Jolliffe, 1986), Factor Analysis (FA), Principal Component Pursuit (PCP) (Candes et al., 2009) or Multiple Correspondence Analysis (MCA) (Blasius and Greenacre, 2006) to represent the data in a low dimensional space; and 3) techniques that provide clustering of participants sharing a similar exposome profile, which could be predictive of the outcome, i.e. supervised, or not. ...
Article
Full-text available
Polyethylene (PE) is the most widely used type of plastic food packaging, in which chemicals can potentially migrate into packaged foods. The implications of using and recycling PE from a chemical perspective remain underexplored. This study is a systematic evidence map of 116 studies looking at the migration of food contact chemicals (FCCs) across the lifecycle of PE food packaging. It identified a total of 377 FCCs, of which 211 were detected to migrate from PE articles into food or food simulants at least once. These 211 FCCs were checked against the inventory FCCs databases and EU regulatory lists. Only 25% of the detected FCCs are authorized by EU regulation for the manufacture of food contact materials. Furthermore, a quarter of authorized FCCs exceeded the specific migration limit (SML) at least once, while one-third (53) of non-authorised FCCs exceeded the threshold value of 10 μg/kg. Overall, evidence on FCCs migration across the PE food packaging lifecycle is incomplete, especially at the reprocessing stage. Considering the EU's commitment to increase packaging recycling, a better understanding and monitoring of PE food packaging quality from a chemical perspective across the entire lifecycle will enable the transition towards a sustainable plastics value chain.
... Attention Networks [266,277,307,308,309,310,311] Representation Disentanglement [112,278,312,313,314,315,316,317,318,319] Explanation Generation [275,320,321,322] Hybrid Transparent and Black-box Methods ...
Thesis
Artificial Intelligence has been developing exponentially over the last decade. Its evolution is mainly linked to the progress of computer graphics card processors, allowing to accelerate the calculation of learning algorithms, and to the access to massive volumes of data. This progress has been principally driven by a search for quality prediction models, making them extremely accurate but opaque. Their large-scale adoption is hampered by their lack of transparency, thus causing the emergence of eXplainable Artificial Intelligence (XAI). This new line of research aims at fostering the use of learning models based on mass data by providing methods and concepts to obtain explanatory elements concerning their functioning. However, the youth of this field causes a lack of consensus and cohesion around the key definitions and objectives governing it. This thesis contributes to the field through two perspectives, one through a theory of what is XAI and how to achieve it and one practical. The first is based on a thorough review of the literature, resulting in two contributions: 1) the proposal of a new definition for Explainable Artificial Intelligence and 2) the creation of a new taxonomy of existing explainability methods. The practical contribution consists of two learning frameworks, both based on a paradigm aiming at linking the connectionist and symbolic paradigms.
... The parallel coordinates plot shows the user-selected network traffic data. Furthermore, the scatter plots display re-scaled high-dimensional data into lower-dimensional space (i.e., 2D display space) by applying different dimension reduction techniques such as t-SNE [50], UMAP [51], and PCA [52]. The heatmap visualization is added to show a global representation of data attributes with pixel-based visualization. ...
Article
Full-text available
Network traffic data analysis is important for securing our computing environment and data. However, analyzing network traffic data requires tremendous effort because of the complexity of continuously changing network traffic patterns. To assist the user in better understanding and analyzing the network traffic data, an interactive web-based visualization system is designed using multiple coordinated views, supporting a rich set of user interactions. For advancing the capability of analyzing network traffic data, feature extraction is considered along with uncertainty quantification to help the user make precise analyses. The system allows the user to perform a continuous visual analysis by requesting incrementally new subsets of data with updated visual representation. Case studies have been performed to determine the effectiveness of the system. The results from the case studies support that the system is well designed to understand network traffic data by identifying abnormal network traffic patterns.
... All appendices referred to in this paper are available by clicking the following link: https://www.ntnu.no/documents/1265701259/1281473463/Appendices.A_I.pdf/0dfebb3 4-067c-bd3a-8f85-5b96a026210c?t=1667216631547 (accessed on 6 November 2022) (Breiman 1998;Connelly 2020;Hess and Hess 2019;Hintze and Nelson 1998;Jolliffe 1986;Lever et al. 2016;Nixon et al. 2019). ...
Article
Full-text available
Banks’ credit scoring models are required by financial authorities to be explainable. This paper proposes an explainable artificial intelligence (XAI) model for predicting credit default on a unique dataset of unsecured consumer loans provided by a Norwegian bank. We combined a LightGBM model with SHAP, which enables the interpretation of explanatory variables affecting the predictions. The LightGBM model clearly outperforms the bank’s actual credit scoring model (Logistic Regression). We found that the most important explanatory variables for predicting default in the LightGBM model are the volatility of utilized credit balance, remaining credit in percentage of total credit and the duration of the customer relationship. Our main contribution is the implementation of XAI methods in banking, exploring how these methods can be applied to improve the interpretability and reliability of state-of-the-art AI models. We also suggest a method for analyzing the potential economic value of an improved credit scoring model.
... Multivariate data visualization is performed using principal component analysis (PCA) (Jolliffe, 1986). The multivariate data of the reaction system is dimensionally reduced using PCA to 3 dimensions (maximum dimensions for visualization). ...
Article
Full-text available
Batch or semi-batch chemical reaction units often requires multiple operational phases to convert reactants to valuable products. In various chemical production facilities, the switching decision of such operational phases has to be confirmed and registered by the operating personnel. Imprecise switching of phases can waste a significant amount of time and energy for the reaction unit, which gives negative plant sustainability and costs. Additionally, automation for phase switching is rarely used due to the challenges of batch-to-batch variance, sensor instability, and various process uncertainties. Here, we demonstrate that by using a machine learning approach which includes optimized noise removal methods and a neural network (that was neural architecture searched), the real-time reaction completion could be precisely tracked (R² > 0.98). Furthermore, we show that the latent space of the evolved neural network could be transferred from predicting reaction completion to classifying the reaction operational phase via optimal transfer learning. From the optimally transfer learned network, a novel phase switch index is proposed to act as a digital phase switch and is shown to be capable of reducing total reactor operation time, with the verification of an operator. These intelligent analytics was studied on a reactive distillation unit for a reaction of monomers and acids to polyester in the Netherlands. The combined analytics gave a potential of 5.4% reaction batch time saving, 10.6% reaction energy savings, and 10.5% carbon emissions reduction. For the operator, this method also saves up to 6 hours during the end discharge of the reaction.
... 10 × 10; [13,17), [22,23), [27,28), [32,33), [37,38), [43,48), [57,58), [67,68), Here, 10 × 10 at the beginning indicates the original image size without which reconstruction of the image from the encoding would not be possible. An interval such as [13,17) indicates that the image array contains a 1 from position 13 to position 16 (indexing is assumed to be zero-based and the intervals do not include upper bounds). ...
Preprint
In this thesis, we propose an alternative characterization of the notion of Configuration Space, which we call Visual Configuration Space (VCS). This new characterization allows an embodied agent (e.g., a robot) to discover its own body structure and plan obstacle-free motions in its peripersonal space using a set of its own images in random poses. Here, we do not assume any knowledge of geometry of the agent, obstacles or the environment. We demonstrate the usefulness of VCS in (a) building and working with geometry-free models for robot motion planning, (b) explaining how a human baby might learn to reach objects in its peripersonal space through motor babbling, and (c) automatically generating natural looking head motion animations for digital avatars in virtual environments. This work is based on the formalism of manifolds and manifold learning using the agent's images and hence we call it Motion Planning on Visual Manifolds.
... Principal Components Analysis (PCA) (Jolliffe 1986;Hotelling 1933) is one of the most popular algorithms for linear dimensionality reduction. Given a data set, the PCA algorithm finds the directions (vectors) along which the data has a maximum variance and deduces the relative importance of these directions. ...
Preprint
Fast radio bursts (FRBs) are one of the most mysterious astronomical transients. Observationally, they can be classified into repeaters and apparently non-repeaters. However, due to the lack of continuous observations, some apparently repeaters may have been incorrectly recognized as non-repeaters. In a series of two papers, we intend to solve such problem with machine learning. In this second paper of the series, we focus on an array of unsupervised machine learning methods. We apply multiple unsupervised machine learning algorithms to the first CHIME/FRB catalog to learn their features and classify FRBs into different clusters without any premise about the FRBs being repeaters or non-repeaters. These clusters reveal the differences between repeaters and non-repeaters. Then, by comparing with the identities of the FRBs in the observed classes, we evaluate the performance of various algorithms and analyze the physical meaning behind the results. Finally, we recommend a list of most credible repeater candidates as targets for future observing campaigns to search for repeated bursts in combination of the results presented in Paper I using supervised machine learning methods.
Chapter
Full-text available
In recent years, a vast amount of data has been accumulated across various fields in industry and academia, and with the rise of artificial intelligence and machine learning technologies, knowledge discovery and high-precision predictions through such data have been demanded. However, real-world data is diverse, including network data that represent relationships, data with multiple modalities or views, data that is distributed across multiple institutions and requires a certain level of information confidentiality.
Conference Paper
Air pollution poses a significant environmental challenge, adversely affecting the health of millions worldwide. Consequently, accurate prediction of pollutant levels has become increasingly crucial to prevent and mitigate the negative impacts of air pollution. This research introduces a Python-based artificial network neural algorithm for predicting PM2.5 levels in Medellín, Colombia, leveraging meteorological and emission data. The model utilizes a Multilayer Perceptron Neural Network, incorporating principal component analysis (PCA) and K-means clustering to determine the optimal number of hidden layers and neurons. Additionally, trend and correlation analyses were conducted to identify the most relevant predictors by examining the relationship between available variables and the target variable (PM2.5). Model performance is assessed using Mean Square Error and Mean Absolute Error.
Article
Network data is ubiquitous, such as telecommunication, transport systems, online social networks, protein-protein interactions, etc. Since the huge scale and the complexity of network data, former machine learning system tried to understand network data arduously. On the other hand, thought of multi-granular cognitive computation simulates the problem-solving process of human brains. It simplifies the complex problems and solves problems from the easier to harder. Therefore, the application of multi-granularity problem-solving ideas or methods to deal with network data mining is increasingly adopted by researchers either intentionally or unintentionally. This paper looks into the domain of network representation learning (NRL). It systematically combs the research work in this field in recent years. In this paper, it is discovered that in dealing with the complexity of the network and pursuing the efficiency of computing resources, the multi-granularity solution becomes an excellent path that is hard to go around. Although there are several papers about survey of NRL, to our best knowledge, we are the first to survey the NRL from the perspective of multi-granular computing. This paper proposes the challenges that NRL meets. Furthermore, the feasibility of solving the challenges of NRL with multi-granular computing methodologies is analyzed and discussed. Some potential key scientific problems are sorted out and prospected in applying multi-granular computing for NRL research.
Chapter
Can a machine or algorithm discover or learn the elliptical orbit of Mars from astronomical sightings alone? Johannes Kepler required two paradigm shifts to discover his First Law regarding the elliptical orbit of Mars. Firstly, a shift from the geocentric to the heliocentric frame of reference. Secondly, the reduction of the orbit of Mars from a three- to a two-dimensional space. We extend AI Feynman, a physics-inspired tool for symbolic regression, to discover the heliocentricity and planarity of Mars’ orbit and emulate his discovery of Kepler’s first law.
Chapter
Devising a model of a dynamical system from raw observations of its states and evolution requires characterising its phase space, which includes identifying its dimension and state variables. Recently, Boyuan Chen and his colleagues proposed a technique that uses intrinsic dimension estimators to discover the hidden variables in experimental data. The method uses estimators of the intrinsic dimension of the manifold of observations. We present the results of a comparative empirical performance evaluation of various candidate estimators. We expand the repertoire of estimators proposed by Chen et al. and find that several estimators not initially suggested by the authors outperforms the others.
Article
Full-text available
The use of principal component analysis (PCA) is a occurrence in analysis of relationship between scores of variables in domestic animals and could assist in selection and breeding programme of such animals. The study adopted the PCA in describing the growth traits of male and female Ross 308 broiler chickens. A total of 100 males and 100 females Ross 308 broiler chickens were used for the study. Traits measured were body weight (BW), body length (BL), keel length (KL), thigh length (TL) and wing length (WL). The descriptive statistics indicated that the average BW (2161.68g), BL (22.22cm), KL (16.53cm), TL (16.52cm), WL (21.10cm) and SL (18.32cm) were obtained for males Ross 308 broiler chicken while BW (2059.22g), BL (21.56cm), KL (15.99cm), TL (16.27cm), WL (20.83cm) and SL (17.89cm) were obtained for females Ross 308 broiler chicken. The correlation coefficient observed varied from r=0.78 to r=0.97 and r=0.78 to r=0.95 for male and female Ross 308 broiler chickens respectively were positive and highly significant (P<0.01). The PCA results revealed that three principal components were extracted for the broiler chicken explaining 98.64% and 97.63% of the total variation in the original variable for both male and female Ross 308 broiler chicken respectively while PC1 of both chickens accounted for 92.905% and 91.04% respectively. However, Kaiser-Meyer-Olkin (KMO) values of 0.912 and 0.875 for male and female chickens respectively were termed to be Marvelous and Meritorious with Bartlett's test of 2418.795g at determinants 0.88E for both male and female Ross 308 broiler chicken. These components could be a template for selection criteria growth traits in male and female broiler chickens.
Article
Full-text available
Voice quality derived from long-term laryngeal settings stands out as a potentially individualizing trait of speakers. This places it in an advantageous situation with respect to other phonetic parameters used in forensic linguistics. However, anyone confronted with its analysis will immediately run into a methodological difficulty stemming from its inherently multidimensional nature. In this lies its main disadvantage and the fundamental reason why its analysis is not always considered in the traditional approach used in the comparison of speakers for identification purposes. Based on an experimental inquiry on voice disguised by means of falsetto, this study shows that it is possible to work with a reduced set of laryngeal features responsible for voice quality and facilitate its interpretation and explanation, which is a critical issue for forensic practice.
Article
Full-text available
Motivation: The exponential growth of genomic sequencing data has created ever-expanding repositories of gene networks. Unsupervised network integration methods are critical to learn informative representations for each gene, which are later used as features for downstream applications. However, these network integration methods must be scalable to account for the increasing number of networks and robust to an uneven distribution of network types within hundreds of gene networks. Results: To address these needs, we present Gemini, a novel network integration method that uses memory-efficient high-order pooling to represent and weight each network according to its uniqueness. Gemini then mitigates the uneven network distribution through mixing up existing networks to create many new networks. We find that Gemini leads to more than a 10% improvement in F1 score, 15% improvement in micro-AUPRC, and 63% improvement in macro-AUPRC for human protein function prediction by integrating hundreds of networks from BioGRID, and that Gemini's performance significantly improves when more networks are added to the input network collection, while Mashup and BIONIC embeddings' performance deteriorates. Gemini thereby enables memory-efficient and informative network integration for large gene networks and can be used to massively integrate and analyze networks in other domains. Availability and implementation: Gemini can be accessed at: https://github.com/MinxZ/Gemini.
Article
Full-text available
Exposure of antimalarial herbal drugs (AMHDs) to ultraviolet radiation (UVR) affects the potency and integrity of the AMHDs. Instant classification of the AMHDs exposed to UVR (UVR-AMHDs) from unexposed ones (Non-UVR-AMHDs) would be beneficial for public health safety, especially in warm regions. For the first time, this work combined laser-induced autofluorescence (LIAF) with chemometric techniques to classify UVR-AMHDs from Non-UVR-AMHDs. LIAF spectra data were recorded from 200 ml of each of the UVR-AMHDs and Non-UVR-AMHDs. To extract useful data from the spectra fingerprint, principal components (PCs) analysis was used. The performance of five chemometric algorithms: random forest (RF), neural network (NN), support vector machine (SVM), linear discriminant analysis (LDA), and k-nearest neighbour (KNN), were compared after optimization by validation. The chemometric algorithms showed that KNN, SVM, NN, and RF were superior with a classification accuracy of 100% for UVR-AMHDs while LDA had a classification accuracy of 98.8% after standardization of the spectra data and was used as an input variable for the model. Meanwhile, a classification accuracy of 100% was obtained for KNN, LDA, SVM, and NN when the raw spectra data was used as input except for RF for which a classification accuracy of 99.9% was obtained. Classification accuracy above 99.74 ± 0.26% at 3 PCs in both the training and testing sets were obtained from the chemometric models. The results showed that the LIAF, combined with the chemometric techniques, can be used to classify UVR-AMHDs from Non-UVR-AMHDs for consumer confidence in malaria-prone regions. The technique offers a non-destructive, rapid, and viable tool for identifying UVR-AMHDs in resource-poor countries.
Article
Unsupervised feature selection attempts to select a small number of discriminative features from original high-dimensional data and preserve the intrinsic data structure without using data labels. As an unsupervised learning task, most previous methods often use a coefficient matrix for feature reconstruction or feature projection, and a certain similarity graph is widely utilized to regularize the intrinsic structure preservation of original data in a new feature space. However, a similarity graph with poor quality could inevitably affect the final results. In addition, designing a rational and effective feature reconstruction/projection model is not easy. In this paper, we introduce a novel and effective unsupervised feature selection method via multiple graph fusion and feature weight learning (MGF2WL) to address these issues. Instead of learning the feature coefficient matrix, we directly learn the weights of different feature dimensions by introducing a feature weight matrix, and the weighted features are projected into the label space. Aiming to exploit sufficient relation of data samples, we develop a graph fusion term to fuse multiple predefined similarity graphs for learning a unified similarity graph, which is then deployed to regularize the local data structure of original data in a projected label space. Finally, we design a block coordinate descent algorithm with a convergence guarantee to solve the resulting optimization problem. Extensive experiments with sufficient analyses on various datasets are conducted to validate the efficacy of our proposed MGF2WL.
Article
Full-text available
Each year a considerable amount of money is spent in the production of several national and international University rankings that may deeply influence the students’ enrollment. However, all such rankings are based almost exclusively on numerical indicators weakly related to the quality of the learning process and do not consider the perceptions of the “end users”, i.e. the learners. Recently, as part of the activity promoted by the ASLERD (Association for Smart Learning Ecosystems and Regional Development), we have developed an alternative approach to benchmark learning ecosystems. Such novel approach is based on: a) the detection of the degree of satisfaction related to the levels of the Maslow’s Pyramid of needs, and b) the detection of indicators related with the achievement of the state of “flow” by the actors involved in the learning processes. In this paper we report on the first implementation of such benchmarking approach that involved six European Campuses and more than 700 students. The critical analysis of the outcomes allowed us to identify: a) the set of the most relevant indicators; b) a “smartness” axis in the plan of the first two principal components derived by applying a Principal Component Analysis (PCA) to the spaces of the selected indicators.
Article
Full-text available
Dimensionality reduction (DR) methods create 2D scatterplots of high-dimensional data for visual exploration. As such scatterplots are often used to reason about the cluster structure of the data, this requires DR methods with good cluster preservation abilities. Recently, Sharpened DR (SDR) was proposed to enhance the ability of existing DR methods to create scatterplots with good cluster structure. Following this, SDR-NNP was proposed to speed the computation of SDR by deep learning. However, both SDR and SDR-NNP require careful tuning of four parameters to control the final projection quality. In this work, we extend SDR-NNP to simplify its parameter settings. Our new method retains all the desirable properties of SDR and SDR-NNP. In addition, our method is stable vs setting all its parameters, making it practically a parameter-free method, and also increases the quality of the produced projections. We support our claims by extensive evaluations involving multiple datasets, parameter values, and quality metrics.
Article
Full-text available
Multivariate functions have a central place in the development of techniques present many domains, such as machine learning and optimization research. However, only a few visual techniques exist to help users understand such multivariate problems, especially in the case of functions that depend on complex algorithms and variable constraints. In this paper, we propose a technique that enables the visualization of high-dimensional surfaces defined by such multivariate functions using a two-dimensional pixel map. We demonstrate two variants of it, OptMap, focused on optimization problems, and RegSurf, focused on regression problems in machine learning. Both our techniques are simple to implement, computationally efficient, and generic with respect to the nature of the high-dimensional data they address. We show how the two techniques can be used to visually explore a wide variety of optimization and regression problems.
Article
Full-text available
Understanding how a machine learning classifier works is an important task in machine learning engineering. However, doing this is for any classifier in general difficult. We propose to leverage visualization methods for this task. For this, we extend a recent technique called Decision Boundary Map (DBM) which graphically depicts how a classifier partitions its input data space into decision zones separated by decision boundaries. We use a supervised, GPU-accelerated technique that computes bidirectional mappings between the data and projection spaces to solve several shortcomings of DBM, such as accuracy and speed. We present several experiments that show that SDBM generates results which are easier to interpret, far less prone to noise, and compute significantly faster than DBM, while maintaining the genericity and ease of use of DBM for any type of single-output classifier. We also show, in addition to earlier work, that SDBM is stable with respect to various types and amounts of changes of the training set used to construct the visualized classifiers. This property was, to our knowledge, not investigated for any comparable method for visualizing classifier decision maps, and is essential for the deployment of such visualization methods in analyzing real-world classification models.
Article
Full-text available
Analysis of population structure and genomic ancestry remains an important topic in human genetics and bioinformatics. Commonly used methods require high-quality genotype data to ensure accurate inference. However, in practice, laboratory artifacts and outliers are often present in the data. Moreover, existing methods are typically affected by the presence of related individuals in the dataset. In this work, we propose a novel hybrid method, called SAE-IBS, which combines the strengths of traditional matrix decomposition-based (e.g., principal component analysis) and more recent neural network-based (e.g., autoencoders) solutions. Namely, it yields an orthogonal latent space enhancing dimensionality selection while learning non-linear transformations. The proposed approach achieves higher accuracy than existing methods for projecting poor quality target samples (genotyping errors and missing data) onto a reference ancestry space and generates a robust ancestry space in the presence of relatedness. We introduce a new approach and an accompanying open-source program for robust ancestry inference in the presence of missing data, genotyping errors, and relatedness. The obtained ancestry space allows for non-linear projections and exhibits orthogonality with clearly separable population groups.
Article
Clonorchis sinensis is a carcinogenic liver fluke that causes clonorchiasis - a neglected tropical disease (NTD) affecting ~ 35 million people worldwide. No vaccine is available, and chemotherapy relies on one anthelmintic, praziquantel. This parasite has a complex life history and is known to infect a range of species of intermediate (freshwater snails and fish) and definitive (piscivorous) hosts. Despite this biological complexity and the impact of this biocarcinogenic pathogen, there has been no previous study of molecular variation in this parasite on a genome-wide scale. Here, we conducted the first extensive nuclear genomic exploration of C. sinensis individuals (n = 152) representing five distinct populations from mainland China, and one from Far East Russia, and revealed marked genetic variation within this species between 'northern' and 'southern' geographic regions. The discovery of this variation indicates the existence of biologically distinct variants within C. sinensis, which may have distinct epidemiology, pathogenicity and/or chemotherapic responsiveness. The detection of high heterozygosity within C. sinensis specimens suggests that this parasite has developed mechanisms to readily adapt to changing environments and/or host species during its life history/evolution. From an applied perspective, the identification of invariable genes could assist in finding new intervention targets in this parasite, given the major clinical relevance of clonorchiasis. From a technical perspective, the genomic-informatic workflow established herein will be readily applicable to a wide range of other parasites that cause NTDs.
Chapter
Dimensionality reduction (DR) is an essential tool for the visualization of high-dimensional data. The recently proposed Self-Supervised Network Projection (SSNP) method addresses DR with a number of attractive features, such as high computational scalability, genericity, stability and out-of-sample support, computation of an inverse mapping, and the ability of data clustering. Yet, SSNP has an involved computational pipeline using self-supervision based on labels produced by clustering methods and two separate deep learning networks with multiple hyperparameters. In this paper we explore the SSNP method in detail by studying its hyperparameter space and pseudo-labeling strategies. We show how these affect SSNP’s quality and how to set them to optimal values based on extensive evaluations involving multiple datasets, DR methods, and clustering algorithms.
Conference Paper
Full-text available
To solve general multi-objective multigraph shortest path problems, this paper proposes an algorithm (MOMGA*) that incorporates an online likely-admissible learning-based heuristic function to accelerate the solution-finding process. MOMGA* is an extended and generalised version of the airport multi-objective A* (AMOA*) algorithm that is tailored for a specific application problem. The online heuristic function is added and developed using artificial neural networks that estimate the costs between two nodes based on their metrics. To implement this metric-based prediction, a graph embedding technique is adopted to learn node feature representations. Results on a range of benchmark multi-objective multigraphs show that (i) in the absence of heuristic information, MOMGA* can deliver the same Pareto optimal solutions as AMOA* does, while requiring less computational time, and (ii) empowered by the likely-admissible learning- based heuristics, MOMGA* is able to provide a set of optimal and near-optimal solutions and strike a good balance between optimality and tractability.
Chapter
We describe an efficient and scalable spherical graph embedding method. The method uses a generalization of the Euclidean stress function for Multi-Dimensional Scaling adapted to spherical space, where geodesic pairwise distances are employed instead of Euclidean distances. The resulting spherical stress function is optimized by means of stochastic gradient descent. Quantitative and qualitative evaluations demonstrate the scalability and effectiveness of the proposed method. We also show that some graph families can be embedded with lower distortion on the sphere, than in Euclidean and hyperbolic spaces.
Article
Fast radio bursts (FRBs) are one of the most mysterious astronomical transients. Observationally, they can be classified into repeaters and apparently non-repeaters. However, due to the lack of continuous observations, some apparently repeaters may have been incorrectly recognized as non-repeaters. In a series of two papers, we intend to solve such problem with machine learning. In this second paper of the series, we focus on an array of unsupervised machine learning methods. We apply multiple unsupervised machine learning algorithms to the first CHIME/FRB catalog to learn their features and classify FRBs into different clusters without any premise about the FRBs being repeaters or non-repeaters. These clusters reveal the differences between repeaters and non-repeaters. Then, by comparing with the identities of the FRBs in the observed classes, we evaluate the performance of various algorithms and analyze the physical meaning behind the results. Finally, we recommend a list of most credible repeater candidates as targets for future observing campaigns to search for repeated bursts in combination of the results presented in Paper I using supervised machine learning methods.
Chapter
In machine learning, making a model interpretable for humans is becoming more relevant. Trust in and understanding of a model greatly increase its deployability. Interpretability and explainability are terms that refer to the understanding of a machine learning model. The relation between these two terms and the requirements for data mining tools are not always clearly defined. This chapter provides a framework for interpretability and provides a taxonomy of interpretability based on the literature. Properties of interpretability are related to the domain and to the methods involved. A distinction is made between inherently interpretable models and post-hoc interpretable models, which in the literature are also referred to as explainable models. This overview will argue that inherently interpretable models are more favorable for deployment than explainable models, which are not as reliable as inherently interpretable models.
Article
Full-text available
In Singular Spectrum Analysis (SSA) window length is a critical tuning parameter that must be assigned by the practitioner. This paper provides a theoretical analysis of signal-noise separation and reconstruction in SSA that can serve as a guide to optimal window choice. We establish numerical bounds on the mean squared reconstruction error and present their almost sure limits under very general regularity conditions on the underlying data generating mechanism. We also provide asymptotic bounds for the mean squared separation error. Evidence obtained using simulation experiments indicates that the theoretical properties are reflected in observed behaviour, even in relatively small samples, and the results indicate how an optimal choice for the window length can be made.
Presentation
La fotogrammetria è una tecnica di rilievo che consente la costruzione di modelli tridimensionali (3D) da output di dati digitalizzati. Negli ultimi anni si è confermata una delle migliori tecniche per costruire modelli 3D ampiamente utilizzati in diversi campi come le scienze della vita e della terra, la medicina, l’architettura, la topografia, l’archeologia, l’indagine sulla scena del crimine, l’ingegneria ecce cc. In particolare, la close-range photogrammetry ha trovato diverse applicazioni negli studi osteologici consentendo di creare database 3D disponibili per successivi studi qualitativi e quantitativi.
Article
Full-text available
Introduction We sought to develop a novel method for a fully automated, robust quantification of protein biomarker expression within the epithelial component of high-grade serous ovarian tumors (HGSOC). Rather than defining thresholds for a given biomarker, the objective of this study in a small cohort of patients was to develop a method applicable to the many clinical situations in which immunomarkers need to be quantified. We aimed to quantify biomarker expression by correlating it with the heterogeneity of staining, using a non-subjective choice of scoring thresholds based on classical mathematical approaches. This could lead to a universal method for quantifying other immunohistochemical markers to guide pathologists in therapeutic decision-making. Methods We studied a cohort of 25 cases of HGSOC for which three biomarkers predictive of the response observed ex vivo to the BH3 mimetic molecule ABT-737 had been previously validated by a pathologist. We calibrated our algorithms using Stereology analyses performed by two experts to detect immunohistochemical staining and epithelial/stromal compartments. Immunostaining quantification within Stereology grids of hexagons was then performed for each histological slice. To define thresholds from the staining distribution histograms and to classify staining within each hexagon as low, medium, or high, we used the Gaussian Mixture Model (GMM). Results Stereology analysis of this calibration process produced a good correlation between the experts for both epithelium and immunostaining detection. There was also a good correlation between the experts and image processing. Image processing clearly revealed the respective proportions of low, medium, and high areas in a single tumor and showed that this parameter of heterogeneity could be included in a composite score, thus decreasing the level of discrepancy. Therefore, agreement with the pathologist was increased by taking heterogeneity into account. Conclusion and discussion This simple, robust, calibrated method using basic tools and known parameters can be used to quantify and characterize the expression of protein biomarkers within the different tumor compartments. It is based on known mathematical thresholds and takes the intratumoral heterogeneity of staining into account. Although some discrepancies need to be diminished, correlation with the pathologist’s classification was satisfactory. The method is replicable and can be used to analyze other biological and medical issues. This non-subjective technique for assessing protein biomarker expression uses a fully automated choice of thresholds (GMM) and defined composite scores that take the intra-tumor heterogeneity of immunostaining into account. It could help to avoid the misclassification of patients and its subsequent negative impact on therapeutic care.
Article
The emergence of new technologies has changed the way clinicians perform diagnosis. Medical imaging play a crucial role in this process, given the amount of information that they usually provide as non-invasive techniques. Despite the high quality offered by these images and the expertise of clinicians, the diagnostic process is not a straightforward task since different pathologies can have similar signs and symptoms. For this reason, it is extremely useful to assist this process with the inclusion of an automatic tool that reduces the bias when analyzing this kind of images. In this work, we propose an ensemble classifier based on probabilistic Support Vector Machine (SVM) in order to identify relevant patterns while providing information about the reliability of the classification. Specifically, each image is divided into patches and features contained in each one of them are extracted by applying kernel principal component analysis (PCA). The use of base classifiers within an ensemble allows our system to identify the informative patterns regardless of their size or location. Decisions of each individual patch are then combined according to the reliability of each individual classification: the lower the uncertainty, the higher the contribution. Performance is evaluated in a real scenario where distinguishing between pneumonia patients and controls from chest Computed Tomography (CCT) images, yielding an accuracy of 97.86%. The large performance obtained and the simplicity of the system (use of deep learning in CCT images would highly increase the computational cost) evidence the applicability of our proposal in a real-world environment.
Article
Declarative grammar is becoming an increasingly important technique for understanding visualization design spaces. The GoTreeScape system presented in the paper allows users to navigate and explore the vast design space implied by GoTree, a declarative grammar for visualizing tree structures. To provide an overview of the design space, GoTreeScape, which is based on an encoder-decoder architecture, projects the tree visualizations onto a 2D landscape. Significantly, this landscape takes the relationships between different design features into account. GoTreeScape also includes an exploratory framework that allows top-down, bottom-up, and hybrid modes of exploration to support the inherently undirected nature of exploratory searches. Two case studies demonstrate the diversity with which GoTreeScape expands the universe of designed tree visualizations for users. The source code associated with GoTreeScape is available at https://github.com/bitvis2021/gotreescape.</uri
Article
Full-text available
Mirror self-recognition (MSR) assessed by the Mark Test has been the staple test for the study of animal self-awareness. When tested in this paradigm, corvid species return discrepant results, with only the Eurasian magpies and the Indian house crow successfully passing the test so far, whereas multiple other corvid species fail. The lack of replicability of these positive results and the large divergence in applied methodologies calls into question whether the observed differences are in fact phylogenetic or methodological, and, if so, which factors facilitate the expression of MSR in some corvids. In this study, we (1) present new results on the self-recognition abilities of common ravens, (2) replicate results of azure-winged magpies, and (3) compare the mirror responses and performances in the mark test of these two corvid species with a third corvid species: carrion crows, previously tested following the same experimental procedure. Our results show interspecies differences in the approach of and the response to the mirror during the mirror exposure phase of the experiment as well as in the subsequent mark test. However, the performances of these species in the Mark Test do not provide any evidence for their ability of self-recognition. Our results add to the ongoing discussion about the convergent evolution of MSR and we advocate for consistent methodologies and procedures in comparing this ability across species to advance this discussion.
Article
Full-text available
Complexity is one of the major attributes of the visual perception of texture. However, very little is known about how humans visually interpret texture complexity. A psychophysical experiment was conducted to visually quantify the seven texture attributes of a series of textile fabrics: complexity, color variation, randomness, strongness, regularity, repetitiveness, and homogeneity. It was found that the observers could discriminate between the textures with low and high complexity using some high-level visual cues such as randomness, color variation, strongness, etc. The results of principal component analysis (PCA) on the visual scores of the above attributes suggest that complexity and homogeneity could be essentially the underlying attributes of the same visual texture dimension, with complexity at the negative extreme and homogeneity at the positive extreme of this dimension. We chose to call this dimension visual texture complexity. Several texture measures including the first-order image statistics, co-occurrence matrix, local binary pattern, and Gabor features were computed for images of the textiles in sRGB, and four luminance-chrominance color spaces (i.e., HSV, YCbCr, Ohta’s I1I2I3, and CIELAB). The relationships between the visually quantified texture complexity of the textiles and the corresponding texture measures of the images were investigated. Analyzing the relationships showed that simple standard deviation of the image luminance channel had a strong correlation with the corresponding visual ratings of texture complexity in all five color spaces. Standard deviation of the energy of the image after convolving with an appropriate Gabor filter and entropy of the co-occurrence matrix, both computed for the image luminance channel, also showed high correlations with the visual data. In this comparison, sRGB, YCbCr, and HSV always outperformed the I1I2I3 and CIELAB color spaces. The highest correlations between the visual data and the corresponding image texture features in the luminance-chrominance color spaces were always obtained for the luminance channel of the images, and one of the two chrominance channels always performed better than the other. This result indicates that the arrangement of the image texture elements that impacts the observer’s perception of visual texture complexity cannot be represented properly by the chrominance channels. This must be carefully considered when choosing an image channel to quantify the visual texture complexity. Additionally, the good performance of the luminance channel in the five studied color spaces proves that variations in the luminance of the texture, or as one could call the luminance contrast, plays a crucial role in creating visual texture complexity.
Article
Full-text available
Background Large medical centers in urban areas, like Los Angeles, care for a diverse patient population and offer the potential to study the interplay between genetic ancestry and social determinants of health. Here, we explore the implications of genetic ancestry within the University of California, Los Angeles (UCLA) ATLAS Community Health Initiative—an ancestrally diverse biobank of genomic data linked with de-identified electronic health records (EHRs) of UCLA Health patients ( N =36,736). Methods We quantify the extensive continental and subcontinental genetic diversity within the ATLAS data through principal component analysis, identity-by-descent, and genetic admixture. We assess the relationship between genetically inferred ancestry (GIA) and >1500 EHR-derived phenotypes (phecodes). Finally, we demonstrate the utility of genetic data linked with EHR to perform ancestry-specific and multi-ancestry genome and phenome-wide scans across a broad set of disease phenotypes. Results We identify 5 continental-scale GIA clusters including European American (EA), African American (AA), Hispanic Latino American (HL), South Asian American (SAA) and East Asian American (EAA) individuals and 7 subcontinental GIA clusters within the EAA GIA corresponding to Chinese American, Vietnamese American, and Japanese American individuals. Although we broadly find that self-identified race/ethnicity (SIRE) is highly correlated with GIA, we still observe marked differences between the two, emphasizing that the populations defined by these two criteria are not analogous. We find a total of 259 significant associations between continental GIA and phecodes even after accounting for individuals’ SIRE, demonstrating that for some phenotypes, GIA provides information not already captured by SIRE. GWAS identifies significant associations for liver disease in the 22q13.31 locus across the HL and EAA GIA groups (HL p -value=2.32×10 ⁻¹⁶ , EAA p -value=6.73×10 ⁻¹¹ ). A subsequent PheWAS at the top SNP reveals significant associations with neurologic and neoplastic phenotypes specifically within the HL GIA group. Conclusions Overall, our results explore the interplay between SIRE and GIA within a disease context and underscore the utility of studying the genomes of diverse individuals through biobank-scale genotyping linked with EHR-based phenotyping.
Article
Sensing solutions provide a rich feature set for electromechanical load monitoring and diagnostics. For example, qualities that describe the operation of an electromechanical load can include measurements of power, torque, vibration, electrical current demand, and electrical harmonic content. If properly interpreted, these measurements can be utilized for energy management, condition-based maintenance, and fault detection and diagnostics. When monitoring several loads from an aggregate data stream, a well posed feature space will permit not only load identification, but also the characterization of faults and gradual changes in the health of an individual machine. Many feature selection methods assume static and generalizable data, without consideration of concept drift and evolving behavior over time. This paper presents a method for evaluating load separability in a feature space prior to the application of a pattern classifier, while accounting for changing operating conditions and load variability. A four-year load dataset is used to validate the method.
Article
Friedman and Tukey (1974) introduced the term "projection pursuit" for a technique for the exploratory analysis of multivariate data sets; the method seeks out "interesting" linear projections of the multivariate data onto a line or a plane. In this paper, we show how to set Friedman and Tukey's idea in a more structured context than they offered. This makes it possible to offer some suggestions for the reformulation of the method, and thence to identify a computationally efficient approach to its implementation. We illustrate its application to empirical data, and discuss its practical attractions and limitations. Extensions by other workers to problems such as non-linear multiple regression and multivariate density estimation are discussed briefly within the same framework.
Article
Electronic computers facilitate greatly carrying out factor analysis. Computers will help in solving the communality problem and the question of the number of factors as well as the question of arbitrary factoring and the problem of rotation. "Cloacal short-cuts will not be necessary and the powerful methods of Guttman will be feasible." A library of programs essential for factor analysis is described, and the use of medium sized computers as the IBM 650 deprecated for factor analysis. (PsycINFO Database Record (c) 2012 APA, all rights reserved)