Figure 1 - uploaded by Michael K Winson
Content may be subject to copyright.
Partial Least Squares 1 (PLS1) was used to classify the samples. The PLS1 model was trained on 50 spectra. The chosen model used 11 components, which gave the minimum RMS error in the prediction values (0.3302) for a validation set of 50 previously unseen spectra. This model was then applied to 100 previously unseen spectra, comprising 50 control and 50 saline-grown samples. The plot below shows the model's predicted output values for this test set. 

Partial Least Squares 1 (PLS1) was used to classify the samples. The PLS1 model was trained on 50 spectra. The chosen model used 11 components, which gave the minimum RMS error in the prediction values (0.3302) for a validation set of 50 previously unseen spectra. This model was then applied to 100 previously unseen spectra, comprising 50 control and 50 saline-grown samples. The plot below shows the model's predicted output values for this test set. 

Source publication
Chapter
Full-text available
Genetic programming, in conjunction with advanced analytical instruments, is a novel tool for the investigation of complex biological systems at the whole-tissue level. In this study, samples from tomato fruit grown hydroponically under both high-and low-salt conditions were analysed using Fourier-transform infrared spectroscopy (FTIR), with the ai...

Citations

... In this study, the evaluation was done by visualizing the data and it was not clear how the feature selection process was performed and how many runs the GP was performed. The analysis of the metabolic data using GP was investigated in [22]. The study aimed to detect changes in the levels of biochemicals between thousands of biochemical features. ...
Chapter
Full-text available
Biomarker detection in LC-MS data depends mainly on the feature selection algorithm as the number of features is extremely high while the number of samples is very small. This makes the classification of these data sets extremely challenging. In this paper we propose the use of genetic programming (GP) for subset feature selection in LC-MS data which works by maximizing the signal to noise of the selected features by GP. The proposed method was applied to eight LC-MS data sets with different sample sizes and different levels of concentration of the spiked biomarkers. We evaluated the accuracy of selection from the list of biomarkers and also using the classification accuracy of the selected features via the support vector machines (SVMs) and Naive Bayes (NB) classifiers. Features selected by the proposed GP method managed to achieve perfect classification accuracy for most of the data sets. The results show that the proposed method strikes a reasonable compromise between the detection rate of the biomarkers and the classification accuracy for all data sets. The method was also compared to linear Support Vector Machine-Recursive Features Elimination (SVM-RFE) and t-test for feature selection and the results show that the biomarker detection rate of the proposed approach is higher.
... They help evolve solutions to complex problems that are simple and intelligible, generating equations essentially in the form of rules, thereby having both desirable properties (accuracy and intelligibility ) mentioned above. GP has been used successfully by us in identifying metabolites in terms of their involvement in particular processes (Gilbert et al., 1999; Johnson et al., 2000; Kell et al., 2001; Kell, 2002; Allen et al., 2003; Goodacre, 2003; Goodacre and Kell, 2003; Allen et al., 2004). A particular trend is towards voting methods of various kinds (Bauer and Kohavi, 1999; Dietterich, 2000; Breiman, 2001a), in which ensembles of ''weak'' learners contribute to more robust classifications via a committee voting approach (Bishop, 1995 ) than is possible with single classifiers alone (Hastie et al., 2001). ...
Article
Full-text available
Metabolomics, like other omics methods, produces huge datasets of biological variables, often accompanied by the necessary metadata. However, regardless of the form in which these are produced they are merely the ground substance for assisting us in answering biological questions. In this short tutorial review and position paper we seek to set out some of the elements of “best practice” in the optimal acquisition of such data, and in the means by which they may be turned into reliable knowledge. Many of these steps involve the solution of what amount to combinatorial optimization problems, and methods developed for these, especially those based on evolutionary computing, are proving valuable. This is done in terms of a “pipeline” that goes from the design of good experiments, through instrumental optimization, data storage and manipulation, the chemometric data processing methods in common use, and the necessary means of validation and cross-validation for giving conclusions that are credible and likely to be robust when applied in comparable circumstances to samples not used in their generation.
Conference Paper
Biomarker detection in LC-MS data depends mainly on feature selection algorithms as the number of features is extremely high while the number of samples is very small. This makes classification of these data sets extremely challenging. In this paper we propose the use of genetic programming (GP) for subset feature selection in LC-MS data which works by maximizing the signal to noise ratio of the selected features by GP. The proposed method was applied to eight LC-MS data sets with different sample sizes and different levels of concentration of the spiked biomarkers. We evaluated the accuracy of selection from the list of biomarkers and also using the classification accuracy of the selected features via the support vector machines (SVMs) and Naive Bayes (NB) classifiers. Features selected by the proposed GP method managed to achieve perfect classification accuracy for most of the data sets. The results show that the proposed method strikes a reasonable compromise between the detection rate of the biomarkers and the classification accuracy for all data sets. The method was also compared to linear Support Vector Machine-Recursive Features Elimination (SVM-RFE) and t-test for feature selection and the results show that the biomarker detection rate of the proposed approach is higher.
Conference Paper
This paper is a manifesto aimed at computer scientists interested in developing and applying scientific discovery methods. It argues that: science is experiencing an unprecedented “explosion” in the amount of available data; traditional data analysis methods cannot deal with this increased quantity of data; there is an urgent need to automate the process of refining scientific data into scientific knowledge; inductive logic programming (ILP) is a data analysis framework well suited for this task; and exciting new scientific discoveries can be achieved using ILP scientific discovery methods. We describe an example of using ILP to analyse a large and complex bioinformatic database that has produced unexpected and interesting scientific results in functional genomics. We then point a possible way forward to integrating machine learning with scientific databases to form intelligent databases.
Article
At present, the assignment of function to novel genes uncovered by the systematic genome-sequencing programmes is a problem. Many studies anticipate that this can be achieved by analysing patterns of gene expression via the transcriptome, proteome and metabolome. Thus, functional genomics is, in part, an exercise in pattern classification. Because many genes have known functional classes, the problem of predicting their functional class is a supervised learning problem. However, most pattern classification methods that have been applied to the problem have been unsupervised clustering methods. Consequently, the best classification tools have not always been used. Furthermore, the present functional classes are suboptimal and new unsupervised clustering methods are needed to improve them. Better-structured functional classes will facilitate the prediction of biochemically testable functions.
Article
Full-text available
Kell, D. B. (2002). Metabolomics and machine learning: explanatory analysis of complex metabolome data using genetic programming to produce simple, robust rules. Molecular Biology Reports, 29, (1-2), 237-241.