Article

Knowledge-based analysis of microarray gene expression

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The analysis of microarray expression data through automatic learning methods showed that genes with similar function frequently display similar expression patterns. This suggests a functional relationship among genes whose expression fluctuates in parallel12131415. This correlation between biological function and expression pattern suggests that machine learning algorithms could be successfully applied to predict function from expression data1617181920. ...
... A correlation between expression profile and biological function was also demonstrated in Drosophila54555657 and humans, although in this last case somewhat obscured by a less complete functional annotation of the genome [12]. On the other side, a clear mapping of functional gene groups to expression profiles was demonstrated in rats [14]. Since then, the use of automatic learning methods to assign putative functions to genes, based not only on their expression profiles but also on protein—protein interactions and on structural similarities, has led to a broad diversity of strategies and to a profuse bibliography1617181920 58]. ...
Article
Full-text available
Background: Assembly and function of neuronal synapses require the coordinated expression of a yet undetermined set of genes. Although roughly a thousand genes are expected to be important for this function in Drosophila melanogaster, just a few hundreds of them are known so far. Results: In this work we trained three learning algorithms to predict a "synaptic function" for genes of Drosophila using data from a whole-body developmental transcriptome published by others. Using statistical and biological criteria to analyze and combine the predictions, we obtained a gene catalogue that is highly enriched in genes of relevance for Drosophila synapse assembly and function but still not recognized as such. Conclusions: The utility of our approach is that it reduces the number of genes to be tested through hypothesis-driven experimentation.
... Brown et al. [3] compare five machine learning techniques to predict functional distinct gene classes by using gene expression data. They find that SVM outperform other techniques (i.e. ...
... All three levels have been applied, since there is no clear direction for a certain classification level suggested by prior research for the problem considered here. Different machine learning algorithms have been proposed [3,4,5,6,7,8] as being superior to others to process microarray data. Considering all the aspects described above, different models for regression and classification have been selected for this study. ...
Article
Full-text available
Model-based prediction is dependent on many choices ranging from the sample collection and prediction endpoint to the choice of algorithm and its parameters. Here we studied the effects of such choices, exemplified by predicting sensitivity (as IC50) of cancer cell lines towards a variety of compounds. For this, we used three independent sample collections and applied several machine learning algorithms for predicting a variety of endpoints for drug response. We compared all possible models for combinations of sample collections, algorithm, drug, and labeling to an identically generated null model. The predictability of treatment effects varies among compounds, i.e. response could be predicted for some but not for all. The choice of sample collection plays a major role towards lowering the prediction error, as does sample size. However, we found that no algorithm was able to consistently outperform the other and there was no significant difference between regression and two- or three class predictors in this experimental setting. These results indicate that response-modeling projects should direct efforts mainly towards sample collection and data quality, rather than method adjustment.
... The points on the boundaries are called support vectors, while our optimal separating hyperplane is located in the middle of the margin. Literature suggests SVM classifiers have superior and robust performance in identifying predictive biomarkers in the setting of highdimensional microarray gene expression data [14][15][16][17][18]. To overcome overfitting due to small number of arrays and large number of features, we also performed recursive feature elimination (SVM-RFE) algorithm [19] on the SVM to remove features with smallest ranking criterion , which corresponds to components of the SVM weight vector that are smallest in absolute value. ...
Article
Full-text available
Background Chronic Lung Allograft Dysfunction (CLAD) is the main limitation to long-term survival after lung transplantation. Although CLAD is usually not responsive to treatment, earlier identification may improve treatment prospects. Methods In a nested case control study, 1-year post transplant surveillance bronchoalveolar lavage (BAL) fluid samples were obtained from incipient CLAD (n = 9) and CLAD free (n = 8) lung transplant recipients. Incipient CLAD cases were diagnosed with CLAD within 2 years, while controls were free from CLAD for at least 4 years following bronchoscopy. Transcription profiles in the BAL cell pellets were assayed with the HG-U133 Plus 2.0 microarray (Affymetrix). Differential gene expression analysis, based on an absolute fold change (incipient CLAD vs no CLAD) >2.0 and an unadjusted p-value ≤0.05, generated a candidate list containing 55 differentially expressed probe sets (51 up-regulated, 4 down-regulated). Results The cell pellets in incipient CLAD cases were skewed toward immune response pathways, dominated by genes related to recruitment, retention, activation and proliferation of cytotoxic lymphocytes (CD8⁺ T-cells and natural killer cells). Both hierarchical clustering and a supervised machine learning tool were able to correctly categorize most samples (82.3% and 94.1% respectively) into incipient CLAD and CLAD-free categories. Conclusions These findings suggest that a pathobiology, similar to AR, precedes a clinical diagnosis of CLAD. A larger prospective investigation of the BAL cell pellet transcriptome as a biomarker for CLAD risk stratification is warranted.
... Most protein-protein interaction (PPI)-based bioinformatics studies for predicting disease related genes are based on direct PPIs, although the systems and dimensions considered vary. A machine-learning approach [9] to analyzing protein-protein interaction data has become popular and has been applied to diverse biological problems, including gene classification [10], prediction of function, and cancer tissue classification. Some approaches used to predict disease genes are based on using combined PPI network topological features [11, 12] to construct a combined classifier, or on analysis of protein sequences [5]. ...
Article
Full-text available
Abdominal aortic aneurysm (AAA) is frequently lethal and has no effective pharmaceutical treatment, posing a great threat to human health. Previous bioinformatics studies of the mechanisms underlying AAA relied largely on the detection of direct protein-protein interactions (level-1 PPI) between the products of reported AAA-related genes. Thus, some proteins not suspected to be directly linked to previously reported genes of pivotal importance to AAA might have been missed. In this study, we constructed an indirect protein-protein interaction (level-2 PPI) network based on common interacting proteins encoded by known AAA-related genes and successfully predicted previously unreported AAA-related genes using this network. We used four methods to test and verify the performance of this level-2 PPI network: cross validation, human AAA mRNA chip array comparison, literature mining, and verification in a mouse CaPO4 AAA model. We confirmed that the new level-2 PPI network is superior to the original level-1 PPI network and proved that the top 100 candidate genes predicted by the level-2 PPI network shared similar GO functions and KEGG pathways compared with positive genes.
... Three methods, SIFTER [17], FlowerPower [18], and Orthostrapper [19], use phylogenetic trees to transfer functions to target genes in the evolutionary context. There are other function prediction methods that consider coexpression patterns of genes2021222324, 3D structures of proteins [25– 34], and interacting proteins in large-scale protein-protein interaction networks353637383940. To evaluate the function prediction performances of AFP methods on a large scale, the Critical Assessment of Function Annotation (CAFA) was developed as a community-wide experiment [41]. ...
Article
Full-text available
Background: Functional annotation of novel proteins is one of the central problems in bioinformatics. With the ever-increasing development of genome sequencing technologies, more and more sequence information is becoming available to analyze and annotate. To achieve fast and automatic function annotation, many computational (automated) function prediction (AFP) methods have been developed. To objectively evaluate the performance of such methods on a large scale, community-wide assessment experiments have been conducted. The second round of the Critical Assessment of Function Annotation (CAFA) experiment was held in 2013-2014. Evaluation of participating groups was reported in a special interest group meeting at the Intelligent Systems in Molecular Biology (ISMB) conference in Boston in 2014. Our group participated in both CAFA1 and CAFA2 using multiple, in-house AFP methods. Here, we report benchmark results of our methods obtained in the course of preparation for CAFA2 prior to submitting function predictions for CAFA2 targets. Results: For CAFA2, we updated the annotation databases used by our methods, protein function prediction (PFP) and extended similarity group (ESG), and benchmarked their function prediction performances using the original (older) and updated databases. Performance evaluation for PFP with different settings and ESG are discussed. We also developed two ensemble methods that combine function predictions from six independent, sequence-based AFP methods. We further analyzed the performances of our prediction methods by enriching the predictions with prior distribution of gene ontology (GO) terms. Examples of predictions by the ensemble methods are discussed. Conclusions: Updating the annotation database was successful, improving the Fmax prediction accuracy score for both PFP and ESG. Adding the prior distribution of GO terms did not make much improvement. Both of the ensemble methods we developed improved the average Fmax score over all individual component methods except for ESG. Our benchmark results will not only complement the overall assessment that will be done by the CAFA organizers, but also help elucidate the predictive powers of sequence-based function prediction methods in general.
... Predictions used a cut off significance p-value of 0.0001 (for dose-rate and irradiation classification) and 0.001 (for dose classification) between classes to determine the classifier gene set. Support Vector Machines [31] was used for classification of samples between two categories , and Diagonal Linear Discriminant Analysis, which avoids complex models with excessive parameters in order to avoid over fitting data without loss of perform- ance [32] was used for classifications with more than two categories. The algorithms tested the classifier gene set for accuracy and sensitivity and specificity [24] and we used the Leave-one-out cross-validation method to compute mis-classification rates. ...
Article
Full-text available
The effects of dose-rate and its implications on radiation biodosimetry methods are not well studied in the context of large-scale radiological scenarios. There are significant health risks to individuals exposed to an acute dose, but a realistic scenario would include exposure to both high and low dose-rates, from both external and internal radioactivity. It is important therefore, to understand the biological response to prolonged exposure; and further, discover biomarkers that can be used to estimate damage from low-dose rate exposures and propose appropriate clinical treatment. We irradiated human whole blood ex vivo to three doses, 0.56 Gy, 2.23 Gy and 4.45 Gy, using two dose rates: acute, 1.03 Gy/min and a low dose-rate, 3.1 mGy /min. After 24 h, we isolated RNA from blood cells and these were hybridized to Agilent Whole Human genome microarrays. We validated the microarray results using qRT-PCR. Microarray results showed that there were 454 significantly differentially expressed genes after prolonged exposure to all doses. After acute exposure, 598 genes were differentially expressed in response to all doses. Gene ontology terms enriched in both sets of genes were related to immune processes and B-cell mediated immunity. Genes responding to acute exposure were also enriched in functions related to natural killer cell activation and cell-to-cell signaling. As expected, the p53 pathway was found to be significantly enriched at all doses and by both dose-rates of radiation. A support vectors machine classifier was able to distinguish between dose-rates with 100 % accuracy using leave-one-out cross-validation. In this study we found that low dose-rate exposure can result in distinctive gene expression patterns compared with acute exposures. We were able to successfully distinguish low dose-rate exposed samples from acute dose exposed samples at 24 h, using a gene expression-based classifier. These genes are candidates for further testing as markers to classify exposure based on dose-rate.
... The Support Vector Machine (SVM) was originally proposed for solving binary classification problems by Cortes and Vapnik [1, 2], and then extended by Hsu and Lin [3, 4] for dealing with multi-class classification problems via constructing one binary classifier for each pair of distinct classes. Based on the Vapnik-Chervonenkis (VC) dimension theory [5] and structural risk minimization (SRM) [6] , SVM has been successfully applied to address small sample, nonlinear and high dimensional learning problems such as text categorization789 , pattern recog- nition101112, time-series prediction [13, 14], gene expression profile analysis151617, and protein analysis [4, 18]. SVM classifies the data objects via identifying the optimal separating hyperplanes among classes. ...
Article
Full-text available
Choosing an appropriate kernel is very important and critical when classifying a new problem with Support Vector Machine. So far, more attention has been paid on constructing new kernels and choosing suitable parameter values for a specific kernel function, but less on kernel selection. Furthermore, most of current kernel selection methods focus on seeking a best kernel with the highest classification accuracy via cross-validation, they are time consuming and ignore the differences among the number of support vectors and the CPU time of SVM with different kernels. Considering the tradeoff between classification success ratio and CPU time, there may be multiple kernel functions performing equally well on the same classification problem. Aiming to automatically select those appropriate kernel functions for a given data set, we propose a multi-label learning based kernel recommendation method built on the data characteristics. For each data set, the meta-knowledge data base is first created by extracting the feature vector of data characteristics and identifying the corresponding applicable kernel set. Then the kernel recommendation model is constructed on the generated meta-knowledge data base with the multi-label classification method. Finally, the appropriate kernel functions are recommended to a new data set by the recommendation model according to the characteristics of the new data set. Extensive experiments over 132 UCI benchmark data sets, with five different types of data set characteristics, eleven typical kernels (Linear, Polynomial, Radial Basis Function, Sigmoidal function, Laplace, Multiquadric, Rational Quadratic, Spherical, Spline, Wave and Circular), and five multi-label classification methods demonstrate that, compared with the existing kernel selection methods and the most widely used RBF kernel function, SVM with the kernel function recommended by our proposed method achieved the highest classification performance.
... The SVM technique is a useful technique for data classification and regression, which has become an important tool for machine learning and data mining. In general, SVM has better performance when competed with existing methods, such as neural networks and decision trees293031. Recently, application of SVM in medicine has grown rapidly. ...
Article
Full-text available
Polysomnography (PSG) is treated as the gold standard for diagnosing obstructive sleep apnea (OSA). However, it is labor-intensive, time-consuming, and expensive. This study evaluates validity of overnight pulse oximetry as a diagnostic tool for moderate to severe OSA patients. A total of 699 patients with possible OSA were recruited for overnight oximetry and PSG examination at the Sleep Center of a University Hospital from Jan. 2004 to Dec. 2005. By excluding 23 patients with poor oximetry recording, poor EEG signals, or respiratory artifacts resulting in a total recording time less than 3 hours; 12 patients with total sleeping time (TST) less than 1 hour, possibly because of insomnia; and 48 patients whose ages less than 20 or more than 85 years old, data of 616 patients were used for further study. By further considering 76 patients with TST < 4 h, a group of 540 patients with TST ≥ 4 h was used to study the effect of insufficient sleeping time. Alice 4 PSG recorder (Respironics Inc., USA) was used to monitor patients with suspected OSA and to record their PSG data. After statistical analysis and feature selection, models built based on support vector machine (SVM) were then used to diagnose moderate and moderate to severe OSA patients with a threshold of AHI = 30 and AHI = 15, respectively. The SVM models designed based on the oxyhemoglobin desaturation index (ODI) derived from oximetry measurements provided an accuracy of 90.42-90.55%, a sensitivity of 89.36-89.87%, a specificity of 91.08-93.05%, and an area under ROC curve (AUC) of 0.953-0.957 for the diagnosis of severe OSA patients; as well as achieved an accuracy of 87.33-87.77%, a sensitivity of 87.71-88.53%, a specificity of 86.38-86.56%, and an AUC of 0.921-0.924 for the diagnosis of moderate to severe OSA patients. The predictive outcome of ODI to diagnose severe OSA patients is better than to diagnose moderate to severe OSA patients. Overnight pulse oximetry provides satisfactory diagnostic performance in detecting severe OSA patients. Home-styled oximetry may be a tool for severe OSA diagnosis.
... However, such studies commonly associated with the overfitting problem [50,51]. To overcome this, we used SVM to construct the classifier model525354 and coupled with a procedure of 10-fold cross validation (in which our samples were partitioned into randomly assigned training and testing sets for the model to be validated 10 times) and 200 times bootstrap resampling (in which the partitioning and cross-validation was randomized and repeated 200 times). Such procedures help reduce overfitting and provide a reliable estimate of the performance of the model [55]. ...
Article
Full-text available
Studies of methylation biomarkers for cervical cancer often involved only few randomly selected CpGs per candidate gene analyzed by methylation-specific PCR-based methods, with often inconsistent results from different laboratories. We evaluated the role of different CpGs from multiple genes as methylation biomarkers for high-grade cervical intraepithelial neoplasia (CIN). We applied a mass spectrometry-based platform to survey the quantitative methylation levels of 34 CpG units from SOX1, PAX1, NKX6-1, LMX1A, and ONECUT1 genes in 100 cervical formalin-fixed paraffin-embedded (FFPE) tissues. We then used nonparametric statistics and Random Forest algorithm to rank significant CpG methylations and support vector machine with 10-fold cross validation and 200 times bootstrap resampling to build a predictive model separating CIN II/III from CIN I/normal subjects. We found only select CpG units showed significant differences in methylation between CIN II/III and CIN I/normal groups, while mean methylation levels per gene were similar between the two groups for each gene except PAX1. An optimal classification model involving five CpG units from SOX1, PAX1, NKX6-1, and LMX1A achieved 81.2% specificity, 80.4% sensitivity, and 80.8% accuracy. Our study suggested that during CIN development, the methylation of CpGs within CpG islands is not uniform, with varying degrees of significance as biomarkers. Our study emphasizes the importance of not only methylated marker genes but also specific CpGs for identifying high-grade CINs. The 5-CpG classification model provides a promising biomarker panel for the early detection of cervical cancer.
... Data modelling methods based on machine learning, such as Artificial Neural Networks (ANN) and Support Vector Machines (SVM), have been extensively used in bioinformatics and molecular biology[15,16,17]. More recently, these techniques have been introduced to solve medical classification and medical prediction problems and aid clinical decision mak- ing[18,19,20,21]. ...
Article
Full-text available
Epidemiological evidence suggests that vitamin D deficiency is linked to various chronic diseases. However direct measurement of serum 25-hydroxyvitamin D (25(OH)D) concentration, the accepted biomarker of vitamin D status, may not be feasible in large epidemiological studies. An alternative approach is to estimate vitamin D status using a predictive model based on parameters derived from questionnaire data. In previous studies, models developed using Multiple Linear Regression (MLR) have explained a limited proportion of the variance and predicted values have correlated only modestly with measured values. Here, a new modelling approach, nonlinear radial basis function support vector regression (RBF SVR), was used in prediction of serum 25(OH)D concentration. Predicted scores were compared with those from a MLR model. Determinants of serum 25(OH)D in Caucasian adults (n = 494) that had been previously identified were modelled using MLR and RBF SVR to develop a 25(OH)D prediction score and then validated in an independent dataset. The correlation between actual and predicted serum 25(OH)D concentrations was analysed with a Pearson correlation coefficient. Better correlation was observed between predicted scores and measured 25(OH)D concentrations using the RBF SVR model in comparison with MLR (Pearson correlation coefficient: 0.74 for RBF SVR; 0.51 for MLR). The RBF SVR model was more accurately able to identify individuals with lower 25(OH)D levels (<75 nmol/L). Using identical determinants, the RBF SVR model provided improved prediction of serum 25(OH)D concentrations and vitamin D deficiency compared with a MLR model, in this dataset.
... SVM is a machine learning tool that is being extensively used for classification and optimization of complex problems. It is particularly attractive to biological sequence analysis due to its ability to handle noise, large datasets, large input spaces and high variability [46,47]. In this study all of the SVM models have been developed using libSVM [48]. ...
Article
Full-text available
The diversity of functions carried out by EF hand-containing calcium-binding proteins is due to various interactions made by these proteins as well as the range of affinity levels for Ca2+ displayed by them. However, accurate methods are not available for prediction of binding affinities. Here, amino acid patterns of canonical EF hand sequences obtained from available crystal structures were used to develop a classifier that distinguishes Ca2+-binding loops and non Ca2+-binding regions with 100% accuracy. To investigate further, we performed a proteome-wide prediction for E. histolytica, and classified known EF-hand proteins. We compared our results with published methods on the E. histolytica proteome scan, and demonstrated our method to be more specific and accurate for predicting potential canonical Ca2+-binding loops. Furthermore, we annotated canonical EF-hand motifs and classified them based on their Ca2+-binding affinities using support vector machines. Using a novel method generated from position-specific scoring metrics and then tested against three different experimentally derived EF-hand-motif datasets, predictions of Ca2+-binding affinities were between 87 and 90% accurate. Our results show that the tool described here is capable of predicting Ca2+-binding affinity constants of EF-hand proteins. The web server is freely available at http://202.41.10.46/calb/index.html.
... DNA microarray analysis is an important tool in medicine and life sciences, because it measures simultaneously the expression levels of thousands of genes. In the past few years, many multivariate data analysis methods have been developed and applied to extract the full potential from microarray experiments including cluster analysis [1,2], support vector machine (SVM) [3,4], self-organizing maps (SOMs) [5,6], artificial neural networks (ANN) [7,8], partial least squares (PLS) [9], and non-negative matrix factorization (NMF) [10,11,12,13,14]. Cluster analysis may be the most widely used method because, at least in part, it not only generates an intuitive tree to visualize clusters, and as an unsupervised technique, clusters samples and genes into different groups. ...
Article
Full-text available
DNA microarray analysis is characterized by obtaining a large number of gene variables from a small number of observations. Cluster analysis is widely used to analyze DNA microarray data to make classification and diagnosis of disease. Because there are so many irrelevant and insignificant genes in a dataset, a feature selection approach must be employed in data analysis. The performance of cluster analysis of this high-throughput data depends on whether the feature selection approach chooses the most relevant genes associated with disease classes. Here we proposed a new method using multiple Orthogonal Partial Least Squares-Discriminant Analysis (mOPLS-DA) models and S-plots to select the most relevant genes to conduct three-class disease classification and prediction. We tested our method using Golub's leukemia microarray data. For three classes with subtypes, we proposed hierarchical orthogonal partial least squares-discriminant analysis (OPLS-DA) models and S-plots to select features for two main classes and their subtypes. For three classes in parallel, we employed three OPLS-DA models and S-plots to choose marker genes for each class. The power of feature selection to classify and predict three-class disease was evaluated using cluster analysis. Further, the general performance of our method was tested using four public datasets and compared with those of four other feature selection methods. The results revealed that our method effectively selected the most relevant features for disease classification and prediction, and its performance was better than that of the other methods.
... specialized vacuoles in phagocytic cells) [48] to kill engulfed cells [49]. In cnidarians, ROS production has been observed in the hydroid Hydra vulgaris exposed to the immune stimulant lipopolysaccaride (LPS) [50] and in reef corals during thermal and UV-induced bleaching [51,52], possibly due to the breakdown of the mitochondrial and photosynthetic membranes [53,54]. In WBD infected corals, it is possible that phagocytosis is aimed either at the removal of invading pathogens and/or used to clear damaged apoptotic cells [55]. ...
Article
Full-text available
Coral diseases are among the most serious threats to coral reefs worldwide, yet most coral diseases remain poorly understood. How the coral host responds to pathogen infection is an area where very little is known. Here we used next-generation RNA-sequencing (RNA-seq) to produce a transcriptome-wide profile of the immune response of the Staghorn coral Acropora cervicornis to White Band Disease (WBD) by comparing infected versus healthy (asymptomatic) coral tissues. The transcriptome of A. cervicornis was assembled de novo from A-tail selected Illumina mRNA-seq data from whole coral tissues, and parsed bioinformatically into coral and non-coral transcripts using existing Acropora genomes in order to identify putative coral transcripts. Differentially expressed transcripts were identified in the coral and non-coral datasets to identify genes that were up- and down-regulated due to disease infection. RNA-seq analyses indicate that infected corals exhibited significant changes in gene expression across 4% (1,805 out of 47,748 transcripts) of the coral transcriptome. The primary response to infection included transcripts involved in macrophage-mediated pathogen recognition and ROS production, two hallmarks of phagocytosis, as well as key mediators of apoptosis and calcium homeostasis. The strong up-regulation of the enzyme allene oxide synthase-lipoxygenase suggests a key role of the allene oxide pathway in coral immunity. Interestingly, none of the three primary innate immune pathways - Toll-like receptors (TLR), Complement, and prophenoloxydase pathways, were strongly associated with the response of A. cervicornis to infection. Five-hundred and fifty differentially expressed non-coral transcripts were classified as metazoan (n = 84), algal or plant (n = 52), fungi (n = 24) and protozoans (n = 13). None of the 52 putative Symbiodinium or algal transcript had any clear immune functions indicating that the immune response is driven by the coral host, and not its algal symbionts.
... Genes identified from the data of monkeys (CASP1, CD38, LAG3, SOCS1, EEIFD, and TNFSF13B) were deemed as the factors affecting the speed of disease progression [12]. Machine learning was recently proved to be an effective strategy for accurate classification of phenotypes based on transcriptome data (gene expression microarray) [13,14,15,16,17]. Among them, minimum redundancy – maximum relevance method (mRMR) is robust and represents a broad spectrum of characteristics [18,19]. ...
Article
Full-text available
Since statistical relationships between HIV load and CD4+ T cell loss have been demonstrated to be weak, searching for host factors contributing to the pathogenesis of HIV infection becomes a key point for both understanding the disease pathology and developing treatments. We applied Maximum Relevance Minimum Redundancy (mRMR) algorithm to a set of microarray data generated from the CD4+ T cells of viremic non-progressors (VNPs) and rapid progressors (RPs) to identify host factors associated with the different responses to HIV infection. Using mRMR algorithm, 147 gene had been identified. Furthermore, we constructed a weighted molecular interaction network with the existing protein-protein interaction data from STRING database and identified 1331 genes on the shortest-paths among the genes identified with mRMR. Functional analysis shows that the functions relating to apoptosis play important roles during the pathogenesis of HIV infection. These results bring new insights of understanding HIV progression.
... For that, a set of genes that have a common function, for example genes coding for ribosomal proteins, are included in the positive training set, and a separate ensemble of genes that are known not to be genes that coding for ribosomal proteins is defined as negative set. A SVMC was applied in [251] and multilayer perceptrons in [252] to predict the functions of yeast genes based on gene expression data. More recently, a modified K- NN learning algorithm was proposed in [253]. ...
Article
Full-text available
Protein function prediction is one of the most challenging problem in the post-genomic era. With the advances of the high-throughput techniques, the number of newly identified proteins has been increasing exponentially. However, the functional characterization of these new proteins have not increased in the same proportion. To fill this gap, a large number of computational methods have been proposed in the literature. Early approaches have explored homology relationships to associate known functions to the newly discovered proteins. Nevertheless, these approaches tend to fail when a new protein is considerably different (divergent) from other known ones. Accordingly, more accurate approaches that use expressive data representation and explore sophisticate computational techniques are urgently required. Regarding these points, this review provides a comprehensible description of machine learning approaches that are currently applied to protein function prediction problems. We start by defining several problems enrolled in understanding protein function aspects, and describing how machine learning can be applied to these problems. We aim to expose, in a systematical framework, the role of these techniques on protein function inference, sometimes difficult to follow up due to the rapid evolvement of the field. With this purpose in mind, we highlighted the most representative contributions, the recent advancements, and provide an insightful categorization and classification of machine learning methods in functional proteomics.
... To classify each image a support vector machine (SVM) was used [27]. An SVM maps feature input vectors into a higher dimensional space and constructs an optimal hyper plane separating a set of training data into two groups282930. The authors used a Gaussian Radial Basis Function (RBF) kernel with a default scaling factor (sigma) of 1. ...
Article
Full-text available
Liquid-based cytology (LBC) in conjunction with Whole-Slide Imaging (WSI) enables the objective and sensitive and quantitative evaluation of biomarkers in cytology. However, the complex three-dimensional distribution of cells on LBC slides requires manual focusing, long scanning-times, and multi-layer scanning. Here, we present a solution that overcomes these limitations in two steps: first, we make sure that focus points are only set on cells. Secondly, we check the total slide focus quality. From a first analysis we detected that superficial dust can be separated from the cell layer (thin layer of cells on the glass slide) itself. Then we analyzed 2,295 individual focus points from 51 LBC slides stained for p16 and Ki67. Using the number of edges in a focus point image, specific color values and size-inclusion filters, focus points detecting cells could be distinguished from focus points on artifacts (accuracy 98.6%). Sharpness as total focus quality of a virtual LBC slide is computed from 5 sharpness features. We trained a multi-parameter SVM classifier on 1,600 images. On an independent validation set of 3,232 cell images we achieved an accuracy of 94.8% for classifying images as focused. Our results show that single-layer scanning of LBC slides is possible and how it can be achieved. We assembled focus point analysis and sharpness classification into a fully automatic, iterative workflow, free of user intervention, which performs repetitive slide scanning as necessary. On 400 LBC slides we achieved a scanning-time of 13.9±10.1 min with 29.1±15.5 focus points. In summary, the integration of semantic focus information into whole-slide imaging allows automatic high-quality imaging of LBC slides and subsequent biomarker analysis.
... On the other hand, SIFTER [23], FlowerPower [24] , and Orthos- trapper [25] employ phylogenetic trees to transfer functions to target genes in the evolutionary context. There are other function prediction methods considering co-expression patterns of genes2627282930, 3D structures of proteins313233343536373839 as well as interacting proteins in largescale protein-protein interaction networks404142434445. For the advancement of such computational techniques it is very important that there are community wide efforts for objective evaluation of prediction accuracy. ...
Article
Full-text available
Many Automatic Function Prediction (AFP) methods were developed to cope with an increasing growth of the number of gene sequences that are available from high throughput sequencing experiments. To support the development of AFP methods, it is essential to have community wide experiments for evaluating performance of existing AFP methods. Critical Assessment of Function Annotation (CAFA) is one such community experiment. The meeting of CAFA was held as a Special Interest Group (SIG) meeting at the Intelligent Systems in Molecular Biology (ISMB) conference in 2011. Here, we perform a detailed analysis of two sequence-based function prediction methods, PFP and ESG, which were developed in our lab, using the predictions submitted to CAFA. We evaluate PFP and ESG using four different measures in comparison with BLAST, Prior, and GOtcha. In addition to the predictions submitted to CAFA, we further investigate performance of a different scoring function to rank order predictions by PFP as well as PFP/ESG predictions enriched with Priors that simply adds frequently occurring Gene Ontology terms as a part of predictions. Prediction accuracies of each method were also evaluated separately for different functional categories. Successful and unsuccessful predictions by PFP and ESG are also discussed in comparison with BLAST. The in-depth analysis discussed here will complement the overall assessment by the CAFA organizers. Since PFP and ESG are based on sequence database search results, our analyses are not only useful for PFP and ESG users but will also shed light on the relationship of the sequence similarity space and functions that can be inferred from the sequences.
... This limitation is easily overcome by machine learning techniques like Support Vector Machine (SVM). SVM is a supervised learning algorithm, which has been found to be useful in recognition and discrimination of hidden patterns in complex datasets [19]. Prediction methods based on SVM have been successfully exploited in many research problems involving complex, sequence or biological datasets202122, like remote protein similarity detection [23], DNA methylation status [24], protein domains [25] and multiclass cancer diagnosis [26]. ...
Article
Full-text available
Functional annotation of protein sequences with low similarity to well characterized protein sequences is a major challenge of computational biology in the post genomic era. The cyclin protein family is once such important family of proteins which consists of sequences with low sequence similarity making discovery of novel cyclins and establishing orthologous relationships amongst the cyclins, a difficult task. The currently identified cyclin motifs and cyclin associated domains do not represent all of the identified and characterized cyclin sequences. We describe a Support Vector Machine (SVM) based classifier, CyclinPred, which can predict cyclin sequences with high efficiency. The SVM classifier was trained with features of selected cyclin and non cyclin protein sequences. The training features of the protein sequences include amino acid composition, dipeptide composition, secondary structure composition and PSI-BLAST generated Position Specific Scoring Matrix (PSSM) profiles. Results obtained from Leave-One-Out cross validation or jackknife test, self consistency and holdout tests prove that the SVM classifier trained with features of PSSM profile was more accurate than the classifiers based on either of the other features alone or hybrids of these features. A cyclin prediction server-CyclinPred has been setup based on SVM model trained with PSSM profiles. CyclinPred prediction results prove that the method may be used as a cyclin prediction tool, complementing conventional cyclin prediction methods.
... On the other hand, SIF- TER [14], FlowerPower [15] and Orthostrapper [16] employ phylogenetic trees to transfer functions to target genes in the evolutionary context. There are other function prediction methods that consider co-expression pat- terns1718192021, 3D structures of proteins222324252627282930 as well as protein-protein interaction networks313233343536. Although existing AFP methods show numerous successful predictions, moonlighting proteins may pose a challenge as they are known to show more than one function that are diverse in nature373839 . ...
Article
Full-text available
Advancements in function prediction algorithms are enabling large scale computational annotation for newly sequenced genomes. With the increase in the number of functionally well characterized proteins it has been observed that there are many proteins involved in more than one function. These proteins characterized as moonlighting proteins show varied functional behavior depending on the cell type, localization in the cell, oligomerization, multiple binding sites, etc. The functional diversity shown by moonlighting proteins may have significant impact on the traditional sequence based function prediction methods. Here we investigate how well diverse functions of moonlighting proteins can be predicted by some existing function prediction methods. RESULTS: We have analyzed the performances of three major sequence based function prediction methods,PSI-BLAST, the Protein Function Prediction (PFP), and the Extended Similarity Group (ESG) on predicting diverse functions of moonlighting proteins. In predicting discrete functions of a set of 19 experimentally identified moonlighting proteins, PFP showed overall highest recall among the three methods. Although ESG showed the highest precision, its recall was lower than PSI-BLAST. Recall by PSI-BLAST greatly improved when BLOSUM45 was used instead of BLOSUM62. CONCLUSION: We have analyzed the performances of PFP, ESG, and PSI-BLAST in predicting the functional diversity of moonlighting proteins. PFP shows overall better performance in predicting diverse moonlighting functions as compared with PSI-BLAST and ESG. Recall by PSI-BLAST greatly improved when BLOSUM45 was used. This analysis indicates that considering weakly similar sequences in prediction enhances the performance of sequence based AFP methods in predicting functional diversity of moonlighting proteins. The current study will also motivate development of novel computational frameworks for automatic identification of such proteins.
... Single biomarkers are less likely to furnish sufficient sensitivity and specificity for most applications [35]. Several classification methods have been utilized, including a variant of linear discriminant analysis [68], support vector machines697071, Bayesian regression [72], partial least squares [73], principal component regression [74], and between-group analysis [75]. The performance of a statistical prediction model should be tested and assessed by various statistical measures such as classification error rate and area under the receiver operating characteristic curve, the product of posterior classification probabilities767778, and an index so-called the misclassification-penalized posterior [79] . ...
Article
Full-text available
The recent advent of "-omics" technologies have heralded a new era of personalized medicine. Personalized medicine is referred to as the ability to segment heterogeneous subsets of patients whose response to a therapeutic intervention within each subset is homogeneous. This new paradigm in healthcare is beginning to affect both research and clinical practice. The key to success in personalized medicine is to uncover molecular biomarkers that drive individual variability in clinical outcomes or drug responses. In this review, we begin with an overview of personalized medicine in breast cancer and illustrate the most encountered statistical approaches in the recent literature tailored for uncovering gene signatures.
... Molecular classification of NSCLC using an objective quantitative test can be highly accurate and could be translated into a diagnostic platform for broad clinical application [40]. Sequence-derived structural and physicochemical descriptors have frequently been used in machine learning prediction of protein structural and functional classes [19,20,21,22,23,24], protein-protein interactions [24,25,26,41], subcellular locations [27,28,42,43], peptides containing specific properties [29,44], microarray data [45] and protein secondary structure prediction [46]. These descriptors serve to represent and distinguish proteins or peptides of different structural, functional and interaction profiles by exploring their distinguished features in compositions, correlations, and distributions of the constituent amino acids and their structural and physicochemical properties [18,20,26,30] and this proved that currently used descriptor-sets are generally useful for classifying proteins and the prediction performance may be enhanced by exploring combinations of descriptors [47]. ...
Article
Full-text available
Rapid distinction between small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC) tumors is very important in diagnosis of this disease. Furthermore sequence-derived structural and physicochemical descriptors are very useful for machine learning prediction of protein structural and functional classes, classifying proteins and the prediction performance. Herein, in this study is the classification of lung tumors based on 1497 attributes derived from structural and physicochemical properties of protein sequences (based on genes defined by microarray analysis) investigated through a combination of attribute weighting, supervised and unsupervised clustering algorithms. Eighty percent of the weighting methods selected features such as autocorrelation, dipeptide composition and distribution of hydrophobicity as the most important protein attributes in classification of SCLC, NSCLC and COMMON classes of lung tumors. The same results were observed by most tree induction algorithms while descriptors of hydrophobicity distribution were high in protein sequences COMMON in both groups and distribution of charge in these proteins was very low; showing COMMON proteins were very hydrophobic. Furthermore, compositions of polar dipeptide in SCLC proteins were higher than NSCLC proteins. Some clustering models (alone or in combination with attribute weighting algorithms) were able to nearly classify SCLC and NSCLC proteins. Random Forest tree induction algorithm, calculated on leaves one-out and 10-fold cross validation) shows more than 86% accuracy in clustering and predicting three different lung cancer tumors. Here for the first time the application of data mining tools to effectively classify three classes of lung cancer tumors regarding the importance of dipeptide composition, autocorrelation and distribution descriptor has been reported.
... While binary (two-class) classification has been extensively studied over the past few years [1,3,4,8,12], the multi-class classification case has received little attention [13,16,5]. In this paper, we focus on multi-class classification, and compare several gene ranking methods, including new variants of correlation coefficients, using different microarray datasets. ...
Article
Full-text available
The fundamental power of microarrays lies in the ability to conduct parallel surveys of gene expression patterns for tens of thousands of genes across a wide range of cellular responses, phenotypes and conditions. Thus microarray data contain an overwhelming number of genes relative to the number of samples, presenting challenges for meaningful pattern discovery. This paper provides a comparative study of gene selection methods for multi-class classifi-cation of microarray data. We compare several feature ranking techniques, including new variants of correlation coefficients, and Support Vector Machine (SVM) method based on Re-cursive Feature Elimination (RFE). The results show that feature selection methods improve SVM classification accuracy in different kernel settings. The performance of feature selection techniques is problem-dependent. SVM-RFE shows an excellent performance in general, but often gives lower accuracy than correlation coefficients in low dimensions.
... SVMs make predictions and give final classification decisions through learning from existing knowledge automatically [13]. Recently, SVMs have become very popular in the applications of a wide variety of biological questions or topics [13,14,15,16,17,18] , including gene classification, functional prediction and cancer tissue classifications. To a certain extent, identifying candidate genes for a complex disease could be regarded as a problem of distinguishing disease genes from nondisease genes, which is one of the right problems that SVMs work on. ...
Article
Full-text available
Predicting candidate genes using gene expression profiles and unbiased protein-protein interactions (PPI) contributes a lot in deciphering the pathogenesis of complex diseases. Recent studies showed that there are significant disparities in network topological features between non-disease and disease genes in protein-protein interaction settings. Integrated methods could consider their characteristics comprehensively in a biological network. In this study, we introduce a novel computational method, based on combined network topological features, to construct a combined classifier and then use it to predict candidate genes for coronary artery diseases (CAD). As a result, 276 novel candidate genes were predicted and were found to share similar functions to known disease genes. The majority of the candidate genes were cross-validated by other three methods. Our method will be useful in the search for candidate genes of other diseases.
... That is, on condition that the empirical risk is zero, SVM can get the best generalization ability by Neural Comput & Applic (2007) 16:481–490 485 maximizing the margin. SVM have been used to handle many problems in bioinformatics [10, 18,353637. In this paper, an integrated software for support vector classification named LIBSVM (Version 2.71) [38] is employed to predict antibody interaction sites and antigen class. ...
Article
Full-text available
In this paper, a machine learning approach, known as support vector machine (SVM) is employed to predict the distance between antibody’s interface residue and antigen in antigen–antibody complex. The heavy chains, light chains and the corresponding antigens of 37 antibodies are extracted from the antibody–antigen complexes in protein data bank. According to different distance ranges, sequence patch sizes and antigen classes, a number of computational experiments are conducted to describe the distance between antibody’s interface residue and antigen with antibody sequence information. The high prediction accuracy of both self-consistent and cross-validation tests indicates that the sequential discovered information from antibody structure characterizes much in predicting the distance between antibody’s interface residue and antigen. Furthermore, the antigen class is predicted from residue composition information that belongs to different distance range by SVM, which shows some potential significance.
... With the help of gene expressions, heterogeneous cancers can be classified into appropriate subtypes [2], [3], [4]. Recently, different kinds of machine learning and statistical methods, e.g., [5], [6], [7], [8], have been used to classify cancers using microarray gene expression data. ...
Conference Paper
Full-text available
Accurate classification of cancers based on microarray gene expressions is very important for doctors to choose a proper treatment. In this paper, we apply a novel radial basis function (RBF) neural network that allows for large overlaps among the hidden kernels of the same class to this problem. We tested our RBF network in three data sets, i.e., the lymphoma data set, the small round blue cell tumors (SRBCT) data set, and the ovarian cancer data set. The results in all the three data sets show that our RBF network is able to achieve 100% accuracy with much fewer genes than the previously published methods did.
... Clustering is a well-known technique to gather the same or similar elements among a large amount of data, and has various usages in data mining, pattern recognition, computer vision , and neural networks. Until now, a variety of clustering algorithms has been proposed to analyze gene expression data; hierarchical clustering [4], self-organizing map [12], -means [11], and support vector machines [1]. Some successful results have been reported, but there is no outstanding method in the gene expression analysis community. ...
Conference Paper
Full-text available
Various clustering methods have been proposed for the analysis of gene expression data, but conventional clustering algorithms have several critical limitations; how to set parameters such as number of clusters, initial cluster centers, and so on. In this paper, we propose a semi-parametric model-based clustering algorithm in which the underlying model is a mixture of Gaussian. Each gene expression data builds a Gaussian kernel, and the uncertainty of microarray data is naturally integrated in the data representation. Our algorithm provides a principled method to automatically determine parameters - number of components in the mixture, mean, covariance, and weight of each Gaussian - by mean-shift procedure (Comaniciu and Meer, 1999) and curvature fitting. After the initialization, expectation maximization (EM) algorithm is employed for clustering to achieve maximum likelihood (ML). The performance of our algorithm is compared with standard EM algorithm using real data as well as synthetic data
... Support Vector Machine (SVM) is a binary classification method proposed by Vapnik et.al (1995) which originally designed for classification and regression tasks [53]. The SVM method has been employed for pattern recognition problems in computational biology, including gene expression analysis [54], protein–protein interactions [55], protein fold class prediction [37,56], and protein–nucleotide interactions [21]. SVM has high performance level when high degree of diversity exists in datasets, because basically, SVM classifier depends on the support vectors, and the classifier function does not influenced by the entire dataset. ...
... Support vector machines (SVMs) [31] are commonly used as supervised learning methods for classification in computational biology and image processing tasks323334. Starting point for the training of a SVM is a set of training data whose class membership is known: ...
Article
Full-text available
The upcoming quantification and automation in biomarker based histological tumor evaluation will require computational methods capable of automatically identifying tumor areas and differentiating them from the stroma. As no single generally applicable tumor biomarker is available, pathology routinely uses morphological criteria as a spatial reference system. We here present and evaluate a method capable of performing the classification in immunofluorescence histological slides solely using a DAPI background stain. Due to the restriction to a single color channel this is inherently challenging. We formed cell graphs based on the topological distribution of the tissue cell nuclei and extracted the corresponding graph features. By using topological, morphological and intensity based features we could systematically quantify and compare the discrimination capability individual features contribute to the overall algorithm. We here show that when classifying fluorescence tissue slides in the DAPI channel, morphological and intensity based features clearly outpace topological ones which have been used exclusively in related previous approaches. We assembled the 15 best features to train a support vector machine based on Keratin stained tumor areas. On a test set of TMAs with 210 cores of triple negative breast cancers our classifier was able to distinguish between tumor and stroma tissue with a total overall accuracy of 88%. Our method yields first results on the discrimination capability of features groups which is essential for an automated tumor diagnostics. Also, it provides an objective spatial reference system for the multiplex analysis of biomarkers in fluorescence immunohistochemistry.
... On the training samples, a 10-fold cross-validation (CV) approach was applied for the optimization of the regularization parameter on a logarithmic scale from 10 −4 to 10 +6 . When the polynomial kernel function was used, a three-dimensional grid was required for the additional optimization of tuning parameters and d, both varying on a linear scale from 1 to 5. Contrary to the typical use of the polynomial kernel with = 1333435, scaling (that is, / = 1) was considered as this has shown to increase test performance [36]. The optimal parameter values were chosen corresponding to the model with the highest 10-fold train area under the receiver operating characteristic curve (AUC). ...
Article
Despite the rise of high-throughput technologies, clinical data such as age, gender and medical history guide clinical management for most diseases and examinations. To improve clinical management, available patient information should be fully exploited. This requires appropriate modeling of relevant parameters. When kernel methods are used, traditional kernel functions such as the linear kernel are often applied to the set of clinical parameters. These kernel functions, however, have their disadvantages due to the specific characteristics of clinical data, being a mix of variable types with each variable its own range. We propose a new kernel function specifically adapted to the characteristics of clinical data. The clinical kernel function provides a better representation of patients' similarity by equalizing the influence of all variables and taking into account the range r of the variables. Moreover, it is robust with respect to changes in r. Incorporated in a least squares support vector machine, the new kernel function results in significantly improved diagnosis, prognosis and prediction of therapy response. This is illustrated on four clinical data sets within gynecology, with an average increase in test area under the ROC curve (AUC) of 0.023, 0.021, 0.122 and 0.019, respectively. Moreover, when combining clinical parameters and expression data in three case studies on breast cancer, results improved overall with use of the new kernel function and when considering both data types in a weighted fashion, with a larger weight assigned to the clinical parameters. The increase in AUC with respect to a standard kernel function and/or unweighted data combination was maximum 0.127, 0.042 and 0.118 for the three case studies. For clinical data consisting of variables of different types, the proposed kernel function--which takes into account the type and range of each variable--has shown to be a better alternative for linear and non-linear classification problems.
Article
Full-text available
Support Vector Machine (SVM) is a machine learning method and widely used in the area of cancer studies especially in microarray data. A common problem related to the microarray data is that the size of genes is essentially larger than the number of samples. Although SVM is capable of handling a large number of genes, better accuracy of classification can be obtained using a small number of gene subset. This research proposed Multiple Support Vector Machine- Recursive Feature Elimination (MSVMRFE) as a gene selection to identify the small number of informative genes. This method is implemented in order to improve the performance of SVM during classification. The effectiveness of the proposed method has been tested on two different datasets of gene expression which are leukemia and lung cancer. In order to see the effectiveness of the proposed method, some methods such as Random Forest and C4.5 Decision Tree are compared in this paper. The result shows that this MSVM-RFE is effective in reducing the number of genes in both datasets thus providing a better accuracy for SVM in cancer classification.
Article
Decades ago, increased volume of data made manual analysis obsolete and prompted the use of computational tools with interactive user interfaces and rich palette of data visualizations. Yet their classic, desktop‐based architectures can no longer cope with the ever‐growing size and complexity of data. Next‐generation systems for explorative data analysis will be developed on client–server architectures, which already run concurrent software for data analytics but are not tailored to for an engaged, interactive analysis of data and models. In explorative data analysis, the key is the responsiveness of the system and prompt construction of interactive visualizations that can guide the users to uncover interesting data patterns. In this study, we review the current software architectures for distributed data analysis and propose a list of features to be included in the next generation frameworks for exploratory data analysis. The new generation of tools for explorative data analysis will need to address integrated data storage and processing, fast prototyping of data analysis pipelines supported by machine‐proposed analysis workflows, pre‐emptive analysis of data, interactivity, and user interfaces for intelligent data visualizations. The systems will rely on a mixture of concurrent software architectures to meet the challenge of seamless integration of explorative data interfaces at client site with management of concurrent data mining procedures on the servers. WIREs Data Mining Knowl Discov 2015, 5:165–180. doi: 10.1002/widm.1155 This article is categorized under: Application Areas > Data Mining Software Tools Technologies > Computer Architectures for Data Mining
Article
Full-text available
Background: For splice site recognition, one has to solve two classification problems: discriminating true from decoy splice sites for both acceptor and donor sites. Gene finding systems typically rely on Markov Chains to solve these tasks.
Chapter
Full-text available
Genomic-scale transcript profiling approaches provide an unbiased view of the transcriptome of organs, tissues, and cells. Such technologies have been applied to the study of lungs and cells of patients with fibrotic lung disease and animal models of lung disease with the goal of detecting key molecules that play a significant role in pathogenesis, identifying potential drug targets, and developing biomarkers of disease presence, progression, and outcome. Genomic profiling studies have also been used to classify and distinguish different interstitial lung diseases such as IPF, nonspecific interstitial pneumonia (NSIP), lung fibrosis associated with scleroderma, and hypersensitivity pneumonitis (HP). In this chapter, we describe the progress and insights derived from applying genomic-scale transcript profiling approaches to fibrotic lung diseases as well as the potential impact of new technologies and NIH-funded projects on the field of genomics.
Article
Microarray expression data is being generated by the gigabyte all over the world with undoubted exponential increases to come. Annotated genomic data is also rapidly pouring into public databases. Our goal is to develop automated ways of combining these two sources of information to produce insight into the operation of cells under various conditions. Our approach is to use machine-learning techniques to identify characteristics of genes that are upregulated or down-regulated in a particular microarray experiment. We seek models that are both accurate and easy to interpret. This paper explores the effectiveness of two algorithms for this task: PFOIL (a standard machine-learning rule-building algorithm) and GORB (a new rulebuilding algorithm devised by us). We use a permutation test to evaluate the statistical significance of the learned models. The paper reports on experiments using actual E. coli microarray data, discussing the strengths and weaknesses of the two algorithms and demonstrating the trade-offs between accuracy and comprehensibility. 1.
Article
Making reliable diagnoses and predictions based on high-throughput transcriptional data has attracted immense attention in the past few years. While experimental gene profiling techniques—such as microarray platforms—are advancing rapidly, there is an increasing demand of computational methods being able to efficiently handle such data. In this work we propose a computational workflow for extracting diagnostic gene signatures from high-throughput transcriptional profiling data. In particular, our research was performed within the scope of the first IMPROVER challenge. The goal of that challenge was to extract and verify diagnostic signatures based on microarray gene expression data in four different disease areas: psoriasis, multiple sclerosis, chronic obstructive pulmonary disease and lung cancer. Each of the different disease areas is handled using the same three-stage algorithm. First, the data are normalized based on a multi-array average (RMA) normalization procedure to account for variability among different samples and data sets. Due to the vast dimensionality of the profiling data, we subsequently perform a feature pre-selection using a Wilcoxon’s rank sum statistic. The remaining features are then used to train an L1-regularized logistic regression model which acts as our primary classifier. Using the four different data sets, we analyze the proposed method and demonstrate its use in extracting diagnostic signatures from microarray gene expression data.
Article
Full-text available
Rapid advances in gene expression microarray technology have enabled to discover molecular markers used for cancer diagnosis, prognosis, and prediction. One computational challenge with using microarray data analysis to create cancer classifiers is how to effectively deal with microarray data which are composed of high-dimensional attributes (p) and low-dimensional instances (n). Gene selection and classifier construction are two key issues concerned with this topics. In this article, we reviewed major methods for computational identification of cancer marker genes. We concluded that simple methods should be preferred to complicated ones for their interpretability and applicability.
Article
Identifying gene function has many useful applications. Identifying gene function based on gene expression data is much easier in prokaryotes than eukaryotes due to the relatively simple structure of prokaryotes. Recent studies have shown that there is a strong learnable correlation between gene function and gene expression. In previous work, we presented novel clustering and neural network (NN) approaches for predicting mouse gene functions from gene expression. In this paper, we build on that work to significantly improve the clustering distribution and the network prediction error by using a different clustering algorithm along with a new NN training technique. Our results show that NNs can be extremely useful in this area. We present the improved results along with comparisons.
Article
Full-text available
The recent explosion of interest in microarray technology has resulted in it becoming the preferred methodology for conducting gene expression experi- ments. Although the ability of an array experiment to simultaneously examine the expression of thousands of genes gives a previously unheard of level of in- sight to researchers, it also raises a plethora of statistical questions involving both the sheer volume of data being produced, as well as the variability inher- ent in this technology. In this paper we present statistical methods based on Bayesian linear models to investigate the various sources of variability present in array experiments. Data from a previously published cDNA microarray experiment is used to illustrate this methodology. A major goal of genomic research involves the determination of gene function, the discovery of which ultimately gives investigators fundamental insight into the ways in which genes act to aect the traits exhibited by an organism. The ability of
Article
Full-text available
Genomic experiments (e.g. differential gene expression, single-nucleotide polymorphism association) typically produce ranked list of genes. We present a simple but powerful approach which uses protein-protein interaction data to detect sub-networks within such ranked lists of genes or proteins. We performed an exhaustive study of network parameters that allowed us concluding that the average number of components and the average number of nodes per component are the parameters that best discriminate between real and random networks. A novel aspect that increases the efficiency of this strategy in finding sub-networks is that, in addition to direct connections, also connections mediated by intermediate nodes are considered to build up the sub-networks. The possibility of using of such intermediate nodes makes this approach more robust to noise. It also overcomes some limitations intrinsic to experimental designs based on differential expression, in which some nodes are invariant across conditions. The proposed approach can also be used for candidate disease-gene prioritization. Here, we demonstrate the usefulness of the approach by means of several case examples that include a differential expression analysis in Fanconi Anemia, a genome-wide association study of bipolar disorder and a genome-scale study of essentiality in cancer genes. An efficient and easy-to-use web interface (available at http://www.babelomics.org) based on HTML5 technologies is also provided to run the algorithm and represent the network.
Chapter
Cancer diagnosis from huge microarray gene expression data is an important and challenging bioinformatics research topic. We used a fuzzy neural network (FNN) proposed earlier for cancer classification. This FNN contains three valuable aspects i.e., automatically generating fuzzy membership functions, parameter optimization, and rule-base simplification. One major obstacle in microarray data set classifier is that the number of features (genes) is much larger than the number of objects. We therefore used a feature selection method based on t-test to select more significant genes before applying the FNN. In this work we used three well-known microarray databases, i.e., the lymphoma data set, the small round blue cell tumor (SRBCT) data set, and the ovarian cancer data set. In all cases we obtained 100% accuracy with fewer genes in comparison with previously published results. Our result shows the FNN classifier not only improves the accuracy of cancer classification problem but also helps biologists to find a better relationship between important genes and development of cancers.
Article
Full-text available
The phenotypic response of a cell results from a well orchestrated web of complex interactions which propagate from the genetic architecture through the metabolic flux network. To rationally design cell factories which carry out specific functional objectives by controlling this hierarchical system is a challenge. Transcriptome analysis, the most mature high-throughput measurement technology, has been readily applied in strain improvement programs in an attempt to identify genes involved in expressing a given phenotype. Unfortunately, while differentially expressed genes may provide targets for metabolic engineering, phenotypic responses are often not directly linked to transcriptional patterns. This limits the application of genome-wide transcriptional analysis for the design of cell factories. However, improved tools for integrating transcriptional data with other high-throughput measurements and known biological interactions are emerging. These tools hold significant promise for providing the framework to comprehensively dissect the regulatory mechanisms that identify the cellular control mechanisms and lead to more effective strategies to rewire the cellular control elements for metabolic engineering.
Article
The majority of classification algorithms are developed for the standard situation in which it is assumed that the examples in the training set come from the same distribution as that of the target population, and that the cost of misclassification into different classes are the same. However, these assumptions are often violated in real world settings. For some classification methods, this can often be taken care of simply with a change of threshold; for others, additional effort is required. In this paper, we explain why the standard support vector machine is not suitable for the nonstandard situation, and introduce a simple procedure for adapting the support vector machine methodology to the nonstandard situation. Theoretical justification for the procedure is provided. Simulation study illustrates that the modified support vector machine significantly improves upon the standard support vector machine in the nonstandard situation. The computational load of the proposed procedure is the same as that of the standard support vector machine. The procedure reduces to the standard support vector machine in the standard situation.
Article
The nearest neighbor classification is a simple and yet effective technique for pattern recognition. Performance of this technique depends significantly on the distance function used to compute similarity between examples. Some techniques were developed to learn weights of features for changing the distance structure of samples in nearest neighbor classification. In this paper, we propose an approach to learning sample weights for enlarging margin by using a gradient descent algorithm to minimize margin based classification loss. Experimental analysis shows that the distances trained in this way reduce the loss of the margin and enlarge the hypothesis margin on several datasets. Moreover, the proposed approach consistently outperforms nearest neighbor classification and some other state-of-the-art methods.
Conference Paper
Analyzing gene expression data from microarray devices has many important application in medicine and biology, but presents significant challenges to data mining. Microarray data typically has many attributes (genes) and few examples (samples), making the process of correctly analyzing such data difficult to formulate and prone to common mistakes. For this reason it is unusually important to capture and record good practices for this form of data mining. This paper presents a process for analyzing microarray data, including pre-processing, gene selection, randomization testing, classification and clustering; this process is captured with "Clementine Application Templates". The paper describes the process in detail and includes three case studies, showing how the process is applied to 2-class classification, multi-class classification and clustering analyses for publicly available microarray datasets.
Conference Paper
This study investigates the effectiveness of support vector machines (SVM) approach in detecting the underlying data pattern for the credit card customer churn analysis. This article introduces a relatively new machine learning technique, SVM, to the customer churning problem in attempt to provide a model with better prediction accuracy. To compare the performance of the proposed model, we used a widely adopted and applied Artificial Intelligence (AI) method, back-propagation neural networks (BPN) as a benchmark. The results demonstrate that SVM outperforms BPN. We also examine the effect of the variability in performance with respect to various values of parameters in SVM.
Article
Data mining, which is also known as knowledge discovery, is one of the most popular topics in information technology. It concerns the process of automatically extracting useful information and has the promise of discovering hidden relationships that exist in large databases. These relationships represent valuable knowledge that is crucial for many applications. This paper presents a review of works on current applications of data mining, which focus on four main application areas, including bioinformatics data, information retrieval, adaptive hypermedia and electronic commerce. How data mining can enhance functions for these four areas is described. The reader of this paper is expected to get an overview of the state-of-the-art research associated with these applications. Furthermore, we identify the limitations of current works and raise several directions for future research.
Article
Full-text available
Previous studies on tumor classification based on feature extraction from gene expression profiles (GEP) were proven to be effective, but some of such methods lack biomedical meaning to some extent. To deal with this problem, we proposed a novel feature extraction method whose experimental results are of biomedical interpretability and helpful for gaining insight into the structure analysis of gene expression dataset. This method first applied rank sum test to roughly select a set of informative genes and then adopted factor analysis to extract latent factors for tumor classification. Experiments on three pairs of cross-platform tumor datasets indicated that the proposed method can obviously improve the performance of cross-platform classification and only several latent factors, which can represent a large number of informative genes, would obtain very high predictive accuracy on test set. The results also suggested that the classification model trained on one dataset can successfully predict another tumor dataset with the same tumor subtype obtained on different experimental platforms.
Article
The editorial section of ACM Transactions on Computer-Human Interaction deals with data mining for understanding user needs. Data mining is the process of extracting valuable information from large amounts of data. It identifies hidden relationships, patterns, and interdependencies without leading to a priori hypotheses so predictive rules can be generated and interesting hypotheses found. Data mining has been used in mobile and collaborative applications to analyze users' behavior, demonstrating the way feedback produced by such analysis can change people's behaviors in meetings. Complex social behaviors can be known from analysis of simple data streams, such as recording conversational turns and locations recorded from mobile phones, or RF tags. Interaction explanation and feedback are also important so that people can understand how and why the system created its advice, while data mining can produce useful advice and recommendations for users.
Article
Web-based instruction (WBI) programs, which have been increasingly developed in educational settings, are used by diverse learners. Therefore, individual differences are key factors for the development of WBI programs. Among various dimensions of individual differences, the study presented in this article focuses on cognitive styles. More specifically, this study investigates how cognitive styles affect students' learning patterns in a WBI program with an integrated approach, utilizing both traditional statistical and data-mining techniques. The former are applied to determine whether cognitive styles significantly affected students' learning patterns. The latter use clustering and classification methods. In terms of clustering, the K-means algorithm has been employed to produce groups of students that share similar learning patterns, and subsequently the corresponding cognitive style for each group is identified. As far as classification is concerned, the students' learning patterns are analyzed using a decision tree with which eight rules are produced for the automatic identification of students' cognitive styles based on their learning patterns. The results from these techniques appear to be consistent and the overall findings suggest that cognitive styles have important effects on students' learning patterns within WBI. The findings are applied to develop a model that can support the development of WBI programs.
ResearchGate has not been able to resolve any references for this publication.