Article

Support vector machine classification and validation of cancer tissue sample using microarray expression data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... SVM has been utilized with success in various applications: pattern recognition including face, speaker, bioinformatics, and DNA recognition. Because of its accuracy in the field of bioinformatics, as well as its ability in managing big data [6][7][8][9][10], we chose the SVM as our classification technique for our prediction system of particular genomic sequences: the helitrons. ...
... Then to study the impact of the coding technique on the accuracy rates, we will repeat the classification for higher FCGS order. We will implement this method for FCGS 3 , FCGS 4 and FCGS 6 . Table 5 represents the best results (ranged from highest to lowest) of helitron classification and demonstrates that high accuracy rates are obtained with the FCGS 2 coding technique. ...
... Three helitron families' classification rates correspond to chromosome I and are based on FCGS 2 , FCGS 3 , FCGS 4 and FCGS6 . ...
Article
Full-text available
Helitrons, eukaryotic transposable elements (TEs), were discovered 18 years ago in various genomes. In the Cænorhabditis elegans (C.elegans) genome, helitron sequences have high variability in terms of size by base pairs (bp) varied from 11 to 8965 bp from one sequence to another. These TEs are not uniformly dispersed sequences, and they have the ability to mobilize within a genome by a rolling-circle mechanism. This ability to move and reproduce in genomes enables these elements to play a major role in genomic evolution. In order to follow the evolution, we predicted helitron families (10 classes) in the C.elegans genome using the combination of the features extracted from signals corresponding to DNA sequences and the Support Vector Machine (SVM) classifier. In our classification system, the features extracted from the signals were shown to be efficient to automatically predict helitronic sequences. As a result, the Gaussian radial kernel over 100-fold cross-validation gave the best accuracy rates, ranging from 68% to 97%, with an overall mean score of 83.7%, and we successfully identified the Helitron Y1A class for a specific value of c and gamma, reaching an accuracy rate of 100%. In addition, other notable helitrons (NDNAX2, NDNAX3 Helitron_Y2) were predicted with interesting accuracy rates.
... . Dudoit [4] , Furey [5] Guyon [6] . ...
... , Furey [5] W i , Pavlidis [7] W i . t , , , . ...
... Matlab6. 5 SVM light [13] , ...
... We sought to use random forest in our analysis of phenotypic aging, as it is capable of handling sizable data sets, considers the interactions between variables, and provides importance measures for predictors. On the other hand, SVM is a type of supervised learning that not only supports high dimensional data, but is robust against noise and sparsity in the data (Furey et al., 2000). SVM functions by taking a set of input features or data and defining an optimal decision boundary or hyperplane that most accurately separates the input space based on assigned binary classifiers. ...
... Both random forest and SVM were applied to the variant data to determine the best genetic predictors of late aging. The ability of both the random forest algorithm and SVM to outperform other non-parametric classification methods led to our use of these predictive modeling approaches in this study (Furey et al., 2000;Lunetta et al., 2004). As depicted in Figure 1C, the training cohorts were divided into early and late agers for random forest model training, and top performing models according to the ROC-AUC were then tested for prediction of aging status in the validation cohort. ...
Article
Full-text available
Background: Recent studies investigating longevity have revealed very few convincing genetic associations with increased lifespan. This is, in part, due to the complexity of biological aging, as well as the limited power of genome-wide association studies, which assay common single nucleotide polymorphisms (SNPs) and require several thousand subjects to achieve statistical significance. To overcome such barriers, we performed comprehensive DNA sequencing of a panel of 20 genes previously associated with phenotypic aging in a cohort of 200 individuals, half of whom were clinically defined by an “early aging” phenotype, and half of whom were clinically defined by a “late aging” phenotype based on age (65–75 years) and the ability to walk up a flight of stairs or walk for 15 min without resting. A validation cohort of 511 late agers was used to verify our results. Results: We found early agers were not enriched for more total variants in these 20 aging-related genes than late agers. Using machine learning methods, we identified the most predictive model of aging status, both in our discovery and validation cohorts, to be a random forest model incorporating damaging exon variants [Combined Annotation-Dependent Depletion (CADD) > 15]. The most heavily weighted variants in the model were within poly(ADP-ribose) polymerase 1 (PARP1) and excision repair cross complementation group 5 (ERCC5), both of which are involved in a canonical aging pathway, DNA damage repair. Conclusion: Overall, this study implemented a framework to apply machine learning to identify sequencing variants associated with complex phenotypes such as aging. While the small sample size making up our cohort inhibits our ability to make definitive conclusions about the ability of these genes to accurately predict aging, this study offers a unique method for exploring polygenic associations with complex phenotypes.
... In various fields, such as forecasting, economic modelling and medical applications, the neural network has widespread uses [43]. Several types of research relevant to cancer classification [32] and other bioinformatics fields [25] use ANN. Moreover, [44,45] discussed the additional utilities of ANN models. ...
Article
Full-text available
Machine Learning (ML)-based prediction and classification systems employ data and learning algorithms to forecast target values. However, improving predictive accuracy is a crucial step for informed decision-making. In the healthcare domain, data are available in the form of genetic profiles and clinical characteristics to build prediction models for complex tasks like cancer detection or diagnosis. Among ML algorithms, Artificial Neural Networks (ANNs) are considered the most suitable framework for many classification tasks. The network weights and the activation functions are the two crucial elements in the learning process of an ANN. These weights affect the prediction ability and the convergence efficiency of the network. In traditional settings, ANNs assign random weights to the inputs. This research aims to develop a learning system for reliable cancer prediction by initializing more realistic weights computed using a supervised setting instead of random weights. The proposed learning system uses hybrid and traditional machine learning techniques such as Support Vector Machine (SVM), Linear Discriminant Analysis (LDA), Random Forest (RF), k-Nearest Neighbour (kNN), and ANN to achieve better accuracy in colon and breast cancer classification. This system computes the confusion matrix-based metrics for traditional and proposed frameworks. The proposed framework attains the highest accuracy of 89.24 percent using the colon cancer dataset and 72.20 percent using the breast cancer dataset, which outperforms the other models. The results show that the proposed learning system has higher predictive accuracies than conventional classifiers for each dataset, overcoming previous research limitations. Moreover, the proposed framework is of use to predict and classify cancer patients accurately. Consequently, this will facilitate the effective management of cancer patients.
... For achieving better performance of classification, mapping of non-linear input data is done to the linearly separated data within certain high dimensional space in SVM. The marginal distance present amongst various classes is increased through SVM [5]. Various kernels are used to divide the classes. ...
Article
Full-text available
The image processing is the technique which can propose the information stored in the form of pixels. The plant disease detection is the technique which can detect the disease from the leaf. The plant disease detection algorithms has various steps like pre-processing, feature extraction, segmentation and classification. The KNN classifier technique is applied which can classify input data into certain classes. The performance of KNN classifier is compared with the existing techniques and it is analyzed that KNN classifier has high accuracy, less fault detection as compared to other techniques
... If linear separation is not possible, it can be combined with a 'kernel' trick that implements a non-linear mapping to a feature space in which the linear separating hyperplane is identified [52]. The kernel technique enables higher dimensional, non-linear models to be developed [52], and is computationally efficient for datasets with high dimensionality through the use of a kernel function, ( , ) = 〈 ( ) • ( )〉 [53], which computes the separating hyperplane without carrying out a mapping to feature space [54]. Commonly used kernels include the radial basis function and polynomial kernels. ...
Article
Full-text available
Irradiation of the tumour site during treatment for cancer with external-beam ionising radiation results in a complex and dynamic series of effects in both the tumour itself and the normal tissue which surrounds it. The development of a spectral model of the effect of each exposure and interaction mode between these tissues would enable label free assessment of the effect of radiotherapeutic treatment in practice. In this study Fourier-transform Infrared microspectroscopic imaging was employed to analyse an in-vitro model of radiotherapeutic treatment for prostate cancer, in which a normal cell line (PNT1A) was exposed to low-dose X-ray radiation from the scattered treatment beam, and also to irradiated cell culture medium (ICCM) from a cancer cell line exposed to a treatment relevant dose (2Gy). Various exposure modes were studied and reference was made to previously acquired data on cellular survival and DNA double strand break damage. Spectral analysis with manifold methods, linear spectral fitting, non-linear classification and non-linear regression approaches were found to accurately segregate spectra on irradiation type and provide a comprehensive set of spectral markers which differentiate on irradiation mode and cell fate. The study demonstrates that high dose irradiation, low-dose scatter irradiation and radiation-induced bystander exposure (RIBE) signalling each produce differential effects on the cell which are observable through spectroscopic analysis.
... Supervised machine learning (SML) has been used in research for the detection, classification and prognostication of cancer diseases for more than two decades. [14][15][16][17][18] We have previously shown that this method improves the diagnostic accuracy of patients with small intestinal neuroendocrine tumors (SI-NET) at the time of diagnosis, especially in patients with normal CgA levels. 19 These techniques can uncover and recognize patterns and correlations in complex collections of data from biomarkers. ...
Article
Full-text available
There is an unmet need for novel biomarkers to diagnose and monitor patients with neuroendocrine neoplasms. The EXPLAIN study explores a multi plasma protein and supervised machine learning (SML) strategy to improve the diagnosis of pancreatic (PanNET) and differentiate them from small intestinal neuroendocrine tumours (SI‐NET). At time of diagnosis blood samples were collected and analysed from 39 patients with PanNET, 135 with SI‐NET, (WHO Grade 1–2) and 144 controls. Exclusion criteria were other malignant diseases, chronic inflammatory diseases, reduced kidney or liver function. Proseek Oncology‐II (OLink) was used to measure 92 cancer related plasma proteins. Chromogranin A (CgA) was analysed separately. Median age in all groups was 65–67 years and with a similar gender distribution (Female; PanNET 51%, SI‐NET 42%, controls 42%). Tumour grade (G1/G2): PanNET 39/61%, SI‐NET 46/54%. Patients with liver metastases: PanNET 78%, SI‐NET 63%. The classification model of PanNET versus controls provided a sensitivity (SEN) of 0.84, specificity (SPE) 0.98, positive predictive value (PPV) of 0.92 and negative predictive value (NPV) of 0.95, and area under ROC (AUROC) of 0.99; the model for the discrimination of PanNET versus SI‐NET providing a SEN 0.61, SPE 0.96, PPV 0.83, NPV 0.90 and AUROC 0.98). These results suggest that a multi plasma protein strategy can significantly improve diagnostic accuracy of PanNET and SI‐NET. This article is protected by copyright. All rights reserved.
... The molecular descriptor is utilized to represent each compound. In order to reflect the effectiveness of CapsNet, SVM [47], RF [48], gcForest [49] and forgeNet [50] are also used to screen the effective compounds in traditional Chinese medicine prescriptions for treating diseases. ROC curve and AUC value are utilized to evaluate the performance of the classifiers. ...
Article
Pneumonia, especially corona virus disease 2019 (COVID-19), can lead to serious acute lung injury, acute respiratory distress syndrome, multiple organ failure and even death. Thus it is an urgent task for developing high-efficiency, low-toxicity and targeted drugs according to pathogenesis of coronavirus. In this paper, a novel disease-related compound identification model-based capsule network (CapsNet) is proposed. According to pneumonia-related keywords, the prescriptions and active components related to the pharmacological mechanism of disease are collected and extracted in order to construct training set. The features of each component are extracted as the input layer of capsule network. CapsNet is trained and utilized to identify the pneumonia-related compounds in Qingre Jiedu injection. The experiment results show that CapsNet can identify disease-related compounds more accurately than SVM, RF, gcForest and forgeNet.
... Molecular descriptors and molecular ngerprints of each ligand could be obtained, which contains 374 features. In order to better re ect the effectiveness of forgeNet, three classical classi ers (SVM [42], RF [43] and gcForest [44]) are utilized to identify the compounds associated with diseases. Five evaluation criteria of classi er performance are utilized, which are SN, SP, ACC, MCC and F1, respectively. ...
Preprint
Full-text available
Background: Acute lung injury (ALI) is a serious respiratory disease, which can lead to acute respiratory failure or death. It is closely related to the pathogenesis of New Coronavirus pneumonia (COVID-19). Many researches showed that traditional Chinese medicine (TCM) had a good effect on its intervention, and network pharmacology could play a very important role. Results: In order to construct "disease-gene-target-drug" interaction network more accurately, deep learning algorithm is utilized in this paper. Two ALI-related target genes (REAL and SATA3) are considered, and the active and inactive compounds of the two corresponding target genes are collected as training data, respectively. Molecular descriptors and molecular fingerprints are utilized to characterize each compound. Forest graph embedded deep feed forward network (forgeNet) is proposed to train and identify 19 compounds in Erhuang decoction (EhD) and Dexamethasone (DXMS). Conclusions: The experiment results show that forgeNet performs better than support vector machines (SVM), random forest (RF) and gcForest.
... The classification process was implemented based on the radiological charactersitics of different hepatic tumors in [46]. SVM is a binary classifier, and it divides the input points into two classes by constructing an N-dimensional hyper separation plane [47]. The input data points must be transformed from their dimension into a higher dimension since input points may not be linearly separable in their own space. ...
Article
Full-text available
One of the leading causes of mortality worldwide is liver cancer. The earlier the detection of hepatic tumors, the lower the mortality rate. This paper introduces a computer-aided diagnosis system to extract hepatic tumors from computed tomography scans and classify them into malignant or benign tumors. Segmenting hepatic tumors from computed tomography scans is considered a challenging task due to the fuzziness in the liver pixel range, intensity values overlap between the liver and neighboring organs, high noise from computed tomography scanner, and large variance in tumors shapes. The proposed method consists of three main stages; liver segmentation using Fast Generalized Fuzzy C-Means, tumor segmentation using dynamic threshold-ing, and the tumor's classification into malignant/benign using support vector machines classifier. The performance of the proposed system was evaluated using three liver benchmark datasets, which are MICCAI-Sliver07, LiTS17, and 3Dircadb. The proposed computer adided diagnosis system achieved an average accuracy of 96.75%, sensetivity of 96.38%, specificity of 95.20% and Dice similarity coefficient of 95.13%.
... SVMs are one of the most recent ML techniques and have shown applicability to a variety of real-world problems. Since being proposed by Vapnik and Lerner in 1963 [11] and formalised into the method known today in 1995 (by Vapnik and Cortes) [10], SVMs have been applied to text categorisation [12], tissue classification [13], gene function prediction [14], handwritten digit recognition [15] and facial recognition [16]. SVMs continue to be applied in biomedicine and healthcare, with researchers utilising them for cancer classification, biomarker discovery and drug discovery [17]. ...
Article
Full-text available
Biomarkers are known to be the key driver behind targeted cancer therapies by either stratifying the patients into risk categories or identifying patient subgroups most likely to benefit. However, the ability of a biomarker to stratify patients relies heavily on the type of clinical endpoint data being collected. Of particular interest is the scenario when the biomarker involved is a continuous one where the challenge is often to identify cut-offs or thresholds that would stratify the population according to the level of clinical outcome or treatment benefit. On the other hand, there are well-established Machine Learning (ML) methods such as the Support Vector Machines (SVM) that classify data, both linear as well as non-linear, into subgroups in an optimal way. SVMs have proven to be immensely useful in data-centric engineering and recently researchers have also sought its applications in healthcare. Despite their wide applicability, SVMs are not yet in the mainstream of toolkits to be utilised in observational clinical studies or in clinical trials. This research investigates the very role of SVMs in stratifying the patient population based on a continuous biomarker across a variety of datasets. Based on the mathematical framework underlying SVMs, we formulate and fit algorithms in the context of biomarker stratified cancer datasets to evaluate their merits. The analysis reveals their superior performance for certain data-types when compared to other ML methods suggesting that SVMs may have the potential to provide a robust yet simplistic solution to stratify real cancer patients based on continuous biomarkers, and hence accelerate the identification of subgroups for improved clinical outcomes or guide targeted cancer therapies.
... The structures of regression and prediction are shown in Fig. 1 (c) and 2 (a) [31] , where K (x i , x) represents its kernel function. In recent years, SVM based regression prediction has been used in the medical eld [30,32] , and its application in medical diagnosis is gradually increasing [33] . Based on libsvm toolbox [34] , the 131 I therapeutic dose model was established, and the input parameters of SVM were determined by cross validation method. ...
Preprint
Full-text available
Objective Multiple mechanical learning models were used to predict the therapeutic dose of ¹³¹ I radionuclide in patients with hyperthyroidism, and to compare the calculation results of each prediction model to obtain the optimal model for dose prediction. Meanwhile, the classification model was used to classify the prognosis of the existing clinical hyperthyroidism case data in order to evaluate the administration results and provide reference for the dose given by clinicians.Methods According to the data of hyperthyroidism patients treated with ¹³¹ I in nuclear medicine department of many hospitals, a prediction model was established based on MATLAB. Firstly, the prediction results of BP neural network, radial basis function (RBF) neural network and support vector machine (SVM) were compared with small sample data, and then the optimal model was selected to predict the drug dose. BP-AdaBoost, SVM and random forest were used to classify the patients after recovery and evaluate whether the dose was accurate.ResultsThe average errors of BP neural network, RBF neural network and SVM models trained with small samples were 6.58%, 17.25% and 14.09% respectively. After comparison, BP neural network was selected to establish the prediction model. The data of 30 cases were randomly selected to verify BP neural network, and average error of the prediction results was 11.99%. Using SVM, BP-AdaBoost and random forest models, 100 groups of case data were selected as the training set and 10 groups as the test set. The classification accuracy were 80%, 90% and 100% respectively. The random forest model with the highest accuracy was selected as the large sample prediction. When 318 groups of cases were trained and 35 groups of cases were used for the test, the classification accuracy was 97.14%.Conclusion This study compared the prediction effects of various prediction models on ¹³¹ I therapeutic dose in patients with hyperthyroidism and the accuracy of prognosis classification. BP neural network and random forest achieved the best results respectively. The two models provide reference for clinicians when giving the dose, which has clinical practical significance.
... The SVM enables the use of vectors (separators) to divide the training data into areas (categories) as far away as possible [49]. It assigns the test data a certain category according to the area of a particular vector that they fall into and uses a subset of training points when making decisions in the classification process. ...
Article
Full-text available
Because of continuous competition in the corporate industrial sector, numerous companies are always looking for strategies to ensure timely product delivery to survive against their competitors. For this reason, logistics play a significant role in the warehousing, shipments, and transportation of the products. Therefore, the high utilization of resources can improve the profit margins and reduce unnecessary storage or shipping costs. One significant issue in shipments is the Pallet Loading Problem (PLP) which can generally be solved by seeking to maximize the total number of boxes to be loaded on a pallet. In many previous studies, various solutions for the PLP have been suggested in the context of logistics and shipment delivery systems. In this paper, a novel two-phase approach is presented by utilizing a number of Machine Learning (ML) models to tackle the PLP. The dataset utilized in this study was obtained from the DHL supply chain system. According to the training and testing of various ML models, our results show that a very high (>85%) Pallet Utilization Volume (PUV) was obtained, and an accuracy of >89% was determined to predict an accurate loading arrangement of boxes on a suitable pallet. Furthermore, a comprehensive analysis of all the results on the basis of a comparison of several ML models is provided in order to show the efficacy of the proposed methodology.
... Methods include linear discriminant analysis, decision trees, random forest (RF), 63 artificial neural networks (ANNs), [64][65][66][67] and Kernel-based methods like the support vector machine (SVM) classifier. [68][69][70][71] Other relevant algorithms are linear and partial regression methods, such as PLS and LDA, having problems generalizing to larger patient datasets. 12,[60][61][62] In general, these algorithms face high dimensionality, a limited number of samples, and inter-patient spectral variability when applied to medical HSI data. ...
Article
Full-text available
New developments in instrumentation and data analysis have further improved the perspectives of hyperspectral imaging in clinical use. Thus, hyperspectral imaging can be considered as "Next Generation Imaging" for future clinical research. As a contactless, non-invasive method with short process times of just a few seconds, it quantifies predefined substance classes. Results of hyperspectral imaging may support the detection of carcinomas and the classification of different tissue structures as well as the assessment of tissue blood flow. Taken together, this method combines the principle of spectros-copy with imaging using conventional visual cameras. Compared to other optical imaging methods, hyperspectral imaging also analyses deeper layers of tissue.
... SVM is a computational algorithm that learns from experience and examples to assign labels to targets. Its basic function is to separate binary labeled data based on a line to maximize the distance between the labeled data [25]. SVM has good accuracy under limited samples [26]. ...
Article
Full-text available
Objectives: Previous researches have demonstrated that abnormal functional connectivity (FC) is associated with the pathophysiology of bipolar disorder (BD). However, inconsistent results were obtained due to different selections of regions of interest in previous researches. This study is aimed at examining voxel-wise brain-wide functional connectivity (FC) alterations in the first-episode, drug-naive patient with BD in an unbiased way. Methods: A total of 35 patients with BD and 37 age-, sex-, and education-matched healthy controls underwent resting-state functional magnetic resonance imaging (rs-fMRI). Global-brain FC (GFC) was applied to analyze the image data. Support vector machine (SVM) was adopted to probe whether GFC abnormalities could be used to identify the patients from the controls. Results: Patients with BD exhibited increased GFC in the left inferior frontal gyrus (LIFG), pars triangularis and left precuneus (PCu)/superior occipital gyrus (SOG). The left PCu belongs to the default mode network (DMN). Furthermore, increased GFC in the LIFG, pars triangularis was positively correlated with the triglycerides (TG) and low-density lipoprotein cholesterol (LDL-C) and negatively correlated with the scores of the Repeatable Battery for the Assessment of Neuropsychological Status (RBANS) coding test and Stroop color. Increased GFC values in the left PCu/SOG can be applied to discriminate patients from controls with preferable sensitivity (80.00%), specificity (75.68%), and accuracy (77.78%). Conclusions: This study found increased GFC in the brain regions of DMN; LIFG, pars triangularis; and LSOG, which was associated with dyslipidemia and cognitive impairment in patients with BD. Moreover, increased GFC values in the left PCu/SOG may be utilized as a potential biomarker to differentiate patients with BD from controls.
... Klasifikasi merupakan bagian dalam machine learning dimana klasifikasi merupakan sebuah proses pembelajaran dari sebuah f sebagai fungsi target yang mampu memetakan tiap himpunan atribut x ke dalam salah satu label class dependen yang sudah didefinisikan sebelumnya [7]. Tahapan dalam klasifikasi antara lain adalah training dan testing. ...
Article
Full-text available
Perguruan tinggi secara rutin melakukan Tracer study setiap tahun yang berguna sebagai pemenuhan kebutuhan data akreditasi, perbaikan pembelajaran dan pengembangan kurikulum di perguruan tinggi agar bisa meningkatkan kualitas lulusan. Kualitas lulusan dapat dilihat dari kelancaran memperoleh pekerjaan setelah mahasiswa lulus dari perguruan tinggi. Semakin lancar lulusan memperoleh pekerjaan maka kualitas lulusan dianggap baik, sebaliknya semakin tidak lancar maka kualitas lulusan dianggap belum atau kurang baik. Penelitian ini bertujuan melakukan analisis klasifikasi waktu tunggu kerja untuk mengetahui tingkat kelancaran alumni dalam mendapatkan pekerjaan menggunakan metode klasifikasi Support Vector Machines (SVM) dan Backpropagation Neural Network (BPNN). Kedua metode klasifikasi baik BPNN dan SVM dengan fungsi Kernel Anova dapat menggambarkan klasifikasi data tracer study berdasarkan tingkat kelancaran alumni untuk mendapatkan pekerjaan (lancar dan tidak lancar) dengan tingkat akurasi yang hampir sama, yaitu sebesar 83,33% untuk tangkat akurasi BPNN dan 83,00% untuk tingkat akutasi SVM. Diharapkan dengan mengetahui faktor yang dapat mengklasifikasikan tingkat kelancaran dalam mendapatkan pekerjaan, pihak universitas dapat memberikan kebijakan yang relevan sehingga kualitas lulusan akan semakin baik.
... For more than a decade, ML has been a powerful tool applied to all scientific fields, including speech recognition [14], translation between languages [17], emotion recognition [18], autonomous navigation of vehicles [19], product recommendations [20] and image processing [7,21]. Notably, there is a fast-growing trend of using ML algorithms in the health care industry [22][23][24][25][26]. It has emerged as an effective way of using medical imaging and clinical data to increase the accuracy of detecting a wide range of medical diseases. ...
Article
Full-text available
Machine Learning (ML) is a subfield of Artificial Intelligence (AI) that is increasingly applied to several medical diagnosis tasks, including a wide range of diseases. Importantly, various ML models were developed to address the complexity of Parkinson’s Disease (PD) diagnosis. PD is a neurodegenerative disease characterized by motor and non-motor disorders where its syndromes affect the daily lives of patients. Several Computer Aided Diagnosis and Detection (CADD) systems based on hand-crafted ML algorithms achieved promising results in distinguishing PD patients from Healthy Control (HC) subjects and other Parkinsonian syndrome categories using clinical data (e.g., speech and gait impairments) and medical imaging [e.g., Position Emission Tomography (PET) and Single Photon Emission Computed Tomography (SPECT)]. Despite the good performance of hand-crafted ML algorithms, there is still a problem linked to the features’ extraction and selection. In fact, Deep Learning DL has provided an ultimate solution for the features’ extraction and selection related issue. An important number of studies on the diagnosis of PD using DL algorithms were developed recently. This study provides an overview of the application of hand-crafted ML algorithms and DL techniques for PD diagnosis. It also introduces key concepts for understanding the application of ML methods to diagnose PD.
... 支持向量机可用于DNA序列剪接位点预测 [14] 、 DNA甲基化预测 [26] 、蛋白质结构预测 [27] 、癌症分 类 [28] 等. [11,31] 和蛋白质-蛋白质相互作用 [32] . ...
Article
Full-text available
机器学习的目标是设计可以根据先验知识和观测数据不断改进其性能的算法. 该算法可以帮助机器从大量的数据中提取知识, 从而提升其在特定任务上的性能. 作为数据驱动的方法, 机器学习可以有效利用高通量实验技术产生的大批量生物数据, 实现合成生物体的功能预测与智能化设计, 改变合成生物学的研究范式. 本文首先介绍机器学习在合成生物学领域广泛应用的几个模型及方法, 如支持向量机、神经网络、生成式对抗网络、深度强化学习等. 然后介绍机器学习方法在合成生物学领域的典型应用, 如启动子预测、酶催化设计、代谢途径构建、基因线路设计等. 本文综述面向合成生物学的机器学习方法及应用, 并试图启发读者如何选择和设计机器学习方法用于合成生物学的研究.
... Traditional machine learning methods have great limitations when dealing with unprocessed data [21]. Machine learning researchers need considerable professional domain knowledge, perform complex preprocessing of the task data that needs to be performed, design a corresponding feature extractor, and further convert the original data image information into feature vectors, and then input the resulting feature vectors The corresponding classifier to output the target category. ...
Article
Full-text available
Gesture control, as a new type of interactive method, has the characteristics of rich expression, convenient control, and fast. It has huge application prospects in entertainment, home furnishing, and industry. Gesture recognition is the basis of gesture control. Gesture recognition technology based on visual detection acquires gesture information in a non-contact manner, which enables the operator to have a better operating experience and is favored by scholars at home and abroad. In order to fully understand the existing research methods of visual gesture recognition, firstly, the basic process of visual gesture recognition is explained. According to the principle of the gesture recognition method, it is divided into gesture recognition based on traditional methods and gesture recognition based on deep learning. And the specific methods are analyzed and summarized in detail. Finally, the technical difficulties of visual gesture recognition are analyzed and discussed, and the development trend of gesture recognition based on vision is prospected.
... Clustering algorithms typically group genes (or samples) in clusters of similar expression profiles to identify possible functional relationships between them. Of particular importance are graphical representations of the clusters and their automatic annotation from available genome databases (Eisen et al., 1998;Furey et al., 2000;Golub et al., 1999;Pe'er et al., 2002;Wu et al., 2000;Zhou et al., 2002). Similar problems are found in the analysis of large networks, where you try to extract subnets that ...
Chapter
The possible applications of modeling and simulation in the field of bioinformatics are very extensive, ranging from understanding basic metabolic paths to exploring genetic variability. Experimental results carried out with DNA microarrays allow researchers to measure expression levels for thousands of genes simultaneously, across different conditions and over time. A key step in the analysis of gene expression data is the detection of groups of genes that manifest similar expression patterns. In this chapter, the authors examine various methods for analyzing gene expression data, addressing the important topics of (1) selecting the most differentially expressed genes, (2) grouping them by means of their relationships, and (3) classifying samples based on gene expressions.
... In classification, the classes, two or more, (e.g., healthy individuals vs. diseased), are predefined and a classifier is built to discriminate between the classes in future applications [17,18], most notably screening and diagnosis [19]. A wide variety of supervised methods have been designed for classification, including Neural Networks [20], Support Vector Machines [21], Graphical Models [22], genetic algorithms [23], nearest neighbour classifiers and many other statistical methods such as shrunken centroids [24] and Partial Least Squares and Discriminant analysis [25]. Due to the large number of features given as input to the various classifiers, a subsequent problem is to select the subset of features that can be used efficiently by the classifier. ...
Article
Full-text available
Meta-analysis is a valuable tool for the synthesis of evidence across a wide range study types including high-throughput experiments such as genome-wide association studies (GWAS) and gene expression studies. There are situations though, in which we have multiple outcomes or multiple treatments, in which the multivariate meta-analysis framework which performs a joint modeling of the different quantities of interest may offer important advantages, such as increasing statistical power and allowing performing global tests. In this work we adapted the multivariate meta-analysis method and applied it in gene expression data. With this method we can test for pleiotropic effects, that is, for genes that influence both outcomes or discover genes that have a change in expression not detectable in the univariate method. We tested this method on data regarding inflammatory bowel disease (IBD), with its two main forms, Crohn’s disease (CD) and Ulcerative colitis (UC), sharing many clinical manifestations, but differing in the location and extent of inflammation and in complications. The Stata code is given in the Appendix and it is available at: www.compgen.org/tools/multivariate-microarrays. • Multivariate meta-analysis method for gene expression data. • Discover genes with pleiotropic effects. • Differentially Expressed Genes (DEGs) identification in complex traits. Method name: Multivariate meta-analysis, Keywords: Multiple outcome, Pleiotropic effects, Microarrays, Meta-analysis
... SVMs have been utilized extensively in oncology for diagnosis and disease staging from radiological and tissue data (99)(100)(101)(102)(103)(104)(105)(106)(107). They have also been utilized for tumor typing from tissue microarray gene expression data, which, because of their high dimensionality, can be problematic for traditional statistical models (108)(109)(110)(111). Outside of oncology, SVMs have shown promise for neuroimaging diagnostics, including for dementia (112) and autism spectrum disorder (113)(114)(115). ...
Article
Machine learning is a branch of computer science that has the potential to transform epidemiological sciences. Amid a growing focus on "Big Data," it offers epidemiologists new tools to tackle problems for which classical methods are not well-suited. In order to critically evaluate the value of integrating machine learning algorithms and existing methods, however, it is essential to address language and technical barriers between the two fields that can make it difficult for epidemiologists to read and assess machine learning studies. Here, we provide an overview of the concepts and terminology used in machine learning literature, which encompasses a diverse set of tools with goals ranging from prediction, to classification, to clustering. We provide a brief introduction to five common machine learning algorithms and four ensemble-based approaches. We then summarize epidemiological applications of machine learning techniques in the published literature. We recommend approaches to incorporate machine learning in epidemiological research and discuss opportunities and challenges for integrating machine learning and existing epidemiological research methods.
... Therefore, the feature selection process selects the best subset of features independently before placing it in the learning algorithm to classify the dataset. Usually, the selection process evaluates each feature individually [58,59]. Thus, the relevance between features is not considered at all. ...
Article
Full-text available
In online social networks, spam profiles represent one of the most serious security threats over the Internet; if they do not stop producing bad advertisements, they can be exploited by criminals for various purposes. This article addresses the nature and the characteristics of spam profiles in a social network like Twitter to improve spam detection, based on a number of publicly available language-independent features. In order to investigate the effectiveness of these features in spam detection, four datasets are extracted for four different language contexts (i.e. Arabic, English, Korean and Spanish), and a fifth is formed by combining them all. We conduct our experiments using a set of five well-known classification algorithms in spam detection field, k-Nearest Neighbours ( k-NN), Random Forest (RF), Naive Bayes (NB), Decision Tree (DT) (J48) and Multilayer Perceptron (MLP) classifiers, along with five filter-based feature selection methods, namely, Information Gain, Chi-square, ReliefF, Correlation and Significance. The results show oscillating performance of each classifier across all datasets, but improved classification results with feature selection. In addition, detailed analysis and comparisons are carried out on two different levels: in the first level, we compare the selected features’ importance among the feature selection methods, whereas in the second level, we observe the relations and the importance of the selected features across all datasets. The findings of this article lead to a better understanding of social spam and improving detection methods by considering the various important features resulting from the different lingual contexts.
... Moreover, the model also has poor portability. Statistical machine learning methods mainly include: Hidden Markov Model (HMM) [19], Maximum Entropy Markov model (MEMM) [20], Support Vector Machine (SVM) [21], Conditional Random Fields (CRF) [22], etc. These methods mainly analyze the language information and mine the language features from the training corpus. ...
Article
Full-text available
The accumulation and explosive growth of the electronic medical records (EMRs) make the name entity recognition (NER) technologies become critical for the meaningful use of EMR data and then the practice of evidence-based medicine. The dominate NER approaches use the distributed representation of the words and characters to build deep learning-based NER models. However, for the task of biomedical named entity recognition, there are a large amount of complicated medical terminologies that are composed of multiple words. Splitting these terminologies to learn the word and character embeddings might cause semantic ambiguities. In this paper, we treat each medical terminology as a concept and propose a concept-enhanced named entity recognition model (CNER), where the features from three different granularities (i.e., concept, word, and character) are combined together for bio-NER. The extensive experiments are conducted on two real-world corpora: fully labeled corpus and partially labeled corpus. CNER achieves the highest F1 score (fully labeled corpus: precision = 88.23, recall = 88.29, and F1 = 88.26; partially labeled corpus: precision = 87.03, recall = 88.19, and F1 = 87.61) by outperforming the baseline CW-BLSTM-CRF approach for 0.58% and 1.15% respectively, which demonstrates the effectiveness of the proposed approach.
... Aside from reduced performance levels, another limitation of the k-nearest neighbor algorithm is that it is computationally expensive regarding processing time and storage requirements, as no model is actually trained and distances must be calculated for every class. Support vector classifiers are robust and have been used for classification of cancer ( Furey et al., 2000;Guyon et al., 2002), image ( Chapelle et al., 1999) and audio (Guo and Li, 2003) classification, and identifying smokers compare to non-smokers ( Pariyadath et al., 2014). In general, because of their ability to operate in high dimensional spaces, support vector classifiers have few drawbacks, with the exception of high processing times and memory consumption during the training and classification stages ( Khan et al., 2010). ...
Article
Full-text available
Neuroimaging research is growing rapidly, providing expansive resources for synthesizing data. However, navigating these dense resources is complicated by the volume of research articles and variety of experimental designs implemented across studies. The advent of machine learning algorithms and text-mining techniques has advanced automated labeling of published articles in biomedical research to alleviate such obstacles. As of yet, a comprehensive examination of document features and classifier techniques for annotating neuroimaging articles has yet to be undertaken. Here, we evaluated which combination of corpus (abstract-only or full-article text), features (bag-of-words or Cognitive Atlas terms), and classifier (Bernoulli naïve Bayes, k-nearest neighbors, logistic regression, or support vector classifier) resulted in the highest predictive performance in annotating a selection of 2,633 manually annotated neuroimaging articles. We found that, when utilizing full article text, data-driven features derived from the text performed the best, whereas if article abstracts were used for annotation, features derived from the Cognitive Atlas performed better. Additionally, we observed that when features were derived from article text, anatomical terms appeared to be the most frequently utilized for classification purposes and that cognitive concepts can be identified based on similar representations of these anatomical terms. Optimizing parameters for the automated classification of neuroimaging articles may result in a larger proportion of the neuroimaging literature being annotated with labels supporting the meta-analysis of psychological constructs.
... Machine learning is widely used as a method for classification and prediction, with a growing number of applications in human health [1]. The use of machine learning in biological fields [2,3], and more specifically the microbiome research field [4][5][6][7], has grown exponentially owing to the robustness of these algorithms to high-dimensional data. However, challenges exist for large-scale meta-analysis because they often require manual curation of metadata and standardized processing of raw sequence data, resulting in variation in the results derived from chosen datasets across studies [8,9]. ...
Article
Full-text available
The use of machine learning in high-dimensional biological applications, such as the human microbiome, has grown exponentially in recent years, but algorithm developers often lack the domain expertise required for interpretation and curation of the heterogeneous microbiome datasets. We present Microbiome Learning Repo (ML Repo, available at https://knights-lab.github.io/MLRepo/), a public, web-based repository of 33 curated classification and regression tasks from 15 published human microbiome datasets. We highlight the use of ML Repo in several use cases to demonstrate its wide application, and we expect it to be an important resource for algorithm developers.
... Admittedly, SVM has some limitations, for instance, it is not statistically functional for the problem of low variable dimensions and small sample sizes; it is time-consuming and requires a substantial amount of computational time to identify the optimal model; it does not perform well if the data set has considerable noise (Dosenbach et al. 2010), and the visualization of SVM's computational process may not be as clear as that of other traditional statistical methods, which is why it is sometimes described as a 'black box' (Wei and Li 2010). Although these concerns may limit its suitable application, SVM has gained great popularity and has exhibited impressive achievements in several research areas, including biomedicine (Furey et al. 2000), education (Huang et al. 2007), management (Tay and Cao 2001), neuroscience (Amari and Wu 1999), and recently, the application of deep learning (Kim et al. 2015). SVM approach has not been as widely used in the humanities and social sciences as much as it has in science, which is likely due to its lack of an intuitive probabilistic model for data interpretation and inference. ...
Article
Full-text available
Science excellence is associated not only with a student’s inherent aptitude but also a range of contextual factors. The objective of this paper was to identify the most important contextual characteristics of top performers in scientific literacy, by simultaneously considering factors at the PISA questionnaire-based student, family, and school levels. The data were based on the science scores of 380,771 PISA 2015 secondary students from 58 countries/economies, of whom 25,181 were top performers at proficiency level 5 or 6, as well as the responses of students and school principals to PISA questionnaires. Overall, 141 contextual variables (derived from the questionnaire responses) were ranked according to their relevance to top performers through a machine learning algorithm—specifically, support vector machine recursive feature elimination (SVM-RFE). An optimal set of 20 features (factors/variables) was then selected from the ranked list due to the high accuracy of these features in classifying and predicting top performers compared to non-top performers based on the support vector machine (SVM) classifier. The research findings indicate that the quality of teachers’ instructional practices, parents’ educational/occupational status, disciplinary climate, time spent on and involvement in learning, schools’ mass media facilities/equipment, the quantity of teachers in the school, and students’ self-efficacy played the most predictive roles in the target students’ superior performance in science. The features identified in this study may provide important information for the future studies on students’ performance in science literacy.
Article
Full-text available
A key mechanistic hypothesis for the evolution of division of labour in social insects is that a shared set of genes co-opted from a common solitary ancestral ground plan (a genetic toolkit for sociality) regulates caste differentiation across levels of social complexity. Using brain transcriptome data from nine species of vespid wasps, we test for overlap in differentially expressed caste genes and use machine learning models to predict castes using different gene sets. We find evidence of a shared genetic toolkit across species representing different levels of social complexity. We also find evidence of additional fine-scale differences in predictive gene sets, functional enrichment and rates of gene evolution that are related to level of social complexity, lineage and of colony founding. These results suggest that the concept of a shared genetic toolkit for sociality may be too simplistic to fully describe the process of the major transition to sociality.
Article
Full-text available
The image processing is the technique which can propose the information stored in the form of pixels. The plant disease detection is the technique which can detect the disease from the leaf. The plant disease detection algorithms has various steps like pre-processing, feature extraction, segmentation and classification. The KNN classifier technique is applied which can classify input data into certain classes. The performance of KNN classifier is analyzed that KNN classifier has high accuracy, less fault detection as compared to other techniques.
Article
Full-text available
Cancer cell lines have been widely used for decades to study biological processes driving cancer development, and to identify biomarkers of response to therapeutic agents. Advances in genomic sequencing have made possible large-scale genomic characterizations of collections of cancer cell lines and primary tumors, such as the Cancer Cell Line Encyclopedia (CCLE) and The Cancer Genome Atlas (TCGA). These studies allow for the first time a comprehensive evaluation of the comparability of cancer cell lines and primary tumors on the genomic and proteomic level. Here we employ bulk mRNA and micro-RNA sequencing data from thousands of samples in CCLE and TCGA, and proteomic data from partner studies in the MD Anderson Cell Line Project (MCLP) and The Cancer Proteome Atlas (TCPA), to characterize the extent to which cancer cell lines recapitulate tumors. We identify dysregulation of a long non-coding RNA and microRNA regulatory network in cancer cell lines, associated with differential expression between cell lines and primary tumors in four key cancer driver pathways: KRAS signaling, NFKB signaling, IL2/STAT5 signaling and TP53 signaling. Our results emphasize the necessity for careful interpretation of cancer cell line experiments, particularly with respect to therapeutic treatments targeting these important cancer pathways.
Article
Full-text available
Acute lung injury (ALI) is a serious respiratory disease, which can lead to acute respiratory failure or death. It is closely related to the pathogenesis of New Coronavirus pneumonia (COVID-19). Many researches showed that traditional Chinese medicine (TCM) had a good effect on its intervention, and network pharmacology could play a very important role. In order to construct "disease-gene-target-drug" interaction network more accurately, deep learning algorithm is utilized in this paper. Two ALI-related target genes (REAL and SATA3) are considered, and the active and inactive compounds of the two corresponding target genes are collected as training data, respectively. Molecular descriptors and molecular fingerprints are utilized to characterize each compound. Forest graph embedded deep feed forward network (forgeNet) is proposed to train. The experimental results show that forgeNet performs better than support vector machines (SVM), random forest (RF), logical regression (LR), Naive Bayes (NB), XGBoost, LightGBM and gcForest. forgeNet could identify 19 compounds in Erhuang decoction (EhD) and Dexamethasone (DXMS) more accurately.
Chapter
Full-text available
The history of artificial intelligence in medicine (AIM) is intimately tied to the history of AI itself, since some of the earliest work in applied AI dealt with biomedicine. This chapter first provides a brief overview of the early history of AI, but then focuses on AI in medicine (and in human biology) and provides a summary of how the field has evolved since the earliest recognition of the potential role of computers in the modeling of medical reasoning and in the support of clinical decision making. The growth of medical AI has been influenced not only by the evolution of AI itself, but also by the remarkable changes in computing and communication technologies. Accordingly, this chapter anticipates many of the topics that are covered in subsequent chapters, providing a concise overview that lays out the concepts and progression that are reflected in the rest of this volume.KeywordsArtificial intelligence historyAIM historyAI winterRoles of knowledge and data in AIMModeling expertiseExpert systemsData scienceMachine learningAIM and clinical decision support
Article
Artificial Intelligence (AI) is a branch of computer science that includes research in robotics, language recognition, image recognition, natural language processing, and expert systems. AI is poised to change medical practice, and oncology is not an exception to this trend. As the matter of fact, lung cancer has the highest morbidity and mortality worldwide. The leading cause is the complexity of associating early pulmonary nodules with neoplastic changes and numerous factors leading to strenuous treatment choice and poor prognosis. AI can effectively enhance the diagnostic efficiency of lung cancer while providing optimal treatment and evaluating prognosis, thereby reducing mortality. This review seeks to provide an overview of AI relevant to all the fields of lung cancer. We define the core concepts of AI and cover the basics of the functioning of natural language processing, image recognition, human-computer interaction and machine learning. We also discuss the most recent breakthroughs in AI technologies and their clinical application regarding diagnosis, treatment, and prognosis in lung cancer. Finally, we highlight the future challenges of AI in lung cancer and its impact on medical practice.
Chapter
Nature-inspired computing (NIC) is a fascinating computing paradigm that applies the methodology and approaches of nature that addresses various realtime complex problems ranging from how an organism finds its prey to genetic evolution. One of the unique features is that it has been provisioned with a decentralized control of computational activities naturally. In this chapter, to have a better insight of such NIC-based algorithms, the problem of identifying the breast cancer is used to exhibit their performances. Also, this chapter briefs about the application of stand-alone and hybridized approaches to identify the disease. Finally, it concludes with the experimental results and other statistical measures. The purpose of the chapter is to guide the new researcher in the area to get inspired from the conventional works and to bring out a new advanced approach that can perform further better. The three swarm algorithms are ant colony optimization, firefly, and particle swarm optimization algorithm. These swarm algorithms were used to optimize the support vector machine (SVM) which was trained to classify the malignant and benign images from the Wisconsin breast cancer dataset. With respect to the experimental results, it has been found that naïve algorithm with PSO optimization demonstrates better discriminating property of the underlying conventional classifiers.
Article
Full-text available
Tinnitus is an auditory phantom perception in the absence of an external sound stimulation. People with tinnitus often report severe constraints in their daily life. Interestingly, indications exist on gender differences between women and men both in the symptom profile as well as in the response to specific tinnitus treatments. In this paper, data of the TrackYourTinnitus platform (TYT) were analyzed to investigate whether the gender of users can be predicted. In general, the TYT mobile Health crowdsensing platform was developed to demystify the daily and momentary variations of tinnitus symptoms over time. The goal of the presented investigation is a better understanding of gender-related differences in the symptom profiles of users from TYT. Based on two questionnaires of TYT, four machine learning based classifiers were trained and analyzed. With respect to the provided daily answers, the gender of TYT users can be predicted with an accuracy of 81.7%. In this context, worries, difficulties in concentration, and irritability towards the family are the three most important characteristics for predicting the gender. Note that in contrast to existing studies on TYT, daily answers to the worst symptom question were firstly investigated in more detail. It was found that results of this question significantly contribute to the prediction of the gender of TYT users. Overall, our findings indicate gender-related differences in tinnitus and tinnitus-related symptoms. Based on evidence that gender impacts the development of tinnitus, the gathered insights can be considered relevant and justify further investigations in this direction.
Article
Full-text available
Much research has been done on time series of financial market in last two decades using linear and non-linear correlation of the returns of stocks. In this paper, we design a method of network reconstruction for the financial market by using the insights from machine learning tool. To do so, we analyze the time series of financial indices of S&P 500 around some financial crises from 1998 to 2012 by using feature ranking approach where we use the returns of stocks in a certain day to predict the feature ranks of the next day. We use two different feature ranking approaches—Random Forest and Gradient Boosting—to rank the importance of each node for predicting the returns of each other node, which produces the feature ranking matrix. To construct threshold network, we assign a threshold which is equal to mean of the feature ranking matrix. The dynamics of network topology in threshold networks constructed by new approach can identify the financial crises covered by the monitored time series. We observe that the most influential companies during global financial crisis were in the sector of energy and financial services while during European debt crisis, the companies are in the communication services. The Shannon entropy is calculated from the feature ranking which is seen to increase over time before market crash. The rise of entropy implies the influences of stocks to each other are becoming equal, can be used as a precursor of market crash. The technique of feature ranking can be an alternative way to infer more accurate network structure for financial market than existing methods, can be used for the development of the market.
Article
Full-text available
This paper addresses the development of predictive models for distinguishing pre-symptomatic infections from uninfected individuals. Our machine learning experiments are conducted on publicly available challenge studies that collected whole-blood transcriptomics data from individuals infected with HRV, RSV, H1N1, and H3N2. We address the problem of identifying discriminatory biomarkers between controls and eventual shedders in the first 32 h post-infection. Our exploratory analysis shows that the most discriminatory biomarkers exhibit a strong dependence on time over the course of the human response to infection. We visualize the feature sets to provide evidence of the rapid evolution of the gene expression profiles. To quantify this observation, we partition the data in the first 32 h into four equal time windows of 8 h each and identify all discriminatory biomarkers using sparsity-promoting classifiers and Iterated Feature Removal. We then perform a comparative machine learning classification analysis using linear support vector machines, artificial neural networks and Centroid-Encoder. We present a range of experiments on different groupings of the diseases to demonstrate the robustness of the resulting models.
Article
Full-text available
Cancer diagnosis using machine learning algorithms is one of the main topics of research in computer-based medical science. Prostate cancer is considered one of the reasons that are leading to deaths worldwide. Data analysis of gene expression from microarray using machine learning and soft computing algorithms is a useful tool for detecting prostate cancer in medical diagnosis. Even though traditional machine learning methods have been successfully applied for detecting prostate cancer, the large number of attributes with a small sample size of microarray data is still a challenge that limits their ability for effective medical diagnosis. Selecting a subset of relevant features from all features and choosing an appropriate machine learning method can exploit the information of microarray data to improve the accuracy rate of detection. In this paper, we propose to use a correlation feature selection (CFS) method with random committee (RC) ensemble learning to detect prostate cancer from microarray data of gene expression. A set of experiments are conducted on a public benchmark dataset using 10-fold cross-validation technique to evaluate the proposed approach. The experimental results revealed that the proposed approach attains 95.098% accuracy, which is higher than related work methods on the same dataset.
Article
Full-text available
Biomarker selection and cancer classification play an important role in knowledge discovery using genomic data. Successful identification of gene biomarkers and biological pathways can significantly improve the accuracy of diagnosis and help machine learning models have better performance on classification of different types of cancer. In this paper, we proposed a LogSum + L2 penalized logistic regression model, and furthermore used a coordinate decent algorithm to solve it. The results of simulations and real experiments indicate that the proposed method is highly competitive among several state-of-the-art methods. Our proposed model achieves the excellent performance in group feature selection and classification problems.
Article
Full-text available
Deep learning analysis of images and text unfolds new horizons in medicine. However, analysis of transcriptomic data, the cause of biological and pathological changes, is hampered by structural complexity distinctive from images and text. Here we conduct unsupervised training on more than 20,000 human normal and tumor transcriptomic data and show that the resulting Deep-Autoencoder, DeepT2Vec, has successfully extracted informative features and embedded transcriptomes into 30-dimensional Transcriptomic Feature Vectors (TFVs). We demonstrate that the TFVs could recapitulate expression patterns and be used to track tissue origins. Trained on these extracted features only, a supervised classifier, DeepC, can effectively distinguish tumors from normal samples with an accuracy of 90% for Pan-Cancer and reach an average 94% for specific cancers. Training on a connected network, the accuracy is further increased to 96% for Pan-Cancer. Together, our study shows that deep learning with autoencoder is suitable for transcriptomic analysis, and DeepT2Vec could be successfully applied to distinguish cancers, normal tissues, and other potential traits with limited samples.
Article
Full-text available
The main goal was to apply machine learning (ML) methods on integrated multi-transcriptomic data, to identify endometrial genes capable of predicting uterine receptivity according to their expression patterns in the cow. Public data from five studies were re-analyzed. In all of them, endometrial samples were obtained at day 6-7 of the estrous cycle, from cows or heifers of four different European breeds, classified as pregnant (n = 26) or not (n = 26). First, gene selection was performed through supervised and unsupervised ML algorithms. Then, the predictive ability of potential key genes was evaluated through support vector machine as classifier, using the expression levels of the samples from all the breeds but one, to train the model, and the samples from that one breed, to test it. Finally, the biological meaning of the key genes was explored. Fifty genes were identified, and they could predict uterine receptivity with an overall 96.1% accuracy, despite the animal's breed and category. Genes with higher expression in the pregnant cows were related to circadian rhythm, Wnt receptor signaling pathway, and embryonic development. This novel and robust combination of computational tools allowed the identification of a group of biologically relevant endometrial genes that could support pregnancy in the cattle.
Article
Full-text available
Non-small cell lung cancer (NSCLC) is one of the most common lung cancers worldwide. Accurate prognostic stratification of NSCLC can become an important clinical reference when designing therapeutic strategies for cancer patients. With this clinical application in mind, we developed a deep neural network (DNN) combining heterogeneous data sources of gene expression and clinical data to accurately predict the overall survival of NSCLC patients. Based on microarray data from a cohort set (614 patients), seven well-known NSCLC biomarkers were used to group patients into biomarker- and biomarker+ subgroups. Then, by using a systems biology approach, prognosis relevance values (PRV) were then calculated to select eight additional novel prognostic gene biomarkers. Finally, the combined 15 biomarkers along with clinical data were then used to develop an integrative DNN via bimodal learning to predict the 5-year survival status of NSCLC patients with tremendously high accuracy (AUC: 0.8163, accuracy: 75.44%). Using the capability of deep learning, we believe that our prediction can be a promising index that helps oncologists and physicians develop personalized therapy and build the foundation of precision medicine in the future.
Article
Full-text available
Current treatments for Alzheimer’s disease are only symptomatic and limited to reduce the progression rate of the mental deterioration. Mild Cognitive Impairment, a transitional stage in which the patient is not cognitively normal but do not meet the criteria for specific dementia, is associated with high risk for development of Alzheimer’s disease. Thus, non-invasive techniques to predict the individual’s risk to develop Alzheimer’s disease can be very helpful, considering the possibility of early treatment. Diffusion Tensor Imaging, as an indicator of cerebral white matter integrity, may detect and track earlier evidence of white matter abnormalities in patients developing Alzheimer’s disease. Here we performed a voxel-based analysis of fractional anisotropy in three classes of subjects: Alzheimer’s disease patients, Mild Cognitive Impairment patients, and healthy controls. We performed Support Vector Machine classification between the three groups, using Fisher Score feature selection and Leave-one-out cross-validation. Bilateral intersection of hippocampal cingulum and parahippocampal gyrus (referred as parahippocampal cingulum) is the region that best discriminates Alzheimer’s disease fractional anisotropy values, resulting in an accuracy of 93% for discriminating between Alzheimer’s disease and controls, and 90% between Alzheimer’s disease and Mild Cognitive Impairment. These results suggest that pattern classification of Diffusion Tensor Imaging can help diagnosis of Alzheimer’s disease, specially when focusing on the parahippocampal cingulum.
Chapter
Transcription profiling enables researchers to understand the activity of the genes in various experimental conditions; in human genomics, abnormal gene expression is typically correlated with clinical conditions. An important application is the detection of genes which are most involved in the development of tumors, by contrasting normal and tumor cells of the same patient. Several statistical and machine learning techniques have been applied to cancer detection; more recently, deep learning methods have been attempted, but they have typically failed in meeting the same performance as classical algorithms. In this paper, we design a set of deep learning methods that can achieve similar performance as the best machine learning methods thanks to the use of external information or of data augmentation; we demonstrate this result by comparing the performance of new methods against several baselines.
Chapter
The image processing is the technique which can propose the information stored in the form of pixels. The plant disease detection is the technique which can detect the disease from the leaf. The plant disease detection algorithms have various steps like preprocessing, feature extraction, segmentation, and classification. The KNN classifier technique is applied which can classify input data into certain classes. The performance of KNN classifier is compared with the existing techniques and it is analyzed that KNN classifier has high accuracy, less fault detection as compared to other techniques. This paper presents methods that use digital image processing techniques to detect, quantify, and classify plant diseases from digital images in the visible spectrum. In plant leaf classification leaf is classified based on its different morphological features. Some of the classification techniques used are neural network, genetic algorithm, support vector machine, and principal component analysis. In this paper results are compared between KNN classifier and SVM classifier.
Article
Full-text available
We present the coronary artery disease (CAD) database, a comprehensive resource, comprising 126 papers and 68 datasets relevant to CAD diagnosis, extracted from the scientific literature from 1992 and 2018. These data were collected to help advance research on CAD-related machine learning and data mining algorithms, and hopefully to ultimately advance clinical diagnosis and early treatment. To aid users, we have also built a web application that presents the database through various reports.
Article
Full-text available
Background Microbiome profiles in the human body and environment niches have become publicly available due to recent advances in high-throughput sequencing technologies. Indeed, recent studies have already identified different microbiome profiles in healthy and sick individuals for a variety of diseases; this suggests that the microbiome profile can be used as a diagnostic tool in identifying the disease states of an individual. However, the high-dimensional nature of metagenomic data poses a significant challenge to existing machine learning models. Consequently, to enable personalized treatments, an efficient framework that can accurately and robustly differentiate between healthy and sick microbiome profiles is needed. Results In this paper, we propose MetaNN (i.e., classification of host phenotypes from Metagenomic data using Neural Networks), a neural network framework which utilizes a new data augmentation technique to mitigate the effects of data over-fitting. Conclusions We show that MetaNN outperforms existing state-of-the-art models in terms of classification accuracy for both synthetic and real metagenomic data. These results pave the way towards developing personalized treatments for microbiome related diseases.
Article
Full-text available
It is difficult to accurately assess axillary lymph nodes metastasis and the diagnosis of axillary lymph nodes in patients with breast cancer is invasive and has low-sensitivity preoperatively. This study aims to develop a mammography-based radiomics nomogram for the preoperative prediction of ALN metastasis in patients with breast cancer. This study enrolled 147 patients with clinicopathologically confirmed breast cancer and preoperative mammography. Features were extracted from each patient’s mammography images. The least absolute shrinkage and selection operator regression method was used to select features and build a signature in the primary cohort. The performance of the signature was assessed using support vector machines. We developed a nomogram by incorporating the signature with the clinicopathologic risk factors. The nomogram performance was estimated by its calibration ability in the primary and validation cohorts. The signature was consisted of 10 selected ALN-status-related features. The AUC of the signature from the primary cohort was 0.895 (95% CI, 0.887–0.909) and 0.875 (95% CI, 0.698–0.891) for the validation cohort. The C-Index of the nomogram from the primary cohort was 0.779 (95% CI, 0.752–0.793) and 0.809 (95% CI, 0.794–0.833) for the validation cohort. Our nomogram is a reliable and non-invasive tool for preoperative prediction of ALN status and can be used to optimize current treatment strategy for breast cancer patients.
Article
Full-text available
Hearing loss (HL) is the most common neurodegenerative disease worldwide. Despite its prevalence, clinical testing does not yield a cell or molecular based identification of the underlying etiology of hearing loss making development of pharmacological or molecular treatments challenging. A key to improving the diagnosis of inner ear disorders is the development of reliable biomarkers for different inner ear diseases. Analysis of microRNAs (miRNA) in tissue and body fluid samples has gained significant momentum as a diagnostic tool for a wide variety of diseases. In previous work, we have shown that miRNA profiling in inner ear perilymph is feasible and may demonstrate distinctive miRNA expression profiles unique to different diseases. A first step in developing miRNAs as biomarkers for inner ear disease is linking patterns of miRNA expression in perilymph to clinically available metrics. Using machine learning (ML), we demonstrate we can build disease specific algorithms that predict the presence of sensorineural hearing loss using only miRNA expression profiles. This methodology not only affords the opportunity to understand what is occurring on a molecular level, but may offer an approach to diagnosing patients with active inner ear disease.
Article
Full-text available
To answer the questions of how information about the physical world is sensed, in what form is information remembered, and how does information retained in memory influence recognition and behavior, a theory is developed for a hypothetical nervous system called a perceptron. The theory serves as a bridge between biophysics and psychology. It is possible to predict learning curves from neurological variables and vice versa. The quantitative statistical approach is fruitful in the understanding of the organization of cognitive systems. 18 references.
Article
Full-text available
Constantly improving gene expression profiling technologies are expected to provide understanding and insight into cancer related cellular processes. Gene expression data is also A preliminary version of this work appeared in Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, 2000. y Part this work was done while Amir Ben-Dor was at University of Washington, Seattle, with support from the Program for Mathematics and Molecular Biology (PMMB). Nir Friedman and Iftach Nachman were supported through the generosity of the Michael Sacher Trust and Israeli Science Foundation equipment grant. Part of this work was done while Zohar Yakhini was visiting the Computer Science Department at the Technion. z Contact author. 1 expected to significantly aid in the development of efficient cancer diagnosis and classification platforms. In this work we examine three sets of gene expression data measured across sets of tumor(s) and normal clinical sa...
Article
Full-text available
We introduce a method of functionally classifying genes by using gene expression data from DNA microarray hybridization experiments. The method is based on the theory of support vector machines (SVMs). SVMs are considered a supervised computer learning method because they exploit prior knowledge of gene function to identify unknown genes of similar function from expression data. SVMs avoid several problems associated with unsupervised clustering methods, such as hierarchical clustering and self-organizing maps. SVMs have many mathematical features that make them attractive for gene expression analysis, including their flexibility in choosing a similarity function, sparseness of solution when dealing with large data sets, the ability to handle large feature spaces, and the ability to identify outliers. We test several SVMs that use different similarity metrics, as well as some other supervised learning methods, and find that the SVMs best identify sets of genes with a common function using expression data. Finally, we use SVMs to predict functional roles for uncharacterized yeast ORFs based on their expression data.
Article
Full-text available
Mechanistic insights to viral replication and pathogenesis generally have come from the analysis of viral gene products, either by studying their biochemical activities and interactions individually or by creating mutant viruses and analyzing their phenotype. Now it is possible to identify and catalog the host cell genes whose mRNA levels change in response to a pathogen. We have used DNA array technology to monitor the level of ≈6,600 human mRNAs in uninfected as compared with human cytomegalovirus-infected cells. The level of 258 mRNAs changed by a factor of 4 or more before the onset of viral DNA replication. Several of these mRNAs encode gene products that might play key roles in virus-induced pathogenesis, identifying them as intriguing targets for further study.
Article
Full-text available
The development and progression of cancer and the experimental reversal of tumorigenicity are accompanied by complex changes in patterns of gene expression. Microarrays of cDNA provide a powerful tool for studying these complex phenomena. The tumorigenic properties of a human melanoma cell line, UACC-903, can be suppressed by introduction of a normal human chromosome 6, resulting in a reduction of growth rate, restoration of contact inhibition, and suppression of both soft agar clonogenicity and tumorigenicity in nude mice. We used a high density microarray of 1,161 DNA elements to search for differences in gene expression associated with tumour suppression in this system. Fluorescent probes for hybridization were derived from two sources of cellular mRNA [UACC-903 and UACC-903(+6)] which were labelled with different fluors to provide a direct and internally controlled comparison of the mRNA levels corresponding to each arrayed gene. The fluorescence signals representing hybridization to each arrayed gene were analysed to determine the relative abundance in the two samples of mRNAs corresponding to each gene. Previously unrecognized alterations in the expression of specific genes provide leads for further investigation of the genetic basis of the tumorigenic phenotype of these cells.
Article
Full-text available
As a step toward understanding the complex differences between normal and cancer cells in humans, gene expression patterns were examined in gastrointestinal tumors. More than 300,000 transcripts derived from at least 45,000 different genes were analyzed. Although extensive similarity was noted between the expression profiles, more than 500 transcripts that were expressed at significantly different levels in normal and neoplastic cells were identified. These data provide insight into the extent of expression differences underlying malignancy and reveal genes that may prove useful as diagnostic or prognostic markers.
Article
Full-text available
We used reverse transcription-coupled PCR to produce a high-resolution temporal map of fluctuations in mRNA expression of 112 genes during rat central nervous system development, focusing on the cervical spinal cord. The data provide a temporal gene expression "fingerprint" of spinal cord development based on major families of inter- and intracellular signaling genes. By using distance matrices for the pair-wise comparison of these 112 temporal gene expression patterns as the basis for a cluster analysis, we found five basic "waves" of expression that characterize distinct phases of development. The results suggest functional relationships among the genes fluctuating in parallel. We found that genes belonging to distinct functional classes and gene families clearly map to particular expression profiles. The concepts and data analysis discussed herein may be useful in objectively identifying coherent patterns and sequences of events in the complex genetic signaling network of development. Functional genomics approaches such as this may have applications in the elucidation of complex developmental and degenerative disorders.
Article
Full-text available
Diploid cells of budding yeast produce haploid cells through the developmental program of sporulation, which consists of meiosis and spore morphogenesis. DNA microarrays containing nearly every yeast gene were used to assay changes in gene expression during sporulation. At least seven distinct temporal patterns of induction were observed. The transcription factor Ndt80 appeared to be important for induction of a large group of genes at the end of meiotic prophase. Consensus sequences known or proposed to be responsible for temporal regulation could be identified solely from analysis of sequences of coordinately expressed genes. The temporal expression pattern provided clues to potential functions of hundreds of previously uncharacterized genes, some of which have vertebrate homologs that may function during gametogenesis.
Article
Full-text available
We sought to create a comprehensive catalog of yeast genes whose transcript levels vary periodically within the cell cycle. To this end, we used DNA microarrays and samples from yeast cultures synchronized by three independent methods: alpha factor arrest, elutriation, and arrest of a cdc15 temperature-sensitive mutant. Using periodicity and correlation algorithms, we identified 800 genes that meet an objective minimum criterion for cell cycle regulation. In separate experiments, designed to examine the effects of inducing either the G1 cyclin Cln3p or the B-type cyclin Clb2p, we found that the mRNA levels of more than half of these 800 genes respond to one or both of these cyclins. Furthermore, we analyzed our set of cell cycle-regulated genes for known and new promoter elements and show that several known elements (or variations thereof) contain information predictive of cell cycle regulation. A full description and complete data sets are available at http://cellcycle-www.stanford.edu
Article
Full-text available
cDNA microarrays and a clustering algorithm were used to identify patterns of gene expression in human mammary epithelial cells growing in culture and in primary human breast tumors. Clusters of coexpressed genes identified through manipulations of mammary epithelial cells in vitro also showed consistent patterns of variation in expression among breast tumor samples. By using immunohistochemistry with antibodies against proteins encoded by a particular gene in a cluster, the identity of the cell type within the tumor specimen that contributed the observed gene expression pattern could be determined. Clusters of genes with coherent expression patterns in cultured cells and in the breast tumors samples could be related to specific features of biological variation among the samples. Two such clusters were found to have patterns that correlated with variation in cell proliferation rates and with activation of the IFN-regulated signal transduction pathway, respectively. Clusters of genes expressed by stromal cells and lymphocytes in the breast tumors also were identified in this analysis. These results support the feasibility and usefulness of this systematic approach to studying variation in gene expression patterns in human cancers as a means to dissect and classify solid tumors.
Article
Full-text available
Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene expression monitoring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.
Article
Full-text available
A new method, called the Fisher kernel method, for detecting remote protein homologies is introduced and shown to perform well in classifying protein domains by SCOP superfamily. The method is a variant of support vector machines using a new kernel function. The kernel function is derived from a hidden Markov model. The general approach of combining generative models like HMMs with discriminative methods such as support vector machines may have applications in other areas of biosequence analysis as well.
Article
Full-text available
Motivation: In order to extract protein sequences from nucleotide sequences, it is an important step to recognize points at which regions start that code for proteins. These points are called translation initiation sites (TIS). Results: The task of finding TIS can be modeled as a classification problem. We demonstrate the applicability of support vector machines for this task, and show how to incorporate prior biological knowledge by engineering an appropriate kernel function. With the described techniques the recognition performance can be improved by 26% over leading existing approaches. We provide evidence that existing related methods (e.g. ESTScan) could profit from advanced TIS recognition.
Article
Full-text available
Classification of patient samples is a crucial aspect of cancer diagnosis and treatment. We present a method for classifying samples by computational analysis of gene expression data. We consider the classification problem in two parts: class discovery and class prediction. Class discovery refers to the process of dividing samples into reproducible classes that have similar behavior or properties, while class prediction places new samples into already known classes. We describe a method for performing class prediction and illustrate its strength by correctly classifying bone marrow and blood samples from acute leukemia patients. We also describe how to use our predictor to validate newly discovered classes, and we demonstrate how this technique could have discovered the key distinctions among leukemias if they were not already known. This proof-of-concept experiment paves the way for a wealth of future work on the molecular classification and understanding of disease. Whitehead/MIT C...
Article
Full-text available
We introduce and analyze a new algorithm for linear classification which combines Rosenblatt 's perceptron algorithm with Helmbold and Warmuth's leave-one-out method. Like Vapnik 's maximal-margin classifier, our algorithm takes advantage of data that are linearly separable with large margins. Compared to Vapnik's algorithm, however, ours is much simpler to implement, and much more efficient in terms of computation time. We also show that our algorithm can be efficiently used in very high dimensional spaces using kernel functions. We performed some experiments using our algorithm, and some variants of it, for classifying images of handwritten digits. The performance of our algorithm is close to, but not as good as, the performance of maximal-margin classifiers on the same problem, while saving significantly on computation time and programming effort. 1 Introduction One of the most influential developments in the theory of machine learning in the last few years is Vapnik's work on supp...
Article
Full-text available
An effective approach to cancer classification based upon gene expression monitoring using DNA microarrays was introduced by Golub et. al. [3]. The main problem they faced was accurately assigning leukemia samples the class labels acute myeloid leukemia (AML) or acute lymphoblastic leukemia (ALL). We used a Support Vector Machine (SVM) classifier to assign these labels. The motivation for the use of a SVM is that DNA microarray problems can be very high dimensional and have very few training data. This type of situation is particularly well suited for an SVM approach. We achieve slightly better performance on this (simple) classification task than Golub et. al. Copyright c fl Massachusetts Institute of Technology, 1998 This report describes research done within the Center for Biological and Computational Learning in the Department of Brain and Cognitive Sciences and at the Artificial Intelligence Laboratory at the Massachusetts Institute of Technology. This research is sponsor...
Article
Full-text available
We sought to create a comprehensive catalog of yeast genes whose transcript levels vary periodically within the cell cycle. To this end, we used DNA microarrays and samples from yeast cultures synchronized by three independent methods: alpha factor arrest, elutriation, and arrest of a cdc15 temperature-sensitive mutant. Using periodicity and correlation algorithms, we identified 800 genes that meet an objective minimum criterion for cell cycle regulation. In separate experiments, designed to examine the effects of inducing either the G1 cyclin Cln3p or the B-type cyclin Clb2p, we found that the mRNA levels of more than half of these 800 genes respond to one or both of these cyclins. Furthermore, we analyzed our set of cell cycle-regulated genes for known and new promoter elements and show that several known elements (or variations thereof) contain information predictive of cell cycle regulation. A full description and complete data sets are available at http://cellcycle-www.stanford.edu
Article
Full-text available
A training algorithm that maximizes the margin between the training patterns and the decision boundary is presented. The technique is applicable to a wide variety of classifiaction functions, including Perceptrons, polynomials, and Radial Basis Functions. The effective number of parameters is adjusted automatically to match the complexity of the problem. The solution is expressed as a linear combination of supporting patterns. These are the subset of training patterns that are closest to the decision boundary. Bounds on the generalization performance based on the leave-one-out method and the VC-dimension are given. Experimental results on optical character recognition problems demonstrate the good generalization obtained when compared with other learning algorithms. 1
Article
The problem of learning linear-discriminant concepts can be solved by various mistake-driven update procedures, including the Winnow family of algorithms and the well-known Perceptron algorithm. In this paper we define the general class of “quasi-additive” algorithms, which includes Perceptron and Winnow as special cases. We give a single proof of convergence that covers a broad subset of algorithms in this class, including both Perceptron and Winnow, but also many new algorithms. Our proof hinges on analyzing a generic measure of progress construction that gives insight as to when and how such algorithms converge. Our measure of progress construction also permits us to obtain good mistake bounds for individual algorithms. We apply our unified analysis to new algorithms as well as existing algorithms. When applied to known algorithms, our method “automatically” produces close variants of existing proofs (recovering similar bounds)—thus showing that, in a certain sense, these seemingly diverse results are fundamentally isomorphic. However, we also demonstrate that the unifying principles are more broadly applicable, and analyze a new class of algorithms that smoothly interpolate between the additive-update behavior of Perceptron and the multiplicative-update behavior of Winnow.
Article
This book is the first comprehensive introduction to Support Vector Machines (SVMs), a new generation learning system based on recent advances in statistical learning theory. The book also introduces Bayesian analysis of learning and relates SVMs to Gaussian Processes and other kernel based learning methods. SVMs deliver state-of-the-art performance in real-world applications such as text categorisation, hand-written character recognition, image classification, biosequences analysis, etc. Their first introduction in the early 1990s lead to a recent explosion of applications and deepening theoretical analysis, that has now established Support Vector Machines along with neural networks as one of the standard tools for machine learning and data mining. Students will find the book both stimulating and accessible, while practitioners will be guided smoothly through the material required for a good grasp of the theory and application of these techniques. The concepts are introduced gradually in accessible and self-contained stages, though in each stage the presentation is rigorous and thorough. Pointers to relevant literature and web sites containing software ensure that it forms an ideal starting point for further study. Equally the book will equip the practitioner to apply the techniques and an associated web site will provide pointers to updated literature, new applications, and on-line software.
Book
This book provides the first comprehensive treatment of feed-forward neural networks from the perspective of statistical pattern recognition. After introducing the basic concepts of pattern recognition, the book describes techniques for modelling probability density functions, and discusses the properties and relative merits of the multi-layer perceptron and radial basis function network models. It also motivates the use of various forms of error functions, and reviews the principal algorithms for error function minimization. As well as providing a detailed discussion of learning and generalization in neural networks, the book also covers the important topics of data processing, feature extraction, and prior knowledge. The book concludes with an extensive treatment of Bayesian techniques and their applications to neural networks.
Article
Array technologies have made it straightforward to monitor simultaneously the expression pattern of thousands of genes. The challenge now is to interpret such massive data sets. The first step is to extract the fundamental patterns of gene expression inherent in the data. This paper describes the application of self-organizing maps, a type of mathematical cluster analysis that is particularly well suited for recognizing and classifying features in complex, multidimensional data. The method has been implemented in a publicly available computer package, GENECLUSTER, that performs the analytical calculations and provides easy data visualization. To illustrate the value of such analysis, the approach is applied to hematopoietic differentiation in four well studied models (HL-60, U937, Jurkat, and NB4 cells). Expression patterns of some 6,000 human genes were assayed, and an online database was created. GENECLUSTER was used to organize the genes into biologically relevant clusters that suggest novel hypotheses about hematopoietic differentiation-for example, highlighting certain genes and pathways involved in "differentiation therapy" used in the treatment of acute promyelocytic leukemia.
Article
The development of cancer is the result of a series of molecular changes occurring in the cell. These events lead to changes in the expression level of numerous genes that result in different phenotypic characteristics of tumors. In this report we describe the assembly and utilization of a 5766 member cDNA microarray to study the differences in gene expression between normal and neoplastic human ovarian tissues. Several genes that may have biological relevance in the process of ovarian carcinogenesis have been identified through this approach. Analyzing the results of microarray hybridizations may provides new leads for tumor diagnosis and intervention.
Article
Comparative hybridization of cDNA arrays is a powerful tool for the measurement of differences in gene expression between two or more tissues. We optimized this technique and employed it to discover genes with potential for the diagnosis of ovarian cancer. This cancer is rarely identified in time for a good prognosis after diagnosis. An array of 21,500 unknown ovarian cDNAs was hybridized with labeled first-strand cDNA from 10 ovarian tumors and six normal tissues. One hundred and thirty-four clones are overexpressed in at least five of the 10 tumors. These cDNAs were sequenced and compared to public sequence databases. One of these, the gene HE4, was found to be expressed primarily in some ovarian cancers, and is thus a potential marker of ovarian carcinoma.
Article
Genome-wide transcript profiling was used to monitor signal transduction during yeast pheromone response. Genetic manipulations allowed analysis of changes in gene expression underlying pheromone signaling, cell cycle control, and polarized morphogenesis. A two-dimensional hierarchical clustered matrix, covering 383 of the most highly regulated genes, was constructed from 46 diverse experimental conditions. Diagnostic subsets of coexpressed genes reflected signaling activity, cross talk, and overlap of multiple mitogen-activated protein kinase (MAPK) pathways. Analysis of the profiles specified by two different MAPKs-Fus3p and Kss1p-revealed functional overlap of the filamentous growth and mating responses. Global transcript analysis reflects biological responses associated with the activation and perturbation of signal transduction pathways.
Article
A number of results have bounded generalization of a classifier in terms of its margin on the training points. There has been some debate about whether the minimum margin is the best measure of the distribution of training set margin values with which to estimate the generalization. Freund and Schapire [7] have shown how a different function of the margin distribution can be used to bound the number of mistakes of an on-line learning algorithm for a perceptron, as well as an expected error bound. ShaweTaylor and Cristianini [13] showed that a slight generalization of their construction can be used to give a pac style bound on the tail of the distribution of the generalization errors that arise from a given sample size. We show that in the linear case the approach can be viewed as a change of kernel and that the algorithms arising from the approach are exactly those originally proposed by Cortes and Vapnik [4]. We generalise the basic result to function classes with bounded f...
Expression monitoring by hybridization to highdensity oligonucleotide arrays
  • D Lockhart
  • H Dong
  • M Byrne
  • M Follettie
  • M Gallo
  • M Chee
  • M Mittmann
  • C Wang
  • M Kobayashi
  • H Horton
  • E Brown
Lockhart,D., Dong,H., Byrne,M., Follettie,M., Gallo,M., Chee,M., Mittmann,M., Wang,C., Kobayashi,M., Horton,H. and Brown,E. (1996) Expression monitoring by hybridization to highdensity oligonucleotide arrays. Nature Biotechnol., 14, 16751680.
Exploring the metabolic and genetic control of gene expression on a genomic scale
  • J Derisi
  • V Iyer
  • P Brown
DeRisi,J., Iyer,V. and Brown,P. (1997) Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278, 680-686.
Gene Shaving: a new class of clustering methods for expression arrays
  • T Hastie
  • R Tibshirani
  • M Eisen
  • P Brown
  • D Ross
  • U Scherf
  • J Weinstein
  • A Alizadeh
  • L Staudt
  • D Botstein
Hastie,T., Tibshirani,R., Eisen,M., Brown,P., Ross,D., Scherf,U., Weinstein,J., Alizadeh,A., Staudt,L. and Botstein,D. (2000) Gene Shaving: a new class of clustering methods for expression arrays. Stanford University Technical report.
Cellular gene expression altered by human cytomegalovirus: global monitoring with oligonucleotide arrays
  • H Zhu
  • J Cong
  • G Mamtora
  • T Gingeras
  • T Schenk
Zhu,H., Cong,J., Mamtora,G., Gingeras,T. and Schenk,T. (1998) Cellular gene expression altered by human cytomegalovirus: global monitoring with oligonucleotide arrays. Proc. Natl. Acad. Sci. USA, 95, 14470-14475.