Article

Multiclass cancer diagnosis using tumor gene expression signatures

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Advances in molecular biology and microarray technology have made it possible to monitor the expression levels of thousands of genes (gene activity) simultaneously, leading to the production of massive amounts of microarray data [5]. These data play an important role in the diagnosis and classification of various cancer tissues, but their most challenging aspects can be identified as their very high dimensions compared to the small sample size, which makes design of appropriate classifiers difficult. ...
... Many attempts have been made in order to find high accuracy and low computational complexity classifiers. However, most of these studies focused on the data with a small number of classes (two or three different tumor types), and applied methods such as Linear Discriminant Analysis (LDA), Neural Networks, Clustering, Nearest Neighbour (NN), and Support Vector Machine (SVM), Decision tree, and Random Forest [5][6][7][8][9][10][11][12]. In this study, the 14-Tumors database is taken under examination. ...
... This database includes information on fourteen different tumor types that have often been studied from the perspective of feature selection. Among the classification works that have been conducted on this database, the method presented in [5] can be noted where the classification problem was converted to a binary classification problem, and different classifiers were used for this purpose of which the best performance was obtained from the SVM classifier. In the binary classifiers, classification procedure is performed using One vs. ...
Preprint
Full-text available
Sparse representation of signals has achieved satisfactory results in classification applications compared to the conventional methods. Microarray data, which are obtained from monitoring the expression levels of thousands of genes simultaneously, have very high dimensions in relation to the small number of samples. This has led to the weaknesses of state-of-the-art classifiers to cope with the microarray data classification problem. The ability of the sparse representation to represent the signals as a linear combination of a small number of training data and to provide a brief description of signals led to reducing computational complexity as well as increasing classification accuracy in many applications. Using all training samples in the dictionary imposes a high computational burden on the sparse coding stage of high-dimensional data. Proposed solutions to solve this problem can be roughly divided into two categories: selection of a subset of training data using different criteria, or learning a concise dictionary. Another important factor in increasing the speed and accuracy of a sparse representation-based classifier is the algorithm which is used to solve the related l1-norm minimization problem. In this paper, different sparse representation-based classification methods are investigated in order to tackle the problem of 14-tumors microarray data classification. Our experimental results show that good performances are obtained by selecting a subset of the original atoms and learning the associated dictionary. Also, using the SL0 sparse coding algorithm increases the speed, and in most cases, accuracy of the classifiers.
... Several studies have identified the site of origin using machine learning methods. [6][7][8][9] For example, one study showed that a deep learning-based algorithm can identify the site of origin. 9 A few studies have reported that transcriptome-based analyses using the machine learning method can be used to identify the site of origin in metastatic carcinomas of unknown 2 Cancer Informatics origin. ...
... 9 A few studies have reported that transcriptome-based analyses using the machine learning method can be used to identify the site of origin in metastatic carcinomas of unknown 2 Cancer Informatics origin. [6][7][8] In addition, gene expression profiling has been used to train a multiclass classifier based on a support vector machine algorithm, 6,8 and an unsupervised cluster analysis method has been applied to evaluate the diagnostic power of a set of genes. 7 However, these studies did not examine the performance of the transcriptome-based classifier in histology-specific cohorts. ...
... 9 A few studies have reported that transcriptome-based analyses using the machine learning method can be used to identify the site of origin in metastatic carcinomas of unknown 2 Cancer Informatics origin. [6][7][8] In addition, gene expression profiling has been used to train a multiclass classifier based on a support vector machine algorithm, 6,8 and an unsupervised cluster analysis method has been applied to evaluate the diagnostic power of a set of genes. 7 However, these studies did not examine the performance of the transcriptome-based classifier in histology-specific cohorts. ...
Article
Full-text available
Purpose There is a lack of tools for identifying the site of origin in mucinous cancer. This study aimed to evaluate the performance of a transcriptome-based classifier for identifying the site of origin in mucinous cancer. Materials And Methods Transcriptomic data of 1878 non-mucinous and 82 mucinous cancer specimens, with 7 sites of origin, namely, the uterine cervix (CESC), colon (COAD), pancreas (PAAD), stomach (STAD), uterine endometrium (UCEC), uterine carcinosarcoma (UCS), and ovary (OV), obtained from The Cancer Genome Atlas, were used as the training and validation sets, respectively. Transcriptomic data of 14 mucinous cancer specimens from a tissue archive were used as the test set. For identifying the site of origin, a set of 100 differentially expressed genes for each site of origin was selected. After removing multiple iterations of the same gene, 427 genes were chosen, and their RNA expression profiles, at each site of origin, were used to train the deep neural network classifier. The performance of the classifier was estimated using the training, validation, and test sets. Results The accuracy of the model in the training set was 0.998, while that in the validation set was 0.939 (77/82). In the test set which is newly sequenced from a tissue archive, the model showed an accuracy of 0.857 (12/14). t-SNE analysis revealed that samples in the test set were part of the clusters obtained for the training set. Conclusion Although limited by small sample size, we showed that a transcriptome-based classifier could correctly identify the site of origin of mucinous cancer.
... There are many supervised and unsupervised machine learning as well as deep learning methods developed for cancer classification using gene expression data. Several studies reported a higher predictive performance of the machine learning methods on the multi-class cancer classification problem 11,[18][19][20] . These studies, however, differ in the methods used for feature (gene) selection. ...
... Ramaswamy et al. 19 , on the other hand, used support vector machines (SVM) and a recursive feature elimination method to remove the uninformative genes. These studies concentrated on the application of machine learning methods on a multi-class classification problem. ...
... Overall, our proposed model surpassed the single 1D-CNN and the machine learning methods in the classification of common cancers among women. These findings are different from those reported in other studies 11,18,19 . These differences can be explained by variations in the type of cancers studied and the methods used for feature/ gene selection. ...
Article
Full-text available
Abstract Cancer tumor classification based on morphological characteristics alone has been shown to have serious limitations. Breast, lung, colorectal, thyroid, and ovarian are the most commonly diagnosed cancers among women. Precise classification of cancers into their types is considered a vital problem for cancer diagnosis and therapy. In this paper, we proposed a stacking ensemble deep learning model based on one-dimensional convolutional neural network (1D-CNN) to perform a multi-class classification on the five common cancers among women based on RNASeq data. The RNASeq gene expression data was downloaded from Pan-Cancer Atlas using GDCquery function of the TCGAbiolinks package in the R software. We used least absolute shrinkage and selection operator (LASSO) as feature selection method. We compared the results of the new proposed model with and without LASSO with the results of the single 1D-CNN and machine learning methods which include support vector machines with radial basis function, linear, and polynomial kernels; artificial neural networks; k-nearest neighbors; bagging trees. The results show that the proposed model with and without LASSO has a better performance compared to other classifiers. Also, the results show that the machine learning methods (SVM-R, SVM-L, SVM-P, ANN, KNN, and bagging trees) with under-sampling have better performance than with over-sampling techniques. This is supported by the statistical significance test of accuracy where the p-values for differences between the SVM-R and SVM-P, SVM-R and ANN, SVM-R and KNN are found to be p = 0.003, p
... d Genomic distribution of 1025 TSSs, active in both testis and K562 cells, relative to their genomic location with respect to RefSeqGenes. e Top diseases or function annotation for the 1025 TSSs identified using the strategy illustrated in panel c tumor types from the global cancer map [50], which is correlated with the original source of gene selection from K562 cells (Additional file 1: Fig. S2c). Furthermore, 103 out of 790 genes were previously described as highly expressed during spermatogenesis [51], with 18 of them belonging to the cancer-testis antigen (CTA) group [52]. ...
... Scatter plots for the number of reads mapped at each base pair position were generated by Excel. For a heatmap analysis of the 790 genes (Additional file 1: Fig. S2c), we used the Gene Set Enrichment Analysis (GSEA) tool [109], using gene expression by the Global Cancer Map data [50]. ...
Article
Full-text available
Background Pervasive usage of alternative promoters leads to the deregulation of gene expression in carcinogenesis and may drive the emergence of new genes in spermatogenesis. However, little is known regarding the mechanisms underpinning the activation of alternative promoters. Results Here we describe how alternative cancer-testis-specific transcription is activated. We show that intergenic and intronic CTCF binding sites, which are transcriptionally inert in normal somatic cells, could be epigenetically reprogrammed into active de novo promoters in germ and cancer cells. BORIS/CTCFL, the testis-specific paralog of the ubiquitously expressed CTCF, triggers the epigenetic reprogramming of CTCF sites into units of active transcription. BORIS binding initiates the recruitment of the chromatin remodeling factor, SRCAP, followed by the replacement of H2A histone with H2A.Z, resulting in a more relaxed chromatin state in the nucleosomes flanking the CTCF binding sites. The relaxation of chromatin around CTCF binding sites facilitates the recruitment of multiple additional transcription factors, thereby activating transcription from a given binding site. We demonstrate that the epigenetically reprogrammed CTCF binding sites can drive the expression of cancer-testis genes, long noncoding RNAs, retro-pseudogenes, and dormant transposable elements. Conclusions Thus, BORIS functions as a transcription factor that epigenetically reprograms clustered CTCF binding sites into transcriptional start sites, promoting transcription from alternative promoters in both germ cells and cancer cells.
... We have previously reported the results of our analysis of oral squamous cell [12][13][14] . Methods to identify the primary tumor site using genetic information of tumor tissues by microarray analysis and next-generation sequencing have been reported recently [15][16][17] . Microarray analysis has been suggested to be useful for accurately identifying the primary tumor site in 80% of solid tumors with a confirmed primary tumor site by evaluation of gene expression 15,16 . ...
... Methods to identify the primary tumor site using genetic information of tumor tissues by microarray analysis and next-generation sequencing have been reported recently [15][16][17] . Microarray analysis has been suggested to be useful for accurately identifying the primary tumor site in 80% of solid tumors with a confirmed primary tumor site by evaluation of gene expression 15,16 . The Tissue of Origin test has been approved for use in the United States. ...
Article
Full-text available
Oral squamous cell carcinomas unusually show distant metastasis to the lung after primary treatment, which can be difficult to differentiate from primary squamous cell carcinoma of the lung. While the location and number of tumor nodules is helpful in diagnosing cases, differential diagnosis may be difficult even with histopathological examination. Therefore, we attempted to identify molecules that can facilitate accurate differential diagnosis. First, we performed a comprehensive gene expression analysis using microarray data for OSCC-LM and LSCC, and searched for genes showing significantly different expression levels. We then identified KRT13, UPK1B, and nuclear receptor subfamily 0, group B, member 1 (NR0B1) as genes that were significantly upregulated in LSCC and quantified the expression levels of these genes by real-time quantitative RT-PCR. The expression of KRT13 and UPK1B proteins were then examined by immunohistochemical staining. While OSCC-LM showed no KRT13 and UPK1B expression, some tumor cells of LSCC showed KRT13 and UPK1B expression in 10 of 12 cases (83.3%). All LSCC cases were positive for at least one of these markers. Thus, KRT13 and UPK1B might contribute in differentiating OSCC-LM from LSCC.
... We have previously reported the results of our analysis of oral squamous cell [10][11][12] . Methods to identify the primary tumor site using genetic information of tumor tissues by microarray analysis and next-generation sequencing have been reported recently [13][14][15] . Microarray analysis has been suggested to be useful for accurately identifying the primary tumor site in 80% of solid tumors with a con rmed primary tumor site by evaluation of gene expression 13,14 . ...
... Methods to identify the primary tumor site using genetic information of tumor tissues by microarray analysis and next-generation sequencing have been reported recently [13][14][15] . Microarray analysis has been suggested to be useful for accurately identifying the primary tumor site in 80% of solid tumors with a con rmed primary tumor site by evaluation of gene expression 13,14 . The Tissue of Origin test has been approved for use in the United States. ...
Preprint
Full-text available
Oral squamous cell carcinomas unusually show distant metastasis to the lung after primary treatment, which can be difficult to differentiate from primary squamous cell carcinoma of the lung. While the location and number of tumor nodules is helpful in diagnosing cases, differential diagnosis may be difficult even with histopathological examination. Therefore, we attempted to identify molecules that can facilitate accurate differential diagnosis. First, we performed a comprehensive gene expression analysis using microarray data for OSCC-LM and LSCC, and searched for genes showing significantly different expression levels. We then identified KRT13, UPK1B, and nuclear receptor subfamily 0, group B, member 1 (NR0B1) as genes that were significantly upregulated in LSCC and quantified the expression levels of these genes by real-time quantitative RT-PCR. The expression of KRT13 and UPK1B proteins were then examined by immunohistochemical staining. While OSCC-LM showed no KRT13 and UPK1B expression, some tumor cells of LSCC showed KRT13 and UPK1B expression in 10 of 12 cases (83.3%). All LSCC cases were positive for at least one of these markers. Thus, KRT13 and UPK1B might contribute in differentiating OSCC-LM from LSCC.
... In recent years, there have been many studies showing that some machine learning models based on decision trees, such as random forests, work very well in dealing with problems in the field of bioinformatics [Car06], and therefore such models have become popular in the field of bioinformatics. The use of support vector machine (SVM) [Cor95] models to predict cancer types from genetic alterations has also been found to be effective in much literature [Soh17] [Ram01]. Networked medical research using network science to abstract complex biological information into complex networks has also emerged as a typical disease research approach to understanding disease modules, identifying disease bio-markers and drug targets [Ire16]. ...
... In 2001, Ramaswamy S et al. [Ram01] proposed a machine learning approach based on gene expression values of tumour samples to classify a wide range of cancer cells. This study involved 218 tumour samples and 90 samples of normal human cells, covering up to 14 different cancers, and 16,063 genes. ...
Preprint
Full-text available
Metastatic prostate cancer is one of the most common cancers in men. In the advanced stages of prostate cancer, tumours can metastasise to other tissues in the body, which is fatal. In this thesis, we performed a genetic analysis of prostate cancer tumours at different metastatic sites using data science, machine learning and topological network analysis methods. We presented a general procedure for pre-processing gene expression datasets and pre-filtering significant genes by analytical methods. We then used machine learning models for further key gene filtering and secondary site tumour classification. Finally, we performed gene co-expression network analysis and community detection on samples from different prostate cancer secondary site types. In this work, 13 of the 14,379 genes were selected as the most metastatic prostate cancer related genes, achieving approximately 92% accuracy under cross-validation. In addition, we provide preliminary insights into the co-expression patterns of genes in gene co-expression networks.
... Another example is tumor classification. With the development of bioinformatics, cancer classification from omics data has become an important topic in genome research (Ramaswamy et al., 2001;Tibshirani et al., 2002;Menyhárt and Győrffy, 2021). ...
... Un autre exemple est la classification des tumeurs. Avec le développement de la bioinformatique, la classification des cancers à partir de données omiques est devenue un sujet important dans la recherche sur le génome (Ramaswamy et al., 2001;Tibshirani et al., 2002;Menyhárt and Győrffy, 2021). ...
Thesis
With the genomic revolution and the new era of precision medicine, the identification of biomarkers that are informative (i.e. active) for a response (endpoint) is becoming increasingly important in clinical research. These biomarkers are beneficial to better understand the progression of a disease (prognostic biomarkers) and to better identify patients more likely to benefit from a given treatment (predictive biomarkers). Biomarker data (e.g. genomics, transcriptomics, and proteomics) usually have a high-dimensional nature, with the number of measured biomarkers (variables) much larger than the sample size. However, only a fraction of biomarkers is truly active, therefore raising the need for variable selection. Among various statistical learning approaches, regularized methods such as Lasso have become very popular for high-dimensional variable selection due to their statistical and numerical performance. However, their selection consistency is not guaranteed when the biomarkers are highly correlated. Throughout my PhD, several novel regularized approaches were developed to perform variable selection in this challenging context. More precisely, four methods were proposed in different statistical models (linear regression model, ANCOVA-type model, and logistic regression model). The main idea is to remove the correlations by whitening the design matrix. For one of the methods, results of the sign consistency were established under mild conditions. The proposed approaches were evaluated through simulation studies and applications on publicly available datasets. The results suggest that our approaches are more performant than compared methods for selecting prognostic and predictive biomarkers in high-dimensional (correlated) settings. Three of our methods are implemented in the R packages: WLasso, PPLasso, and WLogit, available from the CRAN (Comprehensive R Archive Network).
... Genes expression corresponds to some morphological characteristics of pathological TME [16,18], which is crucial for improving survival analysis. Most related works focus on solving the alignment problem among different modalities [2,3,4,14]. ...
Preprint
Survival prediction, utilizing pathological images and genomic profiles, is increasingly important in cancer analysis and prognosis. Despite significant progress, precise survival analysis still faces two main challenges: (1) The massive pixels contained in whole slide images (WSIs) complicate the process of pathological images, making it difficult to generate an effective representation of the tumor microenvironment (TME). (2) Existing multimodal methods often rely on alignment strategies to integrate complementary information, which may lead to information loss due to the inherent heterogeneity between pathology and genes. In this paper, we propose a Multimodal Cross-Task Interaction (MCTI) framework to explore the intrinsic correlations between subtype classification and survival analysis tasks. Specifically, to capture TME-related features in WSIs, we leverage the subtype classification task to mine tumor regions. Simultaneously, multi-head attention mechanisms are applied in genomic feature extraction, adaptively performing genes grouping to obtain task-related genomic embedding. With the joint representation of pathological images and genomic data, we further introduce a Transport-Guided Attention (TGA) module that uses optimal transport theory to model the correlation between subtype classification and survival analysis tasks, effectively transferring potential information. Extensive experiments demonstrate the superiority of our approaches, with MCTI outperforming state-of-the-art frameworks on three public benchmarks. \href{https://github.com/jsh0792/MCTI}{https://github.com/jsh0792/MCTI}.
... Although the mechanism by which Cop1 deletion induces the downregulation of a series of E3 ligase genes remains unclear, a global cancer map for gene expression signatures showed upregulation of a Ring E3 ligase gene group in human leukemia, suggesting that there might be a common regulatory mechanism for the E3 ligase complex (https://www.gseamsigdb.org/gsea/msigdb/human/compendium) [43]. ...
Preprint
Full-text available
Cop1 encodes a ubiquitin E3 ligase that has been well preserved during evolution in both plants and metazoans. In metazoans, the C/EBP family transcription factors are targets for degradation by Cop1, and this process is regulated by the Tribbles pseudokinase family. Over-expression of Tribbles homolog 1 ( Trib1 ) induces acute myeloid leukemia (AML) via Cop1-dependent degradation of the C/EBPa p42 isoform. Here, we induced rapid growth arrest and granulocytic differentiation of Trib1 -expressing AML cells using a Cop1 conditional knockout (KO), which is associated with a transient increase in the C/EBPa p42 isoform. The growth-suppressive effect of Cop1 KO was canceled by silencing of Cebpa and reinforced by exogenous expression of the p42 isoform. Moreover, Cop1 KO improved the survival of recipients transplanted with Trib1 -expressing AML cells. We further identified a marked increase in Trib1 protein expression in Cop1 KO, indicating that Trib1 is self-degraded by the Cop1 degradosome. COP1 downregulation also inhibits the proliferation of human AML cells in a TRIB1 -dependent manner. Taken together, our results provide new insights into the role of Trib1/Cop1 machinery in the C/EBPa p42-dependent leukemogenic activity, and a novel idea to develop new therapeutics.
... There have been numerous studies on tissue classification using gene expression data, but studies that test the class predictor across independent datasets are not as common [55]. Typically, the classifier is trained on a subset of samples from a study and tested on the remaining portion of the unseen samples from the same pool [56][57][58]. Having samples from the same project in the training and test sets can potentially lead to overoptimistic measures of performance due to overfitting, where the model learns the noise of the training set, and in this case detects the same noise in the test set [59]. ...
Article
Full-text available
Background RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins. Results We aimed to investigate the impact of data preprocessing steps—focusing on normalization, batch effect correction, and data scaling—through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer. Conclusion By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.
... Many biological approaches for both extracellular and intracellular nanoparticles synthesis have been reported till date using microorganisms (eg. bacteria, fungi) and plants [7]. Plants provide a better platform for nanoparticles synthesis as they are free from toxic chemicals as well as provide natural capping agents. ...
Article
Full-text available
Background: Nanoscience and nanotechnology have been established recently a new interdisciplinary science and now a days it is one of the most attractive research area in modern material science. Nanotechnology can be defined as the synthesis, characterization, exploration and application of nanosized materials for the development of science. Objective: To Synthesis of MgO nanoparticles using Neem leaves extract & characterization of MgO nanoparticles with its antioxidant and photocatalytic effects. Methodology: A Quasi experimental type of study in the Department of Pharmacology and therapeutics, Rajshahi Medical College, Rajshahi from January 2018 to December 2018. Prior commencement of the study, approval was taken from the institutional Review Board of Rajshahi Medical College. Results: There was no significant difference between Neem extract and Mg (NO 3) 2. MgO-NPs vs Mg (NO 3) 2 showed Mean difference =-0.19, 95% CI of difference = 0.3434 to-0.03662, P value = <0.05. So, there was significant difference between MgO-NPs. and Mg (NO 3) 2. Conclusion: Synthesis of nanoparticles were confirmed by changes of colour from yellow to yellowish brown, UV-Vis spectroscopy and also by evaluation of its photocatalytic and antioxidant properties. In the photocatalytic study 88% dye degradation of MgO-NPs was found and antioxidant activity of MgO-NPs was also examined using DPPH assay which showed significant (P< 0.0001) antioxidant properties with 80% DPPH scavenging activity at 100mg/ml concentration.
... Classification plays an important role in areas of gene selection [2], image classification, medical diagnosis [3][4], economic analysis, risk analysis [5][6], bioinformatics analysis [7] and many others [8]. There are only a few extensive empirical studies comparing classification performance of learning algorithms. ...
Article
Full-text available
Multi-class classification is a fascinating field to study. However, evaluating the classification performance of classifiers is difficult. Class indices such as accuracy, precision, recall, and F-measure, Kappa and area under the curve of receiver operating characteristics (AUC), can be used to evaluate classification performance. These indices describe the classification results achieved on each modelled class. Several measures have been introduced in the literature to deal with this assessment, the most commonly used being accuracy. In general these metrics were proposed to address binary classification tasks, whereas multiclass classification is the more difficult and currently active research area in machine learning (ML). In this paper, we intended to compare classification performance of nine supervised machine learning algorithms based on three learner types: statistical learner, rule-based learner and neural-base learner by considering accuracy, precision, recall and F-measure and ROC area achieved on four different datasets from UCI machine repository. Among these, Random forest has been the best performance in both 10 fold cross validation and percentage split with overall average accuracy of predictive power of 92.20% and 91.76% respectively, with less variability, whereas Naïve Bayes has the worst also in both 10 fold cross validation and percentage split by average correct classification performance of 79.18% and 76.92% respectively, and also with higher variability next to Decision Table.
... Over the last years, several studies have focused on building mathematical models able to predict cancer type based on molecular characteristics, including gene expression [46][47][48] and DNA methylation [49,50] profiles as well as somatic alteration analysis [9,13,51]. ...
Article
Full-text available
Background Machine learning (ML) represents a powerful tool to capture relationships between molecular alterations and cancer types and to extract biological information. Here, we developed a plain ML model aimed at distinguishing cancer types based on genetic lesions, providing an additional tool to improve cancer diagnosis, particularly for tumors of unknown origin. Methods TCGA data from 9,927 samples spanning 32 different cancer types were downloaded from cBioportal. A vector space model type data transformation technique was designed to build consistently homogeneous new datasets containing, as predictive features, calls for somatic point mutations and copy number variations at chromosome arm-level, thus allowing the use of the XGBoost classifier models. Considering the imbalance in the dataset, due to large difference in the number of cases for each tumor, two preprocessing strategies were considered: i) setting a percentage cut-off threshold to remove less represented cancer types, ii) dividing cancer types into different groups based on biological criteria and training a specific XGBoost model for each of them. The performance of all trained models was mainly assessed by the out-of-sample balanced accuracy (BACC) and the AUC scores. Results The XGBoost classifier achieved the best performance (BACC 77%; AUC 97%) on a dataset containing the 10 most represented tumor types. Moreover, dividing the 18 most represented cancers into three different groups (endocrine-related carcinomas, other carcinomas and other cancers),such analysis models achieved 78%, 71% and 86% BACC, respectively, with AUC scores greater than 96%. In addition, the model capable of linking each group to a specific cancer type reached 81% BACC and 94% AUC. Overall, the diagnostic potential of our model was comparable/higher with respect to others already described in literature and based on similar molecular data and ML approaches. Conclusions A boosted ML approach able to accurately discriminate different cancer types was developed. The methodology builds datasets simpler and more interpretable than the original data, while keeping enough information to accurately train standard ML models without resorting to sophisticated Deep Learning architectures. In combination with histopathological examinations, this approach could improve cancer diagnosis by using specific DNA alterations, processed by a replicable and easy-to-use automated technology. The study encourages new investigations which could further increase the classifier’s performance, for example by considering more features and dividing tumors into their main molecular subtypes.
... Support vector machine (SVM), has recently been explored for the prediction of protein-protein interactions (Bock and Gough, 2001), protein secondary structure prediction (Hua and Sun, 2001), protein fold recognition (Ding and Dubchak, 2001), Protein function classification (Cai et al., 2003a), analysis of protein solvent accessibility and other biomedical problems including; microarray gene expression data analysis (Brown et al., 2000) and cancer diagnosis (Ramaswamy et al., 2001). ...
Thesis
Full-text available
Vector-borne diseases are some of the most widely distributed and poorly controlled diseases that affect both humans and livestock globally. East Coast Fever (ECF) is a tick-borne disease caused by the protozoan parasite Theileria parva, which is transmitted by the Rhipicephalus appendiculatus ticks. It is endemic in Eastern, Central, and Southern Africa, where it kills at least 1 million cattle and causes economic losses exceeding 300 million USD annually. Current control methods such as chemotherapy, vector control using acaricides, and immunization by Infection and Treatment Method (ItM) have not been effective in mitigating ECF. Although research is ongoing, an effective subunit vaccine against the disease is still elusive. Proteins facilitate most biological processes and their high tendency to be antigenic, makes them important targets for subunit vaccine development. One of the approaches to elucidate the function of such proteins is through protein-protein interaction studies, which are based on the 'guilty by association' principle. ECF vaccine research has mainly been focused on the host-parasite protein interactions. However, transmission of T parva is dependent on its ability to survive in the tick vector. Therefore, targeting parasite and vector antigenic proteins expressed during tick life cycle stages may lead to the discovery of potential ECF transmission-blocking vaccine candidates. This study, therefore, aimed to develop a computational model that could predict protein-protein interactions between T parva and its tick vector, so as to unravel proteins that are vital for vector and parasite survival. Also, the study predicted potential immune-responsive epitopes in these proteins as targets for vaccine development. A machine learning approach, Support Vector Machine (SVM), was trained at an accuracy of 93.14%, to classify between interacting and non-interacting proteins. The model's performance was evaluated by the area under curve (AUC) of a receiver operative characteristics (ROC) curve. Immunoinformatics tools were used to identify MBC class I and class II epitopes. A total of 9,917 protein-protein interactions were predicted between the tick vector and parasite proteins at a prediction value range of 1-2.8, indicating possibility of very strong interaction. Subsequently, 261 and 1,479 epitopes were predicted in R. appendiculatus and T. parva proteins respectively The functional domains identified in the interacting proteins, revealed that these proteins are involved in numerous cellular functions that are important for parasite and vector survival. Additionally, literature showed that most tick proteins involved in the interactions are known anti-tick vaccine candidates, but none have been studied as transmission-blocking vaccines. The study concluded that protein interaction occurs between parasite and vector proteins that play key cellular functions for the survival of ECF parasite and vector. Further, the interacting proteins have epitopes with potential to induce a protective immune response when used to vaccinate the cattle. Hence, the uptake of immune response components during tick feeding may inhibit protein function, which in turn affects parasite and vector survival. These proteins are therefore promising targets for a cocktail transmission-blocking vaccine against East Coast Fever but appropriate screening procedures are needed to validate the best candidates. This approach offers the advantage of controlling both tick numbers and disrupting the tick vector-pathogen interface, therefore, blocking T. parva transmission to the bovine host.
... 2.2 Sparse PCA Sparse PCA has been paid a more attention over last several years [12]. Sparse PCA has a wide applications in various fields which include bioinformatics [15,16], clustering and feature selection [17], multivariate time series analysis [18,19], large text data analysis [20], finance data analysis [21], natural language processing, machine vision [22] and so on. The sparse PCA algorithm introduced by Shen Ning-min and Li Jing can derive a new kind of category of sparse PCA algorithms [12]. ...
Article
Full-text available
In last few years, the applications related to pattern reorganization are quickly increasing in number of area. Areas to which it applied include communication (speech recognition, data compression), business (e.g. Character recognition), medicine (diagnosis, abnormality detection and harmful action), military intelligence, biometric authentication, Agriculture and many more. The dimension of the data is the number of feature that is measured on each observation. This paper review traditional and current state-of-the-art dimension reduction methods. The objective of this paper is to summarize and compare some of the well-known and recent dimension reduction methods used in various stages of a pattern recognition system.
... Classification plays an important role in areas of gene selection [2], image classification, medical diagnosis [3][4], economic analysis, risk analysis [5][6], bioinformatics analysis [7] and many others [8]. There are only a few extensive empirical studies comparing classification performance of learning algorithms. ...
Article
Full-text available
Multi-class classification is a fascinating field to study. However, evaluating the classification performance of classifiers is difficult. Class indices such as accuracy, precision, recall, and F-measure, Kappa and area under the curve of receiver operating characteristics (AUC), can be used to evaluate classification performance. These indices describe the classification results achieved on each modelled class. Several measures have been introduced in the literature to deal with this assessment, the most commonly used being accuracy. In general these metrics were proposed to address binary classification tasks, whereas multiclass classification is the more difficult and currently active research area in machine learning (ML). In this paper, we intended to compare classification performance of nine supervised machine learning algorithms based on three learner types: statistical learner, rule-based learner and neural-base learner by considering accuracy, precision, recall and F-measure and ROC area achieved on four different datasets from UCI machine repository. Among these, Random forest has been the best performance in both 10 fold cross validation and percentage split with overall average accuracy of predictive power of 92.20% and 91.76% respectively, with less variability, whereas Naïve Bayes has the worst also in both 10 fold cross validation and percentage split by average correct classification performance of 79.18% and 76.92% respectively, and also with higher variability next to Decision Table.
... In addition to transcriptomics data, single-cell technology includes more than 200 techniques that profile the genomic, epigenetic, and proteomic data in individual cells (37). By providing detailed molecular information on cells, single-cell atlases can facilitate the identification of key genes and pathways linked to healthy and disease states, patient classification, and the development of targeted therapies (29,(38)(39)(40). It's essential for machine learning models used in analyzing biological and clinical data to be accurate and robust. ...
Preprint
Full-text available
For predictive computational models to be considered reliable in crucial areas such as biology and medicine, it is essential for them to be accurate, robust, and interpretable. A sufficiently robust model should not have its output affected significantly by a slight change in the input. Also, these models should be able to explain how a decision is made. Efforts have been made to improve the robustness and interpretability of these models as independent challenges, however, the effect of robustness and interpretability on each other is poorly understood. Here, we show that predicting cell type based on single-cell RNA-seq data is more robust by adversarially training a deep learning model. Surprisingly, we find this also leads to improved model interpretability, as measured by identifying genes important for classification. We believe that adversarial training will be generally useful to improve deep learning robustness and interpretability, thereby facilitating biological discovery.
... Multiclass classification can be accomplished by using any of a variety of methods. The simplest approach is the oneversus-one approach, which involves splitting a multiclass classification problem into multiple binary classification problems [63]. ...
Article
Full-text available
Cardiovascular diseases, specifically heart failure and aortic stenosis, are considered common and deadly, with the additional risk of developing dementia in the elderly population. Early diagnosis can help prevent or alleviate these diseases and potentially reduce mortality rates. Machine learning algorithms, especially gradient boosting (GB), can effectively predict the presence of these diseases through binary classification using demographic and medical data. However, research has yet to combine data from all three diseases for multiclass classification, which is the purpose of the present study. Using a dataset collected from Chiang Rai Prachanukroh Hospital, Chiang Rai, Thailand, a GB-based model is proposed for the multiclass classification of elderly people with heart failure, aortic stenosis, and dementia, with the inclusion of feature engineering techniques for maximum accuracy. Other existing methods, including decision tree, support vector machine, k-nearest neighbors, random forest, and extra trees were applied for comparison. The Optuna framework was used with the tree-structured Parzen estimator for hyperparameter optimization. The results produced by each classifier were compared using various performance metrics, namely precision, recall, F1 score, accuracy, the area under the receiver operating characteristic curve, the area under the precision-recall curve, and the Matthews correlation coefficient. The results are presented separately for each machine learning algorithm for comparison. Based on these metrics, it can be concluded that our proposed GB-based model outperformed other comparative models after applying feature engineering techniques.
... Although direct examination of tissue-via morphology and immunohistochemistry-has long guided cancer type diagnosis, advances in sequencing technologies have facilitated new ways of characterizing cancer [9,10]. These approaches have further enabled the development of several molecular diagnostics aimed at determining tumor origin specifically [11]. ...
Article
Full-text available
Introduction: Cancers assume a variety of distinct histologies, and may originate from a myriad of sites including solid organs, hematopoietic cells, and connective tissue. Clinical decision-making based on consensus guidelines such as the National Comprehensive Cancer Network (NCCN) is often predicated on a specific histologic and anatomic diagnosis, supported by clinical features and pathologist interpretation of morphology and immunohistochemical (IHC) staining patterns. However, in patients with nonspecific morphologic and IHC findings-in addition to ambiguous clinical presentations such as recurrence versus new primary-a definitive diagnosis may not be possible, resulting in the patient being categorized as having a cancer of unknown primary (CUP). Therapeutic options and clinical outcomes are poor for patients with CUP, with a median survival of 8-11 months. Methods: Here, we describe and validate the Tempus Tumor Origin (Tempus TO) assay, an RNA-sequencing-based machine learning classifier capable of discriminating between 68 clinically relevant cancer subtypes. Model accuracy was assessed using primary and/or metastatic samples with known subtype. Results: We show that the Tempus TO model is 91% accurate when assessed on both a retrospectively held out cohort and a set of samples sequenced after model freeze that collectively contained 9210 total samples with known diagnoses. When evaluated on a cohort of CUPs, the model recapitulated established associations between genomic alterations and cancer subtype. Discussion: Combining diagnostic prediction tests (e.g., Tempus TO) with sequencing-based variant reporting (e.g., Tempus xT) may expand therapeutic options for patients with cancers of unknown primary or uncertain histology.
... Also, to improve the performance of the classifier, these researchers, e.g. Ramaswamy et al. [6] and many others have done great work. However, we have added a new dimension to it through our work to simply deal with any vast data-set. ...
... Cancer class discovery and prediction using Neighbourhood analysis was presented in [13]. Yet another work of classification of multiclass cancer diagnosis using tumor gene expression was investigated in [14] [15] resulting in the improvement of classification analysis. ...
Article
Full-text available
Tumor clustering from gene expression data has paramount implications for cancer diagnosis and treatment. The adoption of clustering techniques for bio-molecular data provides new way for cancer diagnosis and treatment. In order to perform successful cancer diagnosis and treatment, cancer class discovery using bio-molecular data is considered to be one of the most important tasks. Several single clustering approaches were performed for tumor clustering but it had several drawbacks such as stability, accuracy and robustness. In this paper to improve the tumor clustering, we employ a framework, called, Hybrid Support Vector Machine (HSVM) which incorporates PSO-based feature extraction and GA-based feature selection. Specifically, the framework represents the generation of cluster in the first stage which is performed through Markov clustering algorithm. Then, the SVM classification process is adopted to generate or classify the bio-molecular data into benign tumor or malignant tumor. Our experimental results on real datasets collected from UCI machine learning repository and cancer gene expression profile show HSVM can improve the accuracy of clustering gene expression data than other related technique. The Markov clustering algorithm employed in HSVM achieves comparatively better diagnostic performance, capable of classifying the bio-molecular data into benign tumor or malignant tumor based on gene expression data.
... Gene expression profiling studies in primary tumors have repeatedly demonstrated differences between normal and malignant tissues (1,2). It is becoming clear that expression profiles within tumors seem to be correlated with overall survival (3 -7), and a recent study suggests that expression profiling of primary tumor biopsies yields prognostic ''signatures'' that rival or may even outperform currently accepted standard measures of risk in cancer patients (8). ...
Article
Purpose: Given their accessibility, surrogate tissues, such as peripheral blood mononuclear cells (PBMC), may provide potential predictive biomarkers in clinical pharmacogenomic studies. In leukemias and lymphomas, the prognostic value of peripheral blast expression profiles is clear; however, it is unclear whether circulating mononuclear cells of patients with solid tumors might yield profiles with similar prognostic associations. Experimental Design: In this study, we evaluated the association of expression profiles in PBMCs with clinical outcomes in patients with advanced renal cell cancer. Transcriptional patterns in PBMCs of 45 renal cell cancer patients were compared with clinical outcome data at the conclusion of a phase II study of the mTOR kinase inhibitor CCI-779 to determine whether pretreatment transcriptional patterns in PBMCs were correlated with eventual patient outcomes. Results: Unsupervised hierarchical clustering of the PBMC profiles using all expressed genes identified clusters of patients with significant differences in survival. Cox proportional hazards modeling showed that the expression levels of many PBMC transcripts were predictors for the patient outcomes of time to progression and overall survival (time to death). Supervised class prediction approaches identified multivariate expression patterns in PBMCs capable of assigning favorable outcomes of time to death and time to progression in a test set of renal cancer patients, with overall performance accuracies of 72% and 85%, respectively. Conclusions: The present study provides the first example of gene expression profiling in peripheral blood, a clinically accessible surrogate tissue, for identifying patterns of gene expression associated with higher likelihoods of positive outcome in patients with a solid tumor.
... This technology has made it conceivable to relate biological cell states to gene expression patterns for studying tumor genesis, progression of diseases, cellular response, and identification of drug targets. For example, subsets of genes with amplified and declined activities have been recognized for acute lymphoblast leukemia [15,34], tumor genesis[6], prostate cancer [27], apoptosis induction [8], colon cancer [2], breast cancer [36], drug response [8], multiple tumor types [23] and lung cancer [40].Similarly, microarrays disclose alterations in genetic makeup of an individual, its supervisory mechanisms and refined variations and might direct towards the timeline of adapted medicine to cure the diseases causing hindrance in the treatments. (Trevino et al., 2007) Figure 01, establishes the utilities of microarrays in a single experiment, which was not even imagined few years back but now has been made possible with the chip technology. ...
Article
Full-text available
Microarrays with biomolecules, cells and tissues arrested on compact substrates are noteworthy tools for biological exploration, counting on genomics and proteomics as well as cell analysis. The demand of microarray tools is the possibility of large-scale corresponding determination of a diversity of variables concurrently. Henceforth, microarray technologies fascinate the concern equally of the scientific and professional domains alike. High-throughput screening has been the foremost focus of the exploitation of microarray technologies in modern years, and has delivered the resilient driving force for expansions in this arena. DNA chip and biochip skills have been established as a magnitude of wide-reaching activity in genome exploration. In this review, the current state of microarray fabrication is reviewed and also on microarray-based analysis, microarray stages, techniques and applications.
... The datasets are taken from the UCI machine learning repository (Dua and Graff 2017) , Keel repository (Alcalá-Fdez et al. 2011) 1 and ASU feature selection repository (Li et al. 2018). The microarray GCM data is collected from (Ramaswamy et al. 2001). The data dimensions are reported in the supplement. ...
Article
Mean shift is a simple interactive procedure that gradually shifts data points towards the mode which denotes the highest density of data points in the region. Mean shift algorithms have been effectively used for data denoising, mode seeking, and finding the number of clusters in a dataset in an automated fashion. However, the merits of mean shift quickly fade away as the data dimensions increase and only a handful of features contain useful information about the cluster structure of the data. We propose a simple yet elegant feature-weighted variant of mean shift to efficiently learn the feature importance and thus, extending the merits of mean shift to high-dimensional data. The resulting algorithm not only outperforms the conventional mean shift clustering procedure but also preserves its computational simplicity. In addition, the proposed method comes with rigorous theoretical convergence guarantees and a convergence rate of at least a cubic order. The efficacy of our proposal is thoroughly assessed through experimental comparison against baseline and state-of-the-art clustering methods on synthetic as well as real-world datasets.
... To explore the potential role and specific mechanism of PD-L1 in cancer progression, we assessed the expression profiles of PD-L1 in tumours and adjacent noncancerous tissues in TCGA datasets. We analysed the expression of PD-L1 in different types of cancer by analysing the Ramaswamy multicancer dataset from the Oncomine database 18,19 ( Figure 1A). We found that PD-L1 was predominantly upregulated in CRC (n = 330), lymphoma (n = 19), and cervical cancer (n = 35) datasets ( Figure 1A). ...
Article
Full-text available
PD‐L1 is widely known as an immune checkpoint, and immunotherapy through the inhibition of checkpoint molecules has become an important component in the successful treatment of tumours via PD‐1/PD‐L1 signaling pathways. However, its biological functions and expression profile in colorectal cancer (CRC) are elusive. We previously found that PD‐L1 can bind to PD‐L1 and cause cell detachment. However, the detailed molecular mechanisms of how PD‐L1 binds to PD‐L1 and how it transmits signals to the cell remain unclear. In this study, we disclosed that PD‐L1 expression was dramatically upregulated in CRC compared to normal tissues. Ectopic expression of PD‐L1 inhibits cell adhesive capacity and promotes cell migration in CRC cell lines, while silencing PD‐L1 had the opposite effects and suppressed invasion and proliferation. Mechanistically, PD‐L1 was found to promote EMT through the ERK signaling molecule pathway and interacted with the 1–86 aa fragment of KRAS to transduce signals. Collectively, our study demonstrated the role of PD‐L1 after binding to PD‐L1 in CRC, thereby providing a new theoretical basis for further improving immunotherapy with anti‐PD‐L1 antibodies. This article is protected by copyright. All rights reserved.
... Golub et al. 1999;Ramaswamy et al. 2001;Halder et al 2015;Wang et al. 2007 . ...
Article
Full-text available
Cancer is one of the leading causes of death in the world. In most cases, it can be treated if the disease is detected earlier. One way to diagnose cancer is to use microarray data, which, unlike imaging, does not contain harmful rays to humans. Microarrays have many genes that complicate and take time to analyse, so selecting useful genes is one of the key steps in diagnosing the disease. The proposed method in this paper has two main phases, the first of which is the selection of effective genes. In the next phase, the disease is diagnosed from the selected genes by the first phase. In the past, many algorithms such as Ridge have been proposed for this purpose, which according to the results of experiments, the accuracy of the method proposed in this paper is superior to them. In this paper, leukaemia and colon cancer datasets are used as input and evaluation of the proposed method. The precision of the
... 1av) using the randomForest R package (v4. [6][7][8][9][10][11][12][13][14] with default settings. A filter is applied to the probabilities produced by the random forest based on sample gender, where breast, ovary and cervix probabilities are set to 0 for male samples, and prostate probabilities are set to 0 for female samples. ...
Article
Full-text available
Cancers of unknown primary (CUP) origin account for ∼3% of all cancer diagnoses, whereby the tumor tissue of origin (TOO) cannot be determined. Using a uniformly processed dataset encompassing 6756 whole-genome sequenced primary and metastatic tumors, we develop Cancer of Unknown Primary Location Resolver (CUPLR), a random forest TOO classifier that employs 511 features based on simple and complex somatic driver and passenger mutations. CUPLR distinguishes 35 cancer (sub)types with ∼90% recall and ∼90% precision based on cross-validation and test set predictions. We find that structural variant derived features increase the performance and utility for classifying specific cancer types. With CUPLR, we could determine the TOO for 82/141 (58%) of CUP patients. Although CUPLR is based on machine learning, it provides a human interpretable graphical report with detailed feature explanations. The comprehensive output of CUPLR complements existing histopathological procedures and can enable improved diagnostics for CUP patients.
... A previous study used a support vector machine (SVM) algorithm to establish a multiclass classifier to diagnose multiple common adult malignancies. The overall classification accuracy of the classifier was 78%, far exceeding the accuracy of random classification (9%) [26]. Mu et al. reported an 18 F-FDG-PET/CT-based EGFR-deep learning score that can provide decision support for NSCLC treatment with TKIs or immune checkpoint inhibitors (ICIs) [27]. ...
Article
Full-text available
Background Timely identification of epidermal growth factor receptor (EGFR) mutation and anaplastic lymphoma kinase (ALK) rearrangement status in patients with non-small cell lung cancer (NSCLC) is essential for tyrosine kinase inhibitors (TKIs) administration. We aimed to use artificial intelligence (AI) models to predict EGFR mutations and ALK rearrangement status using common demographic features, pathology and serum tumor markers (STMs). Methods In this single-center study, demographic features, pathology, EGFR mutation status, ALK rearrangement, and levels of STMs were collected from Wuhan Union Hospital. One retrospective set (N = 1089) was used to train diagnostic performance using one deep learning model and five machine learning models, as well as the stacked ensemble model for predicting EGFR mutations, uncommon EGFR mutations, and ALK rearrangement status. A consecutive testing cohort (n = 1464) was used to validate the predictive models. Results The final AI model using the stacked ensemble yielded optimal diagnostic performance with areas under the curve (AUC) of 0.897 and 0.883 for predicting EGFR mutation status and 0.995 and 0.921 for predicting ALK rearrangement in the training and testing cohorts, respectively. Furthermore, an overall accuracy of 0.93 and 0.83 in the training and testing cohorts, respectively, were achieved in distinguishing common and uncommon EGFR mutations, which were key evidence in guiding TKI selection. Conclusions In this study, driverless AI based on robust variables could help clinicians identify EGFR mutations and ALK rearrangement status and provide vital guidance in TKI selection for targeted therapy in NSCLC patients.
... Although direct examination of tissue-via morphology and immunohistochemistry-has long guided cancer type diagnosis, advances in sequencing technologies have facilitated new ways of characterizing cancer 9,10 . These approaches have further enabled the development of several molecular diagnostics aimed at determining tumor origin specifically 11 . ...
Preprint
Full-text available
Cancers assume a variety of distinct histologies and may originate from a myriad of sites including solid organs, hematopoietic cells, and connective tissue. Clinical decision making based on consensus guidelines such as NCCN is often predicated on a specific histologic and anatomic diagnosis, supported by clinical features and pathologist interpretation of morphology and immunohistochemical (IHC) staining patterns. However, in patients with nonspecific morphologic and IHC findings—in addition to ambiguous clinical presentations such as recurrence versus new primary—a definitive diagnosis may not be possible, resulting in the patient being categorized as having a cancer of unknown primary (CUP). Therapeutic options and clinical outcomes are poor for CUP patients, with a median survival of 8-11 months. Here we describe and validate the Tempus Tumor Origin (Tempus TO) assay, an RNA-seq-based machine learning classifier capable of discriminating between 68 clinically relevant cancer subtypes. We show that the Tempus TO model is 91% accurate when assessed on retrospectively and prospectively held out cohorts of containing 9,210 samples with known diagnoses. When evaluated on a cohort of CUPs, the model recapitulated established associations between genomic alterations and cancer subtype. Combining diagnostic prediction tests (e.g., Tempus TO) with sequencing-based variant reporting (e.g., Tempus xT) may expand therapeutic options for patients with cancers of unknown primary or uncertain histology.
... However, high dimensionality is involved in biological and gene expression data. Thus, reduction of dimensionality of feature space was attempted by many researchers (Ramaswamy et al. 2001;Aliferis et al. 2010;Wang et al. 2005). Usual methods like Factor Analysis, Principal Component Analysis, etc. did not yield satisfactory results primarily due to small sample size for specific cancer type, different feature sets associated with each type of cancer, possible nonlinear relationships between expressions of different genes, noncompliance of assumptions, etc. Morphologic characteristics of biopsy specimens are common in diagnosis of tumor. ...
... On the other hand, it was shown that KNN outperformed ANN in the classification of cherry coffee beans [6].PCA is also used to accelerate the training time of KNN [7]. The Support Vector Machine (SVM) has a 78 percent accuracy rate in detecting cancers [8] .SVM has also been used effectively for multiclass prediction [9], including a SVM classifier of dry beans [10]. Directed acyclic graphs can also be used to generate accurate binary predictions in SVM multiclass classification [2]. ...
Article
Full-text available
The technique of multiclass classification based on SVMs has been widely used. SVM optimization will be accomplished by examining the extraction features of Principal Component Analysis (PCA), Box-Cox Transformation, and Recursive Feature Elimination (RFE). The dataset contains 13,611 rows and 17 variables, generated from the UCI repository's multiclass dry bean data. Barbunya, Bombay, Cal, Dermas, Horoz, Seker, and Sira are just a few of the dry bean kinds available. The dataset was tested using SVM Linear kernel and SVM Radial Basis.According to the results, the combination of scale-center-BoxCox-SVM Radial extraction achieves the maximum accuracy of 93.16 percent and the shortest processing time of 6.10 minutes. 96.00 percent, 100 percent, 96.71 percent, 95.16 percent, 97.60 percent, 97.74 percent, and 91.95 percent, according to bean class.RFE-SVM Radial has a 91.18 percent accuracy and a processing time of 6.55 minutes. BoxCox outperforms conventional techniques in terms of prediction accuracy while requiring less training time.
... This study illustrates the utilization of NMF based on the UIK method to select optimal rank on RSS curve with leukemia (esGolub) gene expression data set in simplifying cancer subtypes (23)(24)(25). It was used in several papers on NMF and was built in the NMF package's data (7,14,26), packed into an ExpressionSet object. ...
Article
Full-text available
Background There is a great need to develop a computational approach to analyze and exploit the information contained in gene expression data. Recent utilization of non-negative matrix factorization (NMF) in computational biology has served its capability to derive essential details from a high amount of data in particular gene expression microarrays. Objective A common problem in NMF is finding the proper number rank (r) of factors. Thus, various techniques have been suggested to select the optimal value of rank factorization (r). Method This study focused on the unit invariant knee (UIK) method to calculate factorization rank (basis vector) of the non-negative matrix factorization (NMF) of gene expression data sets is employed. Because the UIK method requires an extremum distance estimator (EDE) that is eventually employed for inflection and identification of a knee point, this study finds the first inflection point of curvature of RSS of the proposed algorithms using the UIK method on gene expression datasets as a target matrix. Results Computation was conducted for the UIK task using the esGolub data set of R studio, and consequently, the distinct results of NMF was subjected to compare on different algorithms. The proposed UIK method is easy to perform, free of a priori rank value input, and does not require initial parameters that significantly influence the model’s functionality. Conclusion This study demonstrates that the UIK method provides a credible prediction for both gene expression data and precisely estimating of simulated mutational processes data with known dimensions.
... Gene expression profiling as a means to classify MTUH was demonstrated to be feasible by several studies in the early 2000s [28][29][30][31][32]. Although some studies suggested outcome benefits and prognostic use of using gene expression panels to guide management, the only available randomized clinical trial did not show benefit, and the ESMO clinical guideline for CUP diagnosis and treatment does not recommend gene expression profiling [33][34][35][36]. ...
Article
Full-text available
In this study, we evaluate the impact of whole genome and transcriptome analysis (WGTA) on predictive molecular profiling and histologic diagnosis in a cohort of advanced malignancies. WGTA was used to generate reports including molecular alterations and site/tissue of origin prediction. Two reviewers analyzed genomic reports, clinical history, and tumor pathology. We used National Comprehensive Cancer Network (NCCN) consensus guidelines, Food and Drug Administration (FDA) approvals, and provincially reimbursed treatments to define genomic biomarkers associated with approved targeted therapeutic options (TTOs). Tumor tissue/site of origin was reassessed for most cases using genomic analysis, including a machine learning algorithm (Supervised Cancer Origin Prediction Using Expression [SCOPE]) trained on The Cancer Genome Atlas data. WGTA was performed on 652 cases, including a range of primary tumor types/tumor sites and 15 malignant tumors of uncertain histogenesis (MTUH). At the time WGTA was performed, alterations associated with an approved TTO were identified in 39 (6%) cases; 3 of these were not identified through routine pathology workup. In seven (1%) cases, the pathology workup either failed, was not performed, or gave a different result from the WGTA. Approved TTOs identified by WGTA increased to 103 (16%) when applying 2021 guidelines. The histopathologic diagnosis was reviewed in 389 cases and agreed with the diagnostic consensus after WGTA in 94% of non‐MTUH cases (n = 374). The remainder included situations where the morphologic diagnosis was changed based on WGTA and clinical data (0.5%), or where the WGTA was non‐contributory (5%). The 15 MTUH were all diagnosed as specific tumor types by WGTA. Tumor board reviews including WGTA agreed with almost all initial predictive molecular profile and histopathologic diagnoses. WGTA was a powerful tool to assign site/tissue of origin in MTUH. Current efforts focus on improving therapeutic predictive power and decreasing cost to enhance use of WGTA data as a routine clinical test.
... Recently, machine-learning technology has been M intensively used for classification tasks [7]. These researches have included a range of techniques including support vector machines (SVM) [8], evolutionary processes [9,10], K-nearest neighbors [11,12], decision trees [13][14][15], artificial neural network (ANN) [16][17][18] and clustering [19]. ...
Article
Full-text available
ethods such as biopsy and surgery for breast cancer diagnosis, are expensive, invasive and with risks. The purpose of this study is to propose a new method for achieving the best artificial neural network (ANN) model using cytology results such as clump thickness,… so that can be applied for the diagnosis of benign or malignancy of breast tumors with minimum error and maximum reliability instead of invasive methods. The Wisconsin breast cancer database was used. We applied the genetic algorithm (GA) for determination of the best structure and training of multi-layer NN i.e. optimization of the values of the weights and bias. The implementation was done by MATLAB showed that GA is able to determine the best structure for a multi-layer NN and train it properly too. Then the error back propagation algorithm (EBPA) was used to train the models proposed by GA and we found that EBPA trains the neural networks (NNs) equal to GA, even better in some cases. Accordingly, the EBPA was determined as a basic algorithm for training NNs which their structure was proposed by GA. 5-Fold Cross-Validation was used for a closer evaluation the performance of these NNs. Based on the results obtained in different performances, we achieved the best neural network model which its structure was (9-6-4-1) with an average accuracy, sensitivity and specificity 0.974, 0.979 and 0.971 respectively, and accordingly a new method to obtain an optimized model based on the available data, is proposed in which the structure of the NN is determined by the GA and Training is done by EBPA. The proposed NN model performance was compared with the logistic regression (LR) that ANN had better performance rather than LR in the diagnosis of benign or malignant tumors. Thus, the ANN model which is obtained by the method described, can replace invasive medical methods, and determine patients who do not need a biopsy and surgery, with high accuracy and sensitivity.
... Particularly, machine learning (ML) and deep learning (DL) technologies include a wide range of algorithms to detect the most significant gene or gene sets from the dataset which might be important for cancer progression or may serve as potential diagnostic and prognostic markers in the future. In spite of that some ML-based algorithms classify the expressed gene with low accuracies such as fuzzy neural network (FNN) [12], hierarchical clustering [13], K-means clustering [14], self-organizing map (SOM) [15], and decision tree (DT) method [16], while support vector machine (SVM) [17], artificial neural network (ANN) [18] and genetic algorithm [19] are noted to have higher accuracy. Here, we have described the computational intelligence-based applications used to identify and classify the genes which are significantly differentially expressed between control and CRC. ...
Chapter
Full-text available
Key challenges in cancer gene expression analysis are to collectively identify the gene (tumor suppressor, proto-oncogenes, and mismatch repair) or sets of genes that are differentially expressed significantly in cancer and normal samples. High throughput expression techniques provide quantitative data about the expression of thousands of genes per biological sample. However, such technologies have some common problems including but not limited to dataset complexity, data integration, and noise. Data generated from the latest technologies are growing at an explosive rate; therefore, it becomes essential to dig out useful information from this data and create biological knowledge. Moreover, traditional data analysis is sometimes not effective to extract valuable information from biological datasets. More recently, to overcome these issues, computational intelligence methods such as artificial intelligence (AI) and machine learning (ML) have been widely applied to study gene co-expression networks, differential gene expression analysis, pathway analysis, and predicting biomarkers and therapeutic targets in cancer. In this chapter, we attempt to describe how computational intelligence-based algorithms can contribute to this field to generate quality knowledge needed in cancer biology. Special emphasis is given on colorectal cancer (CRC) studies, wherein apart from expression biomarkers, diagnostic potential of AI-based analysis has also been investigated. We have highlighted algorithms widely used in identifying unique gene expression signatures in CRCs. AI and ML-based methodologies could help us identify high-risk genes or gene sets and their aberrant expression associated with them. In the future, this would help us improve diagnosis and prognosis for better monitoring and management of CRC. Keywords Colorectal cancer Artificial intelligence Gene expression High-throughput technologies Machine learning
... It should be reiterated that MNALCI does not mean to replace other methods developed for liquid biopsy. On the contrary, we believe this approach could be combined with screening strategies based on other biomarkers, such as mRNA, miRNA, mutated or 5-Hydroxymethylated cfDNA, circulating proteins, etc 5,8,55,56 . As shown in Supplementary Fig. 39, our assay was unable to discriminate common pathogenic mutations in NSCLC, CRC and PTC. ...
Article
Full-text available
As cancer is increasingly considered a metabolic disorder, it is postulated that serum metabolite profiling can be a viable approach for detecting the presence of cancer. By multiplexing mass spectrometry fingerprints from two independent nanostructured matrixes through machine learning for highly sensitive detection and high throughput analysis, we report a laser desorption/ionization (LDI) mass spectrometry-based liquid biopsy for pan-cancer screening and classification. The Multiplexed Nanomaterial-Assisted LDI for Cancer Identification (MNALCI) is applied in 1,183 individuals that include 233 healthy controls and 950 patients with liver, lung, pancreatic, colorectal, gastric, thyroid cancers from two independent cohorts. MNALCI demonstrates 93% sensitivity at 91% specificity for distinguishing cancers from healthy controls in the internal validation cohort, and 84% sensitivity at 84% specificity in the external validation cohort, with up to eight metabolite biomarkers identified. In addition, across those six different cancers, the overall accuracy for identifying the tumor tissue of origin is 92% in the internal validation cohort and 85% in the external validation cohort. The excellent accuracy and minimum sample consumption make the high throughput assay a promising solution for non-invasive cancer diagnosis. As cancer is increasingly considered a metabolic disorder, it is postulated that serum metabolite profiling can be a viable approach for detecting the presence of cancer. Here, the authors report a machine learning model using mass spectrometry-based liquid biopsy data for pan-cancer screening and classification.
... were decreased in several kinds of breast cancer compared to standard samples [31]. Additionally, Ramaswamy et al. [32] reported that the low mRNA level of EIF3F was found in different kinds of breast cancer (P < 0.001, fold change = −3.372). A low expression level of EIF3G was found in ductal breast carcinoma in the Richardson dataset [33]. ...
Article
Full-text available
The EIF3 gene family is essential in controlling translation initiation during the cell cycle. The significance of the EIF3 subunits as prognostic markers and therapeutic targets in breast cancer is not yet clear. We analyzed the expression of EIF3 subunits in breast cancer on the GEPIA and Oncomine databases and compared their expression in breast cancer and normal tissues using BRCA data downloaded from TCGA. Then we performed clinical survival analysis on the Kaplan–Meier Plotter database and clinicopathologic analysis on the bc-genexMiner v4.1 database. And EIF3B was chosen for mutation analysis via the Cancer SEA online tool. Meanwhile, we performed the immunohistochemical assay, real-time RT-PCR, and Western blotting to analyze EIF3B expression levels in breast cancer. An EIF3B knockdown and a negative control cell line were conducted for MTT assay and cell cycle analysis to assess cell growth. Specifically, the results of TCGA and online databases demonstrated that upregulated EIF3B was associated with poorer overall and advanced tumor progression. We also confirmed that EIF3B was more highly expressed in breast cancer cells and tissues than normal and correlated with a worse outcome. And knockdown of EIF3B expression inhibited the cell cycle and proliferation. Furthermore, EIF3B was highly mutated in breast cancer. Collectively, our results suggested EIF3B as a potential prognostic marker and therapeutic target for breast cancer.
... Moreover, we found that the risk score was also related to the expression of CIC negative regulatory markers. DNMT1 belonged to the DNA methyltransferase family, and its overexpression is identified in human T-cell, B-cell, and myeloid malignancies, indicating that DNMT1 may play a crucial role in tumor maintenance [46,47]. Nectin-3, a nectin family member, took part in regulating the formation of adherent junctions and had been reported to be a novel biomarker in tumorigenesis. ...
Article
Full-text available
Esophageal carcinoma (ESCA) is one of the most frequent types of malignant tumor that has a dismal prognosis. This research applied datasets aimed from the GEO and TCGA to create a prognostic signature for forecasting the clinical outcome of ESCA patients on the basis of a circRNA-associated regulatory network. Methods. A regulatory network associated with ESCA was established based on transcriptome data of circRNAs, miRNAs, and mRNAs. Functional annotations were implemented to further explore the mechanism of ESCA. Cox relative regression method was applied to create a risk signature. Besides, the immune microenvironment of the signature was investigated by utilizing the CIBERSORT algorithm. Results. Based on 27 DEcircRNAs, 65 DEmiRNAs, and 780 DEmRNAs, the circRNA-miRNA-mRNA network was finally set up. Functional enrichment unearthed that the regulatory network might participate in phosphorylation negative regulation, MAPK pathway, and PI3K/AKT pathway. This study established a risk scoring signature based on the seven immune-related genes (IRGs) (MARP14, RASGR1, PTK2, HMGB1, DKK1, RARB, and IGF1R), which was validated for its reliability. A stable and accurate nomogram combining immune-related risk scores with clinical features was constructed. Furthermore, we observed that the risk model was also related to the immunocyte infiltration. Conclusion. Our study successfully created a circRNA-associated regulatory network and further developed an immune-related model to forecast the clinical outcome of ESCA patients as well as to assess their immune status.
... These omics techniques are pivotal aspects of the development of personalized medicine by enabling a better understanding of fine-grained molecular mechanisms [2]. In oncology, these techniques provide a more comprehensive insight of the biological processes intricacy in cancers giving momentum to molecular-type characterization through omics or even multiomics approaches [3]- [5]. Such a precise and robust characterization is a highly valuable asset for tumor characterization and provides significant acumen on their treatment. ...
Article
Full-text available
Precision medicine is a paradigm shift in healthcare relying heavily on genomics data. However, the complexity of biological interactions, the large number of genes and the lack of comparisons on the analysis of data, remain a tremendous bottleneck regarding clinical adoption. We introduce a novel, automatic and unsupervised framework to discover low-dimensional gene biomarkers. Our method is based on the center-based LP-Stability clustering algorithm. Our evaluation includes both mathematical and biological criteria. The recovered signature is applied to several biological tasks, including screening of biological pathways and functions, and tumor types and subtypes characterization. Quantitative comparisons among different distance metrics, commonly used clustering methods and a referential gene signature from literature, confirm the high performance of our approach. In particular, our signature, based on 27 genes, reports at least 30 times better mathematical significance (Dunn's Index) and 25% better biological significance (Enrichment in Protein-Protein Interaction) than those produced by other referential clustering methods. Our signature reports promising results on distinguishing immune-inflammatory and immune desert tumors, while reporting a high balanced accuracy of 92% on tumor types classification and averaged balanced accuracy of 68% on tumor subtypes, which represents 7% and 9% higher performance compared to the referential signature.
... The sequence of the entire human genome contains about 3 billion base pairs and tens of thousands of genes [9]. Gene expression data is essentially a high-dimensional data and its feature dimension is highly correlated with the number of genes [10]. Humans possess tens of thousands of genes, resulting in unprocessed gene expression data that typically has tens of thousands of features. ...
Article
Full-text available
Background Correctly classifying the subtypes of cancer is of great significance for the in-depth study of cancer pathogenesis and the realization of personalized treatment for cancer patients. In recent years, classification of cancer subtypes using deep neural networks and gene expression data has gradually become a research hotspot. However, most classifiers may face overfitting and low classification accuracy when dealing with small sample size and high-dimensional biology data. Results In this paper, a laminar augmented cascading flexible neural forest (LACFNForest) model was proposed to complete the classification of cancer subtypes. This model is a cascading flexible neural forest using deep flexible neural forest (DFNForest) as the base classifier. A hierarchical broadening ensemble method was proposed, which ensures the robustness of classification results and avoids the waste of model structure and function as much as possible. We also introduced an output judgment mechanism to each layer of the forest to reduce the computational complexity of the model. The deep neural forest was extended to the densely connected deep neural forest to improve the prediction results. The experiments on RNA-seq gene expression data showed that LACFNForest has better performance in the classification of cancer subtypes compared to the conventional methods. Conclusion The LACFNForest model effectively improves the accuracy of cancer subtype classification with good robustness. It provides a new approach for the ensemble learning of classifiers in terms of structural design.
... [33]. The associations between expression levels and the meta-analysis of data from 7 previous studies [34][35][36][37][38][39] were analyzed using the Oncomine database (https://www.oncomine.com) [40]. ...
Article
Full-text available
Colorectal cancer (CRC) is one of the most mortal cancers in the world. Multiple factors and bio-processes are associated with in tumorigenesis and metastasis of CRC, including cellular senescence and immune evasion. This study aims to identify prognostic and immune-meditating effects of INHBA in CRC. Microarray datasets were downloaded from the Gene Expression Omnibus (GEO) database to screen the differentially expressed genes (DEGs) in senescent cells and CRC tissues from the Cancer Genome Atlas (TCGA). Key factor was settled from the alternative DEGs set. Enrichment analyses and functional networks prediction were determined from online databases. Correlation analyses were performed to reveal the association among key factor, immune infiltration, T cell biomarkers and immune checkpoints. Moreover, expressions of key factors and immune checkpoints of tissue and blood samples from CRC patients as well as human CRC cell lines were measured. Results showed that Inhibin beta A (INHBA) was sorted out as a senescence-related factor and a prognostic predictor in CRC. What's more, INHBA was found highly co-expressed with T-cell biomarkers and immune checkpoints. In conclusion, INHBA was considered as a senescence-related regulator and a prognostic predictor in CRC, which also mediating immune evasion.
Article
Full-text available
Feature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Since then, gene expression technology has been perfected, the Human Genome Project has been completed, new microarray platforms have been created and discontinued, and RNA-seq has gradually replaced microarrays. However, most feature selection methods in the last two decades were designed, evaluated , and validated on the same datasets from the microarray technology's infancy. In this review of over 1200 publications regarding feature selection and gene expression, published between 2010 and 2020, we found that 57% of the publications used at least one outdated dataset, 23% used only outdated data, and 32% did not cite data sources. Other issues include referencing databases that are no longer available, the slow adoption of RNA-seq datasets, and bias toward human cancer data, even for methods designed for a broader scope. In the most popular datasets, some being 23 years old, mislabeled samples, experimental biases, distribution shifts, and the absence of classification challenges are common. These problems are more predominant in publications with computer science backgrounds compared to publications from biology and can lead to inaccurate and misleading biological results. This article is categorized under: Algorithmic Development > Biological Data Mining Technologies > Machine Learning KEYWORDS feature selection, gene expression, machine learning, microarray, RNA-seq
Chapter
Toxicogenomics (TGX) may be defined as a toxicological subdiscipline of pharmacogenomics, which is defined as the study of interindividual variations in whole‐genome or candidate gene single‐nucleotide polymorphism maps, haplotype markers, and alterations in gene expression that might correlate with drug responses. For much of the field of toxicology, the primary focus is on determining the probability and potential exposure‐related aspects of risk. This chapter presents potential uses of TGX in this process. There are, of course, a number of considerations involved in the use of TGX as a tool in risk assessment. With persistent application of TGX to toxicology and risk assessment, it seems inevitable that researchers will learn how to successfully apply this technology to advance the field. The National Research Council (NRC) report stresses that the twenty‐first century vision for toxicity testing should remain consistent with the NRC risk assessment paradigm put forward in 1983.
Article
Full-text available
Gene expression data clustering groups genes with similar patterns into a group, while genes exhibit dissimilar patterns into different groups. Traditional partitional gene expression data clustering partitions the entire set of genes into a finite set of clusters which might not reflect co-expression or coherent patterns across all genes belonging to a cluster. In this paper, we propose a graph-theoretic clustering algorithm called GAClust which groups co-expressed genes into the same cluster while also detecting noise genes. Clustering of genes is based on the presumption that co-expressed genes are more likely to share common biological functions. However, it has been observed that the clusters produced by traditional methods often do not reflect true biological groups or functions. To address this issue, we propose a semi-supervised algorithm, SGAClust to produce more biologically relevant clusters. We consider both synthetic and cancer gene expression datasets to evaluate the performance of the proposed algorithms. It has been found that SGAClust outperforms the unsupervised algorithms. Additionally, we also identify potential gene biomarkers which will further help in cancer management.
Preprint
Full-text available
Deep learning is a wildly popular topic in machine learning and is structured as a series of nonlinear layers that learn various levels of data representations. To implement various computer models, deep learning employs numerous layers to represent data abstractions. Deep learning approches like generative, discriminative models and model transfers approaches have transformed information processing. This article proposes a comprehensive review of various deep learning algorithms Multi layer perception (MLP), Self-organizing map (SOM) and deep belief networks (DBN) algorithms. It first briefly introduces historical and recent state-of-the-art reviews with suitable architectures and implementation steps. Then the various applications of those algorithms in various fields such as speech recognition engineering, medical applications, natural language processing, material science and remote sensing applications, etc are classified.
Chapter
In the early 1900s, multiple significant studies showed high incidences of cancer. During this period, study with infectious agents produced only modest results which looked irrelevant to people. Then, in the 1980s, groundbreaking evidence that a number of viruses can cause cancer in people began to emerge. Machine learning and deep learning techniques have been widely employed in cancer detection and classification that include support vector machines (SVMs), artificial neural networks (ANNs), and conventional neural networks (CNNs). The recurrence of cancer is also an important issue that needs to be predicted with significant accuracy. This chapter reviews current state-of-the-art of ANNs model in the prediction of cancer recurrence.
Article
Full-text available
Many Error-Correcting Output Codes (ECOC) algorithms had been proposed based on the hard coding (HC) schemes: binary coding {1, 0} or ternary coding {+ 1, -1, 0}. This paper introduces two novel strategies to recode the original code matrices with the mean values and the intervals of learners’ outputs, which are named Mean Value Recoding (MVR) and Interval Recoding (IR) strategies. Both strategies are designed to reduce the distance between the outputs of base learners and the target codewords, aiming to produce more accurate results compared with the HC schemes. It is the first time that two concepts, soft recoding, and learner dependent, are injected into the ECOC framework to the best of our knowledge. To verify the effectiveness of our strategies, four data-independent ECOC algorithms and two data-dependent ECOC algorithms are deployed in the experiments based on UCI data sets. The experiments are carried out using the original HC strategies and our soft recoding strategies, and results verify that our strategies outperform the HC-based algorithms in most cases by producing balanced results among classes. In short, our strategies can improve the performance of different ECOC algorithms. Our python code and the corresponding data sets are available for non-commercial or research use at: https://github.com/MLDMXM2017/softcoding-ECOC.
ResearchGate has not been able to resolve any references for this publication.