Figure 4 - uploaded by Bisakha Ray
Content may be subject to copyright.
Multi-modal uniform (MMU) predictive analytics approaches. (a) MMU w/o feature selection, (b) MMU with feature selection performed on all modalities at once, (c) MMU with feature selection performed independently on individual modalities.

Multi-modal uniform (MMU) predictive analytics approaches. (a) MMU w/o feature selection, (b) MMU with feature selection performed on all modalities at once, (c) MMU with feature selection performed independently on individual modalities.

Source publication
Article
Full-text available
The spectrum of modern molecular high-throughput assaying includes diverse technologies such as microarray gene expression, miRNA expression, proteomics, DNA methylation, among many others. Now that these technologies have matured and become increasingly accessible, the next frontier is to collect "multi-modal" data for the same set of subjects and...

Context in source publication

Context 1
... these approaches can capture information from and interactions among features in multiple data modalities. Figure 4 provides a pictorial description of MMU approaches. ...

Similar publications

Article
Full-text available
We consider the problem of predicting a categorical variable based on groups of inputs. Some methods have already been proposed to elaborate classification rules based on groups of variables (e.g. group lasso for logistic regression). However, to our knowledge, no tree-based approach has been proposed to tackle this issue. Here, we propose the Tree...

Citations

... Stacking strategies differ in the ways base models are trained and combined. Optimal stacking strategies have been explored in several domains [27,28]. ...
Chapter
Full-text available
This chapter first reviews areas where AI/ML and other automated decision making performs well in hard problems in the health sciences. It also summarizes main results from the literature comparing empirical performance of AI/ML vs humans. The chapter then addresses foundations of human heuristic decision making (and important related biases), and contrasts those with AI/ML biases. Finally the chapter touches upon how hybrid human/machine intelligence can outperform either approach.
... Unlike natural image analysis 13 , biomedical image analysis is complemented by additional data modalities, such as multiplexed imaging, single cell and bulk sequencing, and clinical information 14,15 . These data may aid in interpreting the deep feature representations of the H&E slide. ...
Article
Full-text available
Convolutional neural networks (CNNs) are revolutionizing digital pathology by enabling machine learning-based classification of a variety of phenotypes from hematoxylin and eosin (H&E) whole slide images (WSIs), but the interpretation of CNNs remains difficult. Most studies have considered interpretability in a post hoc fashion, e.g. by presenting example regions with strongly predicted class labels. However, such an approach does not explain the biological features that contribute to correct predictions. To address this problem, here we investigate the interpretability of H&E-derived CNN features (the feature weights in the final layer of a transfer-learning-based architecture). While many studies have incorporated CNN features into predictive models, there has been little empirical study of their properties. We show such features can be construed as abstract morphological genes (“mones”) with strong independent associations to biological phenotypes. Many mones are specific to individual cancer types, while others are found in multiple cancers especially from related tissue types. We also observe that mone-mone correlations are strong and robustly preserved across related cancers. Importantly, linear mone-based classifiers can very accurately separate 38 distinct classes (19 tumor types and their adjacent normals, AUC = 97.1%±2.8% for each class prediction), and linear classifiers are also highly effective for universal tumor detection (AUC = 99.2%±0.12%). This linearity provides evidence that individual mones or correlated mone clusters may be associated with interpretable histopathological features or other patient characteristics. In particular, the statistical similarity of mones to gene expression values allows integrative mone analysis via expression-based bioinformatics approaches. We observe strong correlations between individual mones and individual gene expression values, notably mones associated with collagen gene expression in ovarian cancer. Mone-expression comparisons also indicate that immunoglobulin expression can be identified using mones in colon adenocarcinoma and that immune activity can be identified across multiple cancer types, and we verify these findings by expert histopathological review. Our work demonstrates that mones provide a morphological H&E decomposition that can be effectively associated with diverse phenotypes, analogous to the interpretability of transcription via gene expression values. Our work also demonstrates mones can be interpreted without using a classifier as a proxy.
... The genetic data generated from individuals collected from multiple sources (i.e. populations) are typically heterogeneous and multimodal (Das et al., 2018;Maji, 2019;Ray et al., 2014;Zhang et al., 2019). Preserving nonlinear and multimodal information could provide complementary insights into the genetic signature of evolution, such as population ancestry, the signatures of natural selection and the architecture of complex traits [i.e. ...
Article
Full-text available
Genetic variations in a species across geographic areas typically exhibit spatial clines. There is increasing interest in inferring population genetic structure to understand the patterns of genetic variation and the evolution of a species. Here, we present the da package and propose to infer population structure using discriminant analysis (DA). We incorporate five supervised learning approaches (DAPC, LDAKPC, LFDA, LFDAKPC and KLFDA) into da package within the same DA family, but with different linear and nonlinear properties. We tested the performance and properties of these five approaches for population structure inference using both simulated and empirical data. Results showed that these five approaches preserved the same global genetic structure under each genetic scenario. Notably, genetic features produced from KLFDA and LFDA had higher correlations with under isolation‐by‐distance model and higher discriminatory power in population structure identification, with KLFDA achieving the best performance. The applications to empirical data indicated that all these methods could intuitively capture the continuous genetic gradients while LFDA and KLFDA could discriminate nuanced population structures that the other approaches cannot. These DA methods can be applied to other statistical inferences in genetics and beyond. The da package is available at https://cran.r‐project.org/web/packages/DA/index.html . We recommend users choosing these approaches appropriately depending on their scientific questions and target data.
... It is interesting to note that multimodal artificial neural networks achieve higher accuracy compared to single-view neural networks and to other methods overall, but also transcriptomicsbased support vector regression achieves good performance scores. Indeed, multiomic data integration does not always guarantee improved predictions, especially when benchmarking over gene expression (56). While any difference in accuracy generally depends on the task, our findings demonstrate that the knowledge embedded in genome-scale metabolic models is complementary to gene expression and may support its exploitation by data-driven models in a variety of scenarios. ...
Article
Full-text available
Significance Linking genotype and phenotype is a fundamental problem in biology, key to several biomedical and biotechnological applications. Cell growth is a central phenotypic trait, resulting from interactions between environment, gene regulation, and metabolism, yet its functional bases are still not completely understood. We propose and test a machine-learning approach that integrates large-scale gene expression profiles and mechanistic metabolic models, for characterizing cell growth and understanding its driving mechanisms in Saccharomyces cerevisiae . At its core, a custom-built multimodal learning method merges experimentally generated and model-generated data. We show that our approach can leverage the advantages of both machine learning and metabolic modeling, revealing unknown interactions between biological domains, incorporating mechanistic knowledge, and therefore overcoming black-box limitations of conventional data-driven approaches.
... The problem of learning predictive models from multiomics data can be naturally considered a multimodal learning problem [13], [14]. Commonly, data from multiple modalities contain more complete and complementary information of the object than that is provided by the single modality only. ...
Article
Full-text available
Rapid advances in high-throughput sequencing technology have led to the generation of a large number of multi-omics biological datasets. Integrating data from different omics provides an unprecedented opportunity to gain insight into disease mechanisms from different perspectives. However, integrative analysis and predictive modeling from multi-omics data are facing three major challenges: i) heavy noises; ii) the high dimensions compared to the small samples; iii) data heterogeneity. Current multi-omics data integration approaches have some limitations and are susceptible to heavy noise. In this paper, we present MSPL, a robust supervised multi-omics data integration method that simultaneously identifies significant multi-omics signatures during the integration process and predicts the cancer subtypes. The proposed method not only inherits the generalization performance of self-paced learning but also leverages the properties of multi-omics data containing correlated information to interactively recommend high-confidence samples for model training. We demonstrate the capabilities of MSPL using simulated data and five multi-omics biological datasets, integrating up three omics to identify potential biological signatures, and evaluating the performance compared to state-of-the-art methods in binary and multi-class classification problems. Our proposed model makes multi-omics data integration more systematic and expands its range of applications.
... Sometimes, more than one set of measurements is collected for the same set of samples; for example, high-throughput genomic studies involving data from multiple domains are often encountered. For the same biological sample microarray gene expression, miRNA expression, proteomics, and DNA methylation data might be gathered [52]. Integrating multiple datasets allows you to both obtain a more accurate representation of higher-order interactions and evaluate the associated variability. ...
... It has three views: gene expression, miRNA expression and DNA Methylation. According to a study performed on this dataset [45], patients are grouped into two categories: first class is Tumor stage I and the second class is Tumor stage II, III and IV. ...
Article
Full-text available
Recent high throughput omics technology has been used to assemble large biomedical omics datasets. Clustering of single omics data has proven invaluable in biomedical research. For the task of patient sub-classification, all the available omics data should be utilized combinedly rather than treating them individually. Clustering of multi-omics datasets has the potential to reveal deep insights. Here, we propose a late integration based multiobjective multi-view clustering algorithm which uses a special perturbation operator. Initially, a large number of diverse clustering solutions (called base partitionings) are generated for each omic dataset using four clustering algorithms, viz., k means, complete linkage, spectral and fast search clustering. These base partitionings of multi-omic datasets are suitably combined using a special perturbation operator. The perturbation operator uses an ensemble technique to generate new solutions from the base partitionings. The optimal combination of multiple partitioning solutions across different views is determined after optimizing the objective functions, namely conn-XB, for checking the quality of partitionings for different views, and agreement index, for checking agreement between the views. The search capability of a multiobjective simulated annealing approach, namely AMOSA is used for this purpose. Lastly, the non-dominated solutions of the different views are combined based on similarity to generate a single set of non-dominated solutions. The proposed algorithm is evaluated on 13 multi-view cancer datasets. An elaborated comparative study with several baseline methods and five state-of-the-art models is performed to show the effectiveness of the algorithm.
... Decisions made at the creation of data are of utmost importance because they can influence all subsequent outcomes (Perrier et al., 2017;Willoughby et al., 2014). For example, in order for data to be reused it requires early planning to consider correct data formats, data integrity and perhaps tools to build and manage data (Gardner et al., 2003;Ray et al., 2014). Additionally, for institutions to support and implement data management policies, to manage risk of file format obsolescence or degradation of information storage, they need to better understand where data are sourced, their format and how they are stored. ...
Article
Background: Building or acquiring research data management (RDM) capacity is a major challenge for health and medical researchers and academic institutes alike. Considering that RDM practices influence the integrity and longevity of data, targeting RDM services and support in recognition of needs is especially valuable in health and medical research. Objective: This project sought to examine the current RDM practices of health and medical researchers from an academic institution in Australia. Method: A cross-sectional survey was used to collect information from a convenience sample of 81 members of a research institute (68 academic staff and 13 postgraduate students). A survey was constructed to assess selected data management tasks associated with the earlier stages of the research data life cycle. Results: Our study indicates that RDM tasks associated with creating, processing and analysis of data vary greatly among researchers and are likely influenced by their level of research experience and RDM practices within their immediate teams. Conclusion: Evaluating the data management practices of health and medical researchers, contextualised by tasks associated with the research data life cycle, is an effective way of shaping RDM services and support in this group. Implications: This study recognises that institutional strategies targeted at tasks associated with the creation, processing and analysis of data will strengthen researcher capacity, instil good research practice and, over time, improve health informatics and research data quality.
... Initial global approaches to defining cancer subtypes relied on transcriptome measurements (van 't Veer et al. , 2002) . Although the transcriptome provides a holistic readout of cell state that is highly predictive of biological activity (Ray et al. , 2014) , the lability of the RNA analyte limits its application in the clinic. This limitation has led to efforts to use more stable analytes to identify cancer subtypes, most notably somatic DNA changes. ...
... Despite this, EPICC clustered pancreatic cancers together on the basis of lower-frequency shared mutations, demonstrating the robustness of the method. Such robustness is an attractive feature of transcriptomic clustering approaches, with DNA-based methods considered more fragile (Ray et al. , 2014) . EPICC's ability to leverage protein interaction network information to extract robust clusters from DNA mutations is a distinct advantage, particularly in clinical contexts where transcriptomic measurements are challenging to acquire. ...
Preprint
Full-text available
The grouping of cancers across tissue boundaries is central to precision oncology, but remains a difficult problem. Here we present EPICC (Experimental Protein Interaction Clustering of Cancer), a novel technique to cluster cancer patients based on DNA mutation profile, that leverages knowledge of protein-protein interactions to reduce noise and amplify biological signal. We applied EPICC to data from The Cancer Genome Atlas (TCGA), and both recapitulated known cancer clusterings, and identified new cross-tissue cancer groups that may indicate novel cancer molecular subtypes. Investigation of EPICC clusters revealed new protein modules which were recurrently mutated across cancers, and indicate new avenues for research into cancer biology. EPICC leveraged the Vodafone DreamLab citizen science platform, and we provide our results as a resource for researchers to investigate the role of protein modules in cancer.
... Such an approach is useful when working, for example, with genetic and epigenetic data where the variable count can be in the millions. 47 Feature Selection and Feature Extraction. One common use of unsupervised learning method for data reduction is to reduce the dimensionality of a set of variables (or features) by removing redundant or irrelevant variables 48 or by combining variables into composite values. ...
Article
Full-text available
Diverse environmental and biological systems interact to influence individual differences in response to environmental stress. Understanding the nature of these complex relationships can enhance the development of methods to (1) identify risk, (2) classify individuals as healthy or ill, (3) understand mechanisms of change, and (4) develop effective treatments. The Research Domain Criteria initiative provides a theoretical framework to understand health and illness as the product of multiple interrelated systems but does not provide a framework to characterize or statistically evaluate such complex relationships. Characterizing and statistically evaluating models that integrate multiple levels (e.g. synapses, genes, and environmental factors) as they relate to outcomes that are free from prior diagnostic benchmarks represent a challenge requiring new computational tools that are capable to capture complex relationships and identify clinically relevant populations. In the current review, we will summarize machine learning methods that can achieve these goals.