A schematic overview of the COALESCE algorithm for regulatory module discovery. COALESCE predicts regulatory modules, each consisting of a gene expression bicluster (co-regulated genes and the conditions under which they are co-regulated) plus zero or more putative regulating motifs. Its primary input data are gene expression microarrays (to form biclusters) and flanking sequences (to predict motifs, although these can be omitted to output only expression biclusters). Additional supporting data types can be flexibly integrated using a Bayesian framework; for example, highly conserved sequence locations may be more likely to contain motifs, and sites occluded by nucleosomes may be less likely. COALESCE is efficient enough to integrate thousands of expression conditions and supporting data for large ( > 20 000 genes) metazoan genomes.

Source publication

Detailing regulatory networks through large scale data integration

Article

Full-text available

Oct 2009

Much of a cell's regulatory response to changing environments occurs at the transcriptional level. Particularly in higher organisms, transcription factors (TFs), microRNAs and epigenetic modifications can combine to form a complex regulatory network. Part of this system can be modeled as a collection of regulatory modules: co-regulated genes, the c...

Context 1

... 2. Evaluation of the functional consistency of S.cerevisiae expression biclusters predicted by COALESCE. Precision and recall are over gene pairs co-annotated in the Gene Ontology as described in (Myers et al ., 2006). Unless noted, COALESCE was executed on ∼ 2200 yeast expression conditions using 2 kb of up- and downstream flanking sequence. See Supplementary Figure 1 for a plot with standard scale axes and Supplementary Figure 2 for a comparable evaluation using human data. ( a ) A comparison of COALESCE with the PISA and SAMBA expression- only biclustering systems. This comprises 1870 modules integrating five runs of COALESCE, 428 modules from one run of COALESCE, ∼ 1000 modules integrating ∼ 20 runs of PISA, 492 modules from one run of SAMBA (lower recall results are not available from PISA or SAMBA), and k -means clusters with k ranging from 10 to 5000 for comparison. ( b ) Effects of supporting data types (evolutionary conservation and nucleosome placement) and of dataset correlation structure on COALESCE predictions. While neither supporting data nor prior knowledge of dataset correlation structure (e.g. sets of related conditions such as time courses) significantly influence overall performance, accounting for correlation structure greatly improves conciseness, achieving comparable functional accuracy using < 1/3 as many modules. ...

View in full-text

Context 2

... the genome sequence of an organism describes its complement of potential proteins, it is the controlled expression, translation and modification of these proteins that allows cells to survive and grow. At the level of transcription and mRNA stability, a complex regulatory network of transcription factors (TFs), RNA binding proteins and microRNAs governs the interactions between components of a cell’s internal state and its external environment. Understanding the elements of this regulatory network and the stimuli to which it responds in higher organisms has been of increasing recent interest as a key to metazoan systems biology (Bonneau, 2008; Long et al ., 2008), particularly as genetic misregulation is a major cause of human disease. One means of discovering regulatory modules is the analysis of gene expression data, since a consequence of transcriptional co-regulation is co-expression. While a wealth of assays has also been developed to explore the transcriptional regulatory network under specific experimental conditions, regulatory module prediction from microarray data is a widely studied problem that remains unsurpassed for inference of general regulatory networks, particularly when additional genomic data sources are also integrated (Bussemaker et al ., 2007). Prediction of regulatory relationships has been particularly well-studied in unicellular systems, where regulation often occurs based on well-defined TF binding sites and discrete activation or repression of transcription (Beer and Tavazoie, 2004; Roth et al ., 1998). These assumptions have led to the current motif discovery paradigm, in which microarray data are clustered, each cluster’s promoter sequences tested for enriched motifs, and the resulting consensus sequences matched again known TF binding sites. In many cases, however, and particularly in more complex organisms, these assumptions no longer hold, and predicting regulatory modules from expression data becomes an increasingly difficult problem. It combines the challenges of biclustering [i.e. grouping together co-expressed genes and the subset of conditions where they are co-expressed (Kloster et al ., 2005; Tanay et al ., 2004)] with the difficulty of de novo motif discovery from DNA sequences, where regulatory motifs can be short, degenerate and frequently present without being functional (Hannenhalli, 2008). Note that this is distinct from the related tasks of inferring regulatory networks with prior knowledge of potential regulators or regulatory motifs (e.g. Kundaje et al ., 2008; Lemmens et al ., 2009; Segal et al ., 2003) or while omitting the process of motif discovery (e.g. Margolin et al ., 2006), both of which have also been intensively studied. Most existing approaches to regulatory module discovery break the biclustering and motif discovery tasks into separate stages: first, expression data is clustered or biclustered, and afterwards, each cluster is analyzed for enrichment of sequence motifs (Elemento et al ., 2007). To discover regulatory modules most effectively, though, it would be natural to perform both tasks at the same time, discovering clusters of genes that are both co-expressed and enriched for regulatory motifs. Recent work (Halperin et al ., 2009; Reiss et al ., 2006) has indeed confirmed the intuition that regulatory module discovery by simultaneous analysis of expression and sequence data can be extremely effective, but this has neither been developed to incorporate heterogeneous data integration, nor has it been scaled for application to complex metazoan genomes. Here, we describe a Combinatorial Algorithm for Expression and Sequence-based Cluster Extraction (COALESCE), which allows the discovery of regulatory motifs and modules from large collections of genomic data (Fig. 1). COALESCE takes advantage of Bayesian integration of multiple data types (primarily expression data) on a large scale (Huttenhower and Troyanskaya, 2008) to predict co-expressed gene modules, the conditions under which they are co-regulated, and the consensus binding motifs responsible for their regulation. The algorithm is practical for use with complex metazoan genomes ( > 25 000 genes), analyzes extremely large expression data collections ( > 15 000 conditions), can explicitly model dependencies between related gene expression conditions (e.g. points in a time course) and can integrate heterogeneous supporting data types in order to improve predictions (nucleosome positioning and evolutionary conservation are specifically demonstrated below). An implementation of COALESCE (including C++ source code) is provided as part of the Sleipnir software package at .princeton.edu/sleipnir, and a web interface is available at http:// function.princeton.edu/coalesce. We have validated COALESCE’s ability to discover functionally relevant biclusters and transcriptional motifs in synthetic data and in Saccharomyces cerevisiae , demonstrating improvements over previous methods in both expression biclustering and binding site prediction. We provide further evaluation of TF target predictions using the Yeastract ( S.cerevisiae ; Teixeira et al ., 2006) and RegulonDB ( E.Coli ; Gama- Castro et al ., 2008) motif databases and results including regulatory modules for Caenorhabditis elegans , Drosophila melanogaster , Mus musculus and Homo sapiens ...

View in full-text

Context 3

... Figure 1; it is summarized in pseudocode in Supplementary Text 1 and described in more detail below. COALESCE receives as input a standard genes-by-conditions expression matrix, DNA sequences for each gene in the regions of interest (e.g. upstream and/or downstream of the coding region), and four parameters: a k -mer length k , maximum P -value cutoffs p e and p m for expression condition and motif significance, respectively, and a minimum probability cutoff p g for inclusion of genes in modules. Each module is then computed beginning with an initial seed of the two genes maximally correlated across all expression conditions. Three steps are then iterated to modify the module until it converges: selection of significant expression conditions, selection of significant motifs and inclusion of probable genes. An expression condition is considered to be significant (and thus included in the module) if the distribution of expression values for genes currently in the module differs below threshhold p e from the genomic background (based on a standard Z -test). Similarly, a motif is significant if its frequency in gene sequences currently in the module differs below threshhold p m from background (based on a Z -test modified to use Cohen’s d ; Supplementary Text 1). Based on these selected features (significant conditions and motifs), each gene’s probability P ( g ∈ G | C , M ) of inclusion in the developing module is calculated using Bayesian data integration of P ( C | g ∈ G ), observed from the expression data, P ( M | g ∈ G ), observed from the sequence data and P ( g ∈ G ), a prior used to stabilize module convergence. Genes above probability p g are included and those below are excluded. When the module converges to a final set of conditions C , motifs M and genes G , its mean expression values and motif frequencies are subtracted from the underlying data and the process is begun again with a new pair of seed genes. C++ source code for the algorithm is available at and a detailed description including pseudocode can be found in Supplementary Text 1. Integration of additional data types modifies the algorithm only minimally and is discussed below. All significance tests involving P -values are Bonferroni corrected for multiple hypotheses. For all experiments in this manuscript, P -value threshholds were fixed at 0.05, probability threshholds at 0.95 and k = 7. 2.1.1 Motifs and DNA sequences COALESCE considers three types of motifs. The pseudocode above describes simple k -mers, each a string of k characters drawn from the alphabet {A, C, G, T}. Our implementation also considers reverse complement pairs (RCs) and probabilistic suffix trees (PSTs) in an equivalent manner. An RC is the equally weighted union of a k -mer and its reverse complement. A PST is the union of two or more arbitrary k -mers and RCs in a weighted (probabilistic) manner; such a structure can be constructed and matched against DNA sequence rapidly at runtime (Pavesi et al ., 2004). Briefly, just as a Position Weight Matrix (PWM) or Position Specific Score Matrix (PSSM) contains a single column per base to be matched, a PST contains a single node in a tree for each base. A PST thus has some depth equivalent to the length of a PWM/PSSM, and the maximal length match against some sequence is thus the depth or the length of the sequence, whichever is shorter. Initially, all possible k -mers and RCs are considered, but no PSTs. During each runtime iteration, a new PST is constructed for any pair of existing motifs m and m for which (i) Z -score ( M m , M m ) is small and (ii) the minimum edit distance between m and m is small (experiments here used gap penalty 1, mismatch penalty 2.1 and threshhold 2.5). Each PST so constructed is treated identically to k -mers and RCs with respect to frequency calculations etc. in all future iterations, subsequent to calculation of gene- specific scores M g , m . For a PST p with depth | p | , maximum length match | p [ s , i ] | for some sequence s beginning at offset i and probability p [ s , i ] of specifically matching position i , these are calculated ...

View in full-text

An enhanced adaptive Bi-clustering algorithm through building a shielding complex sub-matrix

Article

Full-text available

Oct 2022

Bi-clustering refers to the task of finding sub-matrices (indexed by a group of columns and a group of rows) within a matrix of data such that the elements of each sub-matrix (data and features) are related in a particular way, for instance, that they are similar with respect to some metric. In this paper, after analyzing the well-known Cheng and Church bi-clustering algorithm which has been proved to be an effective tool for mining co-expressed genes. However, Cheng and Church bi-clustering algorithm and summarizing its limitations (such as interference of random numbers in the greedy strategy; ignoring overlapping bi-clusters), we propose a novel enhancement of the adaptive bi-clustering algorithm, where a shielding complex sub-matrix is constructed to shield the bi-clusters that have been obtained and to discover the overlapping bi-clusters. In the shielding complex sub-matrix, the imaginary and the real parts are used to shield and extend the new bi-clusters, respectively, and to form a series of optimal bi-clusters. To assure that the obtained bi-clusters have no effect on the bi-clusters already produced, a unit impulse signal is introduced to adaptively detect and shield the constructed bi-clusters. Meanwhile, to effectively shield the null data (zero-size data), another unit impulse signal is set for adaptive detecting and shielding. In addition, we add a shielding factor to adjust the mean squared residue score of the rows (or columns), which contains the shielded data of the sub-matrix, to decide whether to retain them or not. We offer a thorough analysis of the developed scheme. The experimental results are in agreement with the theoretical analysis. The results obtained on a publicly available real microarray dataset show the enhancement of the bi-clusters performance thanks to the proposed method.

An Enhanced Adaptive Bi-clustering Algorithm through Building a Shielding Complex Sub-Matrix

Preprint

Nov 2021

Kaijie Xu

Bi-clustering refers to the task of finding sub-matrices (indexed by a group of columns and a group of rows) within a matrix of data such that the elements of each sub-matrix (data and features) are related in a particular way, for instance, that they are similar with respect to some metric. In this paper, after analyzing the well-known Cheng and Church (CC) bi-clustering algorithm which has been proved to be an effective tool for mining co-expressed genes. However, Cheng and Church bi-clustering algorithm and summarizing its limitations (such as interference of random numbers in the greedy strategy; ignoring overlapping bi-clusters), we propose a novel enhancement of the adaptive bi-clustering algorithm, where a shielding complex sub-matrix is constructed to shield the bi-clusters that have been obtained and to discover the overlapping bi-clusters. In the shielding complex sub-matrix, the imaginary and the real parts are used to shield and extend the new bi-clusters, respectively, and to form a series of optimal bi-clusters. To assure that the obtained bi-clusters have no effect on the bi-clusters already produced, a unit impulse signal is introduced to adaptively detect and shield the constructed bi-clusters. Meanwhile, to effectively shield the null data (zero-size data), another unit impulse signal is set for adaptive detecting and shielding. In addition, we add a shielding factor to adjust the mean squared residue score of the rows (or columns), which contains the shielded data of the sub-matrix, to decide whether to retain them or not. We offer a thorough analysis of the developed scheme. The experimental results are in agreement with the theoretical analysis. The results obtained on a publicly available real microarray dataset show the enhancement of the bi-clusters performance thanks to the proposed method.

Identification Of Differentially Expressed Gene Modules In Heterogeneous Diseases

Article

Dec 2020
BIOINFORMATICS

Motivation Identification of differentially expressed genes is necessary for unraveling disease pathogenesis. This task is complicated by the fact that many diseases are heterogeneous at the molecular level and samples representing distinct disease subtypes may demonstrate different patterns of dysregulation. Biclustering methods are capable of identifying genes that follow a similar expression pattern only in a subset of samples and hence can consider disease heterogeneity. However, identifying biologically significant and reproducible sets of genes and samples remains challenging for the existing tools. Many recent studies have shown that the integration of gene expression and protein interaction data improves the robustness of prediction and classification and advances biomarker discovery. Results Here we present DESMOND, a new method for identification of Differentially ExpreSsed gene MOdules iN Diseases. DESMOND performs network-constrained biclustering on gene expression data and identifies gene modules - connected sets of genes up- or down-regulated in subsets of samples. We applied DESMOND on expression profiles of samples from two large breast cancer cohorts and have shown that the capability of DESMOND to incorporate protein interactions allows identifying the biologically meaningful gene and sample subsets and improves the reproducibility of the results. Availability https://github.com/ozolotareva/DESMOND Supplementary information Supplementary data are available at Bioinformatics online.

Identification of Differentially Expressed Gene Modules in Heterogeneous Diseases

Preprint

Full-text available

Apr 2020

Motivation Identification of differentially expressed genes is necessary for unraveling disease pathogenesis. This task is complicated by the fact that many diseases are heterogeneous at the molecular level and samples representing distinct disease subtypes may demonstrate different patterns of dysregulation. Biclustering methods are capable of identifying genes that follow a similar expression pattern only in a subset of samples and hence can consider disease heterogeneity. However, identifying biologically significant and reproducible sets of genes and samples remains challenging for the existing tools. Many recent studies have shown that the integration of gene expression and protein interaction data improves the robustness of prediction and classification and advances biomarker discovery. Results Here we present DESMOND, a new method for identification of Differentially ExpreSsed gene MOdules iN Diseases. DESMOND performs network-constrained biclustering on gene expression data and identifies gene modules — connected sets of genes up- or down-regulated in subsets of samples. We applied DESMOND on expression profiles of samples from two large breast cancer cohorts and have shown that the capability of DESMOND to incorporate protein interactions allows identifying the biologically meaningful gene and sample subsets and improves the reproducibility of the results. Availability https://github.com/ozolotareva/DESMOND Contact ozolotareva@techfak.uni-bielefeld.de Supplementary information Supplementary data are available at Bioinformatics online.

On fusion methods for knowledge discovery from multi-omics datasets

Article

Full-text available

Mar 2020

Recent years have witnessed the tendency of measuring a biological sample on multiple omics scales for a comprehensive understanding of how biological activities on varying levels are perturbed by genetic variants, environments, and their interactions. This new trend raises substantial challenges to data integration and fusion, of which the latter is a specific type of integration that applies a uniform method in a scalable manner, to solve biological problems which the multi-omics measurements target. Fusion-based analysis has advanced rapidly in the past decade, thanks to application drivers and theoretical breakthroughs in mathematics, statistics, and computer science. We will briefly address these methods from methodological and mathematical perspectives and categorize them into three types of approaches: data fusion (a narrowed definition as compared to the general data fusion concept), model fusion, and mixed fusion. We will demonstrate at least one typical example in each specific category to exemplify the characteristics, principles, and applications of the methods in general, as well as discuss the gaps and potential issues for future studies.

SeQuery: An interactive graph database for visualizing the GPCR superfamily

Article

Full-text available

Jun 2019
Database

The rate at which new protein and gene sequences are being discovered has grown explosively in the omics era, which has increasingly complicated the efficient characterization and analysis of their biological properties. In this study, we propose a web-based graphical database tool, SeQuery, for intuitively visualizing proteome/genome networks by integrating the sequential, structural and functional information of sequences. As a demonstration of our tool's effectiveness, we constructed a graph database of G protein-coupled receptor (GPCR) sequences by integrating data from the UniProt, GPCRdb and RCSB PDB databases. Our tool attempts to achieve two goals: (i) given the sequence of a query protein, correctly and efficiently identify whether the protein is a GPCR, and, if so, define its sequential and functional roles in the GPCR superfamily; and (ii) present a panoramic view of the GPCR superfamily and its network centralities that allows users to explore the superfamily at various resolutions. Such a bottom-up-to-top-down view can provide the users with a comprehensive understanding of the GPCR superfamily through interactive navigation of the graph database. A test of SeQuery with the GPCR2841 dataset shows that it correctly identifies 99 out of 100 queried protein sequences. The developed tool is readily applicable to other biological networks, and we aim to expand SeQuery by including additional biological databases in the near future. Database URL: http://cluster.

It is time to apply biclustering: a comprehensive review of biclustering applications in biological and biomedical data

Article

Full-text available

Feb 2018
BRIEF BIOINFORM

Biclustering is a powerful data mining technique that allows clustering of rows and columns, simultaneously, in a matrix-format data set. It was first applied to gene expression data in 2000, aiming to identify co-expressed genes under a subset of all the conditions/samples. During the past 17 years, tens of biclustering algorithms and tools have been developed to enhance the ability to make sense out of large data sets generated in the wake of high-throughput omics technologies. These algorithms and tools have been applied to a wide variety of data types, including but not limited to, genomes, transcriptomes, exomes, epigenomes, phenomes and pharmacogenomes. However, there is still a considerable gap between biclustering methodology development and comprehensive data interpretation, mainly because of the lack of knowledge for the selection of appropriate biclustering tools and further supporting computational techniques in specific studies. Here, we first deliver a brief introduction to the existing biclustering algorithms and tools in public domain, and then systematically summarize the basic applications of biclustering for biological data and more advanced applications of biclustering for biomedical data. This review will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency.

Biclustering Impact in Biomedical Sciences via Literature Mining

Article

Jan 2018
Int. J. Biomed. Data Min.

Biclustering algorithms have matured from their initial applications in bioinformatics, evolving towards different approaches and bicluster definitions, which makes sometimes hard for the analyst to determine which one of the available algorithms best fits her problem. As a way of benchmarking these algorithms, several quality measures have been proposed in literature. Such measures cover numerical aspects related to the accuracy, the recovery power or the capability of retrieving previous biomedical knowledge. However, biclustering apparently remains as an uncommon option for biomedicine analysis. Here we review the impact of biclustering algorithms in biomedicine and bioinformatics with the object of measuring and understanding non-numerical aspects of biclustering algorithms focusing on citation-based statistics that can be relevant for their application on the domain. In order to achieve this, we performed analyses of the citations impact of several clustering and biclustering algorithms, and propose a methodology that can cover this aspect of biclustering usage.

New heuristics for the Bicluster Editing Problem

Article

Full-text available

Nov 2017
ANN OPER RES

The NP-hard Bicluster Editing Problem (BEP) consists of editing a minimum number of edges of an input bipartite graph G in order to transform it into a vertex-disjoint union of complete bipartite subgraphs. Editing an edge consists of either adding it to the graph or deleting it from the graph. Applications of the BEP include data mining and analysis of gene expression data. In this work, we generate and analyze random bipartite instances for the BEP to perform empirical tests. A new reduction rule for the problem is proposed, based on the concept of critical independent sets, providing an effective reduction in the size of the instances. We also propose a set of heuristics using concepts of the metaheuristics ILS, VNS, and GRASP, including a constructive heuristic based on analyzing vertex neighborhoods, three local search procedures, and an auxiliary data structure to speed up the local search. Computational experiments show that our heuristics outperform other methods from the literature with respect to both solution quality and computational time.

Genome-wide BigData analytics: Case of yeast stress signature detection

Article

Full-text available

Oct 2017

Zelimir Kurtanjek

It has been generally recognized that BigData analytics presently have most significant impact on computer inference in life sciences, such as genome wide association studies (GWAS) in basic research and personalized medicine, and its importance will further increase in near future. In this work non-parametric separation of responsive yeast genes from experimental data obtained in chemostat cultivation under dilution rate and nutrient limitations with basic biogenic elements (C,N,S,P), and the specific leucine and uracil auxothropic limitations. Elastic net models are applied for the detection of the key responsive genes for each of the specific limitations. Bootstrap and perturbation methods are used to determine the most important responsive genes and corresponding quantiles applied to the complete data set for all of the nutritional and growth rate limitations. The model predicts that response of gene YOR 348 C, involved in proline metabolism, as the key signature of stress. Based on literature data, the obtained result are confirmed experimentally by the biochemistry of plants under physical and chemical stress, also by functional genomics of bakers yeast, and also its important function in human tumorogenesis is observed.

Contexts in source publication

Citations