Article

An Algorithm for Clustering cDNA Fingerprints

Authors:
  • Tel Aviv University and Bar Ilan University
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Clustering large data sets is a central challenge in gene expression analysis. The hybridization of synthetic oligonucleotides to arrayed cDNAs yields a fingerprint for each cDNA clone. Cluster analysis of these fingerprints can identify clones corresponding to the same gene. We have developed a novel algorithm for cluster analysis that is based on graph theoretic techniques. Unlike other methods, it does not assume that the clusters are hierarchically structured and does not require prior knowledge on the number of clusters. In tests with simulated libraries the algorithm outperformed the Greedy method and demonstrated high speed and robustness to high error rate. Good solution quality was also obtained in a blind test on real cDNA fingerprints.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Finding structure in networks, known as community detection, or clustering, is an important problem with a wide range of biomedical applications such as systems biology [1], population structure studies [2] and health information systems [3]. In the past decade, computational geneticists have found a new application for clustering algorithms in the context of Identity-By-Descent (IBD) mapping [4]. ...
... We next used the PAGE study dataset to compare the algorithms on real data. First, we ran iLASH over the chromosome 1 genotype data to estimate IBD [1] . While false-negative and false-positive edges occur in local IBD graphs due to a variety of phenomena (minimum length of IBD, genotyping errors, phasing errors), our previous analysis suggests iLASH introduces negligible rates of false-positives and false-negatives [8], which prevents high false-positive/false-negative rates in local IBD graphs. ...
... We then calculated the featurebased metric scores of the results. For each local IBD graph, we also generated and [1] We chose chromosome 1 since it was the largest chromosome without any regions of low complexity in the PAGE dataset. ...
Preprint
Full-text available
Background Groups of distantly related individuals who share a short segment of their genome identical-by-descent (IBD) can provide insights about rare traits and diseases in massive biobanks via a process called IBD mapping. Clustering algorithms play an important role in finding these groups. We set out to analyze the fitness of commonly used, fast and scalable clustering algorithms for IBD mapping applications. We designed a realistic benchmark for local IBD graphs and utilized it to compare clustering algorithms in terms of statistical power. We also investigated the effectiveness of common clustering metrics as replacements for statistical power. Results We simulated 3.4 million clusters across 850 experiments with varying cluster counts, false-positive, and false-negative rates. Infomap and Markov Clustering (MCL) community detection methods have high statistical power in most of the graphs, compared to greedy methods such as Louvain and Leiden. We demonstrate that standard clustering metrics, such as modularity, cannot predict statistical power of algorithms in IBD mapping applications, though they can help with simulating realistic benchmarks. We extend our findings to real datasets by analyzing 3 populations in the Population Architecture using Genomics and Epidemiology (PAGE) Study with 51,000 members and 2 million shared segments on Chromosome 1, resulting in the extraction of 39 million local IBD clusters across three different populations in PAGE. We used cluster properties derived in PAGE to increase the accuracy of our simulations and comparison. Conclusions Markov Clustering produces a 30% increase in statistical power compared to the current state-of-art approach, while reducing runtime by 3 orders of magnitude; making it computationally tractable in modern large-scale genetic datasets. We provide an efficient implementation to enable clustering at scale for IBD mapping and poplation-based linkage for various populations and scenarios.
... The goal is to minimize δ(G, H) over all possible cluster graphs. This method has been successfully used to cluster cDNA fingerprints [1], to find complexes in protein − protein interaction (PPI) data [2], to group protein sequences hierarchically into superfamily and family clusters [3], and to find families of regulatory RNAstructures [4]. This problem has also motivations from computational biology [5], because it arises in the analysis of the similarity between gene expression data [6]. ...
... A cluster deletion problem instance is given by a simple graph G and asks for a cluster graph. The distance δ(G, H) between two graphs G and H defined on the same vertices is given by (1). ...
Article
Full-text available
We consider the following vertex-partition problem on graphs: given a simple graph G = (V, E), we want to partition G into a disjoint union of cliques by only removing a minimum number of edges. This NP-hard optimization problem is referred to as the Cluster Deletion (CD). In this paper, we propose an encoding of CD in terms of a Weighted Constraint Satisfaction Problem (WCSP), a framework which has been widely used in solving hard combinatorial problems. We compare our approach with a fixed-parameter tractability algorithm, one of the most used algorithms for solving cluster deletion. Then, we experimentally show that significant results are obtained using the WCSP encoding. We compare both solution quality and running times of these algorithms on random graphs and protein similarity graphs derived from the COG dataset.
... Two algorithms that have appeared lately in the literature using graph theoretic ideas for one-way clustering of microarray data are HCS [24] and CAST [6]. In [24] a polynomial time algorithm HCS (Highly Connected Sub Graphs) has been presented for cluster analysis using graph theoretic techniques. ...
... Two algorithms that have appeared lately in the literature using graph theoretic ideas for one-way clustering of microarray data are HCS [24] and CAST [6]. In [24] a polynomial time algorithm HCS (Highly Connected Sub Graphs) has been presented for cluster analysis using graph theoretic techniques. The time complexity of the algorithm is not given, but is estimated to be atleast O(ne), where n is the number of vertices, and e number of edges. ...
Article
Full-text available
In this paper, a novel and fast noise tolerant, graph drawing based, clustering and visualization algorithm is proposed for one-way clustering of gene expression data. The proposed novel divide-and-conquer algorithm works by repeatedly bisecting the similarity matrix along x-axis and y-axis, and ignoring the noise space while running Crossing Minimization Heuristics on the remaining parts of the matrix i.e. data space and thus achieves provably good clustering under very noisy conditions. The proposed Robust Clustering or RC algorithm does not require thresholds, or a-priori knowledge to achieve cluster extraction. Comparing the proposed RC algorithm with contemporary techniques using simulated data under different noise conditions shows its high immunity to noise, and comparison using real gene expression data shows promising results.
... Through the clustering of gene expression data, we can get result of the DNA clone classification. At present, the primary algorithms of clustering of gene expression data include hierarchical methods [1,2,3,4] , K-means [5] , greedy methods [6,7] , graph partitioning [8,9,10,11] , probabilistic methods [8,9,10,11] and self-organizing maps [12,13] . In recent years, Figueroa et al transformed gene expression data into 0-1-N vector set called fingerprint vector set, transformed clustering of the gene expression data into clustering of the fingerprint vectors. ...
... Through the clustering of gene expression data, we can get result of the DNA clone classification. At present, the primary algorithms of clustering of gene expression data include hierarchical methods [1,2,3,4] , K-means [5] , greedy methods [6,7] , graph partitioning [8,9,10,11] , probabilistic methods [8,9,10,11] and self-organizing maps [12,13] . In recent years, Figueroa et al transformed gene expression data into 0-1-N vector set called fingerprint vector set, transformed clustering of the gene expression data into clustering of the fingerprint vectors. ...
Conference Paper
Clustering of binary fingerprints is used in the classification of gene expression data. It is known that the clustering of binary fingerprints with 3 bits of missing value is NP-hard. The greedy clique partition (GCP for short) algorithm is a heuristic algorithm used to clustering of binary fingerprints with missing values. In this paper, we firstly study the feature of instances which can not be resolved by the GCP based on hash table. Then a new property of problem instances is given, which can further improve the heuristic algorithm based on linked list. Finally, an empirical formula is presented, which is used to judge the accuracy and credibility of the GCP algorithm
... Finding structure in networks, known as community detection, or clustering, has a wide range of biomedical applications. [1][2][3] Recently, clustering algorithms have been applied in the context of Identity-By-Descent (IBD) mapping 4,5 as an alternative approach for rare variant association testing that leverages genotype data in the absence of directly observed variation for genomic discovery. This method relies on shared haplotypes along the genome co-inherited identically from a recent common ancestor and utilizes them as the basis for association testing, under the assumption that the haplotypes may co-harbour recently arisen rare variation not directly captured on genotyping arrays. ...
Article
Full-text available
Groups of distantly related individuals who share a short segment of their genome identical-by-descent (IBD) can provide insights about rare traits and diseases in massive biobanks using IBD mapping. Clustering algorithms play an important role in finding these groups accurately and at scale. We set out to analyze the fitness of commonly used, fast and scalable clustering algorithms for IBD mapping applications. We designed a realistic benchmark for local IBD graphs and utilized it to compare the statistical power of clustering algorithms via simulating 2.3 million clusters across 850 experiments. We found Infomap and Markov Clustering (MCL) community detection methods to have high statistical power in most of the scenarios. They yield a 30% increase in power compared to the current state-of-art approach, with a 3 orders of magnitude lower runtime. We also found that standard clustering metrics, such as modularity, cannot predict statistical power of algorithms in IBD mapping applications. We extend our findings to real datasets by analyzing the Population Architecture using Genomics and Epidemiology (PAGE) Study dataset with 51,000 samples and 2 million shared segments on Chromosome 1, resulting in the extraction of 39 million local IBD clusters. We demonstrate the power of our approach by recovering signals of rare genetic variation in the Whole-Exome Sequence data of 200,000 individuals in the UK Biobank. We provide an efficient implementation to enable clustering at scale for IBD mapping for various populations and scenarios.Supplementary Information: The code, along with supplementary methods and figures are available at https://github.com/roohy/localIBDClustering.
... Finding structure in networks, known as community detection, or clustering, has a wide range of biomedical applications. [1][2][3] Recently, clustering algorithms have been applied in the context of Identity-By-Descent (IBD) mapping 4,5 as an alternative approach for rare variant association testing that leverages genotype data in the absence of directly observed variation for genomic discovery. This method relies on shared haplotypes along the genome co-inherited identically from a recent common ancestor and utilizes them as the basis for association testing, under the assumption that the haplotypes may co-harbour recently arisen rare variation not directly captured on genotyping arrays. ...
... Clustering can be used to create categories of datasets on the one hand, and categories of users and applications on the other hand. The clustering of datasets can be performed by applying the highly connected subgraph algorithm [71] on the graph data. Datasets will be found in the same cluster if they are highly connected in the graph data, which would mean that the datasets within one cluster will share relevant variables and methodological features. ...
Article
Full-text available
The exploitation of potential societal benefits of Earth observations is hampered by users having to engage in often tedious processes to discover data and extract information and knowledge. A concept is introduced for a transition from the current perception of data as passive objects (DPO) to a new perception of data as active subjects (DAS). This transition would greatly increase data usage and exploitation, and support the extraction of knowledge from data products. Enabling the data subjects to actively reach out to potential users would revolutionize data dissemination and sharing and facilitate collaboration in user communities. The three core elements of the transformative DAS concept are: (1) “intelligent semantic data agents” (ISDAs) that have the capabilities to communicate with their human and digital environment. Each ISDA provides a voice to the data product it represents. It has comprehensive knowledge of the represented product including quality, uncertainties, access conditions, previous uses, user feedbacks, etc., and it can engage in transactions with users. (2) A knowledge base that constructs extensive graphs presenting a comprehensive picture of communities of people, applications, models, tools, and resources and provides tools for the analysis of these graphs. (3) An interaction platform that links the ISDAs to the human environment and facilitates transaction including discovery of products, access to products and derived knowledge, modifications and use of products, and the exchange of feedback on the usage. This platform documents the transactions in a secure way maintaining full provenance.
... Clustering large data sets plays an important role in gene expression analysis. In [12], cluster analysis of cDNA fingerprints is used to identify clones corresponding to the same gene. In [13], many near-optimal clusterings are used to explore the dynamics of network clusterings. ...
Article
In a network, a $k$ -plex represents a subset of $n$ vertices where the degree of each vertex in the subnetwork induced by this subset is at least $n-k$ . The maximum edge-weight $k$ -plex partitioning problem is to find the $k$ -plex partitioning in edge-weighted network, such that the sum of edge weights is maximal. The Max-EkPP has an important role in discovering new information in large biological networks. We propose a variable neighborhood search (VNS) algorithm for solving Max-EkPP. The VNS implements a local search based on the 1-swap first improvement strategy and the objective function that takes into account the degree of every vertex in each partition. The objective function favors feasible solutions and enables a gradual increase of the function's value, when moving from slightly infeasible to barely feasible solutions. Experimental computation is performed on real metabolic networks and other benchmark instances from the literature. Comparing to the previously proposed integer linear programming (ILP), VNS succeeds to find all known optimal solutions. For all other instances, the VNS either reaches previous best known solution or improves it. The proposed VNS is also tested on a large-scale dataset not considered up to now.
... As the precise gene sequence of the genome of every human is exclusive to them, we will all have inimitable disease liabilities and treatment responses. Personalized medicine defines [22,70,242] Hierarchical Method Improved OTU-picking using long-read 16S rRNA gene, discovery of possible gene relationships [33,71,170] Self-organizing map Single-cell transcriptome analysis reveals dynamic changes in lncRNA expression, structuring microbial metabolic responses to multiplexed stimuli, Discovering genetic ancestry [83,115,130] Graph theory approach Threshold selection in gene co-expression networks, clustering cDNA fingerprints [91,180] Coupled two-way Translating biosynthetic gene clusters into fungal armor and weaponry, Two-way learning with one-way supervision for gene expression data [45,112,245] Plaid model Analyzing gene data [18,76,124] Model-based Data transformations for gene expression data [154,157,254] the utilization of our genetic information to modify health care intrusion to individual's requirement [48]. ...
Article
Full-text available
The endless enhancement and decreasing charges of a complete human genome have given rise to fast acceptance of genetic and genomic information at both research institutions and clinics. Biologists are enchanting the primary steps in the direction of knowing the locations and functions of all the genes and controlling sites in the genomes of various organisms. As these researchers govern the nucleotide arrangement of large stretches of the human genome, they are constructing excessive volumes of sequence data. Direct research laboratory investigation of this data is expensive and tough, creating computational techniques vital. The arena of pattern analysis, which intends to build computer algorithms that enhance with knowledge, embraces the capacity to empower computers to support humans in the analysis of complex, large genetic and genomic data sets. Here, an overview of pattern analysis techniques for the study of genome sequencing datasets, as well as the proteomics, epigenetic and metabolomic data is delivered. These techniques employ data pre-processing, feature extraction and selection, classification and clustering. The aim of this survey is to present deliberations and recurring challenges in the application of pattern analysis methods, as well as of discriminative and reproductive modeling approaches and discuss the future research directions of these methods for the analysis of genomic and genetic data sets.
... Clustering large data sets plays an important role in gene expression analysis. In [12], cluster analysis of cDNA fingerprints is used to identify clones corresponding to the same gene. In [13], many near-optimal clusterings are used to explore the dynamics of network clusterings. ...
Preprint
In a network, a $k$-plex represents a subset of $n$ vertices where the degree of each vertex in the subnetwork induced by this subset is at least $n-k$. The maximum edge-weight $k$-plex partitioning problem (Max-EkPP) is to find the $k$-plex partitioning in edge-weighted network, such that the sum of edge weights is maximal. The Max-EkPP has an important role in discovering new information in large sparse biological networks. We propose a variable neighborhood search (VNS) algorithm for solving Max-EkPP. The VNS implements a local search based on the 1-swap first improvement strategy and the objective function that takes into account the degree of every vertex in each partition. The objective function favors feasible solutions, also enabling a gradual increase in terms of objective function value when moving from slightly infeasible to barely feasible solutions. A comprehensive experimental computation is performed on real metabolic networks and other benchmark instances from literature. Comparing to the integer linear programming method from literature, our approach succeeds to find all known optimal solutions. For all other instances, the VNS either reaches previous best known solution or improves it. The proposed VNS is also tested on a large-scaled dataset which was not previously considered in literature.
... One of the reasons for this choice is a huge success in applications of the Highly Connected Subgraphs(HCS) clustering algorithm proposed by Hartuv and Shamir and the second reason is the lack of research for this model compared with the standard clique model. HCS algorithm was used [11] to cluster cDNA fingerprints [8], to find complexes in protein-protein interaction data [10], to group protein sequences hierarchically into superfamily and family clusters [13], to find families of regulatory RNA structures [15]. ...
Article
Full-text available
Clustering is a well-known and important problem with numerous applications. The graph-based model is one of the typical cluster models. In the graph model, clusters are generally defined as cliques. However, such an approach might be too restrictive as in some applications, not all objects from the same cluster must be connected. That is why different types of cliques relaxations often considered as clusters. In our work, we consider a problem of partitioning graph into clusters and a problem of isolating cluster of a special type whereby cluster we mean highly connected subgraph. Initially, such clusterization was proposed by Hartuv and Shamir. And their HCS clustering algorithm was extensively applied in practice. It was used to cluster cDNA fingerprints, to find complexes in protein-protein interaction data, to group protein sequences hierarchically into superfamily and family clusters, to find families of regulatory RNA structures. The HCS algorithm partitions graph in highly connected subgraphs. However, it is achieved by deletion of not necessarily the minimum number of edges. In our work, we try to minimize the number of edge deletions. We consider problems from the parameterized point of view where the main parameter is a number of allowed edge deletions.
... The HCS algorithm recursively finds the minimum graph cut that leads to a graph partition that outputs highly connected components or subgraphs. HCS have been applied for gene expression analysis [13]. RNSC is a partition-based algorithm that starts with a random cluster assignment and proceeds by reassigning nodes to clusters. ...
Article
Full-text available
How can complex relationships among molecular or clinico-pathological entities of neurological disorders be represented and analyzed? Graphs seem to be the current answer to the question no matter the type of information: molecular data, brain images or neural signals. We review a wide spectrum of graph representation and graph analysis methods and their application in the study of both the genomic level and the phenotypic level of the neurological disorder. We find numerous research works that create, process and analyze graphs formed from one or a few data types to gain an understanding of specific aspects of the neurological disorders. Furthermore, with the increasing number of data of various types becoming available for neurological disorders, we find that integrative analysis approaches that combine several types of data are being recognized as a way to gain a global understanding of the diseases. Although there are still not many integrative analyses of graphs due to the complexity in analysis, multi-layer graph analysis is a promising framework that can incorporate various data types. We describe and discuss the benefits of the multi-layer graph framework for studies of neurological disease.
... Each non-CH node determines its cluster by choosing the CH that can be reached using the least communication energy. The role of being a CH is rotated periodically among the nodes of the cluster in order to balance the load [6]. During the set-up phases, each sensor node chooses a random number between 0 and 1. ...
Article
Full-text available
Wireless sensor networks (WSN) are made up of sensor nodes which are usually battery-operated devices, and hence energy saving of sensor nodes is a major design issue. To prolong the networks lifetime, minimization of energy consumption should be implemented at all layers of the network protocol stack starting from the physical to the application layer including cross-layer optimization. Optimizing energy consumption is the main concern for designing and planning the operation of the WSN. Clustering technique is one of the methods utilized to extend lifetime of the network by applying data aggregation and balancing energy consumption among sensor nodes of the network. This paper proposed new version of Low Energy Adaptive Clustering Hierarchy (LEACH), protocols called Advanced Optimized Low Energy Adaptive Clustering Hierarchy (AOLEACH), Optimal Deterministic Low Energy Adaptive Clustering Hierarchy (ODLEACH), and Varying Probability Distance Low Energy Adaptive Clustering Hierarchy (VPDL) combination with Shuffled Frog Leap Algorithm (SFLA) that enables selecting best optimal adaptive cluster heads using improved threshold energy distribution compared to LEACH protocol and rotating cluster head position for uniform energy dissipation based on energy levels. The proposed algorithm optimizing the life time of the network by increasing the first node death (FND) time and number of alive nodes, thereby increasing the life time of the network.
... Existing methods include spectral clustering [24], edge-based agglomerative or divisive methods [25], multi-level graph partitioning [26], algorithms based on Min-cut [27], Markov clustering [22,28], and much more [29][30][31][32][33]. The problem is also similar to that of identifying high-density subgraphs [34] and can be computed (with minor modifications) using an algorithm by Hartuv and Shamir [35,36] or the one by Hüffner et al. [34]. All the above methods have their strengths and weaknesses, but the Markov clustering approach was chosen for our work because of its previous success with biological data sets [37]. ...
Article
Full-text available
It is well understood that distinct communities of bacteria are present at different sites of the body, and that changes in the structure of these communities have strong implications for human health. Yet, challenges remain in understanding the complex interconnections between the bacterial taxa within these microbial communities and how they change during the progression of diseases. Many recent studies attempt to analyze the human microbiome using traditional ecological measures and cataloging differences in bacterial community membership. In this paper, we show how to push metagenomic analyses beyond mundane questions related to the bacterial taxonomic profiles that differentiate one sample from another. We develop tools and techniques that help us to investigate the nature of social interactions in microbial communities, and demonstrate ways of compactly capturing extensive information about these networks and visually conveying them in an effective manner. We define the concept of bacterial "social clubs", which are groups of taxa that tend to appear together in many samples. More importantly, we define the concept of "rival clubs", entire groups that tend to avoid occurring together in many samples. We show how to efficiently compute social clubs and rival clubs and demonstrate their utility with the help of examples including a smokers' dataset and a dataset from the Human Microbiome Project (HMP). The tools developed provide a framework for analyzing relationships between bacterial taxa modeled as bacterial co-occurrence networks. The computational techniques also provide a framework for identifying clubs and rival clubs and for studying differences in the microbiomes (and their interactions) of two or more collections of samples. Microbial relationships are similar to those found in social networks. In this work, we assume that strong (positive or negative) tendencies to co-occur or co-infect is likely to have biological, physiological, or ecological significance, possibly as a result of cooperation or competition. As a consequence of the analysis, a variety of biological interpretations are conjectured. In the human microbiome context, the pattern of strength of interactions between bacterial taxa is unique to body site.
... Such study sheds light on obtaining bio-markers for classifying cancers. Clustering analysis [4][5][6][7][8] is prevalent for the analysis of microarray data. Some studies on clustering analysis have focused on biclustering of gene expression data [9][10][11]. ...
Article
Full-text available
We present a new genetic filter to identify a predictive gene subset for cancer-type classification on gene expression profiles. This approach pursues to not only maximize correlation between selected genes and cancer types but also minimize inter-correlation among selected genes. The proposed genetic filter was tested on well-known leukemia datasets, and significant improvement over previous work was obtained.
... Later, a 4-approximation algorithm was developed for a very simple and intuitive formulation called MinDisAgree (Charikar et al. 2005). This problem has drawn much attention due to applications in image processing and computational biology (Ben-Dor et al. 1999;Hartuv et al. 2000;Jain et al. 1999;Milosavljevic et al. 1995;Sharan et al. 2003;Tatusov et al. 2003;Wittkop et al. 2007), among other areas. Motivated by the large interest in the CEP, as well as its wide applicability, we propose in this work new theoretical and algorithmic developments for the unweighted variant of the problem, where there are no weights associated with edges. ...
Article
The cluster editing problem consists of transforming an input graph \(G\) into a cluster graph (a disjoint union of complete graphs) by performing a minimum number of edge editing operations. Each edge editing operation consists of either adding a new edge or removing an existing edge. In this paper we propose new theoretical results on data reduction and instance generation for the cluster editing problem, as well as two algorithms based on coupling an exact method to, respectively, a GRASP or ILS heuristic. Experimental results show that the proposed algorithms are able to find high-quality solutions in practical runtime.
... To cluster miRNAs, we calculated as a distance measure the weighted pairwise correlation of expression between miRNAs (using mean weight of each miRNA pair). We used a graph-based density clustering method termed the highly connected subgraph (HCS) method (Hartuv et al. 2000), which is optimized for homogeneous clusters within a larger heterogeneous background, with a similarity measure consisting of a weighted Pearson correlation coefficient. HCS is parameter free, except for a robust threshold (set to 0.8) to define the false-discovery rate (FDR) of the final cluster set, enabling the detection of definite clusters from the noisy background of cell lines. ...
Article
Full-text available
We expanded the knowledge base for Drosophila cell line transcriptomes by deeply sequencing their small RNAs. In total, we analyzed more than 1 billion raw reads from 53 libraries across 25 cell lines. We verify reproducibility of biological replicate data sets, determine common and distinct aspects of miRNA expression across cell lines, and infer the global impact of miRNAs on cell line transcriptomes. We next characterize their commonalities and differences in endo-siRNA populations. Interestingly, most cell lines exhibit enhanced TE-siRNA production relative to tissues, suggesting this as a common aspect of cell immortalization. We also broadly extend annotations of cis-NAT-siRNA loci, identifying ones with common expression across diverse cells and tissues, as well as cell-restricted loci. Finally, we characterize small RNAs in a set of ovary-derived cell lines, including somatic cells (OSS and OSC) and a mixed germline/somatic cell population (fGS/OSS) that exhibits ping-pong piRNA signatures. Collectively, the ovary data reveal new genic piRNA loci, including unusual configurations of piRNA-generating regions. Together with the companion analysis of mRNAs described in a previous study, these small RNA data provide comprehensive information on the transcriptional landscape of diverse Drosophila cell lines. These data should encourage broader usage of fly cell lines, beyond the few that are presently in common usage.
... To cluster miRNAs, we calculated as a distance measure the weighted pairwise correlation of expression between miRNAs (using mean weight of each miRNA pair). We used a graph-based density clustering method termed the highly connected subgraph (HCS) method (Hartuv et al. 2000), which is optimized for homogeneous clusters within a larger heterogeneous background, with a similarity measure consisting of a weighted Pearson correlation coefficient. HCS is parameter free, except for a robust threshold (set to 0.8) to define the false-discovery rate (FDR) of the final cluster set, enabling the detection of definite clusters from the noisy background of cell lines. ...
Article
Full-text available
Histone modifications are critical for the regulation of gene expression, cell type specification, and differentiation. However, evolutionary patterns of key modifications that regulate gene expression in differentiating organisms have not been examined. Here we mapped the genomic locations of the repressive mark histone 3 lysine 27 trimethylation (H3K27me3) in four species of Drosophila, and compared these patterns to those in C. elegans. We found that patterns of H3K27me3 are highly conserved across species, but conservation is substantially weaker among duplicated genes. We further discovered that retropositions are associated with greater evolutionary changes in H3K27me3 and gene expression than tandem duplications, indicating that local chromatin constraints influence duplicated gene evolution. These changes are also associated with concomitant evolution of gene expression. Our findings reveal the strong conservation of genomic architecture governed by an epigenetic mark across distantly related species and the importance of gene duplication in generating novel H3K27me3 profiles.
... networks, the most common assumption is that clusters are groups of highly connected nodes, although recently the notion of community intended as a set of topologically similar links has been successfully used in Ahn et al. (2010) and Solava et al. (2012). We also observe that many different clustering techniques have been proposed for graph analysis [e.g. the minimum cut algorithm in Hartuv et al. (2000) and the survey on graph clustering proposed in Schaeffer (2007)]. ...
Article
Full-text available
Protein-Protein Interaction (PPI) networks are powerful models to represent the pair-wise protein interactions of the organisms. Clustering PPI networks can be useful for isolating groups of interacting proteins that participate in the same biological processes, or that perform together specific biological functions. Evolutionary orthologies can be inferred this way, as well as functions and properties of yet uncharacterized proteins. We present an overview of the main state-of-the-art clustering methods that have been applied to PPI networks over the last decade. We distinguish five specific categories of approaches, describe and compare their main features, and then focus on one of them, that is, population-based stochastic search. We provide an experimental evaluation, based on some validation measures widely used in the literature, of techniques in this class, that is as yet less explored than the others. In particular, we study how the capability of Genetic Algorithms (GAs) to extract clusters in PPI networks varies when different topology-based fitness functions are employed, and we compare GAs with the main techniques in the other categories. The experimental campaign shows that predictions returned by GAs are often more accurate than those produced by the contestant methods. Interesting issues still remain open about possible generalizations of GAs allowing for cluster overlapping. We point out which methods and tools described here are publicly available. pizzuti@icar.cnr.it,simona.rombo@math.unipa.it SUPPLEMENTARY INFORMATION: Supplementary Material showing further validation results is available.
... Hartuv and Shamir [5] proposed a clustering algorithm producing so-called highly connected clusters. Their method has been successfully used to cluster cDNA fingerprints [6], to find complexes in protein-protein interaction (PPI) data [7,8], to group protein sequences hierarchically into superfamily and family clusters [9], and to find families of regulatory RNA structures [10]. Hartuv and Shamir [5] formalized the connectivity demand for a cluster as follows: the edge connectivity λ(G) of a graph G is the minimum number of edges whose deletion results in a disconnected graph, and a graph G with n vertices is called highly connected if λ(G) > n/2. ...
Article
Full-text available
A popular clustering algorithm for biological networks which was proposed by Hartuv and Shamir [IPL 2000] identifies nonoverlapping highly connected components. We extend the approach taken by this algorithm by introducing the combinatorial optimization problem Highly Connected Deletion, which asks for removing as few edges as possible from a graph such that the resulting graph consists of highly connected components. We show that Highly Connected Deletion is NP-hard and provide a fixed-parameter algorithm and a kernelization. We propose exact and heuristic solution strategies, based on polynomial-time data reduction rules and integer linear programming with column generation. The data reduction typically identifies 75% of the edges that are deleted for an optimal solution; the column generation method can then optimally solve protein interaction networks with up to 6,000 vertices and 13,500 edges in less than a day. Additionally, we present a new heuristic that finds more clusters than the method by Hartuv and Shamir.
... It is a reduced version of a Peripheral Blood Monocytes (PBM) dataset originally used by Hartuv et al. [52] to test their clustering algorithm. The dataset contains 2329 cDNAs with a fingerprint of 139 oligos (performed with 139 different Oligonucleotide probes) derived from 18 genes. ...
Article
ABSTRACT Unsupervised learning, mostly represented by data clustering methods, is an
... A widely used technique for microarray data analysis is clustering analysis [3,9,4,12]. In gene clustering analysis, correlated genes are grouped together. ...
Article
Recent advances in DNA microarray ofier the abil- ity to monitor and measure the expression lev- els of thousands of genes simultaneously in an or- ganism. These experiments consist of monitoring each gene many times under difierent conditions or evaluating each gene under a single environ- ment but in difierent types of tissues. The flrst one is useful for identiflcation of functionally re- lated genes while the second type of experiment is helpful in classiflcation of difierent types of tissues and identiflcation of those genes whose expression levels are good diagnostic indicators. Difierent machine learning approaches such as supervised and some unsupervised learning have been pre- viously applied to classify difierent kinds of pa- tient samples by identifying those genes respon- sible for difierent types of cancers. However, the main challenges in this task are the availability of a smaller number of samples compared to huge number of genes and the noisy nature of biological data. Moreover, many of these genes are irrelevant to distinction of difierent samples and have nega- tive impact on acquired classiflcation accuracy. In this paper, I provide a survey on gene expression based cancer classiflcation using evolutionary and non-evolutionary methods. Keywords: DNA microarray, gene expres- sion, Naive-Bayes classifler, support vector ma- chine, decision tree, nearest neighbor classifler, neural network, leave-one-out-cross-validation (LOOCV), multi-objective evolutionary algo- rithm, PMBGA.
... The main emphasizes of data mining is on individual subject rather than the population, providing an avenue for personalization [16]. Several computational techniques have been applied for gene expression classification problems, including Fisher linear discriminant analysis [17], k nearest neighbor [18], decision-tree, multi-layer perceptron [19], support vector machines [20], self-organizing maps [4], hierarchical clustering [21], and graph theoretic approaches [22]. ...
... An alternative strategy to deal with noise genes is sequential clustering. Included works are quality-based clustering (Heyer et al. [15]), adaptive quality-based clustering (AQC) (De Smet et al. [8]), CAST CLUSTERING GENE EXPRESSION PROFILES … 3 (Ben-Dor et al. [2]), gene shaving (Hastie et al. [13]), CLICK (Sharan and Shamir [25]), HCS (Hartuv et al. [11]), tight clustering (Tseng and Wong [28]), Liang and Wang [18], among others. Taking AQC as an example, it first searches for a cluster center, and then groups the genes around the center into a cluster. ...
Article
Full-text available
Clustering has been an important tool for extracting underlying gene expression patterns from massive microarray data. However, most of the existing clustering methods cannot automatically separate noise genes, including scattered, singleton and mini-cluster genes, from other genes. Inclusion of noise genes into regular clustering processes can impede identification of gene expression patterns. The model-based clustering method has the potential to automatically separate noise genes from other genes so that major gene expression patterns can be better identified. In this paper, we propose to use the ensemble averaging method to improve the performance of the single model-based clustering method. We also propose a new density estimator for noise genes for Gaussian mixture modeling. Our numerical results indicate that the ensemble averaging method outperforms other clustering methods, such as the quality-based method and the single model-based clustering method, in clustering datasets with noise genes.
... It has numerous applications in biology as well as in many other disciplines. There are several algorithmic techniques preciously used in clustering gene expression data, including hierarchical clustering [24], k-means algorithm, self organizing maps [67], and graph theoretic approach [23] [6] [74]. ...
Article
Full-text available
We review recent research and development in high performance computing (HPC) for computational biology and discuss the great challenges to both biomedical scientists and IT professionals. During the last decades, research in the fields of molecular biology and biomedicine has provided the scientific community with huge amount of data through sequencing, genome-wide annotation and gene expression profiling projects. The genetic databases have been growing exponentially and sophisticated computer algorithms have been developed to cater for needs of data mining, analysis and simulation. It is clear that development of HPC technologies has become crucial for deployment of the software systems to tackle various bioinformatics problems. The goal of this article is to present the current research and our critical review on construction of parallel and distributed computing systems, design of multi-process algorithms, and development of software systems for biocomputing tasks including phylogenetic analysis, pairwise and multiple sequence alignment, heuristic database searching, and gene clustering. We also give a brief introduction to our work in development of highly scalable and reproducible HPC algorithms and indicate the challenging problems in this context.
... This gives a 2329 × 139 data matrix. According to Hartuv et al. [45], the cDNAs in the dataset originated from 18 distinct genes, i.e., the a priori classes are known. The partition of the dataset into 18 groups was obtained by laboratory experiments at Novartis in Vienna. ...
Chapter
Full-text available
This chapter considers three clustering steps: choice of a distance function; choice of a clustering algorithm; and choice of a methodology to assess the statistical significance of clustering solutions. First, the chapter discusses the experimental set-up used for the results that are presented here. Next, it deals with distance functions; in particular, new approaches to assess the intrinsic separation ability of many standard distance functions and their use in conjunction with clustering algorithms. A section is devoted to clustering algorithms, in particular to nonnegative matrix factorization (NMF). The chapter further discusses the assessment of the statistical significance of a clustering solution. It explains the identification of the correct number of clusters in a given data set. This class of statistical methods is usually referred to as internal validation measures. Finally, the chapter deals with consensus clustering highlighting its paradigmatic nature for stability-based validation measures and its excellent discriminative power.
... To accommodate noise genes, several sequential clustering methods have been proposed in the recent literature, including quality-based clustering (Heyer et al., 1999), adaptive quality-based clustering (AQC) (De Smet et al., 2002), CAST (Ben-Dor et al., 1999), gene shaving (Hastie et al., 2000), CLICK (Sharan and Shamir, 2000), HCS (Hartuv et al., 2000), and tight clustering (Tseng and Wong, 2005). Refer to Shamir and Sharan (2002), Jiang et al. (2004), and Tseng (2005) for overviews of these methods. ...
Article
The increasing use of microarray technologies is generating a large amount of data that must be processed to extract underlying gene expression patterns. Existing clustering methods could suffer from certain drawbacks. Most methods cannot automatically separate scattered, singleton and mini-cluster genes from other genes. Inclusion of these types of genes into regular clustering processes can impede identification of gene expression patterns. In this paper, we propose a general clustering method, namely a dynamic agglomerative clustering (DAC) method. DAC can automatically separate scattered, singleton and mini-cluster genes from other genes and thus avoid possible contamination to the gene expression patterns caused by them. For DAC, the scattered gene filtering step is no longer necessary in data pre-processing. In addition, we propose a criterion for evaluating clustering results for a dataset which contains scattered, singleton and/or mini-cluster genes. DAC has been applied successfully to two real datasets for identification of gene expression patterns. Our numerical results indicate that DAC outperforms other clustering methods, such as the quality-based and model-based clustering methods, in clustering datasets which contain scattered, singleton and/or mini-cluster genes.
... CLuster Identification via Connectivity Kernels (CLICK) (Sharan et al., 2000) is appropriate for subspace and high dimensional data clustering. A novel algorithm for cluster analysis that is based on graph theoretic techniques is presented in (Hartuv et al., 2000). Unlike other methods, it does not assume that the clusters are hierarchically structured and does not require prior knowledge on the number of clusters. ...
Article
Full-text available
This paper presents an effective parameter-less graph based clustering technique (GCEPD). GCEPD produces highly coherent clusters in terms of various cluster validity measures. The technique finds highly coherent patterns containing genes with high biological relevance. Experiments with real life datasets establish that the method produces clusters that are significantly better than other similar algorithms in terms of various quality measures.
Chapter
There are many similarities in the symptoms of several types of cancer and that makes it sometimes difficult for the physicians to do an accurate diagnosis. In addition, it is a technical challenge to classify accurately the cancer cells in order to differentiate one type of cancer from another. The DNA microarray technique (also called the DNA chip) has been used in the past for the classification of cancer but it generates a large volume of noisy data that has many features, and is difficult to analyze directly. This paper proposes a new method, combining the genetic algorithm, case-based reasoning, and the k-nearest neighbor classifier, which improves the performance of the classification considerably. The authors have also used the well-known Mahalanobis distance of multivariate statistics as a similarity measure that improves the accuracy. A case-based classifier approach together with the genetic algorithm has never been applied before for the classification of cancer, same with the application of the Mahalanobis distance. Thus, the proposed approach is a novel method for the cancer classification. Furthermore, the results from the proposed method show considerably better performance than other algorithms. Experiments were done on several benchmark datasets such as the leukemia dataset, the lymphoma dataset, ovarian cancer dataset, and breast cancer dataset.
Article
Ecosystem service approaches to watershed management have grown quickly, increasing the importance of understanding the streamflow response to realistic land-cover change. Previous work has investigated the relationship between watershed characteristics and streamflow in catchments around the world, but little has focused on systematic relationships between watershed characteristics and streamflow change after land-cover restoration. To address this gap, we simulate streamflow responses to restoring 10% of watershed area from agricultural land to forest and natural pasture in 29 watersheds around the world. This change is consistent with that performed in watershed-service programs. We calculate the change in a broad array of streamflow indices for each site and use a graph-connectedness approach to cluster the sites based on the sign of the index value changes. We find three primary clusters with distinct responses to restoration. Permutation tests and effect size demonstrate the difference in watershed characteristics and streamflow indices across clusters. The low-flow intensifying sites have shallower soils and smaller saturated soil volume. After restoration, simulated streamflow in these sites increases during relatively dry periods and declines during high-flow periods. The high-flow intensifying sites have larger saturated soil volume. After restoration, simulated dry-season flow in these sites decreases. The high-flow enhancing sites have larger soil hydraulic conductivities than the high-flow intensifying sites. After restoration, simulated dry-season flow in these sites decreases less than in high-flow intensifying sites. The soil depth and hydraulic conductivity appear to be the characteristics that determine clusters, as clusters are not statistically related to climate, watershed location, proximity, size and shape, elevation, or pre-existing land cover. This study provides valuable understanding of land-cover restoration and the watershed characteristics that most impact streamflow change.
Chapter
Identification of targets, generally virus or bacteria, in a biological sample is a relevant problem in medicine. Biologists can use hybridization experiments to determine whether a specific DNA fragment, that represents the virus, is present in a DNA solution. A probe is a segment of DNA or RNA, labeled with a radioactive isotope, dye, or enzyme, used to find a specific target sequence on a DNA molecule by hybridization. Selecting unique probes through hybridization experiments is a difficult task, especially when targets have a high degree of similarity, for instance, in case of closely related viruses.The nonunique probe selection problem is a challenging problem from a biological and computational point of view; a plethora of methods have been proposed in literature ranging from evolutionary algorithms to mathematical programming approaches.In this study, we conducted a survey of the existing computational methods for probe design and selection. We introduced the biological aspects of the problem and examined several issues related to the design and selection of probes: oligonucleotide fingerprinting, maximum distinguishing probe set, minimum cost probe set, and nonunique probe selection.
Conference Paper
In gene expression analysis, grouping co-regulated genes is a major step in discovering genes which are likely to have related biological functions. This critical step can be done using clustering. This paper formally presents three models for iterative clustering based on average, single and complete linkage strategies. Variation of relational clustering algorithms can be built based on these models. The number of clusters needs not to be known in advance. Unlike centroid and medoid-based algorithms the proposed approach avoids minimizing least squares type objective function instead it maximizes the average similarity between objects of the same cluster using a subset of the similarity matrix. Top k nearest, farthest or near average entries in each row of the similarity matrix need to be identified depending on the required linkage strategy. In order to reduce the computational complexity of this step randomized search or genetic technique can be used to approximate these elements; however, in our experimental studies, the exact k elements are computed. The performance of the proposed algorithms is evaluated and compared to existing techniques on two standard gene expression datasets.
Conference Paper
Identifying highly connected subgraphs in biological networks has become a powerful tool in computational biology. By definition a highly connected graph with n vertices can only be disconnected by removing more than \(\frac{n}{2}\) of its edges. This definition, however, is not suitable for bipartite graphs, which have various applications in biology, since such graphs cannot contain highly connected subgraphs. Here, we introduce a natural modification of highly connected graphs for bipartite graphs, and prove that the problem of finding such subgraphs with the maximum number of vertices in bipartite graphs is NP-hard. To address this problem, we provide an integer linear programming solution, as well as a local search heuristic. Finally, we demonstrate the applicability of our heuristic to predict protein function by identifying highly connected subgraphs in bipartite networks that connect proteins with their experimentally established functionality.
Article
Improperly tuned wavelet neural network (WNN) has been shown to exhibit unsatisfactory generalization performance. In this study, the tuning is done by an improved fuzzy C-means algorithm, that utilizes a novel similarity measure. This similarity measure takes the orientation as well as the distance into account. The modified WNN was first applied to a benchmark problem. Performance assessments with other approaches were made subsequently. Next, the feasibility of the proposed WNN in forecasting the chaotic Mackey–Glass time series and a real world application problem, i.e., blood glucose level prediction, were studied. An assessment analysis demonstrated that this presented WNN was superior in terms of prediction accuracy.
Article
Discovering new information about groups of genes implied in a disease is still challenging. Microarrays are a powerful tool to analyze gene expression. They provide an expression level for genes under given biological situations. In this paper, we propose a novel approach outlining relationships between genes based on their ordered expressions. First, we propose to use a new material, called sequential patterns, to be investigated by biologists. But, due to the expression matrice density, extracting sequential patterns from microarray datasets is far away from easy. Second, we propose to introduce a knowledge source during the mining task. By this way, the search space is reduced and more relevant results (from a biological point of view) are obtained. Results of various experiments on real biological data highlight the relevance of our proposal.
Article
Fuzzy C-means (FCM) partitions the observations partially into several clusters based on the principles of fuzzy theory. However, minimization on the Euclidean distance in FCM tends to detect hyper-spherical shaped clusters, which is unfeasible for the real world problems. In this paper, an effective FCM algorithm that adopts the symmetry similarity measure is proposed in order to search for the appropriate clusters, regardless of the geometric structures and overlapping characteristic. Experimental results on several artificial and real life datasets with different nature and the performance assessment with other existing clustering algorithms demonstrate its superiority.
Article
This chapter provides a survey of a classification problem involving genetic sequences, namely the problem of classifying fingerprint vectors with missing values. The main focus of the chapter is motivated by the recent development of a discrete classification approach by Figueroa, Borneman, and Jiang in 2004, called the binary clustering with missing-values (BCMV) problem, for analyzing oligonucleotide fingerprints, especially in applications such as DNA clone classifications. The chapter provides some basic mathematical definitions that are useful in understanding the underlying computational problems more effectively. A brief survey of various other classification approaches is provided. The chapter then provides a brief overview of several approaches for estimating missing values in the genomic data. Finally, it discusses in detail the BCMV problem and its variations.
Conference Paper
The Cluster Editing problem asks to transform a graph into a disjoint union of cliques using a minimum number of edge modifications. Although the problem has been proven NP-complete several times, it has nevertheless attracted much research both from the theoretical and the applied side. The problem has been the inspiration for numerous algorithms in bioinformatics, aiming at clustering entities such as genes, proteins, phenotypes, or patients. In this paper, we review exact and heuristic methods that have been proposed for the Cluster Editing problem, and also applications of these algorithms for biological problems.
Conference Paper
To improve cancer diagnosis and drug development, the classification of tumor types based on genomic information is important. As DNA microarray studies produce a large amount of data, expression data are highly redundant and noisy, and most genes are believed to be uninformative with respect to the studied classes. Only a fraction of genes may present distinct profiles for different classes of samples. Classification tools to deal with these issues are thus important. These tools should learn to robustly identify a subset of informative genes embedded in a large dataset that is contaminated with high dimensional noises. In this paper, an integrated approach of support vector machine (SVM) and particle swarm optimization (PSO) is proposed for this purpose. The proposed approach can simultaneously optimize the selection of feature subset and the classifier through a common solution coding mechanism. As an illustration, the proposed approach is applied to search the combinational gene signatures for predicting histologic response to chemotherapy of osteosarcoma patients. Cross-validation results show that the proposed approach outperforms other existing methods in terms of classification accuracy. Further validation using an independent dataset shows misclassification of only one out of fourteen patient samples, suggesting that the selected gene signatures can reflect the chemo-resistance in osteosarcoma.
Conference Paper
In this study we propose an early lung cancer detection methodology using nucleus based features. First the sputum samples from patients are labeled with Tetrakis Carboxy Phenyl Porphine (TCPP) and fluorescent images of these samples are taken. TCPP is a porphyrin that is able to assist in labeling lung cancer cells by increasing numbers of low density lipoproteins coating on the surface of cancer. We study the performance of well know machine learning techniques in the context of lung cancer detection on Biomoda dataset. We obtained an accuracy of 81% using 71 features related to shape, intensity and color in our previous work. By adding the nucleus segmented features we improved the accuracy to 87%. Nucleus segmentation is performed by using Seeded region growing segmentation method. Our results demonstrate the potential of nucleus segmented features for detecting lung cancer.
Article
Specifying the number and locations of the translation vectors for wavelet neural networks (WNNs) is of paramount significance as the quality of approximation may be drastically reduced if initialization of WNNs parameters was not done judiciously. In this paper, an enhanced fuzzy C-means algorithm, specifically the modified point symmetry–based fuzzy C-means algorithm (MPSDFCM), was proposed, in order to determine the optimal initial locations for the translation vectors. The proposed neural network models were then employed in approximating five different nonlinear continuous functions. Assessment analysis showed that integration of the MPSDFCM in the learning phase of WNNs would lead to a significant improvement in WNNs prediction accuracy. Performance comparison with the approaches reported in the literature in approximating the same benchmark piecewise function verified the superiority of the proposed strategy.
Conference Paper
Full-text available
The availability of large volumes of protein-protein interaction data has allowed the study of biological networks to unveil the complex structure and organization in the cell. It has been recognized by biologists that proteins interacting with each other often participate in the same biological processes, and that protein modules may be often associated with specific biological functions. Thus the detection of protein complexes is an important research problem in systems biology. In this review, recent graph-based approaches to clustering protein interaction networks are described and classified with respect to common peculiarities. The goal is that of providing a useful guide and reference for both computer scientists and biologists.
Article
Full-text available
Detour paths provide overlay networks with improved performance and resilience. Finding good detour routes with methods that scale to millions of nodes is a challenging problem. We propose a novel approach for decentralised discovery of detour paths based on the observation that Internet paths that traverse overlapping sets of autonomous systems may benefit from the same detour nodes. We show how nodes can learn about overlap between Internet paths at the level of autonomous systems and demonstrate how they can exploit detours that other nodes have already found. Our approach is to cluster paths based on the extent to which the autonomous systems traversed overlap and gossip potential detours among nodes. We find that our centralised path clustering algorithm correctly classified over 90% of potential latency detours in a 176-node dataset drawn from PlanetLab. In our decentralised version, we detected 60% of potentially available detours with each node sampling data from only 10% of other nodes.
Article
The rapid developments of technologies that generate arrays of gene data enable a global view of the transcription levels of hundreds of thousands of genes simultaneously. The outlier detection problem for gene data has its importance but together with the difficulty of high dimensionality. The sparsity of data in high-dimensional space makes each point a relatively good outlier in the view of traditional distance-based definitions. Thus, finding outliers in high dimensional data is more complex. In this paper, some basic outier analysis algorithms are discussed and a new genetic algorithm is presented. This algorithm is to find best dimension projections based on a revised cell-based algorithm and to give explantations to solutions. It can solve the outlier detection problem for gene expression data and for other high dimensional data as well.
Conference Paper
Recent advances in biotechnology allow researchers to measure expression levels for thousands of genes simultaneously, across different conditions and over time. Analysis of data produced by such experiments offers potential insight into gene function and regulatory mechanisms. A key step in the analysis of gene expression data is the detection of groups of genes that manifest similar expression patterns. The corresponding algorithmic problem is to cluster multi-condition gene expression patterns. This paper aims to introduce a new clustering algorithm for gene expression data. The design of the proposed algorithm tries to avoid some of the drawbacks and the disadvantages of the present algorithms of clustering gene expression data. The proposed αCORRclustering algorithm is tested and verified on real biological data sets.
Article
Full-text available
We have developed a novel algorithm for cluster analysis that is based on graph theoretic techniques. A similarity graph is defined and clusters in that graph correspond to highly connected subgraphs. A polynomial algorithm to compute them efficiently is presented. Our algorithm produces a solution with some provably good properties and performs well on simulated and real data.
Conference Paper
Distance functions are a fundamental ingredient of classification and clustering procedures, and this holds true also in the particular case of microarray data. In the general data mining and classification literature, functions such as Euclidean distance or Pearson correlation have gained their status of de facto standards thanks to a considerable amount of experimental validation. For microarray data, the issue of which distance function “works best” has been investigated, but no final conclusion has been reached. The aim of this paper is to shed further light on that issue. Indeed, we present an experimental study, involving several distances, assessing (a) their intrinsic separation ability and (b) their predictive power when used in conjunction with clustering algorithms. The experiments have been carried out on six benchmark microarray datasets, where the “gold solution” is known for each of them. We have used both Hierarchical and K-means clustering algorithms and external validation criteria as evaluation tools. From the methodological point of view, the main result of this study is a ranking of those measures in terms of their intrinsic and clustering abilities, highlighting also the correlations between the two. Pragmatically, based on the outcomes of the experiments, one receives the indication that Minkowski, cosine and Pearson correlation distances seems to be the best choice when dealing with microarray data analysis.
Conference Paper
Full-text available
We significantly improve known time bounds for solving the minimum cut problem on undirected graphs. We use a "semiduality" between minimum cuts and maximum spanning tree packings combined with our previously developed random sampling techniques. We give a randomized (Monte Carlo) algorithm that finds a minimum cut in an m-edge, n-vertex graph with high probability in O(m log3 n) time. We also give a simpler randomized algorithm that finds all minimum cuts with high probability in O(m log3 n) time. This variant has an optimal RNC parallelization. Both variants improve on the previous best time bound of O(n2 log3 n). Other applications of the tree-packing approach are new, nearly tight bounds on the number of near-minimum cuts a graph may have and a new data structure for representing them in a space-efficient manner.
Article
Full-text available
We have developed a novel algorithm for cluster analysis that is based on graph theoretic techniques. A similarity graph is defined and clusters in that graph correspond to highly connected subgraphs. A polynomial algorithm to compute them efficiently is presented. Our algorithm produces a solution with some provably good properties and performs well on simulated and real data.
Book
Full-text available
The lack of standard library of the data structures and algorithms of combinatorial and geometric computing severely limits the impact of this area on computer science. To address this problem, the LEDA project was introduced in 1989 to build a library of the data types and algorithms of combinatorial and geometric computing. Among its many features, LEDA provides a sizable collection of data types and algorithms in a form that allows them to be used by non-experts. Sample applications are code optimization, motion planning, logic synthesis, scheduling, VLSI design, term rewriting systems, semantic nets, machine learning, image analysis, computational biology, etc.
Article
Full-text available
Given a set of entities, Cluster Analysis aims at finding subsets, called clusters, which are homogeneous and/or well separated. As many types of clustering and criteria for homogeneity or separation are of interest, this is a vast field. A survey is given from a mathematical programming viewpoint. Steps of a clustering study, types of clustering and criteria are discussed. Then algorithms for hierarchical, partitioning, sequential, and additive clustering are studied. Emphasis is on solution methods, i.e., dynamic programming, graph theoretical algorithms, branch-and-bound, cutting planes, column generation and heuristics.
Article
Full-text available
We significantly improve known time bounds for solving the minimum cut problem on undirected graphs. We use a "semiduality" between minimum cuts and maximum spanning tree packings combined with our previously developed random sampling technfques. We give a randomized (Monte Carlo) algorithm that finds a minimum cut in an m-edge, n-vertex graph with high probability in O(m log 3 n) time. We also give a simpler randomized algorithm that finds all minimum cuts with high probability in O(n 2 log n) time. This variant has an optimal RNC parallelization. Both variants improve on the previous best time bound of O(n 2 log 3 n). Other applications of the tree-packing approach are new, nearly tight bounds on the number of near-minimum cuts a graph may have and a new data structure for representing them in a space-efficient manner. General Terms: Algorithms, Theory.
Article
Full-text available
This paper is concerned wth the physical mapping of DNA molecules using data about the hybridization of oligonucleotide probes to a library of clones. In mathematical terms, the DNA molecule corresponds to an interval on the real line, each clone to a subinterval, and each probe occurs at a finite set of points within the interval. A stochastic model for the occurrences of the probes and the locations of the clones is assumed. Given a matrix of incidences between probes and clones, the task is to reconstruct the most likely interleaving of the clones. Combinatorial algorithms are presented for solving approximations to this problem, and computational results are presented.
Article
Full-text available
Microarrays containing 1046 human cDNAs of unknown sequence were printed on glass with high-speed robotics. These 1.0-cm2 DNA "chips" were used to quantitatively monitor differential expression of the cognate human genes using a highly sensitive two-color hybridization assay. Array elements that displayed differential expression patterns under given experimental conditions were characterized by sequencing. The identification of known and novel heat shock and phorbol ester-regulated genes in human T cells demonstrates the sensitivity of the assay. Parallel gene analysis with microarrays provides a rapid and efficient method for large-scale human gene discovery.
Article
Full-text available
Large-scale sequencing of cDNAs randomly picked from libraries has proven to be a very powerful approach to discover (putatively) expressed sequences that, in turn, once mapped, may greatly expedite the process involved in the identification and cloning of human disease genes. However, the integrity of the data and the pace at which novel sequences can be identified depends to a great extent on the cDNA libraries that are used. Because altogether, in a typical cell, the mRNAs of the prevalent and intermediate frequency classes comprise as much as 50-65% of the total mRNA mass, but represent no more than 1000-2000 different mRNAs, redundant identification of mRNAs of these two frequency classes is destined to become overwhelming relatively early in any such random gene discovery programs, thus seriously compromising their cost-effectiveness. With the goal of facilitating such efforts, previously we developed a method to construct directionally cloned normalized cDNA libraries and applied it to generate infant brain (INIB) and fetal liver/spleen (INFLS) libraries, from which a total of 45,192 and 86,088 expressed sequence tags, respectively, have been derived. While improving the representation of the longest cDNAs in our libraries, we developed three additional methods to normalize cDNA libraries and generated over 35 libraries, most of which have been contributed to our integrated Molecular Analysis of Genomes and Their Expression (IMAGE) Consortium and thus distributed widely and used for sequencing and mapping. In an attempt to facilitate the process of gene discovery further, we have also developed a subtractive hybridization approach designed specifically to eliminate (or reduce significantly the representation of) large pools of arrayed and (mostly) sequenced clones from normalized libraries yet to be (or just partly) surveyed. Here we present a detailed description and a comparative analysis of four methods that we developed and used to generate normalize cDNA libraries from human (15), mouse (3), rat (2), as well as the parasite Schistosoma mansoni (1). In addition, we describe the construction and preliminary characterization of a subtracted liver/spleen library (INFLS-SI) that resulted from the elimination (or reduction of representation) of -5000 INFLS-IMAGE clones from the INFLS library.
Article
Full-text available
As a step toward understanding the complex differences between normal and cancer cells in humans, gene expression patterns were examined in gastrointestinal tumors. More than 300,000 transcripts derived from at least 45,000 different genes were analyzed. Although extensive similarity was noted between the expression profiles, more than 500 transcripts that were expressed at significantly different levels in normal and neoplastic cells were identified. These data provide insight into the extent of expression differences underlying malignancy and reveal genes that may prove useful as diagnostic or prognostic markers.
Article
Full-text available
The use of hybridisation of synthetic oligonucleotides to cDNAs under high stringency to characterise gene sequences has been demonstrated by a number of groups. We have used two cDNA libraries of 9 and 12 day mouse embryos (24 133 and 34 783 clones respectively) in a pilot study to characterise expressed genes by hybridisation with 110 hybridisation probes. We have identified 33 369 clusters of cDNA clones, that ranged in representation from 1 to 487 copies (0.7%). 737 were assigned to known rodent genes, and a further 13 845 showed significant homologies. A total of 404 clusters were identified as significantly differentially represented (P < 0.01) between the two cDNA libraries. This study demonstrates the utility of the fingerprinting approach for the generation of comparative gene expression profiles through the analysis of cDNAs derived from different biological materials.
Article
Full-text available
A new algorithm for the construction of physical maps from hybridization fingerprints of short oligonucleotide probes has been developed. Extensive simulations in high-noise scenarios show that the algorithm produces an essentially completely correct map in over 95% of trials. Tests for the influence of specific experimental parameters demonstrate that the algorithm is robust to both false positive and false negative experimental errors. The algorithm was also tested in simulations using real DNA sequences of C. elegans, E. coli, S. cerevisiae, and H. sapiens. To overcome the non-randomness of probe frequencies in these sequences, probes were preselected based on sequence statistics and a screening process of the hybridization data was developed. With these modifications, the algorithm produced very encouraging results.
Article
High density peptide and oligonucleotide chips are fabricated using semiconductor-based technologies. These chips have a variety of biological applications.
Chapter
The number of DNA clones to be manipulated and analysed in a variety of projects dealing with the analysis of complex genomes exceeds the potential of current methodology by orders of magnitude. By focussing first on the transcribed parts of the genome, i.e., by using cDNA or exon-trap libraries, it is possible to reduce the volume of the task considerably, presumably without sacrificing too much information. We present here an integrated series of mostly automated processes which together allow the isolation, amplification, arrayed spotting and analysis by oligonucleotide fingerprinting of > 100,000 cDNA clones in a few months with little operator involvement. The sequence information thus derived will be used to search databases for related sequences and to establish catalogues of expressed sequences. The technique is currently being applied to the analysis of cDNA derived from various human and mouse tissues and developmental stages. An expanded version of the process will allow us to analyse cDNA libraries from a range of representative human tissues, thereby giving us access to a significant fraction of the human genome.
Article
This chapter discusses the nature and purpose of clustering and classification. The classification is one of the fundamental processes in science. It is important to get all the facts and phenomena straight before understanding and developing the unifying principles that explain the reasons for the occurrence of classification and its apparent order. As classification is the ordering of objects by their similarities, it is important to recognize that classification transcends human intellectual endeavor and is a fundamental property of living organisms. The attempts to develop techniques for automatic classification necessitates the quantification of similarity. Therefore, the ability to perceive any two objects as more similar to each other than either is to a third is sure to have been present in the ancestors of the human species. In much classificatory work, it is impractical to obtain estimates of taxonomic similarity in an assemblage of objects from a sample of subjects.
Article
The output of a cluster analysis method is a collection of subsets of the object set termed clusters characterized in some manner by relative internal coherence and/or external isolation, along with a natural stratification of these identified clusters by levels of cohesive intensity. In formalizing a model of the cluster analysis methods, it is essential to consider the nature and inherent reliability of the proximity data that constitutes the input in substantive clustering applications. The proximity value scales are dichotomous. It is the practice of most authors of cluster methods to assume that the proximity values are available in the form of a real symmetric matrix, where any unjustified structure implicit in the real values is either to be ignored or axiomatically disallowed. The most desirable cluster analysis models for substantive applications should have the input proximity data expressible in a manner faithfully representing only the reliable information content of the empirically measured data.
Article
Utilizing the edge-connectivity of graphs, certain maximally connected subgraphs of a graph termed k-components and clusters are characterized and their interrelations are investigated. The cohesiveness function is described and shown to be a useful measure of the local intensity of connectivity within a graph. Sequences of cuts totally separating a graph, termed slicings, are shown to be intimately related to the k-components and clusters of a graph. An efficient algorithm is presented for determining k-components and clusters. Applications of these notions to graph coloring and numerical taxonomy are discussed.
Article
Given an undirected graph G = (V, E), it is known that its edge-connectivity lambda(G) can be computed by solving O(\V\) max-flow problems. The best time bounds known for the problem are O(lambda(G)\V\2), due to Matula (28th IEEE Symposium on the Foundations of Computer Science, 1987, pp. 249-251) if G is simple, and O(\E\3/2\V\), due to Even and Tarjan (SIAM J. Comput., 4 (1975), pp. 507-518) if G is multiple. An O(\E\ + min {lambda(G)\V\2, p\V\ + \V\2 log \V\}) time algorithm for computing the edge-connectivity lambda(G) of a multigraph G = (V, E), where p(less-than-or-equal-to \E\) is the number of pairs of nodes between which G has an edge, is proposed. This algorithm does not use any max-flow algorithm but consists only of \V\ times of graph searches and edge contractions. This method is then extended to a capacitated network to compute its minimum cut capacity in O(\V\\E\ + \V\2 log\V\) time.
Article
We significantly improve known time bounds for solving the minimum cut problem on undirected graphs. We use a "semiduality" between minimum cuts and maximum spanning tree packings combined with our previously developed random sampling techniques. We give a randomized (Monte Carlo) algorithm that finds a minimum cut in an m-edge, n-vertex graph with high probability in O(m log3 n) time. We also give a simpler randomized algorithm that finds all minimum cuts with high probability in O(m log3 n) time. This variant has an optimal RNC parallelization. Both variants improve on the previous best time bound of O(n2 log3 n). Other applications of the tree-packing approach are new, nearly tight bounds on the number of near-minimum cuts a graph may have and a new data structure for representing them in a space-efficient manner.
Article
We describe the use of oligonucleotide fingerprinting for the generation of a normalized cDNA library from unfertilized sea urchin eggs and report the preliminary analysis of this library, which resulted in the establishment of a partial gene catalogue of the sea urchin egg. In an analysis of 21,925 cDNA clones by hybridization with 217 oligonucleotide probes, we were able to identify 6291 clusters corresponding to different transcripts, ranging in size from 1 to 265 clones. This corresponds to an average 3.5-fold normalization of the starting library. The normalized library represents about one-third of all genes expressed in the sea urchin egg. To generate sequence information for the transcripts represented by the clusters, representative clones selected from 711 clusters were sequenced. The construction and preliminary analysis of the normalized library are the first steps in the assembly of an increasingly complete collection of maternal genes expressed in the sea urchin egg, which will provide a number of insights into the early development of this well-characterized model organism.
Article
We present an algorithm for finding the minimum cut of an undirected edge-weighted graph. It is simple in every respect. It has a short and compact description, is easy to implement, and has a surprisingly simple proof of correctness. Its runtime matches that of the fastest algorithm known. The runtime analysis is straightforward. In contrast to nearly all approaches so far, the algorithm uses no flow techniques. Roughly speaking, the algorithm consists of about uVu nearly identical phases each of which is a maximum adjacency search.
Article
Here we describe progress on a series of molecular techniques designed to bridge the gap between genetic and molecular distances in mammals. This is an essential step in the molecular cloning of genes defined by mammalian mutations, and in the molecular analysis of large regions of mammalian genomes. We summarize approaches for the physical and molecular analysis of genetic distances and describe the experimental, statistical and computational basis of a new approach to create ordered libraries of overlapping clones from large genomes.
Article
A novel graph theoretic approach for data clustering is presented and its application to the image segmentation problem is demonstrated. The data to be clustered are represented by an undirected adjacency graph &Gscr; with arc capacities assigned to reflect the similarity between the linked vertices. Clustering is achieved by removing arcs of &Gscr; to form mutually exclusive subgraphs such that the largest inter-subgraph maximum flow is minimized. For graphs of moderate size (~ 2000 vertices), the optimal solution is obtained through partitioning a flow and cut equivalent tree of &Gscr;, which can be efficiently constructed using the Gomory-Hu algorithm (1961). However for larger graphs this approach is impractical. New theorems for subgraph condensation are derived and are then used to develop a fast algorithm which hierarchically constructs and partitions a partially equivalent tree of much reduced size. This algorithm results in an optimal solution equivalent to that obtained by partitioning the complete equivalent tree and is able to handle very large graphs with several hundred thousand vertices. The new clustering algorithm is applied to the image segmentation problem. The segmentation is achieved by effectively searching for closed contours of edge elements (equivalent to minimum cuts in &Gscr;), which consist mostly of strong edges, while rejecting contours containing isolated strong edges. This method is able to accurately locate region boundaries and at the same time guarantees the formation of closed edge contours
Article
An immediately applicable variant of the sequencing by hybridization (SBH) method is under development with the capacity to determine up to 100 million base pairs per year. The proposed method comprises six steps: (i) arraying genomic or cDNA M13 clones in 864-well plates (wells of 2 mm); (ii) preparation of DNA samples for spotting by growth of the M13 clones or by polymerase chain reaction (PCR) of the inserts using standard 96-well plates, or plates having as many as 864 correspondingly smaller wells; (iii) robotic spotting of 13,824 samples on an 8 x 12 cm nylon membrane, or correspondingly more, on up to 6 times larger filters, by offset printing with a 96 or 864 0.4 mm pin device; (iv) hybridization of dotted samples with 200-2000 32P-labeled probes comprising 16-256 10-mers having a common 8-mer, 7-mer, or 6-mer in the middle (20 probes per day, each hybridized with 250,000 dots); (v) scoring hybridization signals of 5 million sample-probe pairs per day using storage phosphor plates; and (vi) computing clone order and partial-to-complete DNA sequences using various heuristic algorithms. Genome sequencing based on a combination of this method and gel sequencing techniques may be significantly more economical than gel methods alone.
Article
Guidelines for submitting commentsPolicy: Comments that contribute to the discussion of the article will be posted within approximately three business days. We do not accept anonymous comments. Please include your email address; the address will not be displayed in the posted comment. Cell Press Editors will screen the comments to ensure that they are relevant and appropriate but comments will not be edited. The ultimate decision on publication of an online comment is at the Editors' discretion. Formatting: Please include a title for the comment and your affiliation. Note that symbols (e.g. Greek letters) may not transmit properly in this form due to potential software compatibility issues. Please spell out the words in place of the symbols (e.g. replace “α” with “alpha”). Comments should be no more than 8,000 characters (including spaces ) in length. References may be included when necessary but should be kept to a minimum. Be careful if copying and pasting from a Word document. Smart quotes can cause problems in the form. If you experience difficulties, please convert to a plain text file and then copy and paste into the form.
Article
DNA sequencing by hybridization (SBH) Format 1 technique is based on experiments in which thousands of short oligomers are consecutively hybridized with dense arrays of clones. In this paper we present the description of a method for obtaining hybridization signatures for individual clones that guarantees reproducibility despite a wide range of variations in experimental circumstances, a sensitive method for signature comparison at prespecified significance levels, and a clustering algorithm that correctly identifies clusters of significantly similar signatures. The methods and the algorithm have been verified experimentally on a control set of 422 signatures that originate from 9 distinct clones of known sequence. Experiments indicate that only 30 to 50 oligomer probes suffice for correct clustering. This information about the identity of clones can be used to guide both genomic and cDNA sequencing by SBH or by standard gel-based methods.
Article
Efficient procedures for managing a large number of M13 or plasmid clones have been developed. In addition to picking, clones are directly arrayed in multiwell plates by dispensing diluted transformation mixtures. Metal pin arrays are used for fast inoculations of preparative plates filled by medium or by PCR mixture. Growth of M13 clones in multiwell plates is optimized to obtain a consistently high yield, and a PCR protocol is defined for reliable amplification of several thousand M13 or plasmid inserts per day in BioOvens. Over 80,000 cDNA inserts have been amplified. The phages or amplified inserts are spotted on nylon filters using an array of pins having a flat bottom, 0.3 mm in diameter. The procedures are suitable for an automated processing of hundreds of thousands of short clones from representative cDNA and genomic libraries. Hybridization of arrayed clones with oligonucleotide and complex probes can simplify the search for new genes and accelerate large-scale sequencing.
Article
Diverse biochemical and computational procedures and facilities have been developed to hybridize thousands of DNA clones with short oligonucleotide probes and subsequently to extract valuable genetic information. This technology has been applied to 73,536 cDNA clones from infant brain libraries. By a mutual comparison of 57,419 samples that were successfully scored by 200-320 probes, 19,726 genes have been identified and sorted by their expression levels. The data indicate that an additional 20,000 or more genes may be expressed in the infant brain. Representative clones of the found genes create a valuable resource for complete sequencing and functional studies of many novel genes. These results demonstrate the unique capacity of hybridization technology to identify weakly transcribed genes and to study gene networks involved in organismal development, aging, or tumorigenesis by monitoring the expression of every gene in related tissues, whether known or still undiscovered.
Article
In this paper, a number of existing and novel techniques are considered for ordering cloned extracts from the genome of an organism based on fingerprinting data. A metric is defined for comparing the quality of the clone order for each technique. Simulated annealing is used in combination with several different objective functions. Empirical results with many simulated data sets for which the correct solution is known indicate that a simple greedy algorithm with some subsequent stochastic shuffling provides the best solution. Other techniques that attempt to weight comparisons between nonadjacent clones bias the ordering and give worse results. We show that this finding is not surprising since without detailed attempts to reconcile the data into a detailed map, only approximate maps can be obtained. Making N2 pieces of data from measurements of N clones cannot improve the situation.
Conference Paper
We describe an algorithm that determines the edge connectivity of an n-vertex m-edge graph G in O(nm) time. A refinement shows that the question as to whether a graph is k-edge connected can be determined in O(kn2). For dense graphs characterized by m = Ω(n2), the latter result implies that determination of whether a graph is k-edge connected for any fixed k can be accomplished in time linear in input size.
Article
Our purpose is to reconstruct the relative placement of the clones along the DNA molecule; this information is lost in the construction of the clone library. In this article we restrict ourselves to the noiseless case, in which the data are free of experimental error. We also restrict ourselves to hybridization data. However, many of our algorithmic techniques can be extended to incorporate noise and restriction fragment data. Our experiments seem to hint at the following phenomenon. For small values of m, there is simply not enough information in the data to reveal the placement of clones, and therefore, all algorithms are expected to perform poorly. For very large m, the argument of section 6 shows that even the most simple-minded method such as the greedy algorithm is likely to find the true permutation. However, it seems that there is a range of values for m, depending on the underlying placement, where the result of the c++c- objective is superior to those of the greedy or the traveling salesman heuristics.
Article
We have developed a novel algorithm for cluster analysis that is based on graph theoretic techniques. A similarity graph is defined and clusters in that graph correspond to highly connected subgraphs. A polynomial algorithm to compute them efficiently is presented. Our algorithm produces a clustering with some provably good properties. The application that motivated this study was gene expression analysis, where a collection of cDNAs must be clustered based on their oligonucleotide fingerprints. The algorithm has been tested intensively on simulated libraries and was shown to outperform extant methods. It demonstrated robustness to high noise levels. In a blind test on real cDNA fingerprint data the algorithm obtained very good results. Utilizing the results of the algorithm would have saved over 70% of the cDNA sequencing cost on that data set. 1 Introduction Cluster analysis seeks grouping of data elements into subsets, so that elements in the same subset are in some sense more cl...
A simple Min-Cut algorithm Vicentic Sequencing by hybridization: Towards an automated sequencing of one million M13 clones arrayed on mem-branes An optimal graph theoretic approach to data clustering: Theory and its application toimage segmentation
  • M Stoer
  • F Wagner
  • R Drmanac
  • S Drmanac
  • I Labat
  • R Crkvenjakov
  • A Gemmell
Stoer, M., and Wagner, F. (1997). A simple Min-Cut algorithm. J . ACM 44(4): 585–591. Vicentic, R., Drmanac S., Drmanac I., Labat R., Crkvenjakov A., and Gemmell, A. (1992). Sequencing by hybridization: Towards an automated sequencing of one million M13 clones arrayed on mem-branes. Electrophoresis 13: 566–573. Wu, Z., and Leahy, R. (1993). An optimal graph theoretic approach to data clustering: Theory and its application toimage segmentation. IEEE Trans. Pattern Anal. Machine Intelligence 15(11): 1101– 1113. 256 HARTUV ET AL.