Article

An Algorithm for Clustering cDNA Fingerprints

July 2000
Genomics 66(3):249-56

July 2000
66(3):249-56

DOI:10.1006/geno.2000.6187

Source
PubMed

Authors:

Erez Hartuv

Tel Aviv University and Bar Ilan University

Armin O. Schmitt

Georg-August-Universität Göttingen

Show all 6 authorsHide

Clustering large data sets is a central challenge in gene expression analysis. The hybridization of synthetic oligonucleotides to arrayed cDNAs yields a fingerprint for each cDNA clone. Cluster analysis of these fingerprints can identify clones corresponding to the same gene. We have developed a novel algorithm for cluster analysis that is based on graph theoretic techniques. Unlike other methods, it does not assume that the clusters are hierarchically structured and does not require prior knowledge on the number of clusters. In tests with simulated libraries the algorithm outperformed the Greedy method and demonstrated high speed and robustness to high error rate. Good solution quality was also obtained in a blind test on real cDNA fingerprints.

Selecting Clustering Algorithms for IBD Mapping

Preprint

Full-text available

Aug 2021

Background Groups of distantly related individuals who share a short segment of their genome identical-by-descent (IBD) can provide insights about rare traits and diseases in massive biobanks via a process called IBD mapping. Clustering algorithms play an important role in finding these groups. We set out to analyze the fitness of commonly used, fast and scalable clustering algorithms for IBD mapping applications. We designed a realistic benchmark for local IBD graphs and utilized it to compare clustering algorithms in terms of statistical power. We also investigated the effectiveness of common clustering metrics as replacements for statistical power. Results We simulated 3.4 million clusters across 850 experiments with varying cluster counts, false-positive, and false-negative rates. Infomap and Markov Clustering (MCL) community detection methods have high statistical power in most of the graphs, compared to greedy methods such as Louvain and Leiden. We demonstrate that standard clustering metrics, such as modularity, cannot predict statistical power of algorithms in IBD mapping applications, though they can help with simulating realistic benchmarks. We extend our findings to real datasets by analyzing 3 populations in the Population Architecture using Genomics and Epidemiology (PAGE) Study with 51,000 members and 2 million shared segments on Chromosome 1, resulting in the extraction of 39 million local IBD clusters across three different populations in PAGE. We used cluster properties derived in PAGE to increase the accuracy of our simulations and comparison. Conclusions Markov Clustering produces a 30% increase in statistical power compared to the current state-of-art approach, while reducing runtime by 3 orders of magnitude; making it computationally tractable in modern large-scale genetic datasets. We provide an efficient implementation to enable clustering at scale for IBD mapping and poplation-based linkage for various populations and scenarios.

LDS vs FPT method for cluster deletion

Article

Full-text available

Nov 2016

We consider the following vertex-partition problem on graphs: given a simple graph G = (V, E), we want to partition G into a disjoint union of cliques by only removing a minimum number of edges. This NP-hard optimization problem is referred to as the Cluster Deletion (CD). In this paper, we propose an encoding of CD in terms of a Weighted Constraint Satisfaction Problem (WCSP), a framework which has been widely used in solving hard combinatorial problems. We compare our approach with a fixed-parameter tractability algorithm, one of the most used algorithms for solving cluster deletion. Then, we experimentally show that significant results are obtained using the WCSP encoding. We compare both solution quality and running times of these algorithms on random graphs and protein similarity graphs derived from the COG dataset.

ROBUST CLUSTERING BY DOUBLE BISECTION CROSSING MINIMIZATION

Article

Full-text available

Jan 2010
APPL COMPUT MATH-BAK

Ahsan Abdullah

In this paper, a novel and fast noise tolerant, graph drawing based, clustering and visualization algorithm is proposed for one-way clustering of gene expression data. The proposed novel divide-and-conquer algorithm works by repeatedly bisecting the similarity matrix along x-axis and y-axis, and ignoring the noise space while running Crossing Minimization Heuristics on the remaining parts of the matrix i.e. data space and thus achieves provably good clustering under very noisy conditions. The proposed Robust Clustering or RC algorithm does not require thresholds, or a-priori knowledge to achieve cluster extraction. Comparing the proposed RC algorithm with contemporary techniques using simulated data under different noise conditions shows its high immunity to noise, and comparison using real gene expression data shows promising results.

Research on Greedy Clique Partition-GCP Algorithm

Conference Paper

Sep 2006

Pei-Qiang Liu

Clustering of binary fingerprints is used in the classification of gene expression data. It is known that the clustering of binary fingerprints with 3 bits of missing value is NP-hard. The greedy clique partition (GCP for short) algorithm is a heuristic algorithm used to clustering of binary fingerprints with missing values. In this paper, we firstly study the feature of instances which can not be resolved by the GCP based on hash table. Then a new property of problem instances is given, which can further improve the heuristic algorithm based on linked list. Finally, an empirical formula is presented, which is used to judge the accuracy and credibility of the GCP algorithm

Selecting Clustering Algorithms for Identity-By-Descent Mapping

Article

Full-text available

Jan 2023

Groups of distantly related individuals who share a short segment of their genome identical-by-descent (IBD) can provide insights about rare traits and diseases in massive biobanks using IBD mapping. Clustering algorithms play an important role in finding these groups accurately and at scale. We set out to analyze the fitness of commonly used, fast and scalable clustering algorithms for IBD mapping applications. We designed a realistic benchmark for local IBD graphs and utilized it to compare the statistical power of clustering algorithms via simulating 2.3 million clusters across 850 experiments. We found Infomap and Markov Clustering (MCL) community detection methods to have high statistical power in most of the scenarios. They yield a 30% increase in power compared to the current state-of-art approach, with a 3 orders of magnitude lower runtime. We also found that standard clustering metrics, such as modularity, cannot predict statistical power of algorithms in IBD mapping applications. We extend our findings to real datasets by analyzing the Population Architecture using Genomics and Epidemiology (PAGE) Study dataset with 51,000 samples and 2 million shared segments on Chromosome 1, resulting in the extraction of 39 million local IBD clusters. We demonstrate the power of our approach by recovering signals of rare genetic variation in the Whole-Exome Sequence data of 200,000 individuals in the UK Biobank. We provide an efficient implementation to enable clustering at scale for IBD mapping for various populations and scenarios.Supplementary Information: The code, along with supplementary methods and figures are available at https://github.com/roohy/localIBDClustering.

Selecting Clustering Algorithms for Identity-By-Descent Mapping

Conference Paper

Full-text available

Nov 2022

A Transformative Concept: From Data Being Passive Objects to Data Being Active Subjects

Article

Full-text available

Oct 2019

The exploitation of potential societal benefits of Earth observations is hampered by users having to engage in often tedious processes to discover data and extract information and knowledge. A concept is introduced for a transition from the current perception of data as passive objects (DPO) to a new perception of data as active subjects (DAS). This transition would greatly increase data usage and exploitation, and support the extraction of knowledge from data products. Enabling the data subjects to actively reach out to potential users would revolutionize data dissemination and sharing and facilitate collaboration in user communities. The three core elements of the transformative DAS concept are: (1) “intelligent semantic data agents” (ISDAs) that have the capabilities to communicate with their human and digital environment. Each ISDA provides a voice to the data product it represents. It has comprehensive knowledge of the represented product including quality, uncertainties, access conditions, previous uses, user feedbacks, etc., and it can engage in transactions with users. (2) A knowledge base that constructs extensive graphs presenting a comprehensive picture of communities of people, applications, models, tools, and resources and provides tools for the analysis of these graphs. (3) An interaction platform that links the ISDAs to the human environment and facilitates transaction including discovery of products, access to products and derived knowledge, modifications and use of products, and the exchange of feedback on the usage. This platform documents the transactions in a secure way maintaining full provenance.

Variable Neighborhood Search for Partitioning Sparse Biological Networks into the Maximum Edge-Weighted $k$ k -Plexes

Article

Feb 2019

In a network, a $k$ -plex represents a subset of $n$ vertices where the degree of each vertex in the subnetwork induced by this subset is at least $n-k$ . The maximum edge-weight $k$ -plex partitioning problem is to find the $k$ -plex partitioning in edge-weighted network, such that the sum of edge weights is maximal. The Max-EkPP has an important role in discovering new information in large biological networks. We propose a variable neighborhood search (VNS) algorithm for solving Max-EkPP. The VNS implements a local search based on the 1-swap first improvement strategy and the objective function that takes into account the degree of every vertex in each partition. The objective function favors feasible solutions and enables a gradual increase of the function's value, when moving from slightly infeasible to barely feasible solutions. Experimental computation is performed on real metabolic networks and other benchmark instances from the literature. Comparing to the previously proposed integer linear programming (ILP), VNS succeeds to find all known optimal solutions. For all other instances, the VNS either reaches previous best known solution or improves it. The proposed VNS is also tested on a large-scale dataset not considered up to now.

Pattern Analysis of Genetics and Genomics: A Survey of the State-of-art

Article

Full-text available

Jan 2019
MULTIMED TOOLS APPL

The endless enhancement and decreasing charges of a complete human genome have given rise to fast acceptance of genetic and genomic information at both research institutions and clinics. Biologists are enchanting the primary steps in the direction of knowing the locations and functions of all the genes and controlling sites in the genomes of various organisms. As these researchers govern the nucleotide arrangement of large stretches of the human genome, they are constructing excessive volumes of sequence data. Direct research laboratory investigation of this data is expensive and tough, creating computational techniques vital. The arena of pattern analysis, which intends to build computer algorithms that enhance with knowledge, embraces the capacity to empower computers to support humans in the analysis of complex, large genetic and genomic data sets. Here, an overview of pattern analysis techniques for the study of genome sequencing datasets, as well as the proteomics, epigenetic and metabolomic data is delivered. These techniques employ data pre-processing, feature extraction and selection, classification and clustering. The aim of this survey is to present deliberations and recurring challenges in the application of pattern analysis methods, as well as of discriminative and reproductive modeling approaches and discuss the future research directions of these methods for the analysis of genomic and genetic data sets.

Variable neighborhood search for partitioning sparse biological networks into the maximum edge-weighted $k$-plexes

Preprint

Jul 2018

In a network, a $k$-plex represents a subset of $n$ vertices where the degree of each vertex in the subnetwork induced by this subset is at least $n-k$. The maximum edge-weight $k$-plex partitioning problem (Max-EkPP) is to find the $k$-plex partitioning in edge-weighted network, such that the sum of edge weights is maximal. The Max-EkPP has an important role in discovering new information in large sparse biological networks. We propose a variable neighborhood search (VNS) algorithm for solving Max-EkPP. The VNS implements a local search based on the 1-swap first improvement strategy and the objective function that takes into account the degree of every vertex in each partition. The objective function favors feasible solutions, also enabling a gradual increase in terms of objective function value when moving from slightly infeasible to barely feasible solutions. A comprehensive experimental computation is performed on real metabolic networks and other benchmark instances from literature. Comparing to the integer linear programming method from literature, our approach succeeds to find all known optimal solutions. For all other instances, the VNS either reaches previous best known solution or improves it. The proposed VNS is also tested on a large-scaled dataset which was not previously considered in literature.

Parameterized Algorithms for Partitioning Graphs into Highly Connected Clusters

Article

Full-text available

Jun 2017

Clustering is a well-known and important problem with numerous applications. The graph-based model is one of the typical cluster models. In the graph model, clusters are generally defined as cliques. However, such an approach might be too restrictive as in some applications, not all objects from the same cluster must be connected. That is why different types of cliques relaxations often considered as clusters. In our work, we consider a problem of partitioning graph into clusters and a problem of isolating cluster of a special type whereby cluster we mean highly connected subgraph. Initially, such clusterization was proposed by Hartuv and Shamir. And their HCS clustering algorithm was extensively applied in practice. It was used to cluster cDNA fingerprints, to find complexes in protein-protein interaction data, to group protein sequences hierarchically into superfamily and family clusters, to find families of regulatory RNA structures. The HCS algorithm partitions graph in highly connected subgraphs. However, it is achieved by deletion of not necessarily the minimum number of edges. In our work, we try to minimize the number of edge deletions. We consider problems from the parameterized point of view where the main parameter is a number of allowed edge deletions.

Review on Graph Clustering and Subgraph Similarity Based Analysis of Neurological Disorders

Article

Full-text available

Jun 2016
INT J MOL SCI

How can complex relationships among molecular or clinico-pathological entities of neurological disorders be represented and analyzed? Graphs seem to be the current answer to the question no matter the type of information: molecular data, brain images or neural signals. We review a wide spectrum of graph representation and graph analysis methods and their application in the study of both the genomic level and the phenotypic level of the neurological disorder. We find numerous research works that create, process and analyze graphs formed from one or a few data types to gain an understanding of specific aspects of the neurological disorders. Furthermore, with the increasing number of data of various types becoming available for neurological disorders, we find that integrative analysis approaches that combine several types of data are being recognized as a way to gain a global understanding of the diseases. Although there are still not many integrative analyses of graphs due to the complexity in analysis, multi-layer graph analysis is a promising framework that can incorporate various data types. We describe and discuss the benefits of the multi-layer graph framework for studies of neurological disease.

Energy optimization in cluster based wireless sensor networks

Article

Full-text available

Apr 2014

Wireless sensor networks (WSN) are made up of sensor nodes which are usually battery-operated devices, and hence energy saving of sensor nodes is a major design issue. To prolong the networks lifetime, minimization of energy consumption should be implemented at all layers of the network protocol stack starting from the physical to the application layer including cross-layer optimization. Optimizing energy consumption is the main concern for designing and planning the operation of the WSN. Clustering technique is one of the methods utilized to extend lifetime of the network by applying data aggregation and balancing energy consumption among sensor nodes of the network. This paper proposed new version of Low Energy Adaptive Clustering Hierarchy (LEACH), protocols called Advanced Optimized Low Energy Adaptive Clustering Hierarchy (AOLEACH), Optimal Deterministic Low Energy Adaptive Clustering Hierarchy (ODLEACH), and Varying Probability Distance Low Energy Adaptive Clustering Hierarchy (VPDL) combination with Shuffled Frog Leap Algorithm (SFLA) that enables selecting best optimal adaptive cluster heads using improved threshold energy distribution compared to LEACH protocol and rotating cluster head position for uniform energy dissipation based on energy levels. The proposed algorithm optimizing the life time of the network by increasing the first node death (FND) time and number of alive nodes, thereby increasing the life time of the network.

Microbial "social networks"

Article

Full-text available

Nov 2015
BMC GENOMICS

It is well understood that distinct communities of bacteria are present at different sites of the body, and that changes in the structure of these communities have strong implications for human health. Yet, challenges remain in understanding the complex interconnections between the bacterial taxa within these microbial communities and how they change during the progression of diseases. Many recent studies attempt to analyze the human microbiome using traditional ecological measures and cataloging differences in bacterial community membership. In this paper, we show how to push metagenomic analyses beyond mundane questions related to the bacterial taxonomic profiles that differentiate one sample from another. We develop tools and techniques that help us to investigate the nature of social interactions in microbial communities, and demonstrate ways of compactly capturing extensive information about these networks and visually conveying them in an effective manner. We define the concept of bacterial "social clubs", which are groups of taxa that tend to appear together in many samples. More importantly, we define the concept of "rival clubs", entire groups that tend to avoid occurring together in many samples. We show how to efficiently compute social clubs and rival clubs and demonstrate their utility with the help of examples including a smokers' dataset and a dataset from the Human Microbiome Project (HMP). The tools developed provide a framework for analyzing relationships between bacterial taxa modeled as bacterial co-occurrence networks. The computational techniques also provide a framework for identifying clubs and rival clubs and for studying differences in the microbiomes (and their interactions) of two or more collections of samples. Microbial relationships are similar to those found in social networks. In this work, we assume that strong (positive or negative) tendencies to co-occur or co-infect is likely to have biological, physiological, or ecological significance, possibly as a result of cooperation or competition. As a consequence of the analysis, a variety of biological interpretations are conjectured. In the human microbiome context, the pattern of strength of interactions between bacterial taxa is unique to body site.

A genetic filter for cancer classification on gene expression data

Article

Full-text available

Sep 2015
BIO-MED MATER ENG

We present a new genetic filter to identify a predictive gene subset for cancer-type classification on gene expression profiles. This approach pursues to not only maximize correlation between selected genes and cancer types but also minimize inter-correlation among selected genes. The proposed genetic filter was tested on well-known leukemia datasets, and significant improvement over previous work was obtained.

Efficient algorithms for cluster editing

Article

May 2014

The cluster editing problem consists of transforming an input graph $G$ into a cluster graph (a disjoint union of complete graphs) by performing a minimum number of edge editing operations. Each edge editing operation consists of either adding a new edge or removing an existing edge. In this paper we propose new theoretical results on data reduction and instance generation for the cluster editing problem, as well as two algorithms based on coupling an exact method to, respectively, a GRASP or ILS heuristic. Experimental results show that the proposed algorithms are able to find high-quality solutions in practical runtime.

Diversity of miRNAs, siRNAs, and piRNAs across 25 Drosophila cell lines

Article

Full-text available

Jul 2014
GENOME RES

We expanded the knowledge base for Drosophila cell line transcriptomes by deeply sequencing their small RNAs. In total, we analyzed more than 1 billion raw reads from 53 libraries across 25 cell lines. We verify reproducibility of biological replicate data sets, determine common and distinct aspects of miRNA expression across cell lines, and infer the global impact of miRNAs on cell line transcriptomes. We next characterize their commonalities and differences in endo-siRNA populations. Interestingly, most cell lines exhibit enhanced TE-siRNA production relative to tissues, suggesting this as a common aspect of cell immortalization. We also broadly extend annotations of cis-NAT-siRNA loci, identifying ones with common expression across diverse cells and tissues, as well as cell-restricted loci. Finally, we characterize small RNAs in a set of ovary-derived cell lines, including somatic cells (OSS and OSC) and a mixed germline/somatic cell population (fGS/OSS) that exhibits ping-pong piRNA signatures. Collectively, the ovary data reveal new genic piRNA loci, including unusual configurations of piRNA-generating regions. Together with the companion analysis of mRNAs described in a previous study, these small RNA data provide comprehensive information on the transcriptional landscape of diverse Drosophila cell lines. These data should encourage broader usage of fly cell lines, beyond the few that are presently in common usage.

Evolution of H3K27me3-marked chromatin is linked to gene expression evolution and to patterns of gene duplication and diversification

Article

Full-text available

Jul 2014
GENOME RES

Histone modifications are critical for the regulation of gene expression, cell type specification, and differentiation. However, evolutionary patterns of key modifications that regulate gene expression in differentiating organisms have not been examined. Here we mapped the genomic locations of the repressive mark histone 3 lysine 27 trimethylation (H3K27me3) in four species of Drosophila, and compared these patterns to those in C. elegans. We found that patterns of H3K27me3 are highly conserved across species, but conservation is substantially weaker among duplicated genes. We further discovered that retropositions are associated with greater evolutionary changes in H3K27me3 and gene expression than tandem duplications, indicating that local chromatin constraints influence duplicated gene evolution. These changes are also associated with concomitant evolution of gene expression. Our findings reveal the strong conservation of genomic architecture governed by an epigenetic mark across distantly related species and the importance of gene duplication in generating novel H3K27me3 profiles.

Algorithms and tools for protein-protein interaction networks clustering, with a special focus on population-based stochastic methods

Article

Full-text available

Jan 2014
BIOINFORMATICS

Protein-Protein Interaction (PPI) networks are powerful models to represent the pair-wise protein interactions of the organisms. Clustering PPI networks can be useful for isolating groups of interacting proteins that participate in the same biological processes, or that perform together specific biological functions. Evolutionary orthologies can be inferred this way, as well as functions and properties of yet uncharacterized proteins. We present an overview of the main state-of-the-art clustering methods that have been applied to PPI networks over the last decade. We distinguish five specific categories of approaches, describe and compare their main features, and then focus on one of them, that is, population-based stochastic search. We provide an experimental evaluation, based on some validation measures widely used in the literature, of techniques in this class, that is as yet less explored than the others. In particular, we study how the capability of Genetic Algorithms (GAs) to extract clusters in PPI networks varies when different topology-based fitness functions are employed, and we compare GAs with the main techniques in the other categories. The experimental campaign shows that predictions returned by GAs are often more accurate than those produced by the contestant methods. Interesting issues still remain open about possible generalizations of GAs allowing for cluster overlapping. We point out which methods and tools described here are publicly available. pizzuti@icar.cnr.it,simona.rombo@math.unipa.it SUPPLEMENTARY INFORMATION: Supplementary Material showing further validation results is available.

Partitioning Biological Networks into Highly Connected Clusters with Maximum Edge Coverage

Article

Full-text available

Dec 2014
IEEE ACM T COMPUT BI

A popular clustering algorithm for biological networks which was proposed by Hartuv and Shamir [IPL 2000] identifies nonoverlapping highly connected components. We extend the approach taken by this algorithm by introducing the combinatorial optimization problem Highly Connected Deletion, which asks for removing as few edges as possible from a graph such that the resulting graph consists of highly connected components. We show that Highly Connected Deletion is NP-hard and provide a fixed-parameter algorithm and a kernelization. We propose exact and heuristic solution strategies, based on polynomial-time data reduction rules and integer linear programming with column generation. The data reduction typically identifies 75% of the edges that are deleted for an optimal solution; the column generation method can then optimally solve protein interaction networks with up to 6,000 vertices and 13,500 edges in less than a day. Additionally, we present a new heuristic that finds more clusters than the method by Hartuv and Shamir.

GENERALIZED KERNEL METHODS FOR UNSUPERVISED LEARNING

Article

ABSTRACT Unsupervised learning, mostly represented by data clustering methods, is an

Gene Expression Based Cancer Classiflcation Using Evolutionary and Non-evolutionary Methods

Article

Topon Kumar Paul

Recent advances in DNA microarray ofier the abil- ity to monitor and measure the expression lev- els of thousands of genes simultaneously in an or- ganism. These experiments consist of monitoring each gene many times under difierent conditions or evaluating each gene under a single environ- ment but in difierent types of tissues. The flrst one is useful for identiflcation of functionally re- lated genes while the second type of experiment is helpful in classiflcation of difierent types of tissues and identiflcation of those genes whose expression levels are good diagnostic indicators. Difierent machine learning approaches such as supervised and some unsupervised learning have been pre- viously applied to classify difierent kinds of pa- tient samples by identifying those genes respon- sible for difierent types of cancers. However, the main challenges in this task are the availability of a smaller number of samples compared to huge number of genes and the noisy nature of biological data. Moreover, many of these genes are irrelevant to distinction of difierent samples and have nega- tive impact on acquired classiflcation accuracy. In this paper, I provide a survey on gene expression based cancer classiflcation using evolutionary and non-evolutionary methods. Keywords: DNA microarray, gene expres- sion, Naive-Bayes classifler, support vector ma- chine, decision tree, nearest neighbor classifler, neural network, leave-one-out-cross-validation (LOOCV), multi-objective evolutionary algo- rithm, PMBGA.

Data mining and genetic programming based gene/SNP selection

Article

Jan 2004

Clustering gene expression profiles using mixture model ensemble averaging approach

Article

Full-text available

Jan 2008

Faming Liang

Clustering has been an important tool for extracting underlying gene expression patterns from massive microarray data. However, most of the existing clustering methods cannot automatically separate noise genes, including scattered, singleton and mini-cluster genes, from other genes. Inclusion of noise genes into regular clustering processes can impede identification of gene expression patterns. The model-based clustering method has the potential to automatically separate noise genes from other genes so that major gene expression patterns can be better identified. In this paper, we propose to use the ensemble averaging method to improve the performance of the single model-based clustering method. We also propose a new density estimator for noise genes for Gaussian mixture modeling. Our numerical results indicate that the ensemble averaging method outperforms other clustering methods, such as the quality-based method and the single model-based clustering method, in clustering datasets with noise genes.

Accomplishments and Challenges in High Performance Computing for Computational Biology

Article

Full-text available

May 2006
CURR BIOINFORM

We review recent research and development in high performance computing (HPC) for computational biology and discuss the great challenges to both biomedical scientists and IT professionals. During the last decades, research in the fields of molecular biology and biomedicine has provided the scientific community with huge amount of data through sequencing, genome-wide annotation and gene expression profiling projects. The genetic databases have been growing exponentially and sophisticated computer algorithms have been developed to cater for needs of data mining, analysis and simulation. It is clear that development of HPC technologies has become crucial for deployment of the software systems to tackle various bioinformatics problems. The goal of this article is to present the current research and our critical review on construction of parallel and distributed computing systems, design of multi-process algorithms, and development of software systems for biocomputing tasks including phylogenetic analysis, pairwise and multiple sequence alignment, heuristic database searching, and gene clustering. We also give a brief introduction to our work in development of highly scalable and reproducible HPC algorithms and indicate the challenging problems in this context.

The Three Steps of Clustering in the Post-Genomic Era

Chapter

Full-text available

Jan 2014

This chapter considers three clustering steps: choice of a distance function; choice of a clustering algorithm; and choice of a methodology to assess the statistical significance of clustering solutions. First, the chapter discusses the experimental set-up used for the results that are presented here. Next, it deals with distance functions; in particular, new approaches to assess the intrinsic separation ability of many standard distance functions and their use in conjunction with clustering algorithms. A section is devoted to clustering algorithms, in particular to nonnegative matrix factorization (NMF). The chapter further discusses the assessment of the statistical significance of a clustering solution. It explains the identification of the correct number of clusters in a given data set. This class of statistical methods is usually referred to as internal validation measures. Finally, the chapter deals with consensus clustering highlighting its paradigmatic nature for stability-based validation measures and its excellent discriminative power.

Dynamic agglomerative clustering of gene expression profiles

Article

Jul 2007
PATTERN RECOGN LETT

The increasing use of microarray technologies is generating a large amount of data that must be processed to extract underlying gene expression patterns. Existing clustering methods could suffer from certain drawbacks. Most methods cannot automatically separate scattered, singleton and mini-cluster genes from other genes. Inclusion of these types of genes into regular clustering processes can impede identification of gene expression patterns. In this paper, we propose a general clustering method, namely a dynamic agglomerative clustering (DAC) method. DAC can automatically separate scattered, singleton and mini-cluster genes from other genes and thus avoid possible contamination to the gene expression patterns caused by them. For DAC, the scattered gene filtering step is no longer necessary in data pre-processing. In addition, we propose a criterion for evaluating clustering results for a dataset which contains scattered, singleton and/or mini-cluster genes. DAC has been applied successfully to two real datasets for identification of gene expression patterns. Our numerical results indicate that DAC outperforms other clustering methods, such as the quality-based and model-based clustering methods, in clustering datasets which contain scattered, singleton and/or mini-cluster genes.

An effective graph-based clustering technique to identify coherent patterns from gene expression data

Article

Full-text available

Mar 2012

This paper presents an effective parameter-less graph based clustering technique (GCEPD). GCEPD produces highly coherent clusters in terms of various cluster validity measures. The technique finds highly coherent patterns containing genes with high biological relevance. Experiments with real life datasets establish that the method produces clusters that are significantly better than other similar algorithms in terms of various quality measures.

Cancer Classification From DNA Microarray Using Genetic Algorithms and Case-Based Reasoning

Chapter

Dec 2023

There are many similarities in the symptoms of several types of cancer and that makes it sometimes difficult for the physicians to do an accurate diagnosis. In addition, it is a technical challenge to classify accurately the cancer cells in order to differentiate one type of cancer from another. The DNA microarray technique (also called the DNA chip) has been used in the past for the classification of cancer but it generates a large volume of noisy data that has many features, and is difficult to analyze directly. This paper proposes a new method, combining the genetic algorithm, case-based reasoning, and the k-nearest neighbor classifier, which improves the performance of the classification considerably. The authors have also used the well-known Mahalanobis distance of multivariate statistics as a similarity measure that improves the accuracy. A case-based classifier approach together with the genetic algorithm has never been applied before for the classification of cancer, same with the application of the Mahalanobis distance. Thus, the proposed approach is a novel method for the cancer classification. Furthermore, the results from the proposed method show considerably better performance than other algorithms. Experiments were done on several benchmark datasets such as the leukemia dataset, the lymphoma dataset, ovarian cancer dataset, and breast cancer dataset.

Cluster Editing

Chapter

Sep 2022

Insights on Expected Streamflow Response to Land-cover Restoration

Article

Jun 2020
J HYDROL

Ecosystem service approaches to watershed management have grown quickly, increasing the importance of understanding the streamflow response to realistic land-cover change. Previous work has investigated the relationship between watershed characteristics and streamflow in catchments around the world, but little has focused on systematic relationships between watershed characteristics and streamflow change after land-cover restoration. To address this gap, we simulate streamflow responses to restoring 10% of watershed area from agricultural land to forest and natural pasture in 29 watersheds around the world. This change is consistent with that performed in watershed-service programs. We calculate the change in a broad array of streamflow indices for each site and use a graph-connectedness approach to cluster the sites based on the sign of the index value changes. We find three primary clusters with distinct responses to restoration. Permutation tests and effect size demonstrate the difference in watershed characteristics and streamflow indices across clusters. The low-flow intensifying sites have shallower soils and smaller saturated soil volume. After restoration, simulated streamflow in these sites increases during relatively dry periods and declines during high-flow periods. The high-flow intensifying sites have larger saturated soil volume. After restoration, simulated dry-season flow in these sites decreases. The high-flow enhancing sites have larger soil hydraulic conductivities than the high-flow intensifying sites. After restoration, simulated dry-season flow in these sites decreases less than in high-flow intensifying sites. The soil depth and hydraulic conductivity appear to be the characteristics that determine clusters, as clusters are not statistically related to climate, watershed location, proximity, size and shape, elevation, or pre-existing land cover. This study provides valuable understanding of land-cover restoration and the watershed characteristics that most impact streamflow change.

Combinatorial Optimization Algorithms

Chapter

Jan 2013

Identification of targets, generally virus or bacteria, in a biological sample is a relevant problem in medicine. Biologists can use hybridization experiments to determine whether a specific DNA fragment, that represents the virus, is present in a DNA solution. A probe is a segment of DNA or RNA, labeled with a radioactive isotope, dye, or enzyme, used to find a specific target sequence on a DNA molecule by hybridization. Selecting unique probes through hybridization experiments is a difficult task, especially when targets have a high degree of similarity, for instance, in case of closely related viruses.The nonunique probe selection problem is a challenging problem from a biological and computational point of view; a plethora of methods have been proposed in literature ranging from evolutionary algorithms to mathematical programming approaches.In this study, we conducted a survey of the existing computational methods for probe design and selection. We introduced the biological aspects of the problem and examined several issues related to the design and selection of probes: oligonucleotide fingerprinting, maximum distinguishing probe set, minimum cost probe set, and nonunique probe selection.

Iterative approach for clustering based on linkage strategies applied to gene expression analysis

Conference Paper

Dec 2016

Mohamed Mahfouz

In gene expression analysis, grouping co-regulated genes is a major step in discovering genes which are likely to have related biological functions. This critical step can be done using clustering. This paper formally presents three models for iterative clustering based on average, single and complete linkage strategies. Variation of relational clustering algorithms can be built based on these models. The number of clusters needs not to be known in advance. Unlike centroid and medoid-based algorithms the proposed approach avoids minimizing least squares type objective function instead it maximizes the average similarity between objects of the same cluster using a subset of the similarity matrix. Top k nearest, farthest or near average entries in each row of the similarity matrix need to be identified depending on the required linkage strategy. In order to reduce the computational complexity of this step randomized search or genetic technique can be used to approximate these elements; however, in our experimental studies, the exact k elements are computed. The performance of the proposed algorithms is evaluated and compared to existing techniques on two standard gene expression datasets.

LDS vs FPT Method for Cluster Deletion

Conference Paper

Nov 2016

Highly Bi-Connected Subgraphs for Computational Protein Function Annotation

Conference Paper

Aug 2016

Identifying highly connected subgraphs in biological networks has become a powerful tool in computational biology. By definition a highly connected graph with n vertices can only be disconnected by removing more than $\frac{n}{2}$ of its edges. This definition, however, is not suitable for bipartite graphs, which have various applications in biology, since such graphs cannot contain highly connected subgraphs. Here, we introduce a natural modification of highly connected graphs for bipartite graphs, and prove that the problem of finding such subgraphs with the maximum number of vertices in bipartite graphs is NP-hard. To address this problem, we provide an integer linear programming solution, as well as a local search heuristic. Finally, we demonstrate the applicability of our heuristic to predict protein function by identifying highly connected subgraphs in bipartite networks that connect proteins with their experimentally established functionality.

Calibrating Wavelet Neural Networks by Distance Orientation Similarity Fuzzy C-Means for Approximation Problems

Article

Feb 2016
APPL SOFT COMPUT

Improperly tuned wavelet neural network (WNN) has been shown to exhibit unsatisfactory generalization performance. In this study, the tuning is done by an improved fuzzy C-means algorithm, that utilizes a novel similarity measure. This similarity measure takes the orientation as well as the distance into account. The modified WNN was first applied to a benchmark problem. Performance assessments with other approaches were made subsequently. Next, the feasibility of the proposed WNN in forecasting the chaotic Mackey–Glass time series and a real world application problem, i.e., blood glucose level prediction, were studied. An assessment analysis demonstrated that this presented WNN was superior in terms of prediction accuracy.

MSDAdn: Discriminant sequential patterns for DNA microarrays

Article

Jan 2009

Discovering new information about groups of genes implied in a disease is still challenging. Microarrays are a powerful tool to analyze gene expression. They provide an expression level for genes under given biological situations. In this paper, we propose a novel approach outlining relationships between genes based on their ordered expressions. First, we propose to use a new material, called sequential patterns, to be investigated by biologists. But, due to the expression matrice density, extracting sequential patterns from microarray datasets is far away from easy. Second, we propose to introduce a knowledge source during the mining task. By this way, the search space is reduced and more relevant results (from a biological point of view) are obtained. Results of various experiments on real biological data highlight the relevance of our proposal.

An effective fuzzy C-means algorithm based on symmetry similarity approach

Article

Jul 2015
APPL SOFT COMPUT

Fuzzy C-means (FCM) partitions the observations partially into several clusters based on the principles of fuzzy theory. However, minimization on the Euclidean distance in FCM tends to detect hyper-spherical shaped clusters, which is unfeasible for the real world problems. In this paper, an effective FCM algorithm that adopts the symmetry similarity measure is proposed in order to search for the appropriate clusters, regardless of the geometric structures and overlapping characteristic. Experimental results on several artificial and real life datasets with different nature and the performance assessment with other existing clustering algorithms demonstrate its superiority.

A Nearest Neighbor Clustering Algorithm for Gene Expression Data based on Iterative Sampling

Article

Full-text available

2 A Survey on Fingerprint Classification Methods for Biological Sequences

Article

Dec 2013

This chapter provides a survey of a classification problem involving genetic sequences, namely the problem of classifying fingerprint vectors with missing values. The main focus of the chapter is motivated by the recent development of a discrete classification approach by Figueroa, Borneman, and Jiang in 2004, called the binary clustering with missing-values (BCMV) problem, for analyzing oligonucleotide fingerprints, especially in applications such as DNA clone classifications. The chapter provides some basic mathematical definitions that are useful in understanding the underlying computational problems more effectively. A brief survey of various other classification approaches is provided. The chapter then provides a brief overview of several approaches for estimating missing values in the genomic data. Finally, it discusses in detail the BCMV problem and its variations.

Cluster Editing

Conference Paper

Jul 2013

The Cluster Editing problem asks to transform a graph into a disjoint union of cliques using a minimum number of edge modifications. Although the problem has been proven NP-complete several times, it has nevertheless attracted much research both from the theoretical and the applied side. The problem has been the inspiration for numerous algorithms in bioinformatics, aiming at clustering entities such as genes, proteins, phenotypes, or patients. In this paper, we review exact and heuristic methods that have been proposed for the Cluster Editing problem, and also applications of these algorithms for biological problems.

An integrated approach of particle swarm optimization and support vector machine for gene signature selection and cancer prediction

Conference Paper

Jun 2009

To improve cancer diagnosis and drug development, the classification of tumor types based on genomic information is important. As DNA microarray studies produce a large amount of data, expression data are highly redundant and noisy, and most genes are believed to be uninformative with respect to the studied classes. Only a fraction of genes may present distinct profiles for different classes of samples. Classification tools to deal with these issues are thus important. These tools should learn to robustly identify a subset of informative genes embedded in a large dataset that is contaminated with high dimensional noises. In this paper, an integrated approach of support vector machine (SVM) and particle swarm optimization (PSO) is proposed for this purpose. The proposed approach can simultaneously optimize the selection of feature subset and the classifier through a common solution coding mechanism. As an illustration, the proposed approach is applied to search the combinational gene signatures for predicting histologic response to chemotherapy of osteosarcoma patients. Cross-validation results show that the proposed approach outperforms other existing methods in terms of classification accuracy. Further validation using an independent dataset shows misclassification of only one out of fourteen patient samples, suggesting that the selected gene signatures can reflect the chemo-resistance in osteosarcoma.

Early lung cancer detection using nucleus segementation based features

Conference Paper

Apr 2013

In this study we propose an early lung cancer detection methodology using nucleus based features. First the sputum samples from patients are labeled with Tetrakis Carboxy Phenyl Porphine (TCPP) and fluorescent images of these samples are taken. TCPP is a porphyrin that is able to assist in labeling lung cancer cells by increasing numbers of low density lipoproteins coating on the surface of cancer. We study the performance of well know machine learning techniques in the context of lung cancer detection on Biomoda dataset. We obtained an accuracy of 81% using 71 features related to shape, intensity and color in our previous work. By adding the nucleus segmented features we improved the accuracy to 87%. Nucleus segmentation is performed by using Seeded region growing segmentation method. Our results demonstrate the potential of nucleus segmented features for detecting lung cancer.

Design of wavelet neural networks based on symmetry fuzzy C-means for function approximation

Article

Dec 2013

Specifying the number and locations of the translation vectors for wavelet neural networks (WNNs) is of paramount significance as the quality of approximation may be drastically reduced if initialization of WNNs parameters was not done judiciously. In this paper, an enhanced fuzzy C-means algorithm, specifically the modified point symmetry–based fuzzy C-means algorithm (MPSDFCM), was proposed, in order to determine the optimal initial locations for the translation vectors. The proposed neural network models were then employed in approximating five different nonlinear continuous functions. Assessment analysis showed that integration of the MPSDFCM in the learning phase of WNNs would lead to a significant improvement in WNNs prediction accuracy. Performance comparison with the approaches reported in the literature in approximating the same benchmark piecewise function verified the superiority of the proposed strategy.

Complex Detection in Protein-Protein Interaction Networks: A Compact Overview for Researchers and Practitioners

Conference Paper

Full-text available

Apr 2012

The availability of large volumes of protein-protein interaction data has allowed the study of biological networks to unveil the complex structure and organization in the cell. It has been recognized by biologists that proteins interacting with each other often participate in the same biological processes, and that protein modules may be often associated with specific biological functions. Thus the detection of protein complexes is an important research problem in systems biology. In this review, recent graph-based approaches to clustering protein interaction networks are described and classified with respect to common peculiarities. The goal is that of providing a useful guide and reference for both computer scientists and biologists.

Deconstructing internet paths: an approach for AS-level detour route discovery

Article

Full-text available

Jan 2009

Detour paths provide overlay networks with improved performance and resilience. Finding good detour routes with methods that scale to millions of nodes is a challenging problem. We propose a novel approach for decentralised discovery of detour paths based on the observation that Internet paths that traverse overlapping sets of autonomous systems may benefit from the same detour nodes. We show how nodes can learn about overlap between Internet paths at the level of autonomous systems and demonstrate how they can exploit detours that other nodes have already found. Our approach is to cluster paths based on the extent to which the autonomous systems traversed overlap and gossip potential detours among nodes. We find that our centralised path clustering algorithm correctly classified over 90% of potential latency detours in a 176-node dataset drawn from PlanetLab. In our decentralised version, we detected 60% of potentially available detours with each node sampling data from only 10% of other nodes.

Outlier analysis for gene expression data

Article

Jan 2004

The rapid developments of technologies that generate arrays of gene data enable a global view of the transcription levels of hundreds of thousands of genes simultaneously. The outlier detection problem for gene data has its importance but together with the difficulty of high dimensionality. The sparsity of data in high-dimensional space makes each point a relatively good outlier in the view of traditional distance-based definitions. Thus, finding outliers in high dimensional data is more complex. In this paper, some basic outier analysis algorithms are discussed and a new genetic algorithm is presented. This algorithm is to find best dimension projections based on a revised cell-based algorithm and to give explantations to solutions. It can solve the outlier detection problem for gene expression data and for other high dimensional data as well.

αCORR: A novel algorithm for clustering gene expression data

Conference Paper

Nov 2007

Recent advances in biotechnology allow researchers to measure expression levels for thousands of genes simultaneously, across different conditions and over time. Analysis of data produced by such experiments offers potential insight into gene function and regulatory mechanisms. A key step in the analysis of gene expression data is the detection of groups of genes that manifest similar expression patterns. The corresponding algorithmic problem is to cluster multi-condition gene expression patterns. This paper aims to introduce a new clustering algorithm for gene expression data. The design of the proposed algorithm tries to avoid some of the drawbacks and the disadvantages of the present algorithms of clustering gene expression data. The proposed αCORRclustering algorithm is tested and verified on real biological data sets.

A Clustering Algorithm Based on Graph Connectivity

Article

Full-text available

Dec 2000
INFORM PROCESS LETT

Distance Functions, Clustering Algorithms and Microarray Data Analysis

Conference Paper

Jan 2010
Lect Notes Comput Sci

Distance functions are a fundamental ingredient of classification and clustering procedures, and this holds true also in the particular case of microarray data. In the general data mining and classification literature, functions such as Euclidean distance or Pearson correlation have gained their status of de facto standards thanks to a considerable amount of experimental validation. For microarray data, the issue of which distance function “works best” has been investigated, but no final conclusion has been reached. The aim of this paper is to shed further light on that issue. Indeed, we present an experimental study, involving several distances, assessing (a) their intrinsic separation ability and (b) their predictive power when used in conjunction with clustering algorithms. The experiments have been carried out on six benchmark microarray datasets, where the “gold solution” is known for each of them. We have used both Hierarchical and K-means clustering algorithms and external validation criteria as evaluation tools. From the methodological point of view, the main result of this study is a ranking of those measures in terms of their intrinsic and clustering abilities, highlighting also the correlations between the two. Pragmatically, based on the outcomes of the experiments, one receives the indication that Minkowski, cosine and Pearson correlation distances seems to be the best choice when dealing with microarray data analysis.

Mathematical Classification and Clustering: Kluwer Academic Publishers

Article

Full-text available

Jan 1996

Boris G Mirkin

Minimum Cuts in Near-Linear Time

Conference Paper

Full-text available

Jan 1996

David Ron Karger

We significantly improve known time bounds for solving the minimum cut problem on undirected graphs. We use a "semiduality" between minimum cuts and maximum spanning tree packings combined with our previously developed random sampling techniques. We give a randomized (Monte Carlo) algorithm that finds a minimum cut in an m-edge, n-vertex graph with high probability in O(m log3 n) time. We also give a simpler randomized algorithm that finds all minimum cuts with high probability in O(m log3 n) time. This variant has an optimal RNC parallelization. Both variants improve on the previous best time bound of O(n2 log3 n). Other applications of the tree-packing approach are new, nearly tight bounds on the number of near-minimum cuts a graph may have and a new data structure for representing them in a space-efficient manner.

A Clustering Algorithm Based on Graph Connectivity

Article

Full-text available

Dec 2000
INFORM PROCESS LETT

LEDA - A Platform for Combinatorial and Geometric Computing

Book

Full-text available

Jan 1999

The lack of standard library of the data structures and algorithms of combinatorial and geometric computing severely limits the impact of this area on computer science. To address this problem, the LEDA project was introduced in 1989 to build a library of the data types and algorithms of combinatorial and geometric computing. Among its many features, LEDA provides a sizable collection of data types and algorithms in a form that allows them to be used by non-experts. Sample applications are code optimization, motion planning, logic synthesis, scheduling, VLSI design, term rewriting systems, semantic nets, machine learning, image analysis, computational biology, etc.

Cluster Analysis and Mathematical Programming

Article

Full-text available

Oct 1997

Given a set of entities, Cluster Analysis aims at finding subsets, called clusters, which are homogeneous and/or well separated. As many types of clustering and criteria for homogeneity or separation are of interest, this is a vast field. A survey is given from a mathematical programming viewpoint. Steps of a clustering study, types of clustering and criteria are discussed. Then algorithms for hierarchical, partitioning, sequential, and additive clustering are studied. Emphasis is on solution methods, i.e., dynamic programming, graph theoretical algorithms, branch-and-bound, cutting planes, column generation and heuristics.

Minimum cuts in near-linear time

Article

Full-text available

Jan 2000

David Ron Karger

We significantly improve known time bounds for solving the minimum cut problem on undirected graphs. We use a "semiduality" between minimum cuts and maximum spanning tree packings combined with our previously developed random sampling technfques. We give a randomized (Monte Carlo) algorithm that finds a minimum cut in an m-edge, n-vertex graph with high probability in O(m log 3 n) time. We also give a simpler randomized algorithm that finds all minimum cuts with high probability in O(n 2 log n) time. This variant has an optimal RNC parallelization. Both variants improve on the previous best time bound of O(n 2 log 3 n). Other applications of the tree-packing approach are new, nearly tight bounds on the number of near-minimum cuts a graph may have and a new data structure for representing them in a space-efficient manner. General Terms: Algorithms, Theory.

Physical Mapping of Chromosomes: A Combinatorial Problem in Molecular Biology.

Article

Full-text available

Feb 1995

This paper is concerned wth the physical mapping of DNA molecules using data about the hybridization of oligonucleotide probes to a library of clones. In mathematical terms, the DNA molecule corresponds to an interval on the real line, each clone to a subinterval, and each probe occurs at a finite set of points within the interval. A stochastic model for the occurrences of the probes and the locations of the clones is assumed. Given a matrix of incidences between probes and clones, the task is to reconstruct the most likely interleaving of the clones. Combinatorial algorithms are presented for solving approximations to this problem, and computational results are presented.

Parallel human genome analysis: microarray-based expression monitoring of 1000 genes.

Article

Full-text available

Nov 1996

Microarrays containing 1046 human cDNAs of unknown sequence were printed on glass with high-speed robotics. These 1.0-cm2 DNA "chips" were used to quantitatively monitor differential expression of the cognate human genes using a highly sensitive two-color hybridization assay. Array elements that displayed differential expression patterns under given experimental conditions were characterized by sequencing. The identification of known and novel heat shock and phorbol ester-regulated genes in human T cells demonstrates the sensitivity of the assay. Parallel gene analysis with microarrays provides a rapid and efficient method for large-scale human gene discovery.

Normalization and Subtraction: Two Approaches to Facilitate Gene Discovery

Article

Full-text available

Oct 1996
GENOME RES

Large-scale sequencing of cDNAs randomly picked from libraries has proven to be a very powerful approach to discover (putatively) expressed sequences that, in turn, once mapped, may greatly expedite the process involved in the identification and cloning of human disease genes. However, the integrity of the data and the pace at which novel sequences can be identified depends to a great extent on the cDNA libraries that are used. Because altogether, in a typical cell, the mRNAs of the prevalent and intermediate frequency classes comprise as much as 50-65% of the total mRNA mass, but represent no more than 1000-2000 different mRNAs, redundant identification of mRNAs of these two frequency classes is destined to become overwhelming relatively early in any such random gene discovery programs, thus seriously compromising their cost-effectiveness. With the goal of facilitating such efforts, previously we developed a method to construct directionally cloned normalized cDNA libraries and applied it to generate infant brain (INIB) and fetal liver/spleen (INFLS) libraries, from which a total of 45,192 and 86,088 expressed sequence tags, respectively, have been derived. While improving the representation of the longest cDNAs in our libraries, we developed three additional methods to normalize cDNA libraries and generated over 35 libraries, most of which have been contributed to our integrated Molecular Analysis of Genomes and Their Expression (IMAGE) Consortium and thus distributed widely and used for sequencing and mapping. In an attempt to facilitate the process of gene discovery further, we have also developed a subtractive hybridization approach designed specifically to eliminate (or reduce significantly the representation of) large pools of arrayed and (mostly) sequenced clones from normalized libraries yet to be (or just partly) surveyed. Here we present a detailed description and a comparative analysis of four methods that we developed and used to generate normalize cDNA libraries from human (15), mouse (3), rat (2), as well as the parasite Schistosoma mansoni (1). In addition, we describe the construction and preliminary characterization of a subtracted liver/spleen library (INFLS-SI) that resulted from the elimination (or reduction of representation) of -5000 INFLS-IMAGE clones from the INFLS library.

Gene Expression Profiles in Normal and Cancer Cells

Article

Full-text available

Jun 1997

As a step toward understanding the complex differences between normal and cancer cells in humans, gene expression patterns were examined in gastrointestinal tumors. More than 300,000 transcripts derived from at least 45,000 different genes were analyzed. Although extensive similarity was noted between the expression profiles, more than 500 transcripts that were expressed at significantly different levels in normal and neoplastic cells were identified. These data provide insight into the extent of expression differences underlying malignancy and reveal genes that may prove useful as diagnostic or prognostic markers.

Comparative gene expression profiling by oligonucleotide fingerprinting

Article

Full-text available

Jun 1998

The use of hybridisation of synthetic oligonucleotides to cDNAs under high stringency to characterise gene sequences has been demonstrated by a number of groups. We have used two cDNA libraries of 9 and 12 day mouse embryos (24 133 and 34 783 clones respectively) in a pilot study to characterise expressed genes by hybridisation with 110 hybridisation probes. We have identified 33 369 clusters of cDNA clones, that ranged in representation from 1 to 487 copies (0.7%). 737 were assigned to known rodent genes, and a further 13 845 showed significant homologies. A total of 404 clusters were identified as significantly differentially represented (P < 0.01) between the two cDNA libraries. This study demonstrates the utility of the fingerprinting approach for the generation of comparative gene expression profiles through the analysis of cDNAs derived from different biological materials.

Construction of Physical Maps from Oligonucleotide Fingerprints Data

Article

Full-text available

Feb 1999

A new algorithm for the construction of physical maps from hybridization fingerprints of short oligonucleotide probes has been developed. Extensive simulations in high-noise scenarios show that the algorithm produces an essentially completely correct map in over 95% of trials. Tests for the influence of specific experimental parameters demonstrate that the algorithm is robust to both false positive and false negative experimental errors. The algorithm was also tested in simulations using real DNA sequences of C. elegans, E. coli, S. cerevisiae, and H. sapiens. To overcome the non-randomness of probe frequencies in these sequences, probes were preselected based on sequence statistics and a screening process of the hybridization data was developed. With these modifications, the algorithm produced very encouraging results.

Mathematical Taxonomy

Article

Aug 1974

Multiplexed biochemical assays with biological chips

Article

Sep 1993

High density peptide and oligonucleotide chips are fabricated using semiconductor-based technologies. These chips have a variety of biological applications.

Establishing Catalogues of Expressed Sequences by Oligonucleotide Fingerprinting of cDNA Libraries

Chapter

Jan 1994

The number of DNA clones to be manipulated and analysed in a variety of projects dealing with the analysis of complex genomes exceeds the potential of current methodology by orders of magnitude. By focussing first on the transcribed parts of the genome, i.e., by using cDNA or exon-trap libraries, it is possible to reduce the volume of the task considerably, presumably without sacrificing too much information. We present here an integrated series of mostly automated processes which together allow the isolation, amplification, arrayed spotting and analysis by oligonucleotide fingerprinting of > 100,000 cDNA clones in a few months with little operator involvement. The sequence information thus derived will be used to search databases for related sequences and to establish catalogues of expressed sequences. The technique is currently being applied to the analysis of cDNA derived from various human and mouse tissues and developmental stages. An expanded version of the process will allow us to analyse cDNA libraries from a range of representative human tissues, thereby giving us access to a significant fraction of the human genome.

Clustering and Classification: Background and Current Directions

Article

Dec 1977

Robert R. Sokal

This chapter discusses the nature and purpose of clustering and classification. The classification is one of the fundamental processes in science. It is important to get all the facts and phenomena straight before understanding and developing the unifying principles that explain the reasons for the occurrence of classification and its apparent order. As classification is the ordering of objects by their similarities, it is important to recognize that classification transcends human intellectual endeavor and is a fundamental property of living organisms. The attempts to develop techniques for automatic classification necessitates the quantification of similarity. Therefore, the ability to perceive any two objects as more similar to each other than either is to a third is sure to have been present in the ancestors of the human species. In much classificatory work, it is impractical to obtain estimates of taxonomic similarity in an assemblage of objects from a sample of subjects.

Clustering Algorithms

Article

Sep 1975

Mathematical Classification and Clustering

Article

Aug 1997
J OPER RES SOC

Hybridization analyses of arrayed cDNA libraries

Article

Jan 1991
TRENDS GENET

Gregory G. Lennon

Partial sequencing by oligohybridization: Concept and applications in genome analysis

Article

Cluster Analysis via Graph Theoretic Techniques

Article

D. W. Matula

Cloning: A Laboratory Manual

Article

Jan 1982

Molecular cloning: A laboratory manual, second ed

Article

Jan 1989

Computers and Inrracrobiliry: A Guide ro the Theory of NP-Completeness

Article

Graph Theoretic Techniques for Cluster Analysis Algorithms

Article

Dec 1977

David W. Matula

The output of a cluster analysis method is a collection of subsets of the object set termed clusters characterized in some manner by relative internal coherence and/or external isolation, along with a natural stratification of these identified clusters by levels of cohesive intensity. In formalizing a model of the cluster analysis methods, it is essential to consider the nature and inherent reliability of the proximity data that constitutes the input in substantive clustering applications. The proximity value scales are dichotomous. It is the practice of most authors of cluster methods to assume that the proximity values are available in the form of a real symmetric matrix, where any unjustified structure implicit in the real values is either to be ignored or axiomatically disallowed. The most desirable cluster analysis models for substantive applications should have the input proximity data expressible in a manner faithfully representing only the reliable information content of the empirically measured data.

k -Components, Clusters and Slicings in Graphs

Article

May 1972

David W. Matula

Utilizing the edge-connectivity of graphs, certain maximally connected subgraphs of a graph termed k-components and clusters are characterized and their interrelations are investigated. The cohesiveness function is described and shown to be a useful measure of the local intensity of connectivity within a graph. Sequences of cuts totally separating a graph, termed slicings, are shown to be intimately related to the k-components and clusters of a graph. An efficient algorithm is presented for determining k-components and clusters. Applications of these notions to graph coloring and numerical taxonomy are discussed.

Computing Edge Connectivity in Multigraphs and Capacitated Graphs

Article

Feb 1992
SIAM J DISCRETE MATH

Given an undirected graph G = (V, E), it is known that its edge-connectivity lambda(G) can be computed by solving O(\V\) max-flow problems. The best time bounds known for the problem are O(lambda(G)\V\2), due to Matula (28th IEEE Symposium on the Foundations of Computer Science, 1987, pp. 249-251) if G is simple, and O(\E\3/2\V\), due to Even and Tarjan (SIAM J. Comput., 4 (1975), pp. 507-518) if G is multiple. An O(\E\ + min {lambda(G)\V\2, p\V\ + \V\2 log \V\}) time algorithm for computing the edge-connectivity lambda(G) of a multigraph G = (V, E), where p(less-than-or-equal-to \E\) is the number of pairs of nodes between which G has an edge, is proposed. This algorithm does not use any max-flow algorithm but consists only of \V\ times of graph searches and edge contractions. This method is then extended to a capacitated network to compute its minimum cut capacity in O(\V\\E\ + \V\2 log\V\) time.

Computers And Intractability: A Guide to the Theory of NP-Completeness

Chapter

Jan 1979

Minimum Cuts in Near-Linear Time

Article

Jan 1996

David Ron Karger

Molecular Cloning: A Laboratory Manual

Chapter

Jan 1983

Quick Clustering Algorithms

Chapter

Jan 1975

J. A. Hartigan

Toward the Gene Catalogue of Sea Urchin Development: The Construction and Analysis of an Unfertilized Egg cDNA Library Highly Normalized by Oligonucleotide Fingerprinting

Article

Aug 1999

We describe the use of oligonucleotide fingerprinting for the generation of a normalized cDNA library from unfertilized sea urchin eggs and report the preliminary analysis of this library, which resulted in the establishment of a partial gene catalogue of the sea urchin egg. In an analysis of 21,925 cDNA clones by hybridization with 217 oligonucleotide probes, we were able to identify 6291 clusters corresponding to different transcripts, ranging in size from 1 to 265 clones. This corresponds to an average 3.5-fold normalization of the starting library. The normalized library represents about one-third of all genes expressed in the sea urchin egg. To generate sequence information for the transcripts represented by the clusters, representative clones selected from 711 clusters were sequenced. The construction and preliminary analysis of the normalized library are the first steps in the assembly of an increasingly complete collection of maternal genes expressed in the sea urchin egg, which will provide a number of insights into the early development of this well-characterized model organism.

Computing Edge-Connectivity in Multigraphs and Capacitated Graphs

Article

Jan 1992

A Simple Min Cut Algorithm.

Conference Paper

Jan 1994

Computers and Intracdtability: A Guide to the Theory of NP-Completeness

Book

Jan 1979

Network Flows: Theory, Algorithms and Applications

Book

Jan 1993
J OPER RES SOC

A simple min-cut algorithm

Article

Jul 1997

We present an algorithm for finding the minimum cut of an undirected edge-weighted graph. It is simple in every respect. It has a short and compact description, is easy to implement, and has a surprisingly simple proof of correctness. Its runtime matches that of the fastest algorithm known. The runtime analysis is straightforward. In contrast to nearly all approaches so far, the algorithm uses no flow techniques. Roughly speaking, the algorithm consists of about uVu nearly identical phases each of which is a maximum adjacency search.

Molecular approaches to genome analysis: a strategy for the construction of ordered overlapping clone libraries

Article

Sep 1987

Here we describe progress on a series of molecular techniques designed to bridge the gap between genetic and molecular distances in mammals. This is an essential step in the molecular cloning of genes defined by mammalian mutations, and in the molecular analysis of large regions of mammalian genomes. We summarize approaches for the physical and molecular analysis of genetic distances and describe the experimental, statistical and computational basis of a new approach to create ordered libraries of overlapping clones from large genomes.

An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Application to Image Segmentation

Article

Nov 1993

A novel graph theoretic approach for data clustering is presented and its application to the image segmentation problem is demonstrated. The data to be clustered are represented by an undirected adjacency graph &Gscr; with arc capacities assigned to reflect the similarity between the linked vertices. Clustering is achieved by removing arcs of &Gscr; to form mutually exclusive subgraphs such that the largest inter-subgraph maximum flow is minimized. For graphs of moderate size (~ 2000 vertices), the optimal solution is obtained through partitioning a flow and cut equivalent tree of &Gscr;, which can be efficiently constructed using the Gomory-Hu algorithm (1961). However for larger graphs this approach is impractical. New theorems for subgraph condensation are derived and are then used to develop a fast algorithm which hierarchically constructs and partitions a partially equivalent tree of much reduced size. This algorithm results in an optimal solution equivalent to that obtained by partitioning the complete equivalent tree and is able to handle very large graphs with several hundred thousand vertices. The new clustering algorithm is applied to the image segmentation problem. The segmentation is achieved by effectively searching for closed contours of edge elements (equivalent to minimum cuts in &Gscr;), which consist mostly of strong edges, while rejecting contours containing isolated strong edges. This method is able to accurately locate region boundaries and at the same time guarantees the formation of closed edge contours

The LEDA Platform for Combinatorial and Geometric Computing

Article

Sequencing by hybridization: Towards an automated sequencing of one million M13 clones arrayed on membranes

Article

Aug 1992

An immediately applicable variant of the sequencing by hybridization (SBH) method is under development with the capacity to determine up to 100 million base pairs per year. The proposed method comprises six steps: (i) arraying genomic or cDNA M13 clones in 864-well plates (wells of 2 mm); (ii) preparation of DNA samples for spotting by growth of the M13 clones or by polymerase chain reaction (PCR) of the inserts using standard 96-well plates, or plates having as many as 864 correspondingly smaller wells; (iii) robotic spotting of 13,824 samples on an 8 x 12 cm nylon membrane, or correspondingly more, on up to 6 times larger filters, by offset printing with a 96 or 864 0.4 mm pin device; (iv) hybridization of dotted samples with 200-2000 32P-labeled probes comprising 16-256 10-mers having a common 8-mer, 7-mer, or 6-mer in the middle (20 probes per day, each hybridized with 250,000 dots); (v) scoring hybridization signals of 5 million sample-probe pairs per day using storage phosphor plates; and (vi) computing clone order and partial-to-complete DNA sequences using various heuristic algorithms. Genome sequencing based on a combination of this method and gel sequencing techniques may be significantly more economical than gel methods alone.

Hybridization analyses of arrayed cDNA libraries

Article

Nov 1991
TRENDS GENET

Guidelines for submitting commentsPolicy: Comments that contribute to the discussion of the article will be posted within approximately three business days. We do not accept anonymous comments. Please include your email address; the address will not be displayed in the posted comment. Cell Press Editors will screen the comments to ensure that they are relevant and appropriate but comments will not be edited. The ultimate decision on publication of an online comment is at the Editors' discretion. Formatting: Please include a title for the comment and your affiliation. Note that symbols (e.g. Greek letters) may not transmit properly in this form due to potential software compatibility issues. Please spell out the words in place of the symbols (e.g. replace “α” with “alpha”). Comments should be no more than 8,000 characters (including spaces ) in length. References may be included when necessary but should be kept to a minimum. Be careful if copying and pasting from a Word document. Smart quotes can cause problems in the form. If you experience difficulties, please convert to a plain text file and then copy and paste into the form.

Clone Clustering by Hybridization

Article

Jun 1995

DNA sequencing by hybridization (SBH) Format 1 technique is based on experiments in which thousands of short oligomers are consecutively hybridized with dense arrays of clones. In this paper we present the description of a method for obtaining hybridization signatures for individual clones that guarantees reproducibility despite a wide range of variations in experimental circumstances, a sensitive method for signature comparison at prespecified significance levels, and a clustering algorithm that correctly identifies clusters of significantly similar signatures. The methods and the algorithm have been verified experimentally on a control set of 422 signatures that originate from 9 distinct clones of known sequence. Experiments indicate that only 30 to 50 oligomer probes suffice for correct clustering. This information about the identity of clones can be used to guide both genomic and cDNA sequencing by SBH or by standard gel-based methods.

Processing of cDNA and genomic kilobase-size clones for massive screening, mapping and sequencing by hybridization

Article

Sep 1994

Efficient procedures for managing a large number of M13 or plasmid clones have been developed. In addition to picking, clones are directly arrayed in multiwell plates by dispensing diluted transformation mixtures. Metal pin arrays are used for fast inoculations of preparative plates filled by medium or by PCR mixture. Growth of M13 clones in multiwell plates is optimized to obtain a consistently high yield, and a PCR protocol is defined for reliable amplification of several thousand M13 or plasmid inserts per day in BioOvens. Over 80,000 cDNA inserts have been amplified. The phages or amplified inserts are spotted on nylon filters using an array of pins having a flat bottom, 0.3 mm in diameter. The procedures are suitable for an automated processing of hundreds of thousands of short clones from representative cDNA and genomic libraries. Hybridization of arrayed clones with oligonucleotide and complex probes can simplify the search for new genes and accelerate large-scale sequencing.

Gene-Representing cDNA Clusters Defined by Hybridization of 57,419 Clones from Infant Brain Libraries with Short Oligonucleotide Probes

Article

Nov 1996

Diverse biochemical and computational procedures and facilities have been developed to hybridize thousands of DNA clones with short oligonucleotide probes and subsequently to extract valuable genetic information. This technology has been applied to 73,536 cDNA clones from infant brain libraries. By a mutual comparison of 57,419 samples that were successfully scored by 200-320 probes, 19,726 genes have been identified and sorted by their expression levels. The data indicate that an additional 20,000 or more genes may be expressed in the infant brain. Representative clones of the found genes create a valuable resource for complete sequencing and functional studies of many novel genes. These results demonstrate the unique capacity of hybridization technology to identify weakly transcribed genes and to study gene networks involved in organismal development, aging, or tumorigenesis by monitoring the expression of every gene in related tissues, whether known or still undiscovered.

Comparison of Clone-Ordering Algorithms Used in Physical Mapping

Article

Apr 1997

In this paper, a number of existing and novel techniques are considered for ordering cloned extracts from the genome of an organism based on fingerprinting data. A metric is defined for comparing the quality of the clone order for each technique. Simulated annealing is used in combination with several different objective functions. Empirical results with many simulated data sets for which the correct solution is known indicate that a simple greedy algorithm with some subsequent stochastic shuffling provides the best solution. Other techniques that attempt to weight comparisons between nonadjacent clones bias the ordering and give worse results. We show that this finding is not surprising since without detailed attempts to reconcile the data into a detailed map, only approximate maps can be obtained. Making N2 pieces of data from measurements of N clones cannot improve the situation.

Determining edge connectivity in O(nm)

Conference Paper

Nov 1987
Proc Annu Symp Found Comput Sci

David W. Matula

We describe an algorithm that determines the edge connectivity of an n-vertex m-edge graph G in O(nm) time. A refinement shows that the question as to whether a graph is k-edge connected can be determined in O(kn2). For dense graphs characterized by m = Ω(n2), the latter result implies that determination of whether a graph is k-edge connected for any fixed k can be accomplished in time linear in input size.

Physical Mapping of Chromosomes: A Combinatorial Problem in Molecular Biology

Article

Mar 1995

Our purpose is to reconstruct the relative placement of the clones along the DNA molecule; this information is lost in the construction of the clone library. In this article we restrict ourselves to the noiseless case, in which the data are free of experimental error. We also restrict ourselves to hybridization data. However, many of our algorithmic techniques can be extended to incorporate noise and restriction fragment data. Our experiments seem to hint at the following phenomenon. For small values of m, there is simply not enough information in the data to reveal the placement of clones, and therefore, all algorithms are expected to perform poorly. For very large m, the argument of section 6 shows that even the most simple-minded method such as the greedy algorithm is likely to find the true permutation. However, it seems that there is a range of values for m, depending on the underlying placement, where the result of the c++c- objective is superior to those of the greedy or the traveling salesman heuristics.

An Algorithm for Clustering cDNAs for Gene Expression Analysis

Article

Feb 1999

We have developed a novel algorithm for cluster analysis that is based on graph theoretic techniques. A similarity graph is defined and clusters in that graph correspond to highly connected subgraphs. A polynomial algorithm to compute them efficiently is presented. Our algorithm produces a clustering with some provably good properties. The application that motivated this study was gene expression analysis, where a collection of cDNAs must be clustered based on their oligonucleotide fingerprints. The algorithm has been tested intensively on simulated libraries and was shown to outperform extant methods. It demonstrated robustness to high noise levels. In a blind test on real cDNA fingerprint data the algorithm obtained very good results. Utilizing the results of the algorithm would have saved over 70% of the cDNA sequencing cost on that data set. 1 Introduction Cluster analysis seeks grouping of data elements into subsets, so that elements in the same subset are in some sense more cl...

A simple Min-Cut algorithm Vicentic Sequencing by hybridization: Towards an automated sequencing of one million M13 clones arrayed on mem-branes An optimal graph theoretic approach to data clustering: Theory and its application toimage segmentation

585-591

M Stoer
F Wagner
R Drmanac
S Drmanac
I Labat
R Crkvenjakov
A Gemmell

Stoer, M., and Wagner, F. (1997). A simple Min-Cut algorithm. J . ACM 44(4): 585–591. Vicentic, R., Drmanac S., Drmanac I., Labat R., Crkvenjakov A., and Gemmell, A. (1992). Sequencing by hybridization: Towards an automated sequencing of one million M13 clones arrayed on mem-branes. Electrophoresis 13: 566–573. Wu, Z., and Leahy, R. (1993). An optimal graph theoretic approach to data clustering: Theory and its application toimage segmentation. IEEE Trans. Pattern Anal. Machine Intelligence 15(11): 1101– 1113. 256 HARTUV ET AL.

An Algorithm for Clustering cDNA Fingerprints

Abstract

No full-text available

Recommended publications

Clustering Binary Oligonucleotide Fingerprint Vectors for DNA Clone Classification Analysis