Conference Paper

Research on Greedy Clique Partition-GCP Algorithm

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Clustering of binary fingerprints is used in the classification of gene expression data. It is known that the clustering of binary fingerprints with 3 bits of missing value is NP-hard. The greedy clique partition (GCP for short) algorithm is a heuristic algorithm used to clustering of binary fingerprints with missing values. In this paper, we firstly study the feature of instances which can not be resolved by the GCP based on hash table. Then a new property of problem instances is given, which can further improve the heuristic algorithm based on linked list. Finally, an empirical formula is presented, which is used to judge the accuracy and credibility of the GCP algorithm

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Step 4: Using maximal clique [20] search methods [15][16][17][18][19] among the selected codes, set of 2D orthogonal codes are formed with desired cross-correlation and auto-correlation constraints. ...
Article
Full-text available
In this paper, an algorithm for construction of multiple sets of two dimensional (2D) or matrix unipolar (optical) orthogonal codes has been proposed. Representations of these 2D codes in difference of positions representation (DoPR) have also been discussed along-with conventional weighted positions representation (WPR) of the code. This paper also proposes less complex methods for calculation of auto-correlation as well as cross-correlation constraints within set of matrix codes. The multiple sets of matrix codes provide flexibility for selection of optical orthogonal codes set in wavelength-hopping time-spreading (WHTS) optical code division multiple access (CDMA) system.
... ... The cross-correlation of the uni-polar code X with code parameters 1 1 1 ( , , ) a nw and code Y with parameters 2 2 2 ( , , ) a nw is equal to maximum common DoP elements between the any two rows of EdoP matrices along-with first column having zero elements of code X and code Y respectively. 12 [41][42][43][44][45][46] or the algorithm proposed here as follows. ...
Article
Full-text available
This paper proposes an algorithm to search a family of multiple sets of minimum correlated one dimensional uni-polar (optical) orthogonal codes (1-DUOC) or optical orthogonal codes (OOC) with fixed as well as variable code parameters. The cardinality of each set is equal to upper bound. The codes within a set can be searched for general values of code length, code weight, auto-correlation constraint and cross-correlation constraint. Each set forms a maximal clique of the codes within given range of correlation properties . These one-dimensional uni-polar orthogonal codes can find their application as signature sequences for spectral spreading purpose in incoherent optical code division multiple access (CDMA) systems.
Article
Full-text available
Clustering is one of the main mathematical challenges in large-scale gene expression analysis. We describe a clustering procedure based on a sequential k-means algorithm with additional refinements that is able to handle high-throughput data in the order of hundreds of thousands of data items measured on hundreds of variables. The practical motivation for our algorithm is oligonucleotide fingerprinting—a method for simultaneous determination of expression level for every active gene of a specific tissue—although the algorithm can be applied as well to other large-scale projects like EST clustering and qualitative clustering of DNA-chip data. As a pairwise similarity measure between two p-dimensional data points,x and y, we introduce mutual information that can be interpreted as the amount of information about x iny, and vice versa. We show that for our purposes this measure is superior to commonly used metric distances, for example, Euclidean distance. We also introduce a modified version of mutual information as a novel method for validating clustering results when the true clustering is known. The performance of our algorithm with respect to experimental noise is shown by extensive simulation studies. The algorithm is tested on a subset of 2029 cDNA clones coming from 15 different genes from a cDNA library derived from human dendritic cells. Furthermore, the clustering of these 2029 cDNA clones is demonstrated when the entire set of 76,032 cDNA clones is processed.
Article
Full-text available
The use of hybridisation of synthetic oligonucleotides to cDNAs under high stringency to characterise gene sequences has been demonstrated by a number of groups. We have used two cDNA libraries of 9 and 12 day mouse embryos (24 133 and 34 783 clones respectively) in a pilot study to characterise expressed genes by hybridisation with 110 hybridisation probes. We have identified 33 369 clusters of cDNA clones, that ranged in representation from 1 to 487 copies (0.7%). 737 were assigned to known rodent genes, and a further 13 845 showed significant homologies. A total of 404 clusters were identified as significantly differentially represented (P < 0.01) between the two cDNA libraries. This study demonstrates the utility of the fingerprinting approach for the generation of comparative gene expression profiles through the analysis of cDNAs derived from different biological materials.
Article
Full-text available
Clustering is one of the main mathematical challenges in large-scale gene expression analysis. We describe a clustering procedure based on a sequential k-means algorithm with additional refinements that is able to handle high-throughput data in the order of hundreds of thousands of data items measured on hundreds of variables. The practical motivation for our algorithm is oligonucleotide fingerprinting-a method for simultaneous determination of expression level for every active gene of a specific tissue-although the algorithm can be applied as well to other large-scale projects like EST clustering and qualitative clustering of DNA-chip data. As a pairwise similarity measure between two p-dimensional data points, x and y, we introduce mutual information that can be interpreted as the amount of information about x in y, and vice versa. We show that for our purposes this measure is superior to commonly used metric distances, for example, Euclidean distance. We also introduce a modified version of mutual information as a novel method for validating clustering results when the true clustering is known. The performance of our algorithm with respect to experimental noise is shown by extensive simulation studies. The algorithm is tested on a subset of 2029 cDNA clones coming from 15 different genes from a cDNA library derived from human dendritic cells. Furthermore, the clustering of these 2029 cDNA clones is demonstrated when the entire set of 76,032 cDNA clones is processed.
Article
Full-text available
Novel DNA microarray technologies enable the monitoring of expression levels of thousands of genes simultaneously. This allows a global view on the transcription levels of many (or all) genes when the cell undergoes specific conditions or processes. Analyzing gene expression data requires the clustering of genes into groups with similar expression patterns. We have developed a novel clustering algorithm, called CLICK, which is applicable to gene expression analysis as well as to other biological applications. No prior assumptions are made on the structure or the number of the clusters. The algorithm utilizes graph-theoretic and statistical techniques to identify tight groups of highly similar elements (kernels), which are likely to belong to the same true cluster. Several heuristic procedures are then used to expand the kernels into the full clustering. CLICK has been implemented and tested on a variety of biological datasets, ranging from gene expression, cDNA oligo-fingerprinting to protein sequence similarity. In all those applications it outperformed extant algorithms according to several common figures of merit. CLICK is also very fast, allowing clustering of thousands of elements in minutes, and over 100,000 elements in a couple of hours on a regular workstation.
Article
Full-text available
Thorough assessments of fungal diversity are currently hindered by technological limitations. Here we describe a new method for identifying fungi, oligonucleotide fingerprinting of rRNA genes (OFRG). ORFG sorts arrayed rRNA gene (ribosomal DNA [rDNA]) clones into taxonomic clusters through a series of hybridization experiments, each using a single oligonucleotide probe. A simulated annealing algorithm was used to design an OFRG probe set for fungal rDNA. Analysis of 1,536 fungal rDNA clones derived from soil generated 455 clusters. A pairwise sequence analysis showed that clones with average sequence identities of 99.2% were grouped into the same cluster. To examine the accuracy of the taxonomic identities produced by this OFRG experiment, we determined the nucleotide sequences for 117 clones distributed throughout the tree. For all but two of these clones, the taxonomic identities generated by this OFRG experiment were consistent with those generated by a nucleotide sequence analysis. Eighty-eight percent of the clones were affiliated with Ascomycota, while 12% belonged to Basidiomycota. A large fraction of the clones were affiliated with the genera Fusarium (404 clones) and Raciborskiomyces (176 clones). Smaller assemblages of clones had high sequence identities to the Alternaria, Ascobolus, Chaetomium, Cryptococcus, and Rhizoctonia clades.
Article
Full-text available
Oligonucleotide fingerprinting is a powerful DNA array based method to characterize cDNA and ribosomal RNA gene (rDNA) libraries and has many applications including gene expression profiling and DNA clone classification. We are especially interested in the latter application. A key step in the method is the cluster analysis of fingerprint data obtained from DNA array hybridization experiments. Most of the existing approaches to clustering use (normalized) real intensity values and thus do not treat positive and negative hybridization signals equally (positive signals are much more emphasized). In this paper, we consider a discrete approach. Fingerprint data are first normalized and binarized using control DNA clones. Because there may exist unresolved (or missing) values in this binarization process, we formulate the clustering of (binary) oligonucleotide fingerprints as a combinatorial optimization problem that attempts to identify clusters and resolve the missing values in the fingerprints simultaneously. We study the computational complexity of this clustering problem and a natural parameterized version, and present an efficient greedy algorithm based on MINIMUM CLIQUE PARTITION on graphs. The algorithm takes advantage of some unique properties of the graphs considered here, which allow us to efficiently find the maximum cliques as well as some special maximal cliques. Our experimental results on simulated and real data demonstrate that the algorithm runs faster and performs better than some popular hierarchical and graph-based clustering methods. The results on real data from DNA clone classification also suggest that this discrete approach is more accurate than clustering methods based on real intensity values, in terms of separating clones that have different characteristics with respect to the given oligonucleotide probes.
Article
Full-text available
Technologies for generating high-density arrays of cDNAs and oligonucleotides are developing rapidly, and changing the landscape of biological and biomedical research. They enable, for the first time, a global, simultaneous view on the transcription levels of many thousands of genes, when the cell undergoes specific conditions or processes. For several organisms that had their genomes completely sequenced, the full set of genes can already be monitored this way today. The potential of such technologies is tremendous: The information obtained by monitoring gene expression levels in different developmental stages, tissue types, clinical conditions and di erent organisms can help understanding gene function and gene networks, and assist in the diagnostic of disease conditions and of effects of medical treatments. Undoubtedly, other applications will emerge in coming years. A key step in the analysis of gene expression data is the identification of groups of genes that manifest...
Article
DNA sequencing by hybridization (SBH) Format 1 technique is based on experiments in which thousands of short oligomers are consecutively hybridized with dense arrays of clones. In this paper we present the description of a method for obtaining hybridization signatures for individual clones that guarantees reproducibility despite a wide range of variations in experimental circumstances, a sensitive method for signature comparison at prespecified significance levels, and a clustering algorithm that correctly identifies clusters of significantly similar signatures. The methods and the algorithm have been verified experimentally on a control set of 422 signatures that originate from 9 distinct clones of known sequence. Experiments indicate that only 30 to 50 oligomer probes suffice for correct clustering. This information about the identity of clones can be used to guide both genomic and cDNA sequencing by SBH or by standard gel-based methods.
Article
Diverse biochemical and computational procedures and facilities have been developed to hybridize thousands of DNA clones with short oligonucleotide probes and subsequently to extract valuable genetic information. This technology has been applied to 73,536 cDNA clones from infant brain libraries. By a mutual comparison of 57,419 samples that were successfully scored by 200-320 probes, 19,726 genes have been identified and sorted by their expression levels. The data indicate that an additional 20,000 or more genes may be expressed in the infant brain. Representative clones of the found genes create a valuable resource for complete sequencing and functional studies of many novel genes. These results demonstrate the unique capacity of hybridization technology to identify weakly transcribed genes and to study gene networks involved in organismal development, aging, or tumorigenesis by monitoring the expression of every gene in related tissues, whether known or still undiscovered.
Article
A system of cluster analysis for genome-wide expression data from DNA microarray hybridization is described that uses standard statistical algorithms to arrange genes according to similarity in pattern of gene expression. The output is displayed graphically, conveying the clustering and the underlying expression data simultaneously in a form intuitive for biologists. We have found in the budding yeast Saccharomyces cerevisiae that clustering gene expression data groups together efficiently genes of known similar function, and we find a similar tendency in human data. Thus patterns seen in genome-wide expression experiments can be interpreted as indications of the status of cellular processes. Also, coexpression of genes of known function with poorly characterized or novel genes may provide a simple means of gaining leads to the functions of many genes for which information is not available currently.
Article
Array technologies have made it straightforward to monitor simultaneously the expression pattern of thousands of genes. The challenge now is to interpret such massive data sets. The first step is to extract the fundamental patterns of gene expression inherent in the data. This paper describes the application of self-organizing maps, a type of mathematical cluster analysis that is particularly well suited for recognizing and classifying features in complex, multidimensional data. The method has been implemented in a publicly available computer package, GENECLUSTER, that performs the analytical calculations and provides easy data visualization. To illustrate the value of such analysis, the approach is applied to hematopoietic differentiation in four well studied models (HL-60, U937, Jurkat, and NB4 cells). Expression patterns of some 6,000 human genes were assayed, and an online database was created. GENECLUSTER was used to organize the genes into biologically relevant clusters that suggest novel hypotheses about hematopoietic differentiation-for example, highlighting certain genes and pathways involved in "differentiation therapy" used in the treatment of acute promyelocytic leukemia.
Article
Sequencing by hybridization (SBH) applied to dense arrays of clones achieves an unprecedented throughput of analyzed samples that can be performed inexpensively. The adaptation of this technology to an industrial setting has already produced 1 million partial cDNA sequences per month and a database of more than 4 million individual cDNA clones in the facility. This chapter discusses the present protocols and the potential application of this technology to the fields of gene discovery and gene function. SBH technology applied on an array of clones provides a progressive approach from expression screening to complete sequencing of complex libraries. The same approach is applicable for mutation detection and complete sequencing of many genes amplified by polymerase chain reaction (PCR) from thousands of individual samples.
Article
DNA microarray technologies together with rapidly increasing genomic sequence information is leading to an explosion in available gene expression data. Currently there is a great need for efficient methods to analyze and visualize these massive data sets. A self-organizing map (SOM) is an unsupervised neural network learning algorithm which has been successfully used for the analysis and organization of large data files. We have here applied the SOM algorithm to analyze published data of yeast gene expression and show that SOM is an excellent tool for the analysis and visualization of gene expression profiles.
Article
Clustering large data sets is a central challenge in gene expression analysis. The hybridization of synthetic oligonucleotides to arrayed cDNAs yields a fingerprint for each cDNA clone. Cluster analysis of these fingerprints can identify clones corresponding to the same gene. We have developed a novel algorithm for cluster analysis that is based on graph theoretic techniques. Unlike other methods, it does not assume that the clusters are hierarchically structured and does not require prior knowledge on the number of clusters. In tests with simulated libraries the algorithm outperformed the Greedy method and demonstrated high speed and robustness to high error rate. Good solution quality was also obtained in a blind test on real cDNA fingerprints.
Conference Paper
In DNA clone classification, a key step is the cluster analysis of fingerprint data. A good algorithm - GCP (greedy clique partition) is presented, recently. In this paper, a high efficient implementation of GCP is presented, and a little defect of GCP is advanced. Our experimental result on simulated demonstrates that the GCP is more efficient and more accurate than other algorithms such as UPGMA, CLUSTER and CLICK.
Article
Abstract We approach the class discovery and leaf ordering problems using spectral graph partitioning methodologies. For class discovery or clustering, we present a rain-max cut hierarchical clustering method and show it produces subtypes quite close to human expert labeling on the lymphoma dataset with 6 classes. On optimal leaf ordering for displaying the gene expression data, we present a sequential or- dering method that can be computed in O(tz 2) time which also preserves the cluster structure. We also show that the well known statistic methods such as F-statistic test and the principal component analysis are very useful in gene expression analysis.
Article
We present CLIFF, an algorithm for clustering biological samples using gene expression microarray data. This clustering problem is difficult for several reasons, in particular the sparsity of the data, the high dimensionality of the feature (gene) space, and the fact that many features are irrelevant or redundant. Our algorithm iterates between two computational processes, feature filtering and clustering. Given a reference partition that approximates the correct clustering of the samples, our feature filtering procedure ranks the features according to their intrinsic discriminability, relevance to the reference partition, and irredundancy to other relevant features, and uses this ranking to select the features to be used in the following round of clustering. Our clustering algorithm, which is based on the concept of a normalized cut, clusters the samples into a new reference partition on the basis of the selected features. On a well-studied problem involving 72 leukemia samples and 7130 genes, we demonstrate that CLIFF outperforms standard clustering approaches that do not consider the feature selection issue, and produces a result that is very close to the original expert labeling of the sample set. Contact: epxing@cs.berkeley.edu
Interpreting patterns of gene expression with self-organizing maps:Methods and applications to hematopoietic differention
  • P Tamayo
  • J Slonim
  • D Mesirov
  • J Zhu
  • S Kitareewan
  • E Dmitrovsky
  • E Lander
  • Golub
Tamayo P,Slonim J,Mesirov D,Zhu J,Kitareewan S,Dmitrovsky E,Lander E,Golub T,Interpreting patterns of gene expression with self-organizing maps:Methods and applications to hematopoietic differention,PNAS,96:2907-2912,1999.
Oligonucleotide fingerprinting of ribosomal rna genes for anaysis of fungal community composition,Applied and Enviromental Microbiology An Implementation of GCP-Based Cluster Analysis Authorized licensed use limited to
  • L Valinsky
  • D Vedova
  • T Jiang
  • Pei-Qiang Liu J Borneman
Valinsky L,Vedova G D,Jiang T,Borneman J. Oligonucleotide fingerprinting of ribosomal rna genes for anaysis of fungal community composition,Applied and Enviromental Microbiology, 68(12):5999-6004, 2002. [17] PEI-QIANG LIU, HUI FAN, DA-MING ZHU, An Implementation of GCP-Based Cluster Analysis, Proceedings of ICMLC2004, pages 2761-2764, 2004. Authorized licensed use limited to: IEEE Xplore. Downloaded on December 26, 2008 at 02:14 from IEEE Xplore. Restrictions apply.
Analysis of gene expression data using self-organizing maps
  • P Toronen
  • M Kolehmainen
  • G Wong
  • E Castren
Toronen P,Kolehmainen M,Wong G,Castren E,Analysis of gene expression data using self-organizing maps,FEBS Letters,451:142-146,1999.
Clone clustering by hydbridization
  • A Milosavljevic
  • Z Strezosca
  • M Zercmski
  • D Grujic
  • T Paunesku
  • R Crkvenjakov