Conference Paper

αCORR: A novel algorithm for clustering gene expression data

Authors:
  • Alexandria University, Faculty of Engineering, Alexandria, Egypt
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Recent advances in biotechnology allow researchers to measure expression levels for thousands of genes simultaneously, across different conditions and over time. Analysis of data produced by such experiments offers potential insight into gene function and regulatory mechanisms. A key step in the analysis of gene expression data is the detection of groups of genes that manifest similar expression patterns. The corresponding algorithmic problem is to cluster multi-condition gene expression patterns. This paper aims to introduce a new clustering algorithm for gene expression data. The design of the proposed algorithm tries to avoid some of the drawbacks and the disadvantages of the present algorithms of clustering gene expression data. The proposed αCORRclustering algorithm is tested and verified on real biological data sets.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The algorithm produces a hierarchical dendrogram from the k-nearest table that are continuously updated by the arrival of a new object. An iterative algorithm based on average linkage strategy is also proposed in Sharara and Ismail (2007). ...
... Symbolic approach can be used in classification as found in Mahfouz et al. (2016). The most relevant research studies to our focus are by Mahfouz and Ismail (2012), Maji and Pal (2007), Sharara and Ismail (2007), and Xenaki et al. (2016). ...
... However, for large datasets this will be extensive parameter fine-tuning process and totally not practical. The most related techniques for the proposed initialisation procedure are Sharara and Ismail (2007) and Maji and Pal (2007). Sharara and Ismail (2007) specify a threshold α on the average similarity among each cluster. ...
Article
Full-text available
Reduced bio-basis is the minimal set of fixed-length sub-sequences of a biological sequence with maximum information. Sequence data are not numerical so centroid-based clustering algorithms are not directly applicable. The main contribution of this paper is to show how to apply centroid-based algorithms on biological sequences. The average similarity between a subsequence and other sub-sequences in a cluster is reduced to a similarity between the sub-sequence and an artificial centre formed in a similar way to the formation of the centre of symbolic objects. After applying the hard version of the proposed symbolic clustering algorithm, a possibilistic membership is computed for each sub-sequence that adds high outliers’ rejection capability to the algorithm. Well- studied issues for the centroid-based approach such as parallelism or scalability can be applied to the proposed approach. Experimental results on several real datasets show that the proposed approach, in several respects, is superior to traditional methods.
... The algorithm produces a hierarchical dendrogram from the k-nearest table that are continuously updated by the arrival of a new object. An iterative algorithm based on average linkage strategy is also proposed in Sharara and Ismail (2007). ...
... Symbolic approach can be used in classification as found in Mahfouz et al. (2016). The most relevant research studies to our focus are by Mahfouz and Ismail (2012), Maji and Pal (2007), Sharara and Ismail (2007), and Xenaki et al. (2016). ...
... However, for large datasets this will be extensive parameter fine-tuning process and totally not practical. The most related techniques for the proposed initialisation procedure are Sharara and Ismail (2007) and Maji and Pal (2007). Sharara and Ismail (2007) specify a threshold α on the average similarity among each cluster. ...
... Other algorithms are based on maximizing the average similarity between rows/columns of a biclusters such as the correlation based algorithms BISOFT [4] and BCCA [15]. In BISOFT, after all biclusters are generated, memberships are assigned to genes only using simple formula. ...
... 2) The complexity analysis for the proposed algorithm along with the experimental results shows that the algorithm compares favorably to FLOC. Also it has much lower complexity than algorithms based on average similarity [4] due to its randomized search approach. 3) Unlike PBC, EPBC, our iterative approach allows constraints to be put on produced biclusters. ...
Conference Paper
Full-text available
Biclustering is powerful data mining technique that allows identifying groups of genes which are co-regulated and co-expressed under a subset of conditions for analyzing gene expression data from microarray technology. Possibilistic biclustering algorithms can give much insight towards different biological processes that each gene might participate into and the conditions under which its participation is most effective. This paper proposes an iterative algorithm that is able to produce k-possibly overlapping semi-possibilistic (or soft) biclusters satisfying input constraints. Several previous possibilistic approaches are sensitive to their input parameters and initial conditions beside that they don’t allow constraints to be put on the residue of produced biclusters and can work only as refinement step after applying hard biclustering. Our semi-possibilistic approach allows discovering overlapping biclusters with meaningful memberships while reducing the effect of very small memberships that may participate in iterations of possibilistic approaches. Experimental study on Yeast and Human shows that our algorithm can offer substantial improvements in terms of the quality of the output biclusters over several previously proposed biclustering algorithms.
... Correlation based algorithms such as BISOFT [2] and BCCA [15] start with initial bicluster and iteratively add a new row/column to the current bicluster such that the added row/column satisfy the criterion of having the average homogeneity within the bicluster above a prespecified threshold for each dimension. In BISOFT, after all biclusters are generated, memberships are assigned to genes only using simple formula. ...
Conference Paper
Full-text available
In contrast to hard biclustering, possibilistic biclustering not only has the ability to cluster a group of genes together with a group of conditions as hard biclustering but also it has outlier rejection capabilities and can give insights towards the degree under which the participation of a row or a column is most effective. Several previous possibilistic approaches are based on computing the zeros of an objective function. However, they are sensitive to their input parameters and initial conditions beside that they don’t allow constraints on biclusters. This paper proposes an iterative algorithm that is able to produce k-possibly overlapping semi-possibilistic (soft) biclusters satisfying input constraints. The proposed algorithms basically alternate between a depth-first search and a breadth-first search to effectively minimize the underlying objective function. It allows constraints, applicable to any acceptable (dis)similarity measure for the type of the input dataset and it is not sensitive to initial conditions. Experimental results show the ability of our algorithm to offer substantial improvements over several previously proposed biclustering algorithms.
Conference Paper
In gene expression analysis, grouping co-regulated genes is a major step in discovering genes which are likely to have related biological functions. This critical step can be done using clustering. This paper formally presents three models for iterative clustering based on average, single and complete linkage strategies. Variation of relational clustering algorithms can be built based on these models. The number of clusters needs not to be known in advance. Unlike centroid and medoid-based algorithms the proposed approach avoids minimizing least squares type objective function instead it maximizes the average similarity between objects of the same cluster using a subset of the similarity matrix. Top k nearest, farthest or near average entries in each row of the similarity matrix need to be identified depending on the required linkage strategy. In order to reduce the computational complexity of this step randomized search or genetic technique can be used to approximate these elements; however, in our experimental studies, the exact k elements are computed. The performance of the proposed algorithms is evaluated and compared to existing techniques on two standard gene expression datasets.
Article
Full-text available
Biclustering is a key step in analyzing gene expression data by identifying patterns where subset of genes are co-related based on a subset of conditions. This paper proposes a new distance based possibilistic biclustering algorithm (DPBC), in which the average distances between rows and between columns of the bicluster are minimized and at the same time the size of the bicluster is maximized by computing the zeros of the derivative of appropriate objective function. The proposed algorithm uses the possibilistic clustering paradigm similar to another existing possibilistic biclustering algorithm PBC. Whereas PBC is based on residue our approach is applicable to any accepted definition for distances between pairs of rows or columns. Experimental study on the human dataset and several artificial datasets having different noise levels shows that the DPBC algorithm can offer substantial improvements over the previously proposed algorithms.
Article
Full-text available
Biclustering is a very useful data mining technique for identifying patterns where different genes are co-related based on a subset of conditions in gene expression analysis. Association rules mining is an efficient approach to achieve biclustering as in BIMODULE algorithm but it is sensitive to the value given to its input parameters and the discretization procedure used in the preprocessing step, also when noise is present, classical association rules miners discover multiple small fragments of the true bicluster, but miss the true bicluster itself. This paper formally presents a generalized noise tolerant bicluster model, termed as µBicluster. An iterative algorithm termed as BIDENS based on the proposed model is introduced that can discover a set of k possibly overlapping biclusters simultaneously. Our model uses a more flexible method to partition the dimensions to preserve meaningful and significant biclusters. The proposed algorithm allows discovering biclusters that hard to be discovered by BIMODULE. Experimental study on yeast, human gene expression data and several artificial datasets shows that our algorithm offers substantial improvements over several previously proposed biclustering algorithms.
Conference Paper
Microarrays have become a standard tool for investigating gene functions, resulting in vast amount of data exhibiting high level of complexity, and consequently a need for better means of analysis. One of the steps of the analysis of such type of data is the detection of groups of co-regulated genes, which are likely to have related biological functions. Several clustering methods have been proposed to detect those groups of genes, but they have a major limitation imposed by the existence of a number of experimental conditions where the activity of genes is uncorrelated. Biclustering, on the other hand, seeks to find sub-matrices, that is subgroups of genes and subgroups of columns, where the genes exhibit highly correlated activities for every condition. This paper aims to propose BISOFT; a new, semi-fuzzy algorithm for biclustering gene expression data. The design of the algorithm is based on an extension to the αCORR clustering algorithm [10], in addition to introducing new ideas for discovering hidden overlapping biclusters in the underlying data. The proposed biclustering algorithm is thoroughly tested and verified on classical data sets from different biological sources.
Article
Full-text available
Motivation: We describe a new approach to the analysis of gene expression data coming from DNA array experiments, using an unsupervised neural network. DNA array technologies allow monitoring thousands of genes rapidly and efficiently. One of the interests of these studies is the search for correlated gene expression patterns, and this is usually achieved by clustering them. The Self-Organising Tree Algorithm, (SOTA) (Dopazo,J. and Carazo,J.M. (1997) J. Mol. Evol., 44, 226-233), is a neural network that grows adopting the topology of a binary tree. The result of the algorithm is a hierarchical cluster obtained with the accuracy and robustness of a neural network. Results: SOTA clustering confers several advantages over classical hierarchical clustering methods. SOTA is a divisive method: the clustering process is performed from top to bottom, i.e. the highest hierarchical levels are resolved before going to the details of the lowest levels. The growing can be stopped at the desired hierarchical level. Moreover, a criterion to stop the growing of the tree, based on the approximate distribution of probability obtained by randomisation of the original data set, is provided. By means of this criterion, a statistical support for the definition of clusters is proposed. In addition, obtaining average gene expression patterns is a built-in feature of the algorithm. Different neurons defining the different hierarchical levels represent the averages of the gene expression patterns contained in the clusters. Since SOTA runtimes are approximately linear with the number of items to be classified, it is especially suitable for dealing with huge amounts of data. The method proposed is very general and applies to any data providing that they can be coded as a series of numbers and that a computable measure of similarity between data items can be used. Availability: A server running the program can be found at: http://bioinfo.cnio.es/sotarray.
Article
Full-text available
A high-capacity system was developed to monitor the expression of many genes in parallel. Microarrays prepared by high-speed robotic printing of complementary DNAs on glass were used for quantitative expression measurements of the corresponding genes. Because of the small format and high density of the arrays, hybridization volumes of 2 microliters could be used that enabled detection of rare transcripts in probe mixtures derived from 2 micrograms of total cellular messenger RNA. Differential expression measurements of 45 Arabidopsis genes were made by means of simultaneous, two-color fluorescence hybridization.
Article
Full-text available
Technologies to measure whole-genome mRNA abundances and methods to organize and display such data are emerging as valuable tools for systems-level exploration of transcriptional regulatory networks. For instance, it has been shown that mRNA data from 118 genes, measured at several time points in the developing hindbrain of mice, can be hierarchically clustered into various patterns (or 'waves') whose members tend to participate in common processes. We have previously shown that hierarchical clustering can group together genes whose cis-regulatory elements are bound by the same proteins in vivo. Hierarchical clustering has also been used to organize genes into hierarchical dendograms on the basis of their expression across multiple growth conditions. The application of Fourier analysis to synchronized yeast mRNA expression data has identified cell-cycle periodic genes, many of which have expected cis-regulatory elements. Here we apply a systematic set of statistical algorithms, based on whole-genome mRNA data, partitional clustering and motif discovery, to identify transcriptional regulatory sub-networks in yeast-without any a priori knowledge of their structure or any assumptions about their dynamics. This approach uncovered new regulons (sets of co-regulated genes) and their putative cis-regulatory elements. We used statistical characterization of known regulons and motifs to derive criteria by which we infer the biological significance of newly discovered regulons and motifs. Our approach holds promise for the rapid elucidation of genetic network architecture in sequenced organisms in which little biology is known.
Article
Full-text available
Recent advances in biotechnology allow researchers to measure expression levels for thousands of genes simultaneously, across different conditions and over time. Analysis of data produced by such experiments offers potential insight into gene function and regulatory mechanisms. A key step in the analysis of gene expression data is the detection of groups of genes that manifest similar expression patterns. The corresponding algorithmic problem is to cluster multicondition gene expression patterns. In this paper we describe a novel clustering algorithm that was developed for analysis of gene expression data. We define an appropriate stochastic error model on the input, and prove that under the conditions of the model, the algorithm recovers the cluster structure with high probability. The running time of the algorithm on an n-gene dataset is O[n2[log(n)]c]. We also present a practical heuristic based on the same algorithmic ideas. The heuristic was implemented and its performance is demonstrated on simulated data and on real gene expression data, with very promising results.
Article
Full-text available
We compare several commonly used expression-based gene clustering algorithms using a figure of merit based on the mutual information between cluster membership and known gene attributes. By studying various publicly available expression data sets we conclude that enrichment of clusters for biological function is, in general, highest at rather low cluster numbers. As a measure of dissimilarity between the expression patterns of two genes, no method outperforms Euclidean distance for ratio-based measurements, or Pearson distance for non-ratio-based measurements at the optimal choice of cluster number. We show the self-organized-map approach to be best for both measurement types at higher numbers of clusters. Clusters of genes derived from single- and average-linkage hierarchical clustering tend to produce worse-than-random results. [The algorithm described is available at http://llama.med.harvard.edu , under Software.]
Article
Full-text available
The discovery of novel biological knowledge from the ab initio analysis of post-genomic data relies upon the use of unsupervised processing methods, in particular clustering techniques. Much recent research in bioinformatics has therefore been focused on the transfer of clustering methods introduced in other scientific fields and on the development of novel algorithms specifically designed to tackle the challenges posed by post-genomic data. The partitions returned by a clustering algorithm are commonly validated using visual inspection and concordance with prior biological knowledge--whether the clusters actually correspond to the real structure in the data is somewhat less frequently considered. Suitable computational cluster validation techniques are available in the general data-mining literature, but have been given only a fraction of the same attention in bioinformatics. This review paper aims to familiarize the reader with the battery of techniques available for the validation of clustering results, with a particular focus on their application to post-genomic data analysis. Synthetic and real biological datasets are used to demonstrate the benefits, and also some of the perils, of analytical clustervalidation. The software used in the experiments is available at http://dbkweb.ch.umist.ac.uk/handl/clustervalidation/. Enlarged colour plots are provided in the Supplementary Material, which is available at http://dbkweb.ch.umist.ac.uk/handl/clustervalidation/.
Article
Full-text available
Introduction Novel DNA microarray technologies enable the monitoring of expression levels of thousands of genes simultaneously. This allows for the first time a global view on the transcription levels of many (or all) genes when the cell undergoes specific conditions or processes. The potential of such technologies for functional genomics is tremendous: Measuring gene expression levels in different developmental stages, different body tissues, different clinical conditions and different organisms is instrumental in understanding genes function, gene networks, biological processes and effects of medical treatments. A first key step in the analysis of gene expression data is the identification of groups of genes that manifest similar expression patterns over several conditions. The corresponding algorithmic problem is to cluster multi-conditional gene expression patterns. A clustering problem consists of n elements and a characteristic ve
Article
Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene expression monitoring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.
Article
Two separation indices are considered for partitions P = {X1, …, Xk} of a finite data set X in a general inner product space. Both indices increase as the pairwise distances between the subsets Xi become large compared to the diameters of Xi Maximally separated partitions p' are defined and it is shown that as the indices of p' increase without bound, the characteristic functions of Xi' in P' are approximated more and more closely by the membership functions in fuzzy partitions which minimize certain fuzzy extensions of the k-means squared error criterion function.
Article
The problem of comparing two different partitions of a finite set of objects reappears continually in the clustering literature. We begin by reviewing a well-known measure of partition correspondence often attributed to Rand (1971), discuss the issue of correcting this index for chance, and note that a recent normalization strategy developed by Morey and Agresti (1984) and adopted by others (e.g., Miligan and Cooper 1985) is based on an incorrect assumption. Then, the general problem of comparing partitions is approached indirectly by assessing the congruence of two proximity matrices using a simple cross-product measure. They are generated from corresponding partitions using various scoring rules. Special cases derivable include traditionally familiar statistics and/or ones tailored to weight certain object pairs differentially. Finally, we propose a measure based on the comparison of object triples having the advantage of a probabilistic interpretation in addition to being corrected for chance (i.e., assuming a constant value under a reasonable null hypothesis) and bounded between ±1.
Article
Complete sequences of genomes and comprehensive sets of cDNA sequences open the way to a hugh range of biological problems. There is a need for analytical methods that can deal with the large number of sequences in the data banks and, ideally, we would like to analyse all sequences together. Gel-based sequencing is a serial process that analyses one sequence at a time. Oligonucleotide arrays, or 'DNA chips', are miniature, parallel analytical devices, which could bring to sequence analysis and molecular genetics many of the advantages that semiconductor devices brought to computing.
Article
A system of cluster analysis for genome-wide expression data from DNA microarray hybridization is described that uses standard statistical algorithms to arrange genes according to similarity in pattern of gene expression. The output is displayed graphically, conveying the clustering and the underlying expression data simultaneously in a form intuitive for biologists. We have found in the budding yeast Saccharomyces cerevisiae that clustering gene expression data groups together efficiently genes of known similar function, and we find a similar tendency in human data. Thus patterns seen in genome-wide expression experiments can be interpreted as indications of the status of cellular processes. Also, coexpression of genes of known function with poorly characterized or novel genes may provide a simple means of gaining leads to the functions of many genes for which information is not available currently.
Article
Array technologies have made it straightforward to monitor simultaneously the expression pattern of thousands of genes. The challenge now is to interpret such massive data sets. The first step is to extract the fundamental patterns of gene expression inherent in the data. This paper describes the application of self-organizing maps, a type of mathematical cluster analysis that is particularly well suited for recognizing and classifying features in complex, multidimensional data. The method has been implemented in a publicly available computer package, GENECLUSTER, that performs the analytical calculations and provides easy data visualization. To illustrate the value of such analysis, the approach is applied to hematopoietic differentiation in four well studied models (HL-60, U937, Jurkat, and NB4 cells). Expression patterns of some 6,000 human genes were assayed, and an online database was created. GENECLUSTER was used to organize the genes into biologically relevant clusters that suggest novel hypotheses about hematopoietic differentiation-for example, highlighting certain genes and pathways involved in "differentiation therapy" used in the treatment of acute promyelocytic leukemia.
Article
Clustering large data sets is a central challenge in gene expression analysis. The hybridization of synthetic oligonucleotides to arrayed cDNAs yields a fingerprint for each cDNA clone. Cluster analysis of these fingerprints can identify clones corresponding to the same gene. We have developed a novel algorithm for cluster analysis that is based on graph theoretic techniques. Unlike other methods, it does not assume that the clusters are hierarchically structured and does not require prior knowledge on the number of clusters. In tests with simulated libraries the algorithm outperformed the Greedy method and demonstrated high speed and robustness to high error rate. Good solution quality was also obtained in a blind test on real cDNA fingerprints.
An algorithm for clustering of cdnas for gene expression analysis using short olig onucleotide fingerprints
  • E Hartuv
  • A Schmitt
  • J Lange
  • S Meier-Ewert
  • R Shamir
E. Hartuv, A. Schmitt, J. Lange, S. Meier-Ewert, and R. Shamir, "An algorithm for clustering of cdnas for gene expression analysis using short olig onucleotide fingerprints," Genomics, no. 66, pp. 249-256, 2000. 1-4244-1509-8/07/$25.00 02007 IEEE 7 'V 4-(0