Article

Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Motivation: Because co-expressed genes are likely to share the same biological function, cluster analysis of gene expression profiles has been applied for gene function discovery. Most existing clustering methods ignore known gene functions in the process of clustering. Results: To take advantage of accumulating gene functional annotations, we propose incorporating known gene functions into a new distance metric, which shrinks a gene expression-based distance towards 0 if and only if the two genes share a common gene function. A two-step procedure is used. First, the shrinkage distance metric is used in any distance-based clustering method, e.g. K-medoids or hierarchical clustering, to cluster the genes with known functions. Second, while keeping the clustering results from the first step for the genes with known functions, the expression-based distance metric is used to cluster the remaining genes of unknown function, assigning each of them to either one of the clusters obtained in the first step or some new clusters. A simulation study and an application to gene function prediction for the yeast demonstrate the advantage of our proposal over the standard method.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... For certain Bioinformatics applications, weak supervised information is available in the form of which pairs of proteins or genes are related [10]. This incomplete supervision may be incorporated into semi-supervised clustering algorithms formulated as pair-wise constraints [11], [12]. ...
... Noisy features may deteriorate the clustering performance. Therefore, feature selection to remove redundant variables is recommended to improve the clustering results [10]. To this aim, genes (features) are ranked by the interquantile range (IQR). ...
Article
Clustering algorithms such as k-means depend heavily on choosing an appropriate distance metric that reflect accurately the object proximities. A wide range of dissimilarities may be defined that often lead to different clustering results. Choosing the best dissimilarity is an ill-posed problem and learning a general distance from the data is a complex task, particularly for high dimensional problems. Therefore, an appealing approach is to learn an ensemble of dissimilarities. In this paper, we have developed a semi-supervised clustering algorithm that learns a linear combination of dissimilarities considering incomplete knowledge in the form of pairwise constraints. The minimization of the loss function is based on a robust and efficient quadratic optimization algorithm. Besides, a regularization term is considered that controls the complexity of the distance metric learned avoiding overfitting. The algorithm has been applied to the identification of tumor samples using the gene expression profiles, where domain experts provide often incomplete knowledge in the form of pairwise constraints. We report that the algorithm proposed outperforms a standard semi-supervised clustering technique available in the literature and clustering results based on a single dissimilarity. The improvement is particularly relevant for applications with high level of noise. © 2022, Universidad Internacional de la Rioja. All rights reserved.
... Fang et al. (2006) allowed genes to be assigned to already known biological processes by utilizing the knowledge of GO annotations. Huang and Pan (2006) used GO annotations with the K-medoids algorithm, and here it allows unknown genes to be assigned to clusters which contain the genes with known functions but it does not allow known functional genes to other functional clusters. In the work of Brameier and Wiuf (2007), the co-clustering algorithm is used to cluster yeast genes based on the microarray expression profiles and GO annotations, and it used random assignment of the membership values for genes, so we do not get the same final clustering result for the multiple runs of the clustering algorithm. ...
... These help to find some interesting biological interpretation which is usually so much time-consuming analysis process. In Huang and Pan (2006), GO annotations based K-medoids algorithm does not allow to capture new function of previously annotated genes with identified functions by restricting genes with identified functions to be doled out to different clusters. But GO-FRC permits Table 8 Correct Rakwalska and Rospert (2004) genes with known functions to be appointed to other biological processes. ...
Article
The product of gene expression works together in the cell for each living organism in order to achieve different biological processes. Many proteins are involved in different roles depending on the environment of the organism for the functioning of the cell. In this paper, we propose gene ontology (GO) annotations based semi-supervised clustering algorithm called GO Fuzzy relational clustering (GO-FRC) where one gene is allowed to be assigned to multiple clusters which are the most biologically relevant behavior of genes. In the clustering process, GO-FRC utilizes useful biological knowledge which is available in the form of a Gene Ontology, as a prior knowledge along with the gene expression data. The prior knowledge helps to improve the coherence of the groups concerning the knowledge field. The proposed GO-FRC has been tested on the two yeast (Saccharomyces cerevisiae) expression profiles datasets (Eisen and Dream 5 yeast datasets) and compared with other state-of-the-art clustering algorithms. Experimental results imply that GO-FRC is able to produce more biologically relevant clusters with the use of the small amount of GO annotations.
... Several studies have highlighted the importance of presenting microarray data in the framework of documented biological pathways [4,5]. Typically, microarray gene expression experiments produce long lists of genes that are differentially expressed in two different circumstances. ...
... Such studies rapidly generate large quantities of gene expression data, the handling of which represents a major challenge for biologists. Indeed, the importance of presenting microarray data in the framework of documented biological pathways has often been noted [4,5]. In biology, pathway is a set of interactions or functional relationships between the physical and genetic components of a cell that operate in concert to fulfil a biological requirement. ...
... For more general purposes, it has been proposed to adjust the distance metric used in hierarchical clustering by a term that quantifies similarity of GO or KEGG annotations between pairs of genes, with a tuning parameter allowing for a flexible trade-off between knowledge-based and data-driven analysis [15,16]. Annotation-based adjustments have also been proposed for use in k-means/k-medioid clustering [17,18,19] and mixture models [20]. ...
Preprint
Full-text available
Genome-wide expression profiling is a cost-efficient and widely used method to characterize heterogeneous populations of cells, tissues, biopsies, or other biological specimen. The exploratory analysis of such datasets typically relies on generic unsupervised methods, e.g. principal component analysis or hierarchical clustering. However, generic methods fail to exploit the significant amount of knowledge that exists about the molecular functions of genes. Here, I introduce GO-PCA, an unsupervised method that incorporates prior knowledge about gene functions in the form of gene ontology (GO) annotations. GO-PCA aims to discover and represent biological heterogeneity along all major axes of variation in a given dataset, while suppressing heterogeneity due to technical biases. To this end, GO-PCA combines principal component analysis (PCA) with nonparametric GO enrichment analysis, and uses the results to generate expression signatures based on small sets of functionally related genes. I first applied GO-PCA to expression data from diverse lineages of the human hematopoietic system, and obtained a small set of signatures that captured known cell characteristics for most lineages. I then applied the method to expression profiles of glioblastoma (GBM) tumor biopsies, and obtained signatures that were strongly associated with multiple previously described GBM subtypes. Surprisingly, GO-PCA discovered a cell cycle-related signature that exhibited significant differences between the Proneural and the prognostically favorable GBM CpG Island Methylator (G-CIMP) subtypes, suggesting that the G-CIMP subtype is characterized in part by lower mitotic activity. Previous expression-based classifications have failed to separate these subtypes, demonstrating that GO-PCA can detect heterogeneity that is missed by other methods. My results show that GO-PCA is a powerful and versatile expression-based method that facilitates exploration of large-scale expression data, without requiring additional types of experimental data. The low-dimensional representation generated by GO-PCA lends itself to interpretation, hypothesis generation, and further analysis.
... Nonetheless, this method is a de facto standard for the secondary analysis of differentially expressed genes in an expression experiment and as shown in the study, provides quick and useful inferences on the higher order biological process involved in defense response of GR978 against rice blast. Recently, there have been efforts by computational biologists and mathematicians to incorporate biological knowledge into distance-based clustering analysis methods, which may address the shortcomings mentioned (Huang and Pan 2006). ...
Preprint
Full-text available
A blast-resistance rice mutant, GR978, generated by gamma-irradiation of indica cultivar IR64 was used to characterize the disease resistance transcriptome of rice to gain a better understanding of genes or chromosomal regions contributing to broad-spectrum disease resistance. GR978 was selected from the IR64 mutant collection at IRRI. To facilitate phenotypic characterization of the collection, a set of controlled vocabularies (CV) documenting mutant phenotypes in ∼3,700 entries was developed. In collaboration with the Tos17 rice mutant group at National Institute of Agrobiological Sciences, Japan, a merged CV set with 91 descriptions that map onto public ontology databases (PO, TO, OBO) is implemented in the IR64 mutant database. To better characterize the disease resistance transcriptome of rice, gene expression data from a blast resistant cultivar, SHZ-2, was incorporated in the analysis. Disease resistance transcriptome parameters, including differentially expressed genes (DEGs), regions of correlated gene expression (RCEs), and associations between DEGs and RCEs were determined statistically within and between genotypes using MAANOVA, correlation, and fixed ratio analysis. Twelve DEGs were found within the inferred physical location of the recessive gene locus on a ∼3.8MB region of chromosome 12 defined by genetic analysis of GR978. Highly expressed DEGs (≥ 2fold difference) in GR978 or SHZ-2 and in common between the two, are mostly defense-response related, suggesting that most of the DEGs participate in causing the resistance phenotype. Comparing RCEs between SHZ-2 and GR978 showed that most RCEs between genotypes did not overlap. However, an 8-gene RCE in chromosome 11 was in common between SHZ2 and GR978. Gene annotations and GO enrichment analysis showed a high association with resistance response. This region has no DEGs nor is it associated with known blast resistance QTLs. Association analyses between RCEs and DEGs show that there was no enrichment of DEGs in the RCEs within a genotype and across genotypes as well. Association analysis of blast-resistance QTL (Bl-QTLs) regions (assembled from published literature; data courtesy of R. Wisser, pers comm., Cornell University) with DEGs and RCEs showed that while Bl- QTLs are not significantly associated with DEGs, they are associated with genotype-specific RCEs; GR978- RCEs are enriched within Bl-QTLs. The analysis suggested that examining patterns of correlated gene expression patterns in a chromosomal context (rather than the expression levels of individual genes) can yield additional insights into the causal relationship between gene expression and phenotype. Based on these results, we put forward a hypothesis that QTLs with small or moderate effects are represented by genomic regions in which the genes show correlated expression. It implies that gene expression within such a region is regulated by a common mechanism, and that coordinated expression of the region contributes to phenotypic effects. This hypothesis is testable by co segregation analysis of the expression patterns in well-characterized backcross and recombinant inbred lines.
... Adryan and Schuh (2004) have developed a GO-Cluster program that incorporates the hierarchy structure of the GO database as a model for cluster analysis and also gives the visualization of gene expression data at any level of the gene ontology tree. Huang and Pan (2006) have included the gene function in distance metric and showed the advantage of using it over K-medoids (partitional) and hierarchical algorithms. In Ovaska et al. (2008), a fast gene ontology-based clustering has been built which demonstrates hierarchical clustering and a heat map visualization with the help of gene expression data and GO annotations. ...
Article
Full-text available
Gene expression data clustering groups genes with similar patterns into a group, while genes exhibit dissimilar patterns into different groups. Traditional partitional gene expression data clustering partitions the entire set of genes into a finite set of clusters which might not reflect co-expression or coherent patterns across all genes belonging to a cluster. In this paper, we propose a graph-theoretic clustering algorithm called GAClust which groups co-expressed genes into the same cluster while also detecting noise genes. Clustering of genes is based on the presumption that co-expressed genes are more likely to share common biological functions. However, it has been observed that the clusters produced by traditional methods often do not reflect true biological groups or functions. To address this issue, we propose a semi-supervised algorithm, SGAClust to produce more biologically relevant clusters. We consider both synthetic and cancer gene expression datasets to evaluate the performance of the proposed algorithms. It has been found that SGAClust outperforms the unsupervised algorithms. Additionally, we also identify potential gene biomarkers which will further help in cancer management.
... As the quick increase of our domain knowledge, methods that can incorporate prior information is worthy of further study in the near future. Currently there are several methods available for performing clustering while taking into account for prior knowledge (Dotan-Cohen et al., 2007;Huang & Pan, 2006;Tari et al., 2009;Verbanck et al., 2013), as well as methods for integrative analysis for multi-omics data assisted by prior knowledge (de Tayrac et al., 2009;Tong et al., 2020;J. Yan et al., 2018). ...
Article
Integrative analysis of multi‐omics data has drawn much attention from the scientific community due to the technological advancements which have generated various omics data. Leveraging these multi‐omics data potentially provides a more comprehensive view of the disease mechanism or biological processes. Integrative multi‐omics clustering is an unsupervised integrative method specifically used to find coherent groups of samples or features by utilizing information across multi‐omics data. It aims to better stratify diseases and to suggest biological mechanisms and potential targeted therapies for the diseases. However, applying integrative multi‐omics clustering is both statistically and computationally challenging due to various reasons such as high dimensionality and heterogeneity. In this review, we summarized integrative multi‐omics clustering methods into three general categories: concatenated clustering , clustering of clusters , and interactive clustering based on when and how the multi‐omics data are processed for clustering. We further classified the methods into different approaches under each category based on the main statistical strategy used during clustering. In addition, we have provided recommended practices tailored to four real‐life scenarios to help researchers to strategize their selection in integrative multi‐omics clustering methods for their future studies. This article is categorized under: Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification Applications of Computational Statistics > Genomics/Proteomics/Genetics Statistical and Graphical Methods of Data Analysis > Analysis of High Dimensional Data
... In addition, we considered the performance evaluation on biological homogeneity using the solution yielded by our method to obtain one single clustering solution from Pareto front. In Fig. 5, we show for each GO-based measure the number of signi¯cant GO terms measured by the cumulative hypergeometric distribution of Eq. (15). We consider the gene-annotation tables based on the biological process GO term at 5% of signi¯cant level. ...
Article
Full-text available
Using a prior biological knowledge of relationships and genetic functions for gene similarity, from repository such as the Gene Ontology (GO), has shown good results in multi-objective gene clustering algorithms. In this scenario and to obtain useful clustering results, it would be helpful to know which measure of biological similarity between genes should be employed to yield meaningful clusters that have both similar expression patterns (co-expression) and biological homogeneity. In this paper, we studied the influence of the four most used GO-based semantic similarity measures in the performance of a multi-objective gene clustering algorithm. We used four publicly available datasets and carried out comparative studies based on performance metrics for the multi-objective optimization field and clustering performance indexes. In most of the cases, using Jiang-Conrath and Wang similarities stand in terms of multi-objective metrics. In clustering properties, Resnik similarity allows to achieve the best values of compactness and separation and therefore of co-expression of groups of genes. Meanwhile, in biological homogeneity, the Wang similarity reports greater number of significant GO terms. However, statistical, visual, and biological significance tests showed that none of the GO-based semantic similarity measures stand out above the rest in order to significantly improve the performance of the multi-objective gene clustering algorithm.
... A further supposition often made is that genes with similar expression profiles share a common biological function. Most clustering algorithms use a matrix of pairwise distance measures as inputs based on correlation (Pan, 2006;Huang and Pan, 2006;Tseng, 2007), mutual information (also referred to as relevance networks) (Butte and Kohane, 2000;Luo et al., 2008) or entropy (Basso et al., 2005;Meyer et al., 2008). ...
... In some applications, partial supervision information is available. For example, in the task of clustering genes in DNA microarray data [25], [26], there often exists prior knowledge about the relationships between some subset of genes or genes expression profiles [27], [28], [29], [30]. Gene expression data of different cancer subtypes are usually lying on multiple clusters [31] and each cluster can be well approximated by a low-dimensional subspace [3], [32]. ...
Preprint
Full-text available
Subspace clustering refers to the problem of segmenting high dimensional data drawn from a union of subspaces into the respective subspaces. In some applications, partial side-information to indicate "must-link" or "cannot-link" in clustering is available. This leads to the task of subspace clustering with side-information. However, in prior work the supervision value of the side-information for subspace clustering has not been fully exploited. To this end, in this paper, we present an enhanced approach for constrained subspace clustering with side-information, termed Constrained Sparse Subspace Clustering plus (CSSC+), in which the side-information is used not only in the stage of learning an affinity matrix but also in the stage of spectral clustering. Moreover, we propose to estimate clustering accuracy based on the partial side-information and discuss the potential connection to the true clustering accuracy. We conduct experiments on three cancer gene expression datasets to validate the effectiveness of our proposals.
... In gene function analysis, the domain knowledge, like gene ontology (GO) (Ashburner et al., 2000) and KEGG pathway (Kanehisa and Goto, 2000), is often utilized as complementary information to the high-throughput experimental data in various computation tasks, such as gene clustering (Huang and Pan, 2006), disease-related gene identification (Schlicker et al., 2010) and protein-protein-interaction prediction (Mahdavi and Lin, 2007). In order to quantitatively measure the functional similarity between two genes, the annotation terms associated with the genes are extracted and their semantic correlation is computed. ...
Article
Motivation: Benefiting from high-throughput experimental technologies, whole-genome analysis of microRNAs (miRNAs) has been more and more common to uncover important regulatory roles of miRNAs and identify miRNA biomarkers for disease diagnosis. As a complementary information to the high-throughput experimental data, domain knowledge like the Gene Ontology and KEGG pathway is usually used to guide gene function analysis. However, functional annotation for miRNAs is scarce in the public databases. Till now, only a few methods have been proposed for measuring the functional similarity between miRNAs based on public annotation data, and these methods cover a very limited number of miRNAs, which are not applicable to large-scale miRNA analysis. Results: In this paper, we propose a new method to measure the functional similarity for miRNAs, called miRGOFS, which has two notable features: I) it adopts a new GO semantic similarity metric which considers both common ancestors and descendants of GO terms; II) it computes similarity between GO sets in an asymmetric manner, and weights each GO term by its statistical significance. The miRGOFS-based predictor achieves an F1 of 61.2% on a benchmark data set of miRNA localization, and AUC values of 87.7% and 81.1% on two benchmark sets of miRNA-disease association, respectively. Compared with the existing functional similarity measurements of miRNAs, miRGOFS has the advantages of higher accuracy and larger coverage of human miRNAs (over 1000 miRNAs). Availability: http://www.csbio.sjtu.edu.cn/bioinf/MiRGOFS/. Contact: yangyang@cs.sjtu.edu.cn or hbshen@sjtu.edu.cn. Supplementary information: Supplementary data are available at Bioinformatics online.
... The linear combination technique has been likewise utilized as a part of different bioinformatics issues, for example, gene clustering with various information (or requirements), including Gene Ontology, metabolic systems, and quality articulation. For this case, once the informational indexes are incorporated, we can utilize an assortment of clustering models, e.g., hierarchical clustering [27], Gaussian mixture model [28], k-medoids [29], and Markov random fields [30]. But, this methodology has approximately three basic disadvantages in document clustering. ...
Article
Full-text available
Clustering of biomedical documents has become a vital research concept due to its importance in the clinical and telemedicine applications. The clustering of the medical documents is being considered as a major issue because of its unstructured nature. This paper focuses on developing an efficient document clustering approach for the medical documents to be utilized in telemedicine applications. Most existing models utilize n-gram techniques for phrase identification and term, concept or semantic based models for clustering applications. However n-gram does not perform well when the original document has been modified while only hybrid models provide relatively improved clustering. The proposed document clustering approach is named as enriched semantic smoothing model which has been developed on the concept of Mesh ontology. As the semantic smoothing model is not effective in handling the density of general words, an improved model with term frequency and inverse gravity moment (TF-IGM) factor and improved background elimination is used. Unlike term frequency and inverse document frequency), TF-IGM precisely measure the class distinguishing power of a term by making use of the fine-grained term distribution across different classes of text in documents. The modified n-gram technique, which detects the cases of substitution and deletion in the documents and averts them, improves the phrases identification. The clustering efficiency of the k-means clustering and hierarchical clustering algorithms is improved by utilizing the proposed model. The experiments are made on Mesh ontology based PubMed documents with similarity measures and cluster validity indexes used for comparisons. The results show that the proposed approach of medical document clustering is highly accurate and thus improves the concepts of clinical practices and telemedicine.
... Hierarchical clustering based on a distance matrix is commonly used for the aggregation of samples with similar properties [23,24]. In this study, several types of distance matrices were computed based on nucleotide deletion percentages in the preS region for hierarchical clustering, and the cosine distance matrix finally showed the best performance (Fig. 4a), better than Manhattan, Euclidean, maximum distance matrix or correlation matrix. ...
Article
Full-text available
In order to investigate if deletion patterns of the preS region can predict liver disease advancement, the preS region of the hepatitis B virus (HBV) genome in 45 chronic hepatitis B (CHB) and 94 HBV-related hepatocellular carcinoma (HCC) patients was sequenced by next-generation sequencing (NGS) and the percentages of nucleotide deletion in the preS region were analysed. Hierarchical clustering and heatmaps based on deletion percentages of preS revealed different deletion patterns between CHB and HCC patients. Intergenotype comparison also indicated divergence in preS deletions between HBV genotype B and C. No significant difference was found in preS deletion patterns between sera and matched adjacent non-tumour tissues. Based on hierarchical clustering, HCC patients were classed into two groups with different preS deletion patterns and different clinical features. Finally, the support vector machine (SVM) model was trained on preS nucleotide deletion percentages and used to predict HCC versus CHB patients. The prediction performance was assessed with fivefold cross-validation and independent cohort validation. The median area under the curve (AUC) was 0.729 after repeating SVM 500 times with fivefold cross-validations. After parameter optimization, the SVM model was used to predict an independent cohort with 51 CHB patients and 72 HCC patients and the AUC was 0.727. In conclusion, the use of the NGS method revealed a prominent divergence in preS deletion patterns between disease groups and virus genotypes, but not between different tissue types. Quantitative NGS data combined with a machine learning method could be a powerful approach for prediction of the status of different diseases.
... Biological pathways are graphical representations of common knowledge about genes and their interactions on biological processes. This valuable information has been used to cluster related genes using gene expression [25][26][27][28] and should be used to identify disease subtypes as well. Clinical data and biological knowledge are complementary to gene expression and can leverage disease subtyping. ...
Conference Paper
Full-text available
One main challenge in modern medicine is the discovery of molecular disease subtypes characterized by relevant clinical differences, such as survival. However, clustering high-dimensional expression data is challenging due to noise and the curse of high-dimensionality. This article describes a disease subtyping pipeline that is able to exploit the important information available in pathway databases and clinical variables. The pipeline consists of a new feature selection procedure and existing clustering methods. Our procedure partitions a set of patients using the set of genes in each pathway as clustering features. To select the best features, this procedure estimates the relevance of each pathway and fuses relevant pathways. We show that our pipeline finds subtypes of patients with more distinctive survival profiles than traditional subtyping methods by analyzing a TCGA colon cancer gene expression dataset. Here we demonstrate that our pipeline improves three different clustering methods: k-means, SNF, and hierarchical clustering.
... In some applications, for example, in the task of clustering genes in DNA microarray data, there often exists prior knowledge about the relationships between some subset of genes or genes expression profiles [61], [62], [63], [64]. This prior knowledge essentially provides partial side-information to indicate "must-link" or "cannot-link" constraints in clustering. ...
Article
Full-text available
Subspace clustering refers to the problem of segmenting data drawn from a union of subspaces. State of the art approaches for solving this problem follow a two-stage approach. In the first step, an affinity matrix is learned from the data using sparse or low-rank minimization techniques. In the second step, the segmentation is found by applying spectral clustering to this affinity. While this approach has led to state-of-the-art results in many applications, it is sub-optimal because it does not exploit the fact that the affinity and the segmentation depend on each other. In this paper, we propose a joint optimization framework --- Structured Sparse Subspace Clustering (S$^3$C) --- for learning both the affinity and the segmentation. The proposed S$^3$C framework is based on expressing each data point as a structured sparse linear combination of all other data points, where the structure is induced by a norm that depends on the unknown segmentation. Moreover, we extend the proposed S$^3$C framework into Constrained Structured Sparse Subspace Clustering (CS$^3$C) in which available partial side-information is incorporated into the stage of learning the affinity. We show that both the structured sparse representation and the segmentation can be found via a combination of an alternating direction method of multipliers with spectral clustering. Experiments on a synthetic data set, the Extended Yale B data set, the Hopkins 155 motion segmentation database, and three cancer data sets demonstrate the effectiveness of our approach.
... The technique has been used in the modeling of biological networks and biochemical pathways [2,116,161]. Another example is the use of fuzzy clustering with prior knowledge-based distance measures [83,228]. Moreover, a fuzzy formulation of prior knowledge has been used by Khan et al. to automatically adjust parameters of contrast enhancement and segmentation algorithms [100,101]. ...
Thesis
Full-text available
Multidimensional imaging techniques provide powerful ways to examine various kinds of scientific questions. The routinely produced datasets in the terabyte-range, however, can hardly be analyzed manually and require an extensive use of automated image analysis. The present thesis introduces a new concept for the estimation and propagation of uncertainty involved in image analysis operators and new segmentation algorithms that are suitable for terabyte-scale analyses of 3D+t microscopy images.
... For more general purposes, it has been proposed to adjust the distance metric used in hierarchical clustering by a term that quantifies similarity of GO or KEGG annotations between pairs of genes, with a tuning parameter allowing for a flexible trade-off between knowledge-based and data-driven analysis [16,17]. Annotation-based adjustments have also been proposed for use in k-means/kmedioid clustering [18][19][20] and mixture models [21]. ...
Article
Full-text available
Method: Genome-wide expression profiling is a widely used approach for characterizing heterogeneous populations of cells, tissues, biopsies, or other biological specimen. The exploratory analysis of such data typically relies on generic unsupervised methods, e.g. principal component analysis (PCA) or hierarchical clustering. However, generic methods fail to exploit prior knowledge about the molecular functions of genes. Here, I introduce GO-PCA, an unsupervised method that combines PCA with nonparametric GO enrichment analysis, in order to systematically search for sets of genes that are both strongly correlated and closely functionally related. These gene sets are then used to automatically generate expression signatures with functional labels, which collectively aim to provide a readily interpretable representation of biologically relevant similarities and differences. The robustness of the results obtained can be assessed by bootstrapping. Results: I first applied GO-PCA to datasets containing diverse hematopoietic cell types from human and mouse, respectively. In both cases, GO-PCA generated a small number of signatures that represented the majority of lineages present, and whose labels reflected their respective biological characteristics. I then applied GO-PCA to human glioblastoma (GBM) data, and recovered signatures associated with four out of five previously defined GBM subtypes. My results demonstrate that GO-PCA is a powerful and versatile exploratory method that reduces an expression matrix containing thousands of genes to a much smaller set of interpretable signatures. In this way, GO-PCA aims to facilitate hypothesis generation, design of further analyses, and functional comparisons across datasets.
... En effet, après l'extraction de profils de gènes différentiellement exprimés, une étude fonctionnelle de ces groupes de gènes est menée pour identifier les gènes qui partagent des fonctions biologiques similaires représentées par des annotations GO. Cette analyse fonctionnelle se base sur l'hypothèse communément admise chez les biologistes et qui suppose que les gènes co-exprimés partagent souvent des fonctions similaires[HP06]. 9. http ://www.w3.org/TR/owl-features/ ...
Article
Bioinformatic analyses of transcriptomic data aims to identify genes with variations in their expression level in different tissue samples, for example tissues from healthy versus seek patients, and to characterize these genes on the basis of their functional annotation. In this thesis, I present four contributions for taking into account domain knowledge in these methods. Firstly, I define a new semantic and functional similarity measure which optimally exploits functional annotations from Gene Ontology (GO). Then, I show, thanks to a rigorous evaluation method, that this measure is efficient for the functional classification of genes. In the third contribution, I propose a differential approach with fuzzy assignment for building differential expression profiles (DEPs). I define an algorithm for analyzing overlaps between functional clusters and reference sets such as DEPs here, in order to point out genes that have both similar functional annotation and similar variations in expression. This method is applied to experimental data produced from samples of healthy tissue, colorectal tumor and cancerous cultured cell line. Finally the similarity measure IntelliGO is generalized to another structured vocabulary organized as GO as a rooted directed acyclic graph, with an application concerning the semantic reduction of attributes before mining.
... GO terms have been used to derive information about the biological similarity of a pair of genes. This similarity was used as a modified distance metric for clustering [93]. Using a similar idea in a later publication, similarity measures were used to assign prior probabilities for genes to belong in specific clusters [94] using an expectation maximisation model. ...
Article
Full-text available
We summarise various ways of performing dimensionality reduction on high-dimensional microarray data. Many different feature selection and feature extraction methods exist and they are being widely used. All these methods aim to remove redundant and irrelevant features so that classification of new instances will be more accurate. A popular source of data is microarrays, a biological platform for gathering gene expressions. Analysing microarrays can be difficult due to the size of the data they provide. In addition the complicated relations among the different genes make analysis more difficult and removing excess features can improve the quality of the results. We present some of the most popular methods for selecting significant features and provide a comparison between them. Their advantages and disadvantages are outlined in order to provide a clearer idea of when to use each one of them for saving computational time and resources.
... Among the clustering algorithms, K-means and its many variants are widely used in bioinformatics [4][5][6]. One of the drawbacks is that the clustering results are dependent of the initial choice of centroids, hence the experimental results are less reproducible. ...
Article
Full-text available
K-means and its many variants are widely used in bioinformatics. However, one of the drawbacks is that the clustering results are dependent of the initial choice of centroids, hence the experimental results are less reproducible. In addition, poor handling of non-spherical clusters is another weakness of the K-means algorithm. Thus, we investigate spectral clustering on temporal gene expression data in this paper. Experimental results show that when combined with appropriate mother wavelet, spectral clustering is able to improve both clustering accuracy and stability for temporal data, as compared with the K-means approach.
... These approaches make use of a smaller and more easily obtained set of labeled samples to guide clustering strategy. They have been applied in many application domains, including text clustering [25], gene expression analysis [26], and image processing [27]. However, to the best of our knowledge, semi-supervised learning has not yet been applied in full text mining of real biomedical publications. ...
Article
Full-text available
Rapid developments in the biomedical sciences have increased the demand for automatic clustering of biomedical publications. In contrast to current approaches to text clustering, which focus exclusively on the contents of abstracts, a novel method is proposed for clustering and analysis of complete biomedical article texts. To reduce dimensionality, Cosine Coefficient is used on a sub-space of only two vectors, instead of computing the Euclidean distance within the space of all vectors. Then a strategy and algorithm is introduced for Semi-supervised Affinity Propagation (SSAP) to improve analysis efficiency, using biomedical journal names as an evaluation background. Experimental results show that by avoiding high-dimensional sparse matrix computations, SSAP outperforms conventional k-means methods and improves upon the standard Affinity Propagation algorithm. In constructing a directed relationship network and distribution matrix for the clustering results, it can be noted that overlaps in scope and interests among BioMed publications can be easily identified, providing a valuable analytical tool for editors, authors and readers.
... Recently, biological annotations (e.g. GO, KEGG pathways, etc.) have been used for cluster validation (Bolshakova, et al., 2005) and, even more importantly, biological information (Huang, et al., 2006;Pan, 2006) or phenotypic information (Jia, et al., 2005) have been used as a constitutive part of clustering algorithms. ...
Article
Full-text available
This chapter describes the basic concepts and application of a family of methods for class discovery, generically known as clustering, applied to microarray data. Although many clustering methods exist, only a few have been extensively used for microarray data analysis (among them I will revise hierarchical clustering, k-means, SOM, SOTA and model-based clustering). Key aspects in clustering such as the determination of the number of clusters and the reliability of the partition obtained are also discussed. Particular cases, such as the clustering of time series data, are also presented. Finally, the functional interpretation of clustering results, a key step in microarray data analysis, is also discussed.
... Many studies used prior knowledge in clustering genes [7][8][9][10][11][12][13]. These methods are referred as semi-supervised clustering approaches. ...
Article
Full-text available
Background Simple clustering methods such as hierarchical clustering and k-means are widely used for gene expression data analysis; but they are unable to deal with noise and high dimensionality associated with the microarray gene expression data. Consensus clustering appears to improve the robustness and quality of clustering results. Incorporating prior knowledge in clustering process (semi-supervised clustering) has been shown to improve the consistency between the data partitioning and domain knowledge. Methods We proposed semi-supervised consensus clustering (SSCC) to integrate the consensus clustering with semi-supervised clustering for analyzing gene expression data. We investigated the roles of consensus clustering and prior knowledge in improving the quality of clustering. SSCC was compared with one semi-supervised clustering algorithm, one consensus clustering algorithm, and k-means. Experiments on eight gene expression datasets were performed using h-fold cross-validation. Results Using prior knowledge improved the clustering quality by reducing the impact of noise and high dimensionality in microarray data. Integration of consensus clustering with semi-supervised clustering improved performance as compared to using consensus clustering or semi-supervised clustering separately. Our SSCC method outperformed the others tested in this paper.
... In this situation, one seeks to cluster the remaining genes using the information from the labeled genes. Several clustering methods have been developed for the specific problem of analyzing partially labeled microarray data [18][19][20][21][22][23][24][25][26] . These methods are specifically designed for microarray data and will not be described in this review; see the references for details. ...
Article
Cluster analysis methods seek to partition a data set into homogeneous subgroups. It is useful in a wide variety of applications, including document processing and modern genetics. Conventional clustering methods are unsupervised, meaning that there is no outcome variable nor is anything known about the relationship between the observations in the data set. In many situations, however, information about the clusters is available in addition to the values of the features. For example, the cluster labels of some observations may be known, or certain observations may be known to belong to the same cluster. In other cases, one may wish to identify clusters that are associated with a particular outcome variable. This review describes several clustering algorithms (known as ‘semi-supervised clustering’ methods) that can be applied in these situations. The majority of these methods are modifications of the popular k-means clustering method, and several of them will be described in detail. A brief description of some other semi-supervised clustering algorithms is also provided. For further resources related to this article, please visit the WIREs website. Conflict of interest: The authors have declared no conflicts of interest for this article.
... In [26] a similar approach is presented, where a graph is used based on the GO structure. The work of [27] proposed shrinking the distances between pairs of genes sharing a common annotation. In fact, the distance measure between two genes can be modified to be a linear combination of the similarity of their expression profiles and their functional similarity [28][29][30]. ...
Article
Full-text available
It is a common practice in bioinformatics to validate each group returned by a clustering algorithmthrough manual analysis, according to a-priori biological knowledge. This procedure helps findingfunctionally related patterns to propose hypotheses for their behavior and the biological processesinvolved. Therefore, this knowledge is used only as a second step, after data are just clusteredaccording to their expression patterns.Thus, it could be very useful to be able to improve theclustering of biological data by incorporating prior knowledge into the cluster formation itself, inorder to enhance the biological value of the clusters. A novel training algorithm for clustering is presented, which evaluates the biological internalconnections of the data points while the clusters are being formed. Within this training algorithm, thecalculation of distances among data points and neurons centroids includes a new term based oninformation from well-known metabolic pathways. The standard self-organizing map (SOM)training versus the biologically-inspired SOM (bSOM) training were tested with two real data sets oftranscripts and metabolites from Solanum lycopersicum and Arabidopsis thaliana species. Classicaldata mining validation measures were used to evaluate the clustering solutions obtained by bothalgorithms. Moreover, a new measure that takes into account the biological connectivity of theclusters was applied. The results of bSOM show important improvements in the convergence andperformance for the proposed clustering method in comparison to standard SOM training, inparticular, from the application point of view. Analyses of the clusters obtained with bSOM indicate that including biological information duringtraining can certainly increase the biological value of the clusters found with the proposed method. Itis worth to highlight that this fact has effectively improved the results, which can simplify theirfurther analysis.The algorithm is available as a web-demo at http://fich.unl.edu.ar/sinc/web-demo/bsom-lite/ Thesource code and the data sets supporting the results of this article are available athttp://sourceforge.net/projects/sourcesinc/files/bsom.
... The linear combination strategy has been also used in other bioinformatics problems, such as gene clustering with multiple data (or constraints), including Gene Ontology, metabolic networks, and gene expression. In this case, once the data sets are integrated, we can use a variety of clustering models, e.g., hierarchical clustering [13], Gaussian mixture model [14], k-medoids [15], and Markov random fields [16]. However, this strategy has roughly three underlying drawbacks in document clustering. ...
Article
For clustering biomedical documents, we can consider three different types of information: the local-content (LC) information from documents, the global-content (GC) information from the whole MEDLINE collections, and the medical subject heading (MeSH)-semantic (MS) information. Previous methods for clustering biomedical documents are not necessarily effective for integrating different types of information, by which only one or two types of information have been used. Recently, the performance of MEDLINE document clustering has been enhanced by linearly combining both the LC and MS information. However, the simple linear combination could be ineffective because of the limitation of the representation space for combining different types of information (similarities) with different reliability. To overcome the limitation, we propose a new semisupervised spectral clustering method, i.e., SSNCut, for clustering over the LC similarities, with two types of constraints: must-link (ML) constraints on document pairs with high MS (or GC) similarities and cannot-link (CL) constraints on those with low similarities. We empirically demonstrate the performance of SSNCut on MEDLINE document clustering, by using 100 data sets of MEDLINE records. Experimental results show that SSNCut outperformed a linear combination method and several well-known semisupervised clustering methods, being statistically significant. Furthermore, the performance of SSNCut with constraints from both MS and GC similarities outperformed that from only one type of similarities. Another interesting finding was that ML constraints more effectively worked than CL constraints, since CL constraints include around 10% incorrect ones, whereas this number was only 1% for ML constraints.
... In the last few years, several methods have been introduced with that aim, since integrating a biological similarity measure into a clustering method can lead to potential enhancement in the performance of the clustering [44], [45], as a result of a good correlation between biological similarity and gene co-expression [46]. In [47] the proposal is to shrink the distances between pairs of genes that share a common annotation. In fact, the similarity measure between genes can combine expression profiles and functional similarity [48] [49]. ...
Article
Full-text available
Biology is in the middle of a data explosion. The technical advances achieved by the genomics, metabolomics, transcriptomics and proteomics technologies in recent years have significantly increased the amount of data that are available for biologists to analyze different aspects of an organism. However, *omics data sets have several additional problems: they have inherent biological complexity and may have significant amounts of noise as well as measurement artifacts. The need to extract information from such databases has once again become a challenge. This requires novel computational techniques and models to automatically perform data mining tasks such as integration of different data types, clustering and knowledge discovery, among others. In this article, we will present a novel integrated computational intelligence approach for biological data mining that involves neural networks and evolutionary computation. We propose the use of self-organizing maps for the identification of coordinated patterns variations; a new training algorithm that can include a priori biological information to obtain more biological meaningful clusters; a validation measure that can assess the biological significance of the clusters found; and finally, an evolutionary algorithm for the inference of unknown metabolic pathways involving the selected clusters.
... In several fields of modern medical research, such as genetics [10,11] and DNA research [12,13], data are collected not only as scalars but as distance matrices using some generalized measure of distance, for example Euclidian distance or the Sørensen index often used for ecological community data [14]. In image analysis, the measurement under study is typically a high-dimensional object, for example the circumference of a tumor in two dimensions. ...
Article
Full-text available
Distance matrix data are occurring ever more frequently in medical research, particularly in fields such as genetics, DNA research, and image analysis. We propose a non-parametric permutation method for assessing agreement when the data under study are distance matrices. We apply agglomerative hierarchical clustering and accompanying dendrograms to visualize the internal structure of the matrix observations. The accompanying test is based on random permutations of the elements within individual matrix observations and the corresponding matrix mean of these permutations. We compare the within-matrix element sum of squares (WMESS) for the observed mean against the WMESS for the permutation means. The methodology is exemplified using simulations and real data from magnetic resonance imaging. Copyright © 2013 John Wiley & Sons, Ltd.
... A further supposition often made is that genes with similar expression profiles share a common biological function. Most clustering algorithms use a matrix of pairwise distance measures as inputs based on correlation (Pan, 2006;Huang and Pan, 2006;Tseng, 2007), mutual information (also referred to as relevance networks) (Butte and Kohane, 2000;Luo et al., 2008) or entropy (Basso et al., 2005;Meyer et al., 2008). ...
Article
Gene regulatory networks are collections of genes that interact, whether directly or indirectly, with each other and with other substances in the cell. Such gene-to-gene interactions play an important role in a variety of biological processes, as they regulate the rate and degree to which genes are transcribed and proteins are created. By measuring gene expression over time, it may be possible to reverse engineer, or infer, the structure of the gene network involved in a particular cellular process. With the development of microarray and next-generation sequencing technologies, it has become possible to conduct longitudinal experiments to measure the expression of thousands of genes simultaneously over time. However, due to the high dimensionality of gene expression data, the limited number of biological replicates and time points typically measured, and the complexity of biological systems themselves, the problem of reverse engineering networks from transcriptomic data demands a specialized suite of appropriate statistical tools and methodologies. Two methods are proposed that use directed graphical models of stochastic processes, known as dynamic Bayesian networks, and first-order linear models to represent gene regulatory networks. In the first method, an algorithm is developed based on a hierarchical Bayesian framework for a Gaussian state space model. Hyperparameters are estimated using an empirical Bayes procedure, and parameter posterior distributions determine the presence or absence of gene-to-gene interactions. In the second method, a simulation-based approach known as Approximate Bayesian Computing based on Markov Chain Monte Carlo sampling is modified to the context of gene regulatory networks. Because no likelihood calculation is required, this method permits inference even for networks where no distributional assumptions are made. The performance of the proposed approaches is investigated via simulations, and both methods are applied to real longitudinal expression data. The two methods, while not comparable, are complementary, and help illustrate the need for a variety of network inference methods adapted for different contexts.
Chapter
Microarray technology is a powerful tool to analyze thousands of gene expression values with a single experiment. Due to the huge amount of data, most of recent studies are focused on the analysis and the extraction of useful and interesting information from microarray data. Examples of applications include detecting genes highly correlated to diseases, selecting genes which show a similar behavior under specific conditions, building models to predict the disease outcome based on genetic profiles, and inferring regulatory networks. This chapter presents a review of four popular data mining techniques (i.e., Classification, Feature Selection, Clustering and Association Rule Mining) applied to microarray data. It describes the main characteristics of microarray data in order to understand the critical issues which are introduced by gene expression values analysis. Each technique is analyzed and examples of pertinent literature are reported. Finally, prospects of data mining research on microarray data are provided.
Chapter
Nowadays, a huge amount of high throughput molecular data are available for analysis and provide novel and useful insights into complex biological systems, through the acquisition of a high-resolution picture of their molecular status in defined experimental conditions. In this context, microarrays are a powerful tool to analyze thousands of gene expression values with a single experiment. A number of approaches have been developed to detecting genes highly correlated to diseases, selecting genes that exhibit a similar behavior under specific conditions, building models to predict disease outcome based on genetic profiles, and inferring regulatory networks. This paper discusses popular and recent data mining techniques (i.e., Feature Selection, Clustering, Classification, and Association Rule Mining) applied to microarray data. The main characteristics of microarray data and preprocessing procedures are presented to understand the critical issues introduced by gene expression values analysis. Each technique is analyzed, and relevant examples of pertinent literature are reported. Moreover, real use cases exploiting analytic pipelines that use these methods are also introduced. Finally, future directions of data mining research on microarray data are envisioned.
Chapter
This chapter describes the numerical validation techniques. The procedure for evaluating clustering algorithms and their results is known as cluster validation (CV). Most CV algorithms can be classified into three classes: external criteria, internal criteria and relative criteria. The chapter introduces four indices in external criteria: Rand index (RI), adjusted Rand index (ARI), Jaccard index (JI) and normalised mutual information (NMI). It also introduces two indices in internal criteria: figure of merit (FOM) and CLEST. Relative criteria can be classified into model-based indices, fuzzy validity indices and crisp validity indices. Model-based indices include minimum message length (MML), minimum description length (MDL), Bayesian information criterion (BIC) and Akaike's information criterion (AIC). Fuzzy validity indices include partition coefficient (PC), partition entropy (PE) and Xie–Beni (XB) index. Crisp validity indices include Calinski–Harabasz (CH) index, Dunn's index (DI), Davies–Bouldin (DB) index, object-based validation (OBV-LDA), and the validity index (VI).
Chapter
DNA Microarrays allow for monitoring the expression level of thousands of genes simultaneously across a collection of related samples. Supervised learning algorithms such as k-NN or SVM (Support Vector Machines) have been applied to the classification of cancer samples with encouraging results. However, the classification algorithms are not able to discover new subtypes of diseases considering the gene expression profiles. In this chapter, the author reviews several supervised clustering algorithms suitable to discover new subtypes of cancer. Next, he introduces a semi-supervised clustering algorithm that learns a linear combination of dissimilarities from the a priory knowledge provided by human experts. A priori knowledge is formulated in the form of equivalence constraints. The minimization of the error function is based on a quadratic optimization algorithm. A L2 norm regularizer is included that penalizes the complexity of the family of distances and avoids overfitting. The method proposed has been applied to several benchmark data sets and to human complex cancer problems using the gene expression profiles. The experimental results suggest that considering a linear combination of heterogeneous dissimilarities helps to improve both classification and clustering algorithms based on a single similarity.
Conference Paper
Identifying condition-specific co-expressed gene groups is critical for gene functional and regulatory analysis. However, given that genes with critical functions (such as transcription factors) may not co-express with their target genes, it is insufficient to uncover gene functional associations only from gene expression data. In this paper, we propose a novel integrative biclustering approach to build high quality biclusters from gene expression data, and to identify critical missing genes in biclusters based on Gene Ontology as well. Our approach delivers a complete inter- and intra-bicluster functional relationship, thus provides biologists a clear picture for gene functional association study. We experimented with the Yeast cell cycle and Arabidopsis cold-response gene expression datasets. Experimental results show that a clear inter- and intra-bicluster relationship is identified, and the biological significance of the biclusters is considerably improved.
Article
Gene association networks have become one of the most important approaches to modelling of biological processes by means of gene expression data. According to the literature, co-expression-based methods are the main approaches to identification of gene association networks because such methods can identify gene expression patterns in a dataset and can determine relations among genes. These methods usually have two fundamental drawbacks. Firstly, they are dependent on quality of the input dataset for construction of reliable models because of the sensitivity to data noise. Secondly, these methods require that the user select a threshold to determine whether a relation is biologically relevant. Due to these shortcomings, such methods may ignore some relevant information. We present a novel fuzzy approach named FyNE (Fuzzy NEtworks) for modelling of gene association networks. FyNE has two fundamental features. Firstly, it can deal with data noise using a fuzzy-set-based protocol. Secondly, the proposed approach can incorporate prior biological knowledge into the modelling phase, through a fuzzy aggregation function. These features help to gain some insights into doubtful gene relations. The performance of FyNE was tested in four different experiments. Firstly, the improvement offered by FyNE over the results of a co-expression-based method in terms of identification of gene networks was demonstrated on different datasets from different organisms. Secondly, the results produced by FyNE showed its low sensitivity to noise data in a randomness experiment. Additionally, FyNE could infer gene networks with a biological structure in a topological analysis. Finally, the validity of our proposed method was confirmed by comparing its performance with that of some representative methods for identification of gene networks
Article
DNA Microarrays allow for monitoring the expression level of thousands of genes simultaneously across a collection of related samples. Supervised learning algorithms such as k-NN or SVM (Support Vector Machines) have been applied to the classification of cancer samples with encouraging results. However, the classification algorithms are not able to discover new subtypes of diseases considering the gene expression profiles. In this chapter, the author reviews several supervised clustering algorithms suitable to discover new subtypes of cancer. Next, he introduces a semi-supervised clustering algorithm that learns a linear combination of dissimilarities from the a priory knowledge provided by human experts. A priori knowledge is formulated in the form of equivalence constraints. The minimization of the error function is based on a quadratic optimization algorithm. A L2 norm regularizer is included that penalizes the complexity of the family of distances and avoids overfitting. The method proposed has been applied to several benchmark data sets and to human complex cancer problems using the gene expression profiles. The experimental results suggest that considering a linear combination of heterogeneous dissimilarities helps to improve both classification and clustering algorithms based on a single similarity.
Article
The wealth of genetic information available in life sciences now allows the engineering of DNA and RNA molecules, proteins, cells, tissues and even entire organismswith wanted properties. This is due to large scale sequencing projects that determined the DNA sequence of the entire human genome and of the genomes of several animal species, microorganisms and pathogens. At the same time microarray technology advanced to a stage that permits measuring the expression levels of thousand of genes simultaneously in one single experiment. In addition, specific bioinformatic tools have been developed to determine transcriptional “signatures” of various cell types, tissues and entire organisms, both in the normal and pathological state. Furthermore, various data mining strategies are being used for gene discovery and for a systematic and genome wide understanding of complex gene networks. Here we provide an overview of data mining strategies for microarray gene expression data, including (i) data preprocessing, (ii) methods of cluster analysis and (iii) the retrieval of information from knowledge-based databases and their integration into microarray data analysis workflows. Our analysis shows examples of mining strategies for antigen presenting dendritic cells (DC) that were treated with transforming growth factor β type 1 (TGF-β1).
Article
The original DEA models only deal with quantitative data because the algebraic operations on qualitative data are meaningless. This chapter provides the framework of dealing with qualitative data in DEA through fuzzy numbers. At first, use fuzzy numbers representing qualitative data. Then apply sets of two-level mathematical programing to implement fuzzy extension principle to crisp DEA model to find α-cuts of leveled fuzzy efficiency based on crisp observations and α-cuts of fuzzy factors. Adequate number of α-cuts determines the fuzzy efficiency. Furthermore, to provide persuadable fuzzy numbers representing qualitative data, use DEA models as experts to integrate objective production data and subjective information to generate possible values of qualitative data. Based on possible values of qualitative data, the shapes of fuzzy numbers are determined. To increase readability of fuzzy efficiency for most decision-makers, apply K-medoids clustering method along with Hausdorff distances to convert these efficiencies into qualitative efficiencies. Finally, a case of university performance evaluation demonstrates the framework.
Article
Full-text available
Gene expression is the process of collecting information from gene used in producing geneproducts. Microarray technology enables us to analyse expression levels of various genes where as Clustering techniques helps us in analysing gene expression. Clustering means grouping of data into several disjoint groups where in objects in one group are dissimilar to the other group. Clustering algorithms such as k-means [13] and heirarchical [14] have their on limitations in dealing large data sets. Hence, in this paper we come with efficient algorithm which can deal with large data sets.
Chapter
In this chapter an informal treatment of the basic ideas of what, in our view, abstraction should is reported. In particular, the notion of abstraction is linked to that of information hiding, and the nature of the abstraction as an intensional property of system descriptions is suggested. Moreover, grounding abstraction on a configuration space (containing all the possible descriptions of a system, given a set of sensors) allows a clear distinction to be made among the cognate notions of abstraction, approximation and reformulation. Finally, the relationships between abstraction and generalization are discussed in depth.
Chapter
In contrast to conventional clustering algorithms, where a single dataset is used to produce a clustering solution, we introduce herein a MapReduce approach for clustering of datasets generated in multiple-experiment settings. It is inspired by the map-reduce functions commonly used in functional programming and consists of two distinctive phases. Initially, the selected clustering algorithm is applied (mapped) to each experiment separately. This produces a list of different clustering solutions, one per experiment. These are further transformed (reduced) by portioning the cluster centers into a single clustering solution. The obtained partition is not disjoint in terms of the different participating genes, and it is further analyzed and refined by applying formal concept analysis.
Article
Recently many researches have been presented to improve the clustering performance of gene expression data by incorporating Gene Ontology into the process of clustering. In particular, Kustra et al. showed higher performance improvement by exploiting Biological Process Ontology compared to the typical expression-based clustering. This paper extends the work of Kustra et al. by performing extensive experiments on the way of incorporating GO structures. To this end, we used three ontological distance measures (Lin`s, Resnik`s, Jiang`s) and three GO structures (BP, CC, MF) for the yeast expression data. From all test cases, We found that clustering performances were remarkably improved by incorporating GO; especially, Resnik`s distance measure based on Biological Process Ontology was the best.
Article
In this paper, a novel feature selection algorithm, which is governed by biological knowledge, is developed. Gene expression data being high dimensional and redundant, dimensionality reduction is of prime concern. We employ the algorithm clustering large applications based on RAN-domized search (CLARANS) for attribute clustering and dimensionality reduction based on gene ontology (GO) study. Feature selection with unsupervised learning is a difficult problem, with neither class labels present nor any guidance available to the search. Determination of the optimal number of clusters is another major issue, and has an impact on the resulting output. The use of GO analysis helps in the automated selection of biologically meaningful partitions. Tools such as Eisen plot and cluster profiles of these clusters help establish their coherence. Important representative features (or genes) are extracted from each correlated set of genes in such partitions. The algorithm is implemented on high-dimensional Yeast cell-cycle, Human Multiple Tissues, and Leukemia microarray data. In the second pass, clustering on the reduced gene space validates preservation of the inherent behavior of the original high-dimensional expression profiles. While the reduced gene set forms a biologically meaningful gene space, it simultaneously leads to a decrease in computational burden. External validation of the reduced subspace, using various well-known classifiers, establishes the effectiveness of the proposed methodology.
Article
Nowadays, a huge amount of high throughput molecular data are available for analysis and provide novel and useful insights into complex biological systems, through the acquisition of a high-resolution picture of their molecular status in defined experimental conditions. In this context, microarrays are a powerful tool to analyze thousands of gene expression values with a single experiment. A number of approaches have been developed to detecting genes highly correlated to diseases, selecting genes that exhibit a similar behavior under specific conditions, building models to predict disease outcome based on genetic profiles, and inferring regulatory networks. This paper discusses popular and recent data mining techniques (i.e., Feature Selection, Clustering, Classification, and Association Rule Mining) applied to microarray data. The main characteristics of microarray data and preprocessing procedures are presented to understand the critical issues introduced by gene expression values analysis. Each technique is analyzed, and relevant examples of pertinent literature are reported. Moreover, real use cases exploiting analytic pipelines that use these methods are also introduced. Finally, future directions of data mining research on microarray data are envisioned. Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Article
The totipotent zygote gives rise to cells with differing identities during mouse preimplantation development. Many studies have focused on analyzing the spatio-temporal dependencies during these lineage decision processes and much has been learnt by tracing transgenic marker gene expression up to the blastocyst stage and by analyzing the effects of genetic manipulations (knockout/ overexpression) on embryo development. However, until recently, it has not been possible to get broader overviews on the gene expression networks that distinguish one cell from the other within the same embryo. With the advent of whole genome amplification methodology and microfluidics-based quantitative RT–PCR it became possible to generate transcriptomes of single cells. Here we review the current state of the art of single-cell transcriptomics applied to mouse preimplantation embryo blastomeres and summarize findings made by pioneering studies in recent years. Furthermore we use the PluriNetWork and ExprEssence to investigate cell transitions based on published data.
Article
Full-text available
The wound healing process is well-understood on the cellular and tissue level; however, its complex molecular mechanisms are not yet uncovered in their entirety. Viewing wounds as perturbed molecular networks provides the tools for analyzing and optimizing the healing process. It helps to answer specific questions that lead to better understanding of the complexity of the process. What are the molecular pathways involved in wound healing? How do these pathways interact with each other during the different stages of wound healing? Is it possible to grasp the entire mechanism of regulatory interactions in the healing of a wound? Networks are structures composed of nodes connected by links. A network describing the state of a cell taking part in the healing process may contain nodes representing genes, proteins, microRNAs, metabolites, and drug molecules. The links connecting nodes represent interactions such as binding, regulation, co-expression, chemical reaction, and others. Both nodes and links can be weighted by numbers related to molecular concentration and the intensity of intermolecular interactions. Proceeding from data and from molecular profiling experiments, different types of networks are built to characterize the stages of the healing process. Network nodes having a higher degree of connectivity and centrality usually play more important roles for the functioning of the system they describe. We describe here the algorithms and software packages for building, manipulating and analyzing networks proceeding from information available from a literature or database search or directly extracted from experimental gene expression, metabolic, and proteomic data. Network analysis identifies genes/proteins most differentiated during the healing process, and their organization in functional pathways or modules, and their distribution into gene ontology categories of biological processes, molecular functions, and cellular localization. We provide an example of how network analysis can be used to reach better understanding of regulation of key wound healing mediators and microRNAs that regulate them. Univariate statistical tests widely used in clinical studies are not enough to improve understanding and optimize the processes of wound healing. Network methods of analysis of patients "omics" data, such as transcriptoms, proteomes, and others can provide a better insight into the healing processes and help in development of better treatment practices. We review several articles that are examples of this emergent approach to the study of wound healing. Network analysis has the potential to considerably contribute to the better understanding of the molecular mechanisms of wound healing and to the discovery of means to control and optimize that process.
Conference Paper
Despite the complete sequencing of human genome, most of the gene functions are still unknown. Mi- croarray techniques provides a fast and reliable means to analysis of the gene expression and the understanding of their function. In this context, clustering gene expression data is an essential step for gene function discovery, as groups of genes with similar expressions potentially having the same biological function. In this work, we analyze the use of external biological knowledge, such as the ones provided in ontologies to improve the functional grouping of gene expression measured from microarray data set. We propose here application of semi- supervised clustering algorithms that optimize an objective function for clustering functionally related genes. These al- gorithms demonstrated improvements on finding functionally related genes in relation to a previously proposed model based approach
Article
This paper presents an application of Fuzzy Clustering of Large Applications based on Randomized Search (FCLARANS) for attribute clustering and dimensionality reduction in gene expression data. Domain knowledge based on gene ontology and differential gene expressions are employed in the process. The use of domain knowledge helps in the automated selection of biologically meaningful partitions. Gene ontology (GO) study helps in detecting biologically enriched and statistically significant clusters. Fold-change is measured to select the differentially expressed genes as the representatives of these clusters. Tools like Eisen plot and cluster profiles of these clusters help establish their coherence. Important representative features (or genes) are extracted from each enriched gene partition to form the reduced gene space. While the reduced gene set forms a biologically meaningful attribute space, it simultaneously leads to a decrease in computational burden. External validation of the reduced subspace, using various well-known classifiers, establishes the effectiveness of the proposed methodology on four sets of publicly available microarray gene expression data.
Article
Clustering algorithms depend strongly on the dissimilarity considered to evaluate the sample proximities. In real applications, several dissimilarities are available that may come from different object representations or data sources. Each dissimilarity provides usually complementary information about the problem. Therefore, they should be integrated in order to reflect accurately the object proximities. In many applications, the user feedback or the a priory knowledge about the problem provide pairs of similar and dissimilar examples. In this paper, we address the problem of learning a linear combination of dissimilarities using side information in the form of equivalence constraints. The minimization of the error function is based on a quadratic optimization algorithm. A smoothing term is included that penalizes the complexity of the family of distances and avoids overfitting. The experimental results suggest that the method proposed outperforms a standard metric learning algorithm and improves classification and clustering results based on a single dissimilarity and data source.
Article
Full-text available
The rapid development of microarray technologies has raised many challenging problems in experiment design and data analysis. Although many numerical algorithms have been successfully applied to analyze gene expression data, the effects of variations and uncertainties in measured gene expression levels across samples and experiments have been largely ignored in the literature. In this article, in the context of hierarchical clustering algorithms, we introduce a statistical resampling method to assess the reliability of gene clusters identified from any hierarchical clustering method. Using the clustering trees constructed from the resampled data, we can evaluate the confidence value for each node in the observed clustering tree. A majority-rule consensus tree can be obtained, showing clusters that only occur in a majority of the resampled trees. We illustrate our proposed methods with applications to two published data sets. Although the methods are discussed in the context of hierarchical clustering methods, they can be applied with other cluster-identification methods for gene expression data to assess the reliability of any gene cluster of interest.
Book
Full-text available
This is a book, not a book review.
Article
Full-text available
The Munich Information Center for Protein Sequences (MIPS‐GSF), Neuherberg, Germany, provides protein sequence‐related information based on whole‐genome analysis. The main focus of the work is directed toward the systematic organization of sequence‐related attributes as gathered by a variety of algorithms, primary information from experimental data together with information compiled from the scientific literature. MIPS maintains automatically generated and manually annotated genome‐specific databases, develops systematic classification schemes for the functional annotation of protein sequences and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the database of complete cDNAs (German Human Genome Project, NGFN), the database of mammalian protein–protein interactions (MPPI), the database of FASTA homologies (SIMAP), and the interface for the fast retrieval of protein‐associated information (QUIPOS). The Arabidopsis thaliana database, the rice database, the plant EST databases (MATDB, MOsDB, SPUTNIK), as well as the databases for the comprehensive set of genomes (PEDANT genomes) are described elsewhere in the 2003 and 2004 NAR database issues, respectively. All databases described, and the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de).
Article
Full-text available
We introduce a method of functionally classifying genes by using gene expression data from DNA microarray hybridization experiments. The method is based on the theory of support vector machines (SVMs). SVMs are considered a supervised computer learning method because they exploit prior knowledge of gene function to identify unknown genes of similar function from expression data. SVMs avoid several problems associated with unsupervised clustering methods, such as hierarchical clustering and self-organizing maps. SVMs have many mathematical features that make them attractive for gene expression analysis, including their flexibility in choosing a similarity function, sparseness of solution when dealing with large data sets, the ability to handle large feature spaces, and the ability to identify outliers. We test several SVMs that use different similarity metrics, as well as some other supervised learning methods, and find that the SVMs best identify sets of genes with a common function using expression data. Finally, we use SVMs to predict functional roles for uncharacterized yeast ORFs based on their expression data.
Article
Full-text available
Technologies to measure whole-genome mRNA abundances and methods to organize and display such data are emerging as valuable tools for systems-level exploration of transcriptional regulatory networks. For instance, it has been shown that mRNA data from 118 genes, measured at several time points in the developing hindbrain of mice, can be hierarchically clustered into various patterns (or 'waves') whose members tend to participate in common processes. We have previously shown that hierarchical clustering can group together genes whose cis-regulatory elements are bound by the same proteins in vivo. Hierarchical clustering has also been used to organize genes into hierarchical dendograms on the basis of their expression across multiple growth conditions. The application of Fourier analysis to synchronized yeast mRNA expression data has identified cell-cycle periodic genes, many of which have expected cis-regulatory elements. Here we apply a systematic set of statistical algorithms, based on whole-genome mRNA data, partitional clustering and motif discovery, to identify transcriptional regulatory sub-networks in yeast-without any a priori knowledge of their structure or any assumptions about their dynamics. This approach uncovered new regulons (sets of co-regulated genes) and their putative cis-regulatory elements. We used statistical characterization of known regulons and motifs to derive criteria by which we infer the biological significance of newly discovered regulons and motifs. Our approach holds promise for the rapid elucidation of genetic network architecture in sequenced organisms in which little biology is known.
Article
Full-text available
Genomic sequencing has made it clear that a large fraction of the genes specifying the core biological functions are shared by all eukaryotes. Knowledge of the biological role of such shared proteins in one organism can often be transferred to other organisms. The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing. To this end, three independent ontologies accessible on the World-Wide Web (http://www.geneontology.org) are being constructed: biological process, molecular function and cellular component.
Article
Full-text available
We introduce a general technique for making statistical inference from clustering tools applied to gene expression microarray data. The approach utilizes an analysis of variance model to achieve normalization and estimate differential expression of genes across multiple conditions. Statistical inference is based on the application of a randomization technique, bootstrapping. Bootstrapping has previously been used to obtain confidence intervals for estimates of differential expression for individual genes. Here we apply bootstrapping to assess the stability of results from a cluster analysis. We illustrate the technique with a publicly available data set and draw conclusions about the reliability of clustering results in light of variation in the data. The bootstrapping procedure relies on experimental replication. We discuss the implications of replication and good design in microarray experiments.
Article
Full-text available
Motivation: Hierarchical clustering is one of the major analytical tools for gene expression data from microarray experiments. A major problem in the interpretation of the output from these procedures is assessing the reliability of the clustering results. We address this issue by developing a mixture model-based approach for the analysis of microarray data. Within this framework, we present novel algorithms for clustering genes and samples. One of the byproducts of our method is a probabilistic measure for the number of true clusters in the data. Results: The proposed methods are illustrated by application to microarray datasets from two cancer studies; one in which malignant melanoma is profiled (Bittner et al., Nature, 406, 536-540, 2000), and the other in which prostate cancer is profiled (Dhanasekaran et al., 2001, submitted).
Article
Full-text available
This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets. EMMIX-GENE is available at http://www.maths.uq.edu.au/~gjm/emmix-gene/
Article
Full-text available
Genome sequencing has led to the discovery of tens of thousands of potential new genes. Six years after the sequencing of the well-studied yeast Saccharomyces cerevisiae and the discovery that its genome encodes approximately 6,000 predicted proteins, more than 2,000 have not yet been characterized experimentally, and determining their functions seems far from a trivial task. One crucial constraint is the generation of useful hypotheses about protein function. Using a new approach to interpret microarray data, we assign likely cellular functions with confidence values to these new yeast proteins. We perform extensive genome-wide validations of our predictions and offer visualization methods for exploration of the large numbers of functional predictions. We identify potential new members of many existing functional categories including 285 candidate proteins involved in transcription, processing and transport of non-coding RNA molecules. We present experimental validation confirming the involvement of several of these proteins in ribosomal RNA processing. Our methodology can be applied to a variety of genomics data types and organisms.
Article
Full-text available
Current methods for the functional analysis of microarray gene expression data make the implicit assumption that genes with similar expression profiles have similar functions in cells. However, among genes involved in the same biological pathway, not all gene pairs show high expression similarity. Here, we propose that transitive expression similarity among genes can be used as an important attribute to link genes of the same biological pathway. Based on large-scale yeast microarray expression data, we use the shortest-path analysis to identify transitive genes between two given genes from the same biological process. We find that not only functionally related genes with correlated expression profiles are identified but also those without. In the latter case, we compare our method to hierarchical clustering, and show that our method can reveal functional relationships among genes in a more precise manner. Finally, we show that our method can be used to reliably predict the function of unknown genes from known genes lying on the same shortest path. We assigned functions for 146 yeast genes that are considered as unknown by the Saccharomyces Genome Database and by the Yeast Proteome Database. These genes constitute around 5% of the unknown yeast ORFome.
Article
Full-text available
Motivation: The biologic significance of results obtained through cluster analyses of gene expression data generated in microarray experiments have been demonstrated in many studies. In this article we focus on the development of a clustering procedure based on the concept of Bayesian model-averaging and a precise statistical model of expression data. Results: We developed a clustering procedure based on the Bayesian infinite mixture model and applied it to clustering gene expression profiles. Clusters of genes with similar expression patterns are identified from the posterior distribution of clusterings defined implicitly by the stochastic data-generation model. The posterior distribution of clusterings is estimated by a Gibbs sampler. We summarized the posterior distribution of clusterings by calculating posterior pairwise probabilities of co-expression and used the complete linkage principle to create clusters. This approach has several advantages over usual clustering procedures. The analysis allows for incorporation of a reasonable probabilistic model for generating data. The method does not require specifying the number of clusters and resulting optimal clustering is obtained by averaging over models with all possible numbers of clusters. Expression profiles that are not similar to any other profile are automatically detected, the method incorporates experimental replicates, and it can be extended to accommodate missing data. This approach represents a qualitative shift in the model-based cluster analysis of expression data because it allows for incorporation of uncertainties involved in the model selection in the final assessment of confidence in similarities of expression profiles. We also demonstrated the importance of incorporating the information on experimental variability into the clustering model. Availability: The MS Windows(TM) based program implementing the Gibbs sampler and supplemental material is available at http://homepages.uc.edu/~medvedm/BioinformaticsSupplement.htm Contact: medvedm@email.uc.edu
Article
Full-text available
Motivation: With the advent of microarray chip technology, large data sets are emerging containing the simultaneous expression levels of thousands of genes at various time points during a biological process. Biologists are attempting to group genes based on the temporal pattern of their expression levels. While the use of hierarchical clustering (UPGMA) with correlation 'distance' has been the most common in the microarray studies, there are many more choices of clustering algorithms in pattern recognition and statistics literature. At the moment there do not seem to be any clear-cut guidelines regarding the choice of a clustering algorithm to be used for grouping genes based on their expression profiles. Results: In this paper, we consider six clustering algorithms (of various flavors!) and evaluate their performances on a well-known publicly available microarray data set on sporulation of budding yeast and on two simulated data sets. Among other things, we formulate three reasonable validation strategies that can be used with any clustering algorithm when temporal observations or replications are present. We evaluate each of these six clustering methods with these validation measures. While the 'best' method is dependent on the exact validation strategy and the number of clusters to be used, overall Diana appears to be a solid performer. Interestingly, the performance of correlation-based hierarchical clustering and model-based clustering (another method that has been advocated by a number of researchers) appear to be on opposite extremes, depending on what validation measure one employs. Next it is shown that the group means produced by Diana are the closest and those produced by UPGMA are the farthest from a model profile based on a set of hand-picked genes. Availability: S+ codes for the partial least squares based clustering are available from the authors upon request. All other clustering methods considered have S+ implementation in the library MASS. S+ codes for calculating the validation measures are available from the authors upon request. The sporulation data set is publicly available at http://cmgm.stanford.edu/pbrown/sporulation
Article
Full-text available
DNA microarrays can be used to identify gene expression changes characteristic of human disease. This is challenging, however, when relevant differences are subtle at the level of individual genes. We introduce an analytical strategy, Gene Set Enrichment Analysis, designed to detect modest but coordinate changes in the expression of groups of functionally related genes. Using this approach, we identify a set of genes involved in oxidative phosphorylation whose expression is coordinately decreased in human diabetic muscle. Expression of these genes is high at sites of insulin-mediated glucose disposal, activated by PGC-1alpha and correlated with total-body aerobic capacity. Our results associate this gene set with clinically important variation in human metabolism and illustrate the value of pathway relationships in the analysis of genomic profiling experiments.
Article
Full-text available
We have developed an algorithm for inferring the degree of similarity between genes by using the graph-based structure of Gene Ontology (GO). We applied this knowledge-based similarity metric to a clique-finding algorithm for detecting sets of related genes with biological classifications. We also combined it with an expression-based distance metric to produce a co-cluster analysis, which accentuates genes with both similar expression profiles and similar biological characteristics and identifies gene clusters that are more stable and biologically meaningful. These algorithms are demonstrated in the analysis of MPRO cell differentiation time series experiments.
Article
Full-text available
Today, the characterization of clinical phenotypes by gene-expression patterns is widely used in clinical research. If the investigated phenotype is complex from the molecular point of view, new challenges arise and these have not been addressed systematically. For instance, the same clinical phenotype can be caused by various molecular disorders, such that one observes different characteristic expression patterns in different patients. In this paper we describe a novel algorithm called Structured Analysis of Microarrays (StAM), which accounts for molecular heterogeneity of complex clinical phenotypes. Our algorithm goes beyond established methodology in several aspects: in addition to the expression data, it exploits functional annotations from the Gene Ontology database to build biologically focussed classifiers. These are used to uncover potential molecular disease subentities and associate them to biological processes without compromising overall prediction accuracy. Bioconductor compliant R package Complete analyses are available at http://compdiag.molgen.mpg.de/supplements/lottaz05.
Article
Full-text available
Motivation: The analysis of genome-scale data from different high throughput techniques can be used to obtain lists of genes ordered according to their different behaviours under distinct experimental conditions corresponding to different phenotypes (e.g. differential gene expression between diseased samples and controls, different response to a drug, etc.). The order in which the genes appear in the list is a consequence of the biological roles that the genes play within the cell, which account, at molecular scale, for the macroscopic differences observed between the phenotypes studied. Typically, two steps are followed for understanding the biological processes that differentiate phenotypes at molecular level: first, genes with significant differential expression are selected on the basis of their experimental values and subsequently, the functional properties of these genes are analysed. Instead, we present a simple procedure which combines experimental measurements with available biological information in a way that genes are simultaneously tested in groups related by common functional properties. The method proposed constitutes a very sensitive tool for selecting genes with significant differential behaviour in the experimental conditions tested. Results: We propose the use of a method to scan ordered lists of genes. The method allows the understanding of the biological processes operating at molecular level behind the macroscopic experiment from which the list was generated. This procedure can be useful in situations where it is not possible to obtain statistically significant differences based on the experimental measurements (e.g. low prevalence diseases, etc.). Two examples demonstrate its application in two microarray experiments and the type of information that can be extracted.
Article
Full-text available
The discovery of novel biological knowledge from the ab initio analysis of post-genomic data relies upon the use of unsupervised processing methods, in particular clustering techniques. Much recent research in bioinformatics has therefore been focused on the transfer of clustering methods introduced in other scientific fields and on the development of novel algorithms specifically designed to tackle the challenges posed by post-genomic data. The partitions returned by a clustering algorithm are commonly validated using visual inspection and concordance with prior biological knowledge--whether the clusters actually correspond to the real structure in the data is somewhat less frequently considered. Suitable computational cluster validation techniques are available in the general data-mining literature, but have been given only a fraction of the same attention in bioinformatics. This review paper aims to familiarize the reader with the battery of techniques available for the validation of clustering results, with a particular focus on their application to post-genomic data analysis. Synthetic and real biological datasets are used to demonstrate the benefits, and also some of the perils, of analytical clustervalidation. The software used in the experiments is available at http://dbkweb.ch.umist.ac.uk/handl/clustervalidation/. Enlarged colour plots are provided in the Supplementary Material, which is available at http://dbkweb.ch.umist.ac.uk/handl/clustervalidation/.
Article
Full-text available
Independent of the platform and the analysis methods used, the result of a microarray experiment is, in most cases, a list of differentially expressed genes. An automatic ontological analysis approach has been recently proposed to help with the biological interpretation of such results. Currently, this approach is the de facto standard for the secondary analysis of high throughput experiments and a large number of tools have been developed for this purpose. We present a detailed comparison of 14 such tools using the following criteria: scope of the analysis, visualization capabilities, statistical model(s) used, correction for multiple comparisons, reference microarrays available, installation issues and sources of annotation data. This detailed analysis of the capabilities of these tools will help researchers choose the most appropriate tool for a given type of analysis. More importantly, in spite of the fact that this type of analysis has been generally adopted, this approach has several important intrinsic drawbacks. These drawbacks are associated with all tools discussed and represent conceptual limitations of the current state-of-the-art in ontological analysis. We propose these as challenges for the next generation of secondary data analysis tools. Contact:sod@cs.wayne.edu
Article
Full-text available
Accurate and rapid identification of perturbed pathways through the analysis of genome-wide expression profiles facilitates the generation of biological hypotheses. We propose a statistical framework for determining whether a specified group of genes for a pathway has a coordinated association with a phenotype of interest. Several issues on proper hypothesis-testing procedures are clarified. In particular, it is shown that the differences in the correlation structure of each set of genes can lead to a biased comparison among gene sets unless a normalization procedure is applied. We propose statistical tests for two important but different aspects of association for each group of genes. This approach has more statistical power than currently available methods and can result in the discovery of statistically significant pathways that are not detected by other methods. This method is applied to data sets involving diabetes, inflammatory myopathies, and Alzheimer's disease, using gene sets we compiled from various public databases. In the case of inflammatory myopathies, we have correctly identified the known cytotoxic T lymphocyte-mediated autoimmunity in inclusion body myositis. Furthermore, we predicted the presence of dendritic cells in inclusion body myositis and of an IFN-α/β response in dermatomyositis, neither of which was previously described. These predictions have been subsequently corroborated by immunohistochemistry. • microarrays • gene ontology • normalization • correlated data • inflammatory myopathies
Article
Full-text available
Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, model-based clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The issues of selecting a 'good' clustering method and determining the 'correct' number of clusters are reduced to model selection problems in the probability framework. Gaussian mixture models have been shown to be a powerful tool for clustering in many applications. We benchmarked the performance of model-based clustering on several synthetic and real gene expression data sets for which external evaluation criteria were available. The model-based approach has superior performance on our synthetic data sets, consistently selecting the correct model and the number of clusters. On real expression data, the model-based approach produced clusters of quality comparable to a leading heuristic clustering algorithm, but with the key advantage of suggesting the number of clusters and an appropriate model. We also explored the validity of the Gaussian mixture assumption on different transformations of real data. We also assessed the degree to which these real gene expression data sets fit multivariate Gaussian distributions both before and after subjecting them to commonly used data transformations. Suitably chosen transformations seem to result in reasonable fits. MCLUST is available at http://www.stat.washington.edu/fraley/mclust. The software for the diagonal model is under development. kayee@cs.washington.edu. http://www.cs.washington.edu/homes/kayee/model.
Article
Many traditional multivariate techniques such as ordination, clustering, classification and discriminant analysis are now routinely used in most fields of application. However, the past decade has seen considerable new developments, particularly in computational ...
Article
Kaufman and Rousseeuw (1990) proposed a clustering algorithm Partitioning Around Medoids (PAM) which maps a distance matrix into a specified number of clusters. A particularly nice property is that PAM allows clustering with respect to any specified distance metric. In addition, the medoids are robust representations of the cluster centers, which is particularly important in the common context that many elements do not belong well to any cluster. Based on our experience in clustering gene expression data, we have noticed that PAM does have problems recognizing relatively small clusters in situations where good partitions around medoids clearly exist. In this paper, we propose to partition around medoids by maximizing a criteria "Average Silhouette" defined by Kaufman and Rousseeuw (1990). We also propose a fast-to-compute approximation of "Average Silhouette". We implement these two new partitioning around medoids algorithms and illustrate their performance relative to existing partitioning methods in simulations.
Article
A system of cluster analysis for genome-wide expression data from DNA microarray hybridization is described that uses standard statistical algorithms to arrange genes according to similarity in pattern of gene expression. The output is displayed graphically, conveying the clustering and the underlying expression data simultaneously in a form intuitive for biologists. We have found in the budding yeast Saccharomyces cerevisiae that clustering gene expression data groups together efficiently genes of known similar function, and we find a similar tendency in human data. Thus patterns seen in genome-wide expression experiments can be interpreted as indications of the status of cellular processes. Also, coexpression of genes of known function with poorly characterized or novel genes may provide a simple means of gaining leads to the functions of many genes for which information is not available currently.
Article
Array technologies have made it straightforward to monitor simultaneously the expression pattern of thousands of genes. The challenge now is to interpret such massive data sets. The first step is to extract the fundamental patterns of gene expression inherent in the data. This paper describes the application of self-organizing maps, a type of mathematical cluster analysis that is particularly well suited for recognizing and classifying features in complex, multidimensional data. The method has been implemented in a publicly available computer package, GENECLUSTER, that performs the analytical calculations and provides easy data visualization. To illustrate the value of such analysis, the approach is applied to hematopoietic differentiation in four well studied models (HL-60, U937, Jurkat, and NB4 cells). Expression patterns of some 6,000 human genes were assayed, and an online database was created. GENECLUSTER was used to organize the genes into biologically relevant clusters that suggest novel hypotheses about hematopoietic differentiation-for example, highlighting certain genes and pathways involved in "differentiation therapy" used in the treatment of acute promyelocytic leukemia.
Article
Ascertaining the impact of uncharacterized perturbations on the cell is a fundamental problem in biology. Here, we describe how a single assay can be used to monitor hundreds of different cellular functions simultaneously. We constructed a reference database or "compendium" of expression profiles corresponding to 300 diverse mutations and chemical treatments in S. cerevisiae, and we show that the cellular pathways affected can be determined by pattern matching, even among very subtle profiles. The utility of this approach is validated by examining profiles caused by deletions of uncharacterized genes: we identify and experimentally confirm that eight uncharacterized open reading frames encode proteins required for sterol metabolism, cell wall function, mitochondrial respiration, or protein synthesis. We also show that the compendium can be used to characterize pharmacological perturbations by identifying a novel target of the commonly used drug dyclonine.
Article
Microarray technologies are emerging as a promising tool for genomic studies. The challenge now is how to analyze the resulting large amounts of data. Clustering techniques have been widely applied in analyzing microarray gene-expression data. However, normal mixture model-based cluster analysis has not been widely used for such data, although it has a solid probabilistic foundation. Here, we introduce and illustrate its use in detecting differentially expressed genes. In particular, we do not cluster gene-expression patterns but a summary statistic, the t-statistic. The method is applied to a data set containing expression levels of 1,176 genes of rats with and without pneumococcal middle-ear infection. Three clusters were found, two of which contain more than 95% genes with almost no altered gene-expression levels, whereas the third one has 30 genes with more or less differential gene-expression levels. Our results indicate that model-based clustering of t-statistics (and possibly other summary statistics) can be a useful statistical tool to exploit differential gene expression for microarray data.
Article
Microarray technology is increasingly being applied in biological and medical research to address a wide range of problems, such as the classification of tumors. An important statistical problem associated with tumor classification is the identification of new tumor classes using gene-expression profiles. Two essential aspects of this clustering problem are: to estimate the number of clusters, if any, in a dataset; and to allocate tumor samples to these clusters, and assess the confidence of cluster assignments for individual samples. Here we address the first of these problems. We have developed a new prediction-based resampling method, Clest, to estimate the number of clusters in a dataset. The performance of the new and existing methods were compared using simulated data and gene-expression data from four recently published cancer microarray studies. Clest was generally found to be more accurate and robust than the six existing methods considered in the study. Focusing on prediction accuracy in conjunction with resampling produces accurate and robust estimates of the number of clusters.
Article
Recent developments in microarrays technology enable researchers to study simultaneously the expression of thousands of genes from one cell line or tissue sample. This new technology is often used to assess changes in mRNA expression upon a specified transfection for a cell line in order to identify target genes. For such experiments, the range of differential expression is moderate, and teasing out the modified genes is challenging and calls for detailed modeling. The aim of this paper is to propose a methodological framework for studies that investigate differential gene expression through microarrays technology that is based on a fully Bayesian mixture approach (Richardson and Green, 1997). A case study that investigated those genes that were differentially expressed in two cell lines (normal and modified by a gene transfection) is provided to illustrate the performance and usefulness of this approach.
Article
In this article, we propose a method for clustering that produces tight and stable clusters without forcing all points into clusters. The methodology is general but was initially motivated from cluster analysis of microarray experiments. Most current algorithms aim to assign all genes into clusters. For many biological studies, however, we are mainly interested in identifying the most informative, tight, and stable clusters of sizes, say, 20-60 genes for further investigation. We want to avoid the contamination of tightly regulated expression patterns of biologically relevant genes due to other genes whose expressions are only loosely compatible with these patterns. "Tight clustering" has been developed specifically to address this problem. It applies K-means clustering as an intermediate clustering engine. Early truncation of a hierarchical clustering tree is used to overcome the local minimum problem in K-means clustering. The tightest and most stable clusters are identified in a sequential manner through an analysis of the tendency of genes to be grouped together under repeated resampling. We validated this method in a simulated example and applied it to analyze a set of expression profiles in the study of embryonic stem cells.
Article
To microarray expression data analysis, it is well accepted that biological knowledge-guided clustering techniques show more advantages than pure mathematical techniques. In this paper, Gene Ontology is introduced to guide the clustering process, and thus a new algorithm capturing both expression pattern similarities and biological function similarities is developed. Our algorithm was validated on two well-known public data sets and the results were compared with some previous works. It is shown that our method has advantages in both the quality of clusters and the precision of biological annotations. Furthermore, the clustering results can be adjusted according to different stringency requirements. It is expected that our algorithm can be extended to other biological knowledge, for example, metabolic networks.
Article
Prediction of biological functions of genes is an important issue in basic biology research and has applications in drug discoveries and gene therapies. Previous studies have shown either gene expression data or protein-protein interaction data alone can be used for predicting gene functions. In particular, clustering gene expression profiles has been widely used for gene function prediction. In this paper, we first propose a new method for gene function prediction using protein-protein interaction data, which will facilitate combining prediction results based on clustering gene expression profiles. We then propose a new method to combine the prediction results based on either source of data by weighting on the evidence provided by each. Using protein-protein interaction data downloaded from the GRID database, published gene expression profiles from 300 microarray experiments for the yeast S. cerevisiae, we show that this new combined analysis provides improved predictive performance over that of using either data source alone in a cross-validated analysis of the MIPS gene annotations. Finally, we propose a logistic regression method that is flexible enough to combine information from any number of data sources while maintaining computational feasibility.
Article
Cluster analysis of gene expression profiles has been widely applied to clustering genes for gene function discovery. Many approaches have been proposed. The rationale is that the genes with the same biological function or involved in the same biological process are more likely to co-express, hence they are more likely to form a cluster with similar gene expression patterns. However, most existing methods, including model-based clustering, ignore known gene functions in clustering. To take advantage of accumulating gene functional annotations, we propose incorporating known gene functions as prior probabilities in model-based clustering. In contrast to a global mixture model applicable to all the genes in the standard model-based clustering, we use a stratified mixture model: one stratum corresponds to the genes of unknown function while each of the other ones corresponding to the genes sharing the same biological function or pathway; the genes from the same stratum are assumed to have the same prior probability of coming from a cluster while those from different strata are allowed to have different prior probabilities of coming from the same cluster. We derive a simple EM algorithm that can be used to fit the stratified model. A simulation study and an application to gene function prediction demonstrate the advantage of our proposal over the standard method. weip@biostat.umn.edu
Article
It has been increasingly recognized that incorporating prior knowledge into cluster analysis can result in more reliable and meaningful clusters. In contrast to the standard modelbased clustering with a global mixture model, which does not use any prior information, a stratified mixture model was recently proposed to incorporate gene functions or biological pathways as priors in model-based clustering of gene expression profiles: various gene functional groups form the strata in a stratified mixture model. Albeit useful, the stratified method may be less efficient than the global analysis if the strata are non-informative to clustering. We propose a weighted method that aims to strike a balance between a stratified analysis and a global analysis: it weights between the clustering results of the stratified analysis and that of the global analysis; the weight is determined by data. More generally, the weighted method can take advantage of the hierarchical structure of most existing gene functional annotation systems, such as MIPS and Gene Ontology (GO), and facilitate choosing appropriate gene functional groups as priors. We use simulated data and real data to demonstrate the feasibility and advantages of the proposed method.
Article
Currently the practice of using existing biological knowledge in analyzing high throughput genomic and proteomic data is mainly for the purpose of validations. Here we take a different approach of incorporating biological knowledge into statistical analysis to improve statistical power and efficiency. Specifically, we consider how to fuse biological information into a mixture model to analyze microarray data. In contrast to a standard mixture model where it is assumed that all the genes come from the same (marginal) distribution, including an equal prior probability of having an event, such as having differential expression or being bound by a transcription factor (TF), our proposed mixture model allows the genes in different groups to have different distributions while the grouping of the genes reflects biological information. Using a list of about 800 putative cell cycle-regulated genes as prior biological knowledge, we analyze a genome-wide location data to detect binding sites of TF Fkh1. We find that our proposal improves over the standard approach, resulting in reduced false discovery rates (FDR), and hence it is a useful alternative to the current practice.
Article
Motivation: Large scale gene expression data are often analysed by clustering genes based on gene expression data alone, though a priori knowledge in the form of biological networks is available. The use of this additional information promises to improve exploratory analysis considerably. Results: We propose constructing a distance function which combines information from expression data and biological networks. Based on this function, we compute a joint clustering of genes and vertices of the network. This general approach is elaborated for metabolic networks. We define a graph distance function on such networks and combine it with a correlation-based distance function for gene expression measurements. A hierarchical clustering and an associated statistical measure is computed to arrive at a reasonable number of clusters. Our method is validated using expression data of the yeast diauxic shift. The resulting clusters are easily interpretable in terms of the biochemical network and the gene expression data and suggest that our method is able to automatically identify processes that are relevant under the measured conditions.
The Elements of Statistical Learning Data mining, Inference, and Prediction Combining gene annotations and gene expression data in model-based clustering: a weighted method Functional discovery via a compendium of expression profiles
  • J R Statist
  • B Soc
  • T Hastie
  • R Tibshirani
  • J Friedman
  • T R Hughes
J. R. Statist. Soc. B., 58, 155–176. Hastie,T., Tibshirani,R. and Friedman,J. (2001) The Elements of Statistical Learning. Data mining, Inference, and Prediction. Springer, New York. Huang,D. et al. (2006) Combining gene annotations and gene expression data in model-based clustering: a weighted method. OMICS, (in press). Hughes,T.R. et al. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109–126
Cluster analysis and display of genome-wide expression patterns
  • Eisen
Eisen,M. et al. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95, 14863–14868.
Discriminant analysis by mixture modelling
  • Hastie
Hastie,T. and Tibshirani,R. (1995) Discriminant analysis by mixture modelling. J. R. Statist. Soc. B., 58, 155-176.
Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation
  • Tamayo
Tamayo,P. et al. (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA, 96, 2907-2912.
  • Carlin