Figure 1 - uploaded by David A. Pelta
Content may be subject to copyright.
Main protein structures 

Main protein structures 

Source publication
Conference Paper
Full-text available
Comparing protein structures, either to infer bio- logical functionality or to assess protein structure predictions is an essential component of proteomic research. In this paper we extend our previous work on the use of the Universal Similarity Met- ric(USM) and Generalized Fuzzy Contact maps. More specically we compare the impact that gen- eraliz...

Context in source publication

Context 1
... some cases, a protein structure may be composed by a set of three dimensional chains structures. Figure 1 graphically shows the previous description 1 . ...

Similar publications

Preprint
Full-text available
β cells are biologically essential for humans and other vertebrates. Because their functionality arises from cell-cell interactions, they are also a model system for collective organization among cells. There are currently two contradictory pictures of this organization: the hub-cell idea pointing at leaders who coordinate the others, and the elect...

Citations

... Compression of amino acid sequences, besides reducing storage space, can be exploited to predict and uncover structure of proteins [24,12], namely through fusions and duplications [25], which are possibly linked with new functionalities [26]. Protein classification [27] and domain identification [28] are other examples. ...
Article
Advancement of protein sequencing technologies has led to the production of a huge volume of data that needs to be stored and transmitted. This challenge can be tackled by compression. In this paper, we propose AC, a state-of-the-art method for lossless compression of amino acid sequences. The proposed method works based on the cooperation between finite-context models and substitutional tolerant Markov models. Compared to several general-purpose and specific-purpose protein compressors, AC provides the best bit-rates. This method can also compress the sequences nine times faster than its competitor, paq8l. In addition, employing AC, we analyze the compressibility of a large number of sequences from different domains. The results show that viruses are the most difficult sequences to be compressed. Archaea and bacteria are the second most difficult ones, and eukaryota are the easiest sequences to be compressed.
... Besides storage purposes, the compression of amino acid sequences is very important to predict and uncover structure [1,14], namely through fusions and duplications in proteins, that may be linked with new functionalities [15]. More examples are protein classification [16] and domain identification [17]. ...
Conference Paper
Full-text available
Amino acid sequences are known to be very hard to compress. In this paper, we propose a lossless compressor for efficient compression of amino acid sequences (AC). The compressor uses a cooperation between multiple context and substitutional tolerant context models. The cooperation between models is balanced with weights that benefit the models with better performance, according to a forgetting function specific for each model. We have shown consistently better compression results than other approaches, using low computational resources. The compressor implementation is freely available, under license GPLv3, at https://github.com/pratas/ac.
... That the basic principle behind the Lempel-Ziv compression algorithms have been so successful in identifying evolutionary relationships may mean that the differences uncovered through the use of compression are somehow natural to the evolutionary process. This speculation is further supported by the exploitation of distance metrics based on compression for protein classification57585960 and genome segmentation [61]. Kocsor et al. [58] showed that using compression based approaches can be more accurate for protein classification than the commonly used Smith-Waterman alignment algorithm or Hidden Markov Models. ...
... Kocsor et al. [58] showed that using compression based approaches can be more accurate for protein classification than the commonly used Smith-Waterman alignment algorithm or Hidden Markov Models. Pelta [59] showed that the compression of protein contact maps can be used for protein classification and [62] showed that the UNIX compress algorithm can be used for protein domain identification. ...
Article
Full-text available
Data compression at its base is concerned with how information is organized in data. Understanding this organization can lead to efficient ways of representing the information and hence data compression. In this paper we review the ways in which ideas and approaches fundamental to the theory and practice of data compression have been used in the area of bioinformatics. We look at how basic theoretical ideas from data compression, such as the notions of entropy, mutual information, and complexity have been used for analyzing biological sequences in order to discover hidden patterns, infer phylogenetic relationships between organisms and study viral populations. Finally, we look at how inferred grammars for biological sequences have been used to uncover structure in biological sequences.
... The benchmark Chew-Kedem (CK) dataset is adopted for our experiments. The CK dataset is a non-homologous, medium-size protein data set of 35 proteins from 5 different protein families (3 different protein classes) that was introduced in ‎ [11], and further extensively studied in several other studies ‎ [25]‎ [48]‎ [49]. It is important for our experiments to use an evolutionarily unrelated (non-homologous) protein dataset so that the structural features can be considered independently distributed, which is essential for the experimental quality of the statistical analysis of protein structures. ...
... In previous work we addressed the comparison of fuzzy contact maps against crisp contact maps [5] representations when the protein comparison was done through the universal similarity metric (USM) [6]. It was shown that the pairwise similarity values obtained through USM coupled with fuzzy contact maps allowed to correctly cluster a set of proteins in terms of their structural class. ...
... This confirms that the fitness values are good enough to distinguish proteins of the same classes both with the crisp (classical) function and with the fuzzy ones. A similar conclusion was also obtained in [4], [5] but with a different type of analysis. ...
Conference Paper
Full-text available
The comparison of protein structures is an important problem in bioinformatics, and soft computing techniques were recently introduced for achieving a better representation and potentially, for getting better solving strategies. We focus here in the generalized maximum fuzzy contact map overlap model for analyzing the impact of the fuzzy contact map's definition, and the relation between the crisp and fuzzy costs. Surprisingly, we detected some situations where solving the fuzzy model gave better results in terms of crisp values than solving the crisp model directly.
... CD has been used for classification and data mining in [10] and it was obtained independently in [11,12] in the realm of table compression. UCD has been used in131415 to classify protein structures. Those studies, although groundbreaking , seem to be only an initial assessment of the power of the new methodology and leave open fundamental experimental questions that need to be addressed in order to establish how appropriate the use of the methodology is for classification of biological data. ...
Article
Full-text available
Background: Similarity of sequences is a key mathematical notion for Classification and Phylogenetic studies in Biology. It is currently primarily handled using alignments. However, the alignment methods seem inadequate for post-genomic studies since they do not scale well with data set size and they seem to be confined only to genomic and proteomic sequences. Therefore, alignment-free similarity measures are actively pursued. Among those, USM (Universal Similarity Metric) has gained prominence. It is based on the deep theory of Kolmogorov Complexity and universality is its most novel striking feature. Since it can only be approximated via data compression, USM is a methodology rather than a formula quantifying the similarity of two strings. Three approximations of USM are available, namely UCD (Universal Compression Dissimilarity), NCD (Normalized Compression Dissimilarity) and CD (Compression Dissimilarity). Their applicability and robustness is tested on various data sets yielding a first massive quantitative estimate that the USM methodology and its approximations are of value. Despite the rich theory developed around USM, its experimental assessment has limitations: only a few data compressors have been tested in conjunction with USM and mostly at a qualitative level, no comparison among UCD, NCD and CD is available and no comparison of USM with existing methods, both based on alignments and not, seems to be available. Results: We experimentally test the USM methodology by using 25 compressors, all three of its known approximations and six data sets of relevance to Molecular Biology. This offers the first systematic and quantitative experimental assessment of this methodology, that naturally complements the many theoretical and the preliminary experimental results available. Moreover, we compare the USM methodology both with methods based on alignments and not. We may group our experiments into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the methodology to discriminate and classify biological sequences and structures. A second set of experiments aims at assessing how well two commonly available classification algorithms, UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and NJ (Neighbor Joining), can use the methodology to perform their task, their performance being evaluated against gold standards and with the use of well known statistical indexes, i.e., the F-measure and the partition distance. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of USM on biological data. The main ones are reported next. Conclusion: UCD and NCD are indistinguishable, i.e., they yield nearly the same values of the statistical indexes we have used, accross experiments and data sets, while CD is almost always worse than both. UPGMA seems to yield better classification results with respect to NJ, i.e., better values of the statistical indexes (10% difference or above), on a substantial fraction of experiments, compressors and USM approximation choices. The compression program PPMd, based on PPM (Prediction by Partial Matching), for generic data and Gencompress for DNA, are the best performers among the compression algorithms we have used, although the difference in performance, as measured by statistical indexes, between them and the other algorithms depends critically on the data set and may not be as large as expected. PPMd used with UCD or NCD and UPGMA, on sequence data is very close, although worse, in performance with the alignment methods (less than 2% difference on the F-measure). Yet, it scales well with data set size and it can work on data other than sequences. In summary, our quantitative analysis naturally complements the rich theory behind USM and supports the conclusion that the methodology is worth using because of its robustness, flexibility, scalability, and competitiveness with existing techniques. In particular, the methodology applies to all biological data in textual format. The software and data sets are available under the GNU GPL at the supplementary material web page.
... A methodological study of its application to protein sequence classification has been published recently by Kocsor et al. (2006); they show that a compression based distance combined with a BLAST score has a performance even slightly better than that of the Smith-Waterman algorithm (SSEARCH). Krasnogor and Pelta (2004); Pelta et al. (2005) have used a slightly more general approximation of the Universal Similarity Metric to compare protein structures. The formula they use (and we use in this paper) is USM(x, y) = max{C(xy) − C(x), C(yx) − C(y)} max{C(x), C(y)} . ...
Article
Full-text available
Kolmogorov complexity has inspired several alignment-free distance measures, based on the comparison of lengths of compressions, which have been applied successfully in many areas. One of these measures, the so-called Universal Similarity Metric (USM), has been used by Krasnogor and Pelta to compare simple protein contact maps, showing that it yielded good clustering on four small datasets. We report an extensive test of this metric using a much larger and representative protein dataset: the domain dataset used by Sierk and Pearson to evaluate seven protein structure comparison methods and two protein sequence comparison methods. One result is that Krasnogor-Pelta method has less domain discriminant power than any one of the methods considered by Sierk and Pearson when using these simple contact maps. In another test, we found that the USM based distance has low agreement with the CATH tree structure for the same benchmark of Sierk and Pearson. In any case, its agreement is lower than the one of a standard sequential alignment method, SSEARCH. Finally, we manually found lots of small subsets of the database that are better clustered using SSEARCH than USM, to confirm that Krasnogor-Pelta's conclusions were based on datasets that were too small.
... VI.2]. This Universal Similarity Metric has been used successfully for instance to compute phylogenetic trees based on whole mitochondrial genomes [24,6], to cluster SARS virus [6], to compare protein structures [21,32], to reconstruct phylogenies from metabolic pathways [33], to classify languages [24], musical pieces [7,6,25], and images [36], to detect plagiarism in student assignments [4], and to cluster Russian literature [6]. Actually, since Kolmogorov complexities are non-computable in the Turing sense, the Universal Similarity Metric was not used in these applications as it stands, but approximations of it. ...
... Finally, and since K(x, y) = K(x) + K(y|x * ) = K(y) + K(x|y * ), again up to constant additive precision [26], the conditional complexity K(x|y * ) can be approximated by C(xy) − C(y), and K(y|x * ) can be approximated by C(yx) − C(x). This leads to the following approximation of the Universal Similarity Metric [21,32], which is the one we use in this paper: ...
Article
Full-text available
. TOPS diagrams are concise descriptions of the struc-tural topology of proteins, and their comparison usually relies on a structural alignment of the corresponding vertex ordered and vertex and edge labelled graphs. Such an approach involves checking for the existence of subgraph isomorphisms, which is an NP complete problem even for this kind of graphs. Therefore, although there exist several algorithms for the alignment-based comparison of TOPS dia-grams that are fast in practice, they have an exponential worst case complexity. Moreover, the alignment-based comparison of TOPS dia-grams assumes conservation of contiguity between homologous TOPS diagram segments. In this paper, we explore the alignment-free comparison of TOPS diagrams. We consider on the one hand similarity and dissimilarity measures based on subword composition of the sequences of secondary structure elements, thus neglecting contact map information, and on the other hand the Universal Similarity Metric from Kolmogorov complexity theory. Effectiveness of these alignment-free methods for TOPS diagrams comparison is assessed by cluster validation tech-niques.
... Fuzzy contact maps have been used successfully in protein structure comparison. Pelta et al. [40] paired fuzzy contact maps with the Universal Similarity Metric (USM) [32] to cluster the Chew-Kedem dataset into the correct protein families (globins, alpha-beta, tim-barrels, all beta, and alpha). ...
... It is of interest how fuzzy contact maps have performed well in the structural comparison of proteins and as such, validated the use of fuzzy contact maps as a representation of protein structure [40,16]. When considering substructure elements, there has been research into the relationship between similarity in contact maps and similarity in structure, in terms of helix-to-helix orientation [9]. ...
... Here, we must also acknowledge the fact that there are innumerable ways to construct fuzzy contact maps from a set of contact maps. For our methodology, we chose the membership function shape based on the results of previous research on fuzzy contact maps [40]. The threshold ranges were based on extending the CASP [6] standard threshold of 8Å to a fuzzy membership function. ...
Article
One approach to protein structure prediction is to first predict from sequence, a thresholded and binary 2D representation of a protein's topology known as a contact map. The predicted contact map can be used as distance constraints to construct a 3D structure. We focus on the latter half of the process for helix pairs and present an approach that aims to obtain a set of non-binary distance constraints from contacts maps. We extend the definition of “in contact” by incorporating fuzzy logic to construct fuzzy contact maps. Then, template-based retrieval and distance geometry bound smoothing were applied to obtain distance constraints in the form of a distance map. From the distance map, we can calculate the helix pair structure. Our experimental results indicate that distance constraints close to the true distance map could be predicted at various noise levels and the resulting structure was highly correlated to the predicted distance map.