Main protein structures

Source publication

Protein Structure Comparison through Fuzzy Contact Maps and the Universal Similarity Metric.

Conference Paper

Full-text available

Jan 2005

Comparing protein structures, either to infer bio- logical functionality or to assess protein structure predictions is an essential component of proteomic research. In this paper we extend our previous work on the use of the Universal Similarity Met- ric(USM) and Generalized Fuzzy Contact maps. More specically we compare the impact that gen- eraliz...

Context 1

... some cases, a protein structure may be composed by a set of three dimensional chains structures. Figure 1 graphically shows the previous description 1 . ...

View in full-text

Figure 1. Functional networks of β cells contain high-degree hubs. The...

Figure 2. β-cell functional networks exhibit positive assortative...

Figure 3. Assortativity and the largest adjacency-matrix eigenvalue of...

Figure 4. Hub cells carry most of the cross-correlation content of a...

Autopoietic influence hierarchies in pancreatic β cells

Preprint

Full-text available

Jan 2021

β cells are biologically essential for humans and other vertebrates. Because their functionality arises from cell-cell interactions, they are also a model system for collective organization among cells. There are currently two contradictory pictures of this organization: the hub-cell idea pointing at leaders who coordinate the others, and the elect...

AC: A Compression Tool for Amino Acid Sequences

Article

Feb 2019
Interdiscipl Sci Comput Life Sci

Advancement of protein sequencing technologies has led to the production of a huge volume of data that needs to be stored and transmitted. This challenge can be tackled by compression. In this paper, we propose AC, a state-of-the-art method for lossless compression of amino acid sequences. The proposed method works based on the cooperation between finite-context models and substitutional tolerant Markov models. Compared to several general-purpose and specific-purpose protein compressors, AC provides the best bit-rates. This method can also compress the sequences nine times faster than its competitor, paq8l. In addition, employing AC, we analyze the compressibility of a large number of sequences from different domains. The results show that viruses are the most difficult sequences to be compressed. Archaea and bacteria are the second most difficult ones, and eukaryota are the easiest sequences to be compressed.

Compression of Amino Acid Sequences

Conference Paper

Full-text available

Jun 2018

Amino acid sequences are known to be very hard to compress. In this paper, we propose a lossless compressor for efficient compression of amino acid sequences (AC). The compressor uses a cooperation between multiple context and substitutional tolerant context models. The cooperation between models is balanced with weights that benefit the models with better performance, according to a forgetting function specific for each model. We have shown consistently better compression results than other approaches, using low computational resources. The compressor implementation is freely available, under license GPLv3, at https://github.com/pratas/ac.

Data Compression Concepts and Algorithms and Their Applications to Bioinformatics

Article

Full-text available

Jan 2010
Entropy

Data compression at its base is concerned with how information is organized in data. Understanding this organization can lead to efficient ways of representing the information and hence data compression. In this paper we review the ways in which ideas and approaches fundamental to the theory and practice of data compression have been used in the area of bioinformatics. We look at how basic theoretical ideas from data compression, such as the notions of entropy, mutual information, and complexity have been used for analyzing biological sequences in order to discover hidden patterns, infer phylogenetic relationships between organisms and study viral populations. Finally, we look at how inferred grammars for biological sequences have been used to uncover structure in biological sequences.

MULTI-REGIONAL ANALYSIS OF CONTACT MAPS FOR PROTEIN STRUCTURE PREDICTION

Article

Full-text available

May 2009

On Using Fuzzy Contact Maps for Protein Structure Comparison

Conference Paper

Full-text available

Jun 2007

The comparison of protein structures is an important problem in bioinformatics, and soft computing techniques were recently introduced for achieving a better representation and potentially, for getting better solving strategies. We focus here in the generalized maximum fuzzy contact map overlap model for analyzing the impact of the fuzzy contact map's definition, and the relation between the crisp and fuzzy costs. Surprisingly, we detected some situations where solving the fuzzy model gave better results in terms of crisp values than solving the crisp model directly.

Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment

Article

Full-text available

Feb 2007
BMC BIOINFORMATICS

Background: Similarity of sequences is a key mathematical notion for Classification and Phylogenetic studies in Biology. It is currently primarily handled using alignments. However, the alignment methods seem inadequate for post-genomic studies since they do not scale well with data set size and they seem to be confined only to genomic and proteomic sequences. Therefore, alignment-free similarity measures are actively pursued. Among those, USM (Universal Similarity Metric) has gained prominence. It is based on the deep theory of Kolmogorov Complexity and universality is its most novel striking feature. Since it can only be approximated via data compression, USM is a methodology rather than a formula quantifying the similarity of two strings. Three approximations of USM are available, namely UCD (Universal Compression Dissimilarity), NCD (Normalized Compression Dissimilarity) and CD (Compression Dissimilarity). Their applicability and robustness is tested on various data sets yielding a first massive quantitative estimate that the USM methodology and its approximations are of value. Despite the rich theory developed around USM, its experimental assessment has limitations: only a few data compressors have been tested in conjunction with USM and mostly at a qualitative level, no comparison among UCD, NCD and CD is available and no comparison of USM with existing methods, both based on alignments and not, seems to be available. Results: We experimentally test the USM methodology by using 25 compressors, all three of its known approximations and six data sets of relevance to Molecular Biology. This offers the first systematic and quantitative experimental assessment of this methodology, that naturally complements the many theoretical and the preliminary experimental results available. Moreover, we compare the USM methodology both with methods based on alignments and not. We may group our experiments into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the methodology to discriminate and classify biological sequences and structures. A second set of experiments aims at assessing how well two commonly available classification algorithms, UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and NJ (Neighbor Joining), can use the methodology to perform their task, their performance being evaluated against gold standards and with the use of well known statistical indexes, i.e., the F-measure and the partition distance. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of USM on biological data. The main ones are reported next. Conclusion: UCD and NCD are indistinguishable, i.e., they yield nearly the same values of the statistical indexes we have used, accross experiments and data sets, while CD is almost always worse than both. UPGMA seems to yield better classification results with respect to NJ, i.e., better values of the statistical indexes (10% difference or above), on a substantial fraction of experiments, compressors and USM approximation choices. The compression program PPMd, based on PPM (Prediction by Partial Matching), for generic data and Gencompress for DNA, are the best performers among the compression algorithms we have used, although the difference in performance, as measured by statistical indexes, between them and the other algorithms depends critically on the data set and may not be as large as expected. PPMd used with UCD or NCD and UPGMA, on sequence data is very close, although worse, in performance with the alignment methods (less than 2% difference on the F-measure). Yet, it scales well with data set size and it can work on data other than sequences. In summary, our quantitative analysis naturally complements the rich theory behind USM and supports the conclusion that the methodology is worth using because of its robustness, flexibility, scalability, and competitiveness with existing techniques. In particular, the methodology applies to all biological data in textual format. The software and data sets are available under the GNU GPL at the supplementary material web page.

Compression Ratios Based on the Universal Similarity Metric Still Yield Protein Distances far from CATH Distances

Article

Full-text available

Mar 2006

Kolmogorov complexity has inspired several alignment-free distance measures, based on the comparison of lengths of compressions, which have been applied successfully in many areas. One of these measures, the so-called Universal Similarity Metric (USM), has been used by Krasnogor and Pelta to compare simple protein contact maps, showing that it yielded good clustering on four small datasets. We report an extensive test of this metric using a much larger and representative protein dataset: the domain dataset used by Sierk and Pearson to evaluate seven protein structure comparison methods and two protein sequence comparison methods. One result is that Krasnogor-Pelta method has less domain discriminant power than any one of the methods considered by Sierk and Pearson when using these simple contact maps. In another test, we found that the USM based distance has low agreement with the CATH tree structure for the same benchmark of Sierk and Pearson. In any case, its agreement is lower than the one of a standard sequential alignment method, SSEARCH. Finally, we manually found lots of small subsets of the database that are better clustered using SSEARCH than USM, to confirm that Krasnogor-Pelta's conclusions were based on datasets that were too small.

Alignment-free comparison of TOPS strings

Article

Full-text available

. TOPS diagrams are concise descriptions of the struc-tural topology of proteins, and their comparison usually relies on a structural alignment of the corresponding vertex ordered and vertex and edge labelled graphs. Such an approach involves checking for the existence of subgraph isomorphisms, which is an NP complete problem even for this kind of graphs. Therefore, although there exist several algorithms for the alignment-based comparison of TOPS dia-grams that are fast in practice, they have an exponential worst case complexity. Moreover, the alignment-based comparison of TOPS dia-grams assumes conservation of contiguity between homologous TOPS diagram segments. In this paper, we explore the alignment-free comparison of TOPS diagrams. We consider on the one hand similarity and dissimilarity measures based on subword composition of the sequences of secondary structure elements, thus neglecting contact map information, and on the other hand the Universal Similarity Metric from Kolmogorov complexity theory. Effectiveness of these alignment-free methods for TOPS diagrams comparison is assessed by cluster validation tech-niques.

A Computational Approach to Predicting Distance Maps from Contact Maps

Article

Tony Kuo

Predicting helix pair structure from fuzzy contact maps

Article

Jan 2014
APPL SOFT COMPUT

One approach to protein structure prediction is to first predict from sequence, a thresholded and binary 2D representation of a protein's topology known as a contact map. The predicted contact map can be used as distance constraints to construct a 3D structure. We focus on the latter half of the process for helix pairs and present an approach that aims to obtain a set of non-binary distance constraints from contacts maps. We extend the definition of “in contact” by incorporating fuzzy logic to construct fuzzy contact maps. Then, template-based retrieval and distance geometry bound smoothing were applied to obtain distance constraints in the form of a distance map. From the distance map, we can calculate the helix pair structure. Our experimental results indicate that distance constraints close to the true distance map could be predicted at various noise levels and the resulting structure was highly correlated to the predicted distance map.

Main protein structures

Context in source publication

Similar publications

Citations