Matching Strategies. Schematic overview of the three matching strategies. 1a, one-to-many matching; 1b, many-to-one matching; 1c, the two superimposed. Lines represent template searches; arrows, matches; bold lines, correct matches; other lines, incorrect matches; X’s, no match. Purple spheres are residues in both the source and target template and match; red spheres, residues in the query template and target match; blue spheres, residues in the target template and query match. doi:10.1371/journal.pone.0002136.g001 

Matching Strategies. Schematic overview of the three matching strategies. 1a, one-to-many matching; 1b, many-to-one matching; 1c, the two superimposed. Lines represent template searches; arrows, matches; bold lines, correct matches; other lines, incorrect matches; X’s, no match. Purple spheres are residues in both the source and target template and match; red spheres, residues in the query template and target match; blue spheres, residues in the target template and query match. doi:10.1371/journal.pone.0002136.g001 

Source publication
Article
Full-text available
Function prediction frequently relies on comparing genes or gene products to search for relevant similarities. Because the number of protein structures with unknown function is mushrooming, however, we asked here whether such comparisons could be improved by focusing narrowly on the key functional features of protein structures, as defined by the E...

Citations

... ETA then searches already annotated protein structures, the targets, for those that match the query 3D template ( Fig. 1 and Movie S1). False positive matches are common but can be recognized because they typically (i) involve unimportant residues in the target (39), (ii) are not reciprocated back to the query (40), and (iii) point to multiple proteins that each bear unrelated functions. With appropriate specificity filters to eliminate these false positives, ETA identified enzyme activity down to the first three Enzyme Commission (EC) levels with 92% accuracy (40), as well as in nonenzymes (41) in large-scale Structural Genomics retrospective controls. ...
... False positive matches are common but can be recognized because they typically (i) involve unimportant residues in the target (39), (ii) are not reciprocated back to the query (40), and (iii) point to multiple proteins that each bear unrelated functions. With appropriate specificity filters to eliminate these false positives, ETA identified enzyme activity down to the first three Enzyme Commission (EC) levels with 92% accuracy (40), as well as in nonenzymes (41) in large-scale Structural Genomics retrospective controls. The prediction of substrate specificity remains an open question and further requires accurate identification of the fourth and last EC level (42) presumably by adding a more discriminating use of 3D template residues than is sufficient to specify a general chemical process (43). ...
... Each position is geometrically represented by the 3D Cartesian coordinates of the selected residue's alpha carbon atoms. ETA templates use both residue labels in the query structure (native templates) and a combination of variations that were observed at least twice in the multiple sequence alignment (variations) for the identified positions (40). Further, the paired-distance algorithm (40) searches the query template against a "target" library of proteins with known functions for geometric similarity and, in doing so, identifies geometric matches in which residue labels are alike with those in the query template and subject to the criterion that each pair of residues in the template and the matched region are within a distance of 2.5 Å. ...
Article
Full-text available
Significance Many proteins solved by Structural Genomics have low sequence identity to other proteins and cannot be assigned functions. To address this problem, we present a computational approach that creates structural motifs of a few evolutionarily important residues, and these motifs probe local geometric and evolutionary similarities in other protein structures to detect functional similarities. This approach does not require prior knowledge of functional mechanisms and is highly accurate in computational benchmarks when annotations rely on homologs with low sequence identity. We further demonstrate the accuracy of this approach using biochemical and mutagenesis studies to validate two predictions of unannotated proteins.
... ET predictions have been extensively validated experimentally (Onrust et al., 1997; Rajagopalan et al., 2006; Ribes-Zamora et al., 2007; Rodriguez et al., 2010; Shenoy et al., 2006; Sowa et al., 2000, 2001) and through large-scale retrospective predictions of functional sites (Yao et al., 2003) and protein functions (Venner et al., 2010). These studies point to a number of general and consistent observations in well-structured protein domains: (i) sequence positions may be ranked by evolutionary importance; (ii) most important sequence residues cluster structurally (Madabushi et al., 2002); (iii) these structural clusters predict functional sites (Yao et al., 2003), such that (iv) small structure–function motifs called 3D templates based on these clusters can predict protein function on a genomic scale (Erdin et al., 2010; Kristensen et al., 2008; Venner et al., 2010; Ward et al., 2008). The evolutionary principles that give rise to these useful patterns remain unclear. ...
Article
Full-text available
The constraints under which sequence, structure, and function co-evolve are not fully understood. Bringing this mutual relationship to light can reveal the molecular basis of binding, catalysis and allostery, thereby identifying function and rationally guiding protein redesign. Underlying these relationships are the epistatic interactions that occur when the consequences of a mutation to a protein are determined by the genetic background in which it occurs. Based on prior data, we hypothesize that epistatic forces operate most strongly between residues nearby in the structure, resulting in smooth evolutionary importance across the structure.Methods and RESULTS: We find that when residue scores of evolutionary importance are distributed smoothly between nearby residues, functional site prediction accuracy improves. Accordingly, we designed a novel measure of evolutionary importance that focuses on the interaction between pairs of structurally neighboring residues. This measure that we term pair-interaction Evolutionary Trace (piET) yields greater functional site overlap and better structure-based proteome-wide functional predictions.Conclusions: Our data show that the structural smoothness of evolutionary importance is a fundamental feature of the co-evolution of sequence, structure, and function. Mutations operate on individual residues, but selective pressure depends in part on the extent to which a mutation perturbs interactions with neighboring residues. In practice, this principle led us to redefine the importance of a residue in terms of the importance of its epistatic interactions with neighbors, yielding better annotation of functional residues, motivating experimental validation of a novel functional site in LexA, and refining protein function prediction. lichtarge@bcm.edu.
... We systematically tested this protocol for enzymes [17,23,24] and non-enzymes [24] using Enzyme Commission (EC) numbers [25] and Gene Ontology terms [26] as functional classifications. The accuracy was 92% and 94% for enzymes and non-enzymes respectively, with sensitivity near 50% in both [23,24]. ...
... We systematically tested this protocol for enzymes [17,23,24] and non-enzymes [24] using Enzyme Commission (EC) numbers [25] and Gene Ontology terms [26] as functional classifications. The accuracy was 92% and 94% for enzymes and non-enzymes respectively, with sensitivity near 50% in both [23,24]. To raise sensitivity, we then pooled together all ETA matches into a network of protein structures [27] and let functional information diffuse globally within it from proteins of known function to unannotated ones. ...
... Match filtering: In the next round, ETA eliminates selfmatches and matches with RMSD greater than 2 Å. The remaining matches are fed into the Support Vector Machine (SVM) that is trained for enzymes [23]. Each match is represented by either six or seven dimensional vectors depending on the template size. ...
Article
Full-text available
The constraints under which sequence, structure, and function co-evolve are not fully understood. Bringing this mutual relationship to light can reveal the molecular basis of binding, catalysis and allostery, thereby identifying function and rationally guiding protein redesign. Underlying these relationships are the epistatic interactions that occur when the consequences of a mutation to a protein are determined by the genetic background in which it occurs. Based on prior data, we hypothesize that epistatic forces operate most strongly between residues nearby in the structure, resulting in smooth evolutionary importance across the structure.Methods and RESULTS: We find that when residue scores of evolutionary importance are distributed smoothly between nearby residues, functional site prediction accuracy improves. Accordingly, we designed a novel measure of evolutionary importance that focuses on the interaction between pairs of structurally neighboring residues. This measure that we term pair-interaction Evolutionary Trace (piET) yields greater functional site overlap and better structure-based proteome-wide functional predictions.Conclusions: Our data show that the structural smoothness of evolutionary importance is a fundamental feature of the co-evolution of sequence, structure, and function. Mutations operate on individual residues, but selective pressure depends in part on the extent to which a mutation perturbs interactions with neighboring residues. In practice, this principle led us to redefine the importance of a residue in terms of the importance of its epistatic interactions with neighbors, yielding better annotation of functional residues, motivating experimental validation of a novel functional site in LexA, and refining protein function prediction.
... We systematically tested this protocol for enzymes [17,23,24] and non-enzymes [24] using Enzyme Commission (EC) numbers [25] and Gene Ontology terms [26] as functional classifications. The accuracy was 92% and 94% for enzymes and non-enzymes respectively, with sensitivity near 50% in both [23,24]. ...
... We systematically tested this protocol for enzymes [17,23,24] and non-enzymes [24] using Enzyme Commission (EC) numbers [25] and Gene Ontology terms [26] as functional classifications. The accuracy was 92% and 94% for enzymes and non-enzymes respectively, with sensitivity near 50% in both [23,24]. To raise sensitivity, we then pooled together all ETA matches into a network of protein structures [27] and let functional information diffuse globally within it from proteins of known function to unannotated ones. ...
... Match filtering: In the next round, ETA eliminates selfmatches and matches with RMSD greater than 2 Å. The remaining matches are fed into the Support Vector Machine (SVM) that is trained for enzymes [23]. Each match is represented by either six or seven dimensional vectors depending on the template size. ...
Article
Full-text available
Annotating protein function with both high accuracy and sensitivity remains a major challenge in structural genomics. One proven computational strategy has been to group a few key functional amino acids into templates and search for these templates in other protein structures, so as to transfer function when a match is found. To this end, we previously developed Evolutionary Trace Annotation (ETA) and showed that diffusing known annotations over a network of template matches on a structural genomic scale improved predictions of function. In order to further increase sensitivity, we now let each protein contribute multiple templates rather than just one, and also let the template size vary. Retrospective benchmarks in 605 Structural Genomics enzymes showed that multiple templates increased sensitivity by up to 14% when combined with single template predictions even as they maintained the accuracy over 91%. Diffusing function globally on networks of single and multiple template matches marginally increased the area under the ROC curve over 0.97, but in a subset of proteins that could not be annotated by ETA, the network approach recovered annotations for the most confident 20-23 of 91 cases with 100% accuracy. We improve the accuracy and sensitivity of predictions by using multiple templates per protein structure when constructing networks of ETA matches and diffusing annotations.
... It then seeks a match that is similar both geometrically and evolutionarily in protein structures with known function. These Evolutionary Trace Annotation (ETA) templates usually overlap with catalytic sites (Ward et al., 2008) and identify function with 87% accuracy at 61% coverage (Kristensen et al., 2008). * These authors contributed equally to this work † to whom correspondence should be addressed Using a network in which structures form the nodes and ETA matches form the edges helps to overcome limitations from sparse functional data. ...
... The "Add new node to Network" menu option queries the ETA Server (Ward et al., 2009), which opens in a browser and suggests an ET template that the user may customize. This template is then matched against proteins in the network and matches are filtered as described previously (Ward et al., 2008). Modified networks may be saved and later reloaded. ...
... 2f2gBs direct matches are well below the reliable homology range with sequence identities of 16% with 1rtwD and 14% with 1z72B. As observed previously (Ward et al., 2008), the direct matches share overall structural similarity: many of the proteins in this cluster belong to the CATH heme oxygenase superfamily. ...
Article
Unlabelled: Most proteins lack experimentally validated functions. To address this problem, we implemented the Evolutionary Trace Annotation (ETA) method in the Cytoscape network visualization environment. The result is the ETAscape plugin, which builds a structural genomics network based on local structural and evolutionary similarities among proteins and then globally diffuses known annotations across the resulting network. The plugin displays these novel functional annotations, their confidence, the molecular basis for individual matches and the set of matches that lead to a prediction. Availability: The ETA Network Plugin is available publicly for download at http://mammoth.bcm.tmc.edu/networks/.
... In practice, ET residues have remarkable structural and functional properties: l They cluster together spatially in the protein structure (3) l These clusters map out on the protein surface possible functional sites for catalysis or ligand binding (4) l Internal clusters of ET residues presumably form the folding core of the protein, and, in some cases, play a critical role in allosteric regulation and specificity (5) l Mutations directed to ET residues will alter function in a variety of ways (6)(7)(8) l Mimicry of ET residues leads to peptides with functional properties (9) l And in silico mimicry of top-ranked ET residues identifies functional similarity (10,11) For example, this early version of ET detected functional residues and directed mutational studies into the molecular basis of G protein signaling (12)(13)(14). One hundred mutations of the Galpha-protein confirmed prior ET predictions of binding sites to the G beta gamma subunits and to the G protein-coupled receptor (15). ...
... A series of technical studies developed these ideas into an Evolutionary Trace Annotation (ETA) pipeline to predict the function of novel protein structures. ET rankings proved useful to define small structure-function motifs called 3D-templates (27), to identify meaningful geometric and evolutionary matches of these templates to other protein structures based on reciprocity (10), and voting plurality (28) in order to infer function in enzymes and non-enzymes alike (10,11). ETA was extensively benchmarked; for example, its positive predictive value was 93% (10) in 1218 SG enzymes (whose functions were described the first three digits of the Enzyme Commission classification, EC numbers). ...
... A series of technical studies developed these ideas into an Evolutionary Trace Annotation (ETA) pipeline to predict the function of novel protein structures. ET rankings proved useful to define small structure-function motifs called 3D-templates (27), to identify meaningful geometric and evolutionary matches of these templates to other protein structures based on reciprocity (10), and voting plurality (28) in order to infer function in enzymes and non-enzymes alike (10,11). ETA was extensively benchmarked; for example, its positive predictive value was 93% (10) in 1218 SG enzymes (whose functions were described the first three digits of the Enzyme Commission classification, EC numbers). ...
Article
Full-text available
The evolutionary trace (ET) is the single most validated approach to identify protein functional determinants and to target mutational analysis, protein engineering and drug design to the most relevant sites of a protein. It applies to the entire proteome; its predictions come with a reliability score; and its results typically reach significance in most protein families with 20 or more sequence homologs. In order to identify functional hot spots, ET scans a multiple sequence alignment for residue variations that correlate with major evolutionary divergences. In case studies this enables the selective separation, recoding, or mimicry of functional sites and, on a large scale, this enables specific function predictions based on motifs built from select ET-identified residues. ET is therefore an accurate, scalable and efficient method to identify the molecular determinants of protein function and to direct their rational perturbation for therapeutic purposes. Public ET servers are located at: http://mammoth.bcm.tmc.edu/.
... The ET method used the structure and an alignment of 140 sequences from 80 different sources (including animals, plants, bacteria, and viruses). To assay the effect of each mutation, we also determined if they belonged to any functional sites and how the local sequence (20 closest residues) and structural environment (residues within 1 nm) of each mutant change to accommodate sequence alterations during evolution [Ward et al., 2008]. ...
Article
Desmosterolosis, a rare disorder of cholesterol biosynthesis, is caused by mutations in DHCR24, the gene encoding the enzyme 24-dehydrocholesterol reductase (DHCR24). To date, desmosterolosis has been described in only two patients. Here we report on a third patient with desmosterolosis who presented after delivery with relative macrocephaly, mild arthrogryposis, and dysmorphic facial features. Brain MRI revealed hydrocephalus, thickening of the tectum and massa intermedia, mildly effaced gyral pattern, underopercularization, and a thin corpus callosum. The diagnosis of desmosterolosis was established by detection of significant elevation of plasma desmosterol levels and reduced enzyme activity of DHCR24 upon expression of the patient's DHCR24 cDNA in yeast. The patient was found to be a compound heterozygote for c.281G>A (p.R94H) and c.1438G>A (p.E480K) mutations. Structural and evolutionary analyses showed that residue R94 resides at the flavin adenine dinucleotide (FAD) binding site and is strictly conserved throughout evolution, while residue E480 is less conserved, but the charge shift substitution is accompanied by drastic changes in the local protein environment of that residue. We compare the phenotype of our patient with previously reported cases.
... The second imposes plurality, so that a function is passed to a protein only if that function recurs more often than any other in all of its hits [60]. And the third filter requires hit reciprocity, so that if the template of protein A has a hit on protein B, the reverse is also true: the template of protein B will hit protein A [61]. With all of these filters applied together, the positive predictive value (PPV) up to the third digit of EC numbers rose to 92% in a large-scale control over more than 1200 SG proteins. ...
Article
Genomic centers discover increasingly many protein sequences and structures, but not necessarily their full biological functions. Thus, currently, less than one percent of proteins have experimentally verified biochemical activities. To fill this gap, function prediction algorithms apply metrics of similarity between proteins on the premise that those sufficiently alike in sequence, or structure, will perform identical functions. Although high sensitivity is elusive, network analyses that integrate these metrics together hold the promise of rapid gains in function prediction specificity.
... Building on these computational and experimental studies that demonstrate evolutionary identification of functional determinants, our approach ranks the relative evolutionary importance of every residue in a protein sequence with the Evolutionary Trace [35], [55] (ET), and then selects the six most important and clustered surface residues to define a 3D template. The geometric matches of these evolutionary templates in other protein structures at sites that are themselves evolutionarily important then define Evolutionary Trace Annotation (ETA) annotations [31], [56]. So far, ETA annotations have been shown to be functionally specific (positive predictive values above 90%) in enzymes and non-enzymes alike [57], but their functional resolution and coverage are limited. ...
Article
Full-text available
High-throughput Structural Genomics yields many new protein structures without known molecular function. This study aims to uncover these missing annotations by globally comparing select functional residues across the structural proteome. First, Evolutionary Trace Annotation, or ETA, identifies which proteins have local evolutionary and structural features in common; next, these proteins are linked together into a proteomic network of ETA similarities; then, starting from proteins with known functions, competing functional labels diffuse link-by-link over the entire network. Every node is thus assigned a likelihood z-score for every function, and the most significant one at each node wins and defines its annotation. In high-throughput controls, this competitive diffusion process recovered enzyme activity annotations with 99% and 97% accuracy at half-coverage for the third and fourth Enzyme Commission (EC) levels, respectively. This corresponds to false positive rates 4-fold lower than nearest-neighbor and 5-fold lower than sequence-based annotations. In practice, experimental validation of the predicted carboxylesterase activity in a protein from Staphylococcus aureus illustrated the effectiveness of this approach in the context of an increasingly drug-resistant microbe. This study further links molecular function to a small number of evolutionarily important residues recognizable by Evolutionary Tracing and it points to the specificity and sensitivity of functional annotation by competitive global network diffusion. A web server is at http://mammoth.bcm.tmc.edu/networks.
... The template can then be used to search all PDB structures for similarities that suggest a common function. While geometric matches within 2 Å root mean square deviation are often random, the specificity rises to over 90% once these matches are also filtered for (i) the importance of the matched site [86], (ii) reciprocity, so the 3D-template of the match matches back to the query [87], and (iii) plurality, so that multiple matches point to the same function more than to any other [88]. This approach is scalable the structural proteome to annotate over 1200 structural genomics enzyme up to three Enzyme Classification digits with 92% accuracy [89], or non-enzymes using the Gene Ontology functional classification [29]. ...
Article
Protein interactions give rise to networks that control cell fate in health and disease; selective means to probe these interactions are therefore of wide interest. We discuss here Evolutionary Tracing (ET), a comparative method to identify protein functional sites and to guide experiments that selectively block, recode, or mimic their amino acid determinants. These studies suggest, in principle, a scalable approach to perturb individual links in protein networks.