ArticlePDF Available

Non-sequential Structure-based Alignments Reveal Topology-independent Core Packing Arrangements in Proteins

May 2005
Bioinformatics 21(7):1010-9

May 2005
21(7):1010-9

DOI:10.1093/bioinformatics/bti128

Source
PubMed

Authors:

Christopher Bystroff

Rensselaer Polytechnic Institute

Motivation: Proteins of the same class often share a secondary structure packing arrangement but differ in how the secondary structure units are ordered in the sequence. We find that proteins that share a common core also share local sequence-structure similarities, and these can be exploited to align structures with different topologies. In this study, segments from a library of local sequence-structure alignments were assembled hierarchically, enforcing the compactness and conserved inter-residue contacts but not sequential ordering. Previous structure-based alignment methods often ignore sequence similarity, local structural equivalence and compactness. Results: The new program, SCALI (Structural Core ALIgnment), can efficiently find conserved packing arrangements, even if they are non-sequentially ordered in space. SCALI alignments conserve remote sequence similarity and contain fewer alignment errors. Clustering of our pairwise non-sequential alignments shows that recurrent packing arrangements exist in topologically different structures. For example, the three-layer sandwich domain architecture may be divided into four structural subclasses based on internal packing arrangements. These subclasses represent an intermediate level of structure classification, more general than topology, but more specific than architecture as defined in CATH. A strategy is presented for developing a set of predictive hidden Markov models based on multiple SCALI alignments.

Content uploaded by Christopher Bystroff

Content may be subject to copyright.

Non-sequential Structure-based Alignments Reveal Topology-

independent Core Packing Arrangements in Proteins

Xin Yuan* and Christopher Bystroff*

Department of Biology, Rensselaer Polytechnic Institute, Troy, NY 12180, USA

Phone: 518-276-3185, Fax: 518-276-2162

Email: bystrc@rpi.edu, yuanx2@rpi.edu

*To whom correspondence should be addressed

Keywords: topology independent structure alignment, protein classification, structure

prediction, hidden Markov model, conserved core packing, contact map

Bioinformatics Advance Access published November 5, 2004

ABSTRACT

Motivation: Proteins of the same class often share a secondary structure packing arrangement

but differ in how the secondary structure units are ordered in the sequence. We find that proteins

that share a common core also share local sequence-structure similarities, and these can be

exploited to align structures with different topologies. In this study, segments from a library of

local sequence-structure alignments were assembled hierarchically, enforcing the compactness

and conserved inter-residue contacts but not sequential ordering. Previous structure-based

alignment methods often ignore sequence similarity, local structural equivalence, and

compactness.

Results: The new program, SCALI (Structural Core ALIgnment), can efficiently find conserved

packing arrangements, even if they are non-sequentially ordered in space. SCALI alignments

conserve remote sequence similarity and contain fewer alignment errors. Clustering of our

pairwise non-sequential alignments shows that recurrent packing arrangements exist in

topologically different structures. For example, the 3-layer sandwich domain architecture may be

divided into four structural subclasses based on internal packing arrangements. These subclasses

represent an intermediate level of structure classification, more general than topology but more

specific than architecture as defined in CATH. A strategy is presented for developing a set of

predictive hidden Markov models based on multiple SCALI alignments.

Availability: An online topology independent SCALI structure comparison server is available at

http://www.bioinfo.rpi.edu/~bystrc/scali.html.

Contact: bystrc@rpi.edu; yuanx2@rpi.edu

INTRODUCTION

Recurrent structural motifs in proteins can be found by structure-based alignment

methods. Generally it is assumed that similar protein structures will align to each other in a

sequential manner, conserving the direction of the chain and the order of the structural units. But

many examples exist of structural similarity that is non-sequential, produced possibly by

sequence rearrangements (Janowski, et al., 2001;Bennett, et al., 1994;Schiering, et al.,

2000;Jeltsch, 1999;Gong, et al., 1997;Iwakura, et al., 2000;Viguera, et al., 1995;Smith, et al.,

2001;Jung, et al., 2001) or by convergent evolution (Rost, 1997;Milik, et al., 2003). Circular

permutants and other rearrangements represent the topologically possible and energetically

favorable ways of arranging secondary structure units along the chain.

Structural similarities that have permuted orders are interesting because they reveal

recurrent structural packing themes in proteins (Efimov, 1995;Abagyan, et al., 1989;Alexandrov,

1996). Examples are presented later in this paper. These recurrent themes may be used to build

predictive models. But so far, there are no sequence models for the well-known structural

paradigms at the level of protein architecture. Instead, the focus has been on predicting structure

at the family, superfamily or fold level (Eddy, 1998;Karplus, et al., 1998;Gough, et al., 2002).

The most successful of these are hidden Markov models (HMM), which generally do not allow

for non-sequential alignments. The lack of non-sequential HMMs may be due in part to the

difficulty in obtaining good structural alignments without sequential constraints. Because of this,

a recurrent structural motif in a new protein may not be recognized as such if it is sequentially

permuted.

Alignments of topologically different structures may be found by inspection using an

interactive graphics program such as Rasmol (Bernstein, 2000;Sayle, et al., 1995). The spatial

alignment of permuted segments is often remarkably good, yet most structure-based alignment

programs, such as DALI (Holm, et al., 1993), CE (Shindyalov, et al., 1998), VAST (Gibrat, et

al., 1996), PrISM (Yang, et al., 2000) and MAMMOTH (Ortiz, et al., 2002), cannot find these

superpositions because they assume the aligned segments are sequentially ordered. Two

exceptions to this rule are SARF (Alexandrov, 1996;Alexandrov, et al., 1996) and K2

(Szustakowski, et al., 2002;Szustakowski, et al., 2000), which consider non-topological

alignments using secondary structure element information. However, neither of these programs

considers the sequence similarity. Our goal was to build sequence models from structure-based

alignments; therefore we developed a new program that optimizes both the structure and

sequence similarity.

The new program, SCALI (Structural Core ALIgnment), was conceived based on the

following five criteria defining a biologically relevant structure-based alignment (Koehl,

2001;Flores, et al., 1993;Taylor, et al., 1989). (1) Aligned residues should conserve structure

locally (i.e. backbone angles). (2) Contacts between pairs of aligned residues should be

conserved. (3) The alignment as a whole should be spatially compact, rather than disperse. (4)

Aligned segments should have some degree of sequence similarity. (5) The sequence order of

aligned segments should be minimally permuted.

In preliminary studies, we constructed non-sequential alignments manually for two cases

of topologically different proteins with similar 3D core packing arrangements, one of which is

illustrated in Figure. 1. We then attempted to reproduce the manually constructed alignments,

automatically, using a fragment assembly strategy. The new program was compared with the two

of the most commonly used structure alignment programs, DALI and CE, and with two non-

sequential alignment programs, SARF and K2.

Pairwise SCALI alignments of representative protein structures were clustered to produce

multiple structure alignments. Within these clusters we found recurrent core packing

arrangements that could be used as models for structure prediction. Hidden Markov models

(HMM) based on these “cores” are presented here in a diagrammatic form. These models

represent a level of structural classification that is more general than “fold” or “topology” but

more specific than “architecture” or “class”. Applications of the new non-topological HMMs for

structure prediction and design are discussed. Recurrent core packing geometries may also tell us

something about the folding process.

METHODS

SCALI: non-sequential sequence-structure alignment

SCALI aligns structures in a three-step process. First we generate a library of gapless

local sequence-structure alignments (“fragments”) using HMMSTR (Bystroff, et al., 2000). The

second step is a tree search in alignment space, where each branch point is the addition of a new

fragment to the alignment. Finally, the best alignments are pruned and extended.

HMMSTR (Hidden Markov Model for protein STRucture) is an almost comprehensive

model for local sequence/structure correlations in proteins (Bystroff, et al., 2000). In HMMSTR,

each Markov state represents a single position in an I-sites motif (Bystroff, et al., 1998). Each

state contains information about the amino acid preference and the preferred backbone angles.

The transitions between the states model the adjacencies of motifs in protein sequences.

HMMSTR has been used for secondary structure prediction, remote homolog detection (Hou, et

al., 2003) and for developing knowledge-based contact potentials (Shao, et al., 2003). The

algorithms for using this and other HMMs are described in Rabiner’s classic tutorial (Rabiner,

1989).

To align two structures using SCALI, we first computed the position specific HMMSTR

state probabilities, denoted γ, using the Forward/Backward algorithm (Rabiner, 1989). The input

to this program was a sequence profile derived from PSI-BLAST (Altschul, et al., 1997) as

described previously (Bystroff, et al., 1998).

Next, we made an exhaustive list of all aligned fragments. To obtain this list, we first

calculated the alignment matrix A as the dot-product of the state probabilities:

t arg et

template

∑

, (1)

, where q represents a Markov state in the HMMSTR model, and

is the probability of state q

at position i. The score S(i, j, L) for a fragment of length L, starting at position i in the target and

j in the template is simply the sum over a diagonal segment of the alignment matrix A:

S(i, j, L) =

(2)

∑

−=

1,0

))((

kjki

All possible fragments, defined by the positions

i, j, and the length L, were compiled to a

list, subject to the following constraints. A fragment

(1) must have no gaps or insertions,

(2) must have no backbone angle difference greater than 90˚,

(3) must be at least 5 residues in length, and

(4) must not be contained within a longer fragment that has a higher score.

Fragments were sorted by their alignment score,

S (Eq. 2). In every example of two

aligned segments that have no backbone angle differences greater than 90°, the two segments are

superimposable with a low root-mean-square deviation (RMSD). There is no upper limit on the

length of a fragment.

A breadth-first tree search in alignment space was conducted using a contact map scoring

function. A contact map,

C, is an N x N matrix where C

= 1 if the β-carbons (Cα for glycine) of

residues i and j are separated by less than 8Å, and 0 otherwise. The n (where n=200) fragments

with the highest scores, S (Eq. 2), were used as seed alignments for the tree search. At each

branch point, the parent alignment

y was extended using fragment x if and only if:

(1) no residue in

x is already aligned,

(2) there is at least one conserved contact between fragment x and a residue in y,

(3) distance geometric constraints are not violated, meaning Distance(i, j) < 3.8 × | l–k |,

and Distance(k, l) < 3.8 × | j–i |, for all positions i aligned to l, and j aligned to k,

(4) the resulting alignment has one of the top n scores (NS, as defined in Eq. 6).

The top n scoring alignments (parents and children) become the parent alignments of a new

search, until no new fragments could be added.

The similarity between two contact maps is more sensitive than the global RMSD when

comparing distantly related proteins (Yang, et al., 1999), since conformational plasticity can

result in a high overall RMSD even when most of the pairwise contacts are conserved. The

contact score, CS, is the sum over the dot products of the contact maps for all aligned segments:

T =

∑

(3) (C

× C

)

∑

(4) )1( () )1((

klij

CCCC −×+×−

CS = T – NCpenalty × F (5)

Here, C

is the contact property at position i, j in the target, C

is the contact property at position

k, l in the template, where i is aligned to l, and j is aligned to k. The F (false positive and false

negative contacts) was penalized by non-contact penalty (NCpenalty).

The score NS is calculated as follows:

NS = CS – NSpenalty × Nb , (6)

where NSpenalty is a constant for non-sequential penalty, and Nb is defined as the number of

breaks needed to convert the alignment into sequential order. For example, if we have the

alignment with the blocks labeled as ACBD, where each letter represents a sequentially aligned

segment, to reorder the blocks sequentially to ABCD, three breaks are required, therefore Nb=3.

For the arrangement CDAB, a circular permutation, Nb=1. The optimal settings for NSpenalty

and NCpenalty were determined empirically by reproducing manually constructed alignments.

The final step in generating the alignment is pruning and extension, based on global

RMSD. Occasionally SCALI aligned fragments that conserved a pattern of contacts but the two

structures were mirror images of each other. 3D superposition followed by pruning eliminated

these types of errors. After the superposition of structures, an aligned block was removed if it:

(1) had any distance difference greater than 9Å, or

(2) had any backbone angle difference greater than 100˚.

Similarly, an aligned block was extended on either end if the extension:

(1) had no backbone angle difference greater than 100˚, and

(2) had no distance difference greater than 9Å.

The larger cutoff in distance allowed for distorted packing arrangements. Pruning and extension

were applied iteratively as long as the RMSD and aligned length continued to improve.

Empirical values for the permutation penalty (NSpenalty) and non-contact penalty

(NCpenalty) were determined by attempting to reproduce the manual non-sequential alignments

for two study cases: 1fsfA versus 1ig0A, and 1jx6A versus 1qo7A. The two manual alignments

were made by inspection using the molecular modeling program InsightII (Accelrys, Inc.).

Different parameter settings were assessed by inspecting the automated alignments and

comparing them with the manual ones. The SCALI result for a difficult sequential alignment

between remote homologs (sequence identity of 4.5%), 1rec_ versus 1eg4A, was also inspected.

The final settings were NCpenalty = 0.15 and NSpenalty = 6.0.

Many of the automatically generated alignments (available from the website) were

inspected in order to validate the method. In particular, we looked for the types of errors as

described in the Results and

in Table 1. No further changes were made to the algorithm once the

validation was undertaken.

DALI, CE, SARF alignments

CE and SARF programs were downloaded and the alignments were generated locally.

DALI alignments were obtained from the server (www.ebi.ac.uk/dali/Interactive.html). For each

program, the default settings were used. We were unable to run the K2 program in-house for

technical reasons. Programs were written to test each alignment for specific types of errors, as

described in the Results.

Comparison and evaluation of the alignments from different methods

A comparison of various structural alignment methods with SCALI was performed using

a reference set of 111 alignments derived from CATH. The alignments of SCALI were compared

to those from DALI, CE and SARF, which were generated as described above. The alignment

results are summarized in

Table 1. While the alignments were judged by visual inspections, an

evaluation was also undertaken using a figure-of-merit (FOM) scoring function denoted as:

FOM = )()()()5.3()(

54321

BerrwDisjwNonlocalwRMSDwLenw

−+ (7)

Here, FOM is computed as the scaled sum of the five criteria which are represented as Len,

RMSD, Nonlocal, Didj, and Berr in (7). Each criterion is assigned a weight (w

). Len is

the number of locally correct aligned residues in the alignment, where the positions having

backbone angle deviations greater than 120° were ignored. RMSD is the root mean square

deviation of the aligned Cα coordinates. Nonlocal is the percentage of the aligned residues that

were locally non-equivalent, having backbone angle deviations greater than 120°. Disj is the

number of disjoint segments in the alignment, and Berr is the number of misaligned beta strands

(as described in

Table. 1-(2)). The weights were chosen so as to roughly equalize the

contribution of each factor: w

= -0.2, w

= 2.0, w

= 15, w

= 4, and w

= 8. A smaller FOM is

better.

The One-tailed paired Wilcoxson´s sum of signed ranks test (Sokal, et al., 1973) was

calculated to measure the significance of the difference in FOM between the four alignment

methods (CE, DALI, SARF and SCALI). The differences in FOM scores for the two methods

being compared were ranked based on their absolute values, and the positive and negative ranks

were summed separately. The smaller sum (T

) was used to calculate a Z-score (Eq. 8, n=111).

Z =

| Ts −

n(n

n(2n + 1)( n +1)

(8)

The calculated Z was compared to tabulated critical values in (Sokal, et al., 1973) to

obtain a p-value, or significance level. Lower is more significant. The DALI method aligned

only 76 cases, so for comparisons to DALI, n=76 and only those alignments were used.

Clustering multiple SCALI structural alignments

Pairwise SCALI non-sequential alignments were clustered using a simple greedy

algorithm. The set of all pairwise alignments defines a graph where each vertex represents one

protein, and an edge exists if the alignment between the two structures had RMSD ≤4.0Å and at

least 50 residues aligned. The first cluster was the vertex with the most edges and all of its

connected vertices. The second cluster was the vertex with the most edges and all of the

connected vertices after removing the first cluster, and so on.

Theoretically, this simple clustering method could group together different structures by

transitive association (Structure A is similar to B, and B is similar to C, but A is not similar to C).

Surprisingly this did not happen. Instead, alignments within a cluster conserved the same spatial

location. We should note again here that the structures in the CATH database are single domains,

and therefore we did not expect to find multiple cores within one structure. Also, domain folding

is usually an all-or-none, two-state phenomenon, and this may explain why we did not observe

transitive association.

Hidden Markov models based on SCALI multiple structure alignments

A hidden Markov state was defined for each column in the multiple structure alignment

after clustering of our alignments. The state sequence profiles were initialized by summing the

profiles of the aligned positions. The aligned proteins were representatives of the CATH

topologies. Each protein was given equal weight in the summation. Any non-aligned residues

were condensed to a single “Loop” state that connects the aligned states. The Loop states emit

sequences whose length was drawn from a probability distribution. The probability distribution

may be flat, allowing any size loop with equal probability.

State-state transitions were defined according to the sequential ordering of the states in

each member protein. In many cases, since the alignments were non-sequential, cyclic state paths

were possible. These paths are not physically meaningful, since they would imply that two

residues can occupy the same position in space. Therefore, “self-avoiding” states were defined as

Markov states that could be visited at most once in any state pathway. The development of a

modified Forward-Backward algorithm (Rabiner, 1989) that handles self-avoiding states is

ongoing and will produce the correct results on our newly defined HMMs.

In the figures, the Markov states for aligned positions are grouped into single icons

representing secondary structures, according to the TOPS convention (Westhead, et al., 1999).

Loop states are not drawn, but would occur on each of the arrows.

Information content of HMM states

The information content is defined as the likelihood of obtaining a similar distribution of

polar and non-polar residues by chance given the number of observations. To estimate this

likelihood, we ran 5000 simulations for each Markov state. We randomly chose amino acids

from the background distribution N times, where N was the number of observations. The p-value

for non-polar was calculated as the fraction of the 5000 randomly-generated profiles where the

percent non-polar matched or exceeded the percent non-polar of the observed data. If the Markov

state represented a polar position, then a polar p-value was calculated using a similar method.

Run-time complexity, implementation and availability

The alignment algorithm was implemented in Fortran90. The run time complexity for the

main alignment program is O(min(L1, L2)

×L1

L2), where L1 and L2 are the lengths for the

target and template protein, respectively. The typical run time for proteins of length 250 on one

700MHz Pentium3 CPU was about 15 minutes. A searchable database of pre-calculated

alignments may be found at http://www.bioinfo.rpi.edu/~bystrc/scali.html. Development of an

installation package is in progress and will appear at the same site.

RESULTS

Validation of structure-based alignments

To assess its ability to reproduce state-of-the-art sequential structure-based alignments,

SCALI was tested on a set of 120 pairs of distant structural homologs, defined as members of the

same topology class in the CATH database but having less than 25% sequence identity. The

alignments were compared with those from CE and DALI programs. All three methods produced

similar aligned sub-structures. However, in CE and DALI alignments there were segments that

should not have been aligned by the intuitive criteria defined above. Specifically, aligned

residues sometimes lacked local structural similarity and/or the aligned region was not compact.

To assess its ability to find non-sequential alignments, SCALI was compared with DALI,

CE and SARF on a set of topologically different structures. One example is the alignment of

structures 1fsfA and 1ig0A, shown in

Figure 1. These two proteins were aligned manually first,

and the manual alignment was used to develop the method. Both proteins contain parallel six-

stranded beta sheets with five alpha helices arranged anti-parallel to the strands, two on one side

and three on the other. The sheet continues in both cases, but in 1ig0A the seventh strand is anti-

parallel. The seven strands appear in order 1765234 in 1fsfA and 4321567 in 1ig0A (i.e. Strand 1

in 1fsfA is the structural equivalent of strand 4 in 1ig0A, and so on). In our manual alignment,

the six parallel strands and the five helices could be aligned with one circular permutation.

SCALI reproduced this alignment, superimposing the eleven secondary structure units with

RMSD of 5.4Å (

Figure 1a).

Both CE and DALI produced sequential alignments with the beta sheets in the flipped

orientation. CE aligned strands 4325 in 1fsfA with strands 4325 in 1ig0A, aligning unpaired

strands to paired strands (

Figure 1b). Strand 1, which is between strands 2 and 5 in 1ig0A, was

left un-aligned (green-colored in

Figure 1b). DALI aligned strands 43257 in 1fsfA to strands

32157 in 1ig0A, but one strand is skipped and the two strand 7’s point in opposite directions

(

Figure 1c). Both CE and DALI alignments contained additional alignments of non-equivalent

secondary structures. SARF was able to find the correct six stranded sheet alignment but did not

align three of the helices (

Figure 1d). The alignment is disjoint and several aligned segments are

locally different. Two segments are aligned in reverse. Some of the beta strand alignments are

offset by one residue (see

Figure 1. caption).

To further examine the ability and the quality of aligning different topologies with

automatic SCALI method, a set of 111 pairs of different topologies that had similar core

structures were used. This list includes randomly selected proteins that shared the same

architecture but differ in topology, according to CATH classification (Orengo, et al.,

1997;Orengo, 1994;Pearl, et al., 2003;Pearl, et al., 2000). All of the pairwise alignments returned

from CE, DALI, SARF and SCALI were compared to each other (

Table 1) without manual

curation. CE returned the alignments for all of the structure pairs, whereas DALI returned only

76 alignments, with 35 pairs rejected as non-superimposable. As expected, neither CE nor DALI

returned non-sequential alignments, since they were not designed to do so. SARF returned

alignments, including non-sequential and reverse alignments, for all of the test cases. In the

alignments, three specific types of errors were observed. These are defined and tabulated in

Table 1.

Among the returned alignments, none of the CE, DALI and SARF methods returned

error-free alignments. Both types of strand pairing miss-alignment (“unpaired strand” and “cross-

aligned strand” in

Table 1) occurred in CE and DALI, which are similar to the strand pairing

error as shown in

Figure 1b. SARF also made strands pairing errors, and one example is shown

Figure 2. In this alignment, one beta strand in 1cbf is aligned to two beta strands (in green and

pink) and one beta strand in 2ts1 is aligned to two strands (in blue and pink), which result in non-

equivalent paired hydrogen bonding among the aligned strands. The alignments returned from

SCALI did not contain strand pairing errors due to its algorithmic design, which requires

conserved contacts among the aligned segments. Compared with CE, the better performance of

DALI on these difficult cases seems to be due to its ability to decide when to align and when not

to. It will fail to return the alignment if a sequential structural comparison is too difficult, while

CE will return the alignment anyway, even though it may contain many errors. Both types of

strand misalignment result in a parallel displacement of one strand and cause only a minor

increase in the RMSD.

SARF often produced a subset or superset of the SCALI alignment, but SARF aligned

segments in reverse and allowed non-equivalent local structures to align. SCALI does not allow

segments to be aligned in reverse. SARF alignments usually had disjoint pieces (41 out of 111).

SARF alignments were often displaced by one residue from SCALI alignments. This would

change the direction of the side chain in beta strands. 91 out of 111 of our alignments were

compact and contained no obvious errors.

To further evaluate the quality of the 111 alignments generated from various structural

alignment methods, the figure of merit (FOM) was defined and computed (Eq. 7 in Methods). In

principle, the FOM should capture the quality of the structural alignment by rewarding correct

spatial alignments and penalizing the errors (defined in the

Table 1 caption). The FOM scoring

function gives approximately equal penalties to the local non-equivalence errors and beta strand

misalignment errors, and less penalty to the disjoint alignment errors. Using the FOM scores to

evaluate the quality of each alignment method, SCALI performed the best on 81 out of 111 non-

sequential alignments. If we ignore the 35 cases where DALI failed to align the structures, the

best among the four methods is SCALI, followed by SARF, DALI and CE. The Wilcoxon’s sum

of signed ranks test was performed for all possible six paired comparisons and the results show

that the differences between methods are statistically significant at the level of 0.1%. We should

note that the choice of topologically different protein pairs favored SARF and SCALI over the

other two methods.

Cluster analysis

By clustering pairwise SCALI alignments, we obtained non-sequential multiple structure

alignments for several CATH architectures, including the “up-down α bundle”, “β sandwich”, “β

roll”, “3-layer αβα sandwich” proteins, and others. As an example, we choose the 3-layer αβα

“sandwich” (CATH code 3.40) for the all-against-all comparisons. This protein architecture is

the most common and diverse, comprising 61 different topologies (Orengo, et al., 1997;Orengo,

1994;Pearl, et al., 2003;Pearl, et al., 2000). After clustering all 1830 alignments, 56 out of the 61

structures were divided into four subclasses (

Figure 3), each with a conserved core packing

arrangement. The other five proteins (1b0pA, 1adn, 1div, 1inp, 1qhkA) had unique core packing

arrangements. Proteins within a cluster conserved at least 50 residues in a compact region that

aligned with RMSD less than 4.0Å. This cutoff produced some false negatives but very few false

positives. Proteins within a cluster that fell below this significance test were still found to

conserve the recurrent core, albeit more distorted.

The clustered alignments may be modeled as HMMs, where each aligned segment is a

state and the variable sequential connections between the segments define the state-state

transitions.

Figure 4 shows diagrammatic HMMs for each of the αβα clusters. In each model,

some topological connections between the sub-structures are observed and others are not,

probably reflecting the physical constraints on secondary structure packing (Honig, 1999). There

are often compact subsets of connections that dominate, consistent with the previous argument

that certain motifs, described as “attractors”, occur as the core of a protein’s structure more

frequently than others (Holm, et al., 1996). An example is the right-handed parallel βαβ motif.

In each cluster, all observed topologies are represented as pathways through the HMM.

Based on these models, certain pathways exist that might represent proteins that have not been

observed in crystal structures. For example, we may predict that the topology shown in

Figure 5

is possible, based on the HMM for subclass-A in Figure 4a. However, this topology has not yet

been observed and would be considered as a novel fold if found.

When we analyzed the sequence information per position from the multiple structure

alignments, we could clearly see a concentration of sequence information in the core positions. If

we define the p-value as the likelihood of obtaining a similar distribution of polar and non-polar

residues by chance given the number of observations (as described in Methods), all the high-

information content positions (low p-value) tend to be deeply buried in the core. One example of

such result is shown in

Figure 6, which shows the sequence information content for CATH

architecture 3.40 subclass A (as in

Figure 4a).

Multiple SCALI alignments have been carried out for up-down bundle α

proteins,

sandwich β

proteins, 3-layer (αβα) and roll proteins in CATH database. The resulting models

are shown in

Figure 7 and 8. A full analysis of these conserved core packing arrangements is

ongoing.

Discussion

SCALI alignments are comparable to CE and DALI methods for comparing proteins that

share the same topology, better if we agree that structure-based alignments should be compact

and that aligned pieces should be locally similar in their backbone angles. In the cases where

proteins share only a core packing arrangement but with different topologies, SCALI is able to

find the proper structural equivalences, while previous methods fail, either because they assume

a sequential ordering (DALI, CE) or because they do not enforce compactness, local equivalence,

and sequence similarity (SARF).

Multiple non-sequential alignments from SCALI have been used to construct non-linear

profile HMMs, similar to the way profile hidden Markov models have been constructed to model

protein families and superfamilies (Eddy, 1998;Karplus, et al., 1998;Gough, et al., 2002), and

these may be useful in predicting structures that have a recurrent core packing arrangements.

Cluster analysis of our non-sequential alignments shows that some core packing arrangements

have occurred dozens of times, each with a different topology. It is therefore reasonable to

assume that there are many more permutants among the unsolved proteins.

Classification of cores

There are two widely cited classification schemes for protein domain structures, CATH

(Orengo, 1994;Orengo, et al., 1997;Pearl, et al., 2000;Pearl, et al., 2003) and SCOP (Murzin, et

al., 1995). Both have a top-down hierarchy, starting from classes based on secondary structure

content, then gross arrangement of secondary structure units, and then classes based on the

topological connections between those units. At this level (“topology” in the CATH or "fold" in

SCOP) we expect structures to superimpose sequentially. The recurrent packing motifs discussed

above represent a structure classification scheme that is more specific than "architecture" but not

as specific as "topology".

A new, intermediate classification level based on non-sequential multiple alignments may

help us understand the universe of protein folds. We may call these “core” types, and apply

codified names to each. For example, the model described in Figure 4a may be termed

unambiguously as a “3-layer 2α(all down)-5β(all up)-2α(all down)” core. Figure 4b may be

termed unambiguously as a “3-layer 1α(up)-4β(2 up, 2 down)-2α(all down).” A numerical

representation may be substituted for easy searching, such as “a.00/b.11111/a.00” for Figure 4a,

with /’s separating the structural layers and binary digits indicating the number and orientation of

the secondary structures. However, some domains may not lend themselves easily to the

“layered” notation.

Can we make predictive models from non-sequential alignments?

The highest degree of sequence identity that we found in the SCALI non-sequential

alignments at architecture level in CATH database was only 12%. We cannot completely exclude

the possibility of a common ancestor, but it is more likely that these core similarities are the

result of convergent evolution, where energetic stability was the selection pressure. A conserved

packing arrangement of secondary structures should energetically favor some sequence patterns,

and this idea is supported by the results shown in Figure 6. Conserved secondary structure and

3D packing environment does appear to define conserved sequence patterns, at least binary

(polar/non-polar) patterns.

We have observed certain core packing arrangements multiple times with different

topologies. If these cores are recurrent themes in nature, then we might expect to see some future

“new folds” fall into these same classes. That is, “new folds” may be permuted “old folds”. For

example, the hypothetical protein yjiA from E. coli (PDB code 1NIJ) solved in 2002 (Khil et al.)

was found to be a “new fold” according to CASP5 (Moult, et al., 2003). It was a new type of

alpha-beta protein consists of a single mixed β-sheet with strand order 15234 where strands 3

and 5 are anti-parallel to the others (Aloy, et al., 2003). We have found a cluster of proteins that

have the same core packing arrangement, all solved before 2002, and among the possible

topologies given the HMM (

Figure 8a) was the topology of the new protein 1NIJ (Figure 9).

Since core alignments conserve sequence information, and cores are often recurrent, therefore

self-avoiding HMMs based on SCALI alignments have the potential for predicting the core

structure of topologically novel proteins based on the sequence alone.

ACKNOWLEDGEMENTS

This research was supported by NSF grant EIA-0229454 and 0343206.

REFERENCES

Abagyan, R.A. and Maiorov, V.N.(1989)An automatic search for similar spatial arrangements of

alpha-helices and beta-strands in globular proteins,J Biomol Struct Dyn,

6,1045-60.

Alexandrov, N.N.(1996)SARFing the PDB,Protein Eng,

9,727-32.

Alexandrov, N.N. and Fischer, D.(1996)Analysis of topological and nontopological structural

similarities in the PDB: new examples with old structures,Proteins,

25,354-65.

Aloy, P., Stark, A., Hadley, C. and Russell, R.B.(2003)Predictions without templates: new folds,

secondary structure, and contacts in CASP5,Proteins,

53 Suppl 6,436-56.

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman,

D.J.(1997)Gapped BLAST and PSI-BLAST: a new generation of protein database search

programs,Nucleic Acids Res,

25,3389-402.

Bennett, M.J., Choe, S. and Eisenberg, D.(1994)Domain swapping: entangling alliances between

proteins,Proc Natl Acad Sci U S A,

91,3127-31.

Bernstein, H.J.(2000)Recent changes to RasMol, recombining the variants,Trends Biochem

Sci,

25,453-5.

Bystroff, C. and Baker, D.(1998)Prediction of local structure in proteins using a library of

sequence-structure motifs,J Mol Biol,

281,565-77.

Bystroff, C., Thorsson, V. and Baker, D.(2000)HMMSTR: a hidden Markov model for local

sequence-structure correlations in proteins,J Mol Biol,

301,173-90.

Eddy, S.R.(1998)Profile hidden Markov models,Bioinformatics,

14,755-63.

Efimov, A.V.(1995)Structural similarity between two-layer alpha/beta and beta-proteins,J Mol

Biol,

245,402-15.

Flores, T.P., Orengo, C.A., Moss, D.S. and Thornton, J.M.(1993)Comparison of conformational

characteristics in structurally similar protein pairs,Protein Sci,

2,1811-26.

Gibrat, J.F., Madej, T. and Bryant, S.H.(1996)Surprising similarities in structure

comparison,Curr Opin Struct Biol,

6,377-85.

Gong, W., O'Gara, M., Blumenthal, R.M. and Cheng, X.(1997)Structure of pvu II DNA-

(cytosine N4) methyltransferase, an example of domain permutation and protein fold

assignment,Nucleic Acids Res,

25,2702-15.

Gough, J. and Chothia, C.(2002)SUPERFAMILY: HMMs representing all proteins of known

structure. SCOP sequence searches, alignments and genome assignments,Nucleic Acids

Res,

30,268-72.

Holm, L. and Sander, C.(1993)Protein structure comparison by alignment of distance matrices,J

Mol Biol,

233,123-38.

Holm, L. and Sander, C.(1996)Mapping the protein universe,Science,

273,595-603.

Honig, B.(1999)Protein folding: from the levinthal paradox to structure prediction,J Mol

Biol,

293,283-93.

Hou, Y., Hsu, W., Lee, M.L. and Bystroff, C.(2003)Efficient remote homology detection using

local structure,Bioinformatics,

19,2294-301.

Iwakura, M., Nakamura, T., Yamane, C. and Maki, K.(2000)Systematic circular permutation of

an entire protein reveals essential folding elements,Nat Struct Biol,

7,580-5.

Janowski, R., Kozak, M., Jankowska, E., Grzonka, Z., Grubb, A., Abrahamson, M. and Jaskolski,

M.(2001)Human cystatin C, an amyloidogenic protein, dimerizes through three-dimensional

domain swapping,Nat Struct Biol,

8,316-20.

Jeltsch, A.(1999)Circular permutations in the molecular evolution of DNA methyltransferases,J

Mol Evol,

49,161-4.

Jung, J. and Lee, B.(2001)Circularly permuted proteins in the protein structure database,Protein

Sci,

10,1881-6.

Karplus, K., Barrett, C. and Hughey, R.(1998)Hidden Markov models for detecting remote

protein homologies,Bioinformatics,

14,846-56.

Khil, P.P., Obmolova, G., Teplyakov, A., Howard, A., J., Gilliland, G. L. & Camerini-otero, R.

D.Crystal Structure of the Yjia Protein from E. Coli.,To be published.,

Koehl, P.(2001)Protein structure similarities,Curr Opin Struct Biol,

11,348-53.

Milik, M., Szalma, S. and Olszewski, K.A.(2003)Common Structural Cliques: a tool for protein

structure and function analysis,Protein Eng,

16,543-52.

Moult, J., Fidelis, K., Zemla, A. and Hubbard, T.(2003)Critical assessment of methods of protein

structure prediction (CASP)-round V,Proteins,

53 Suppl 6,334-9.

Murzin, A.G., Brenner, S.E., Hubbard, T. and Chothia, C.(1995)SCOP: a structural classification

of proteins database for the investigation of sequences and structures,J Mol Biol,

247,536-40.

Orengo, C.A.(1994)Classification of protein folds,Curr Opin Struct Biol,

4,429-440.

Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B. and Thornton,

J.M.(1997)CATH--a hierarchic classification of protein domain structures,Structure,

5,1093-

108.

Ortiz, A.R., Strauss, C.E. and Olmea, O.(2002)MAMMOTH (matching molecular models

obtained from theory): an automated method for model comparison,Protein Sci,

11,2606-21.

Pearl, F.M., Lee, D., Bray, J.E., Sillitoe, I., Todd, A.E., Harrison, A.P., Thornton, J.M. and

Orengo, C.A.(2000)Assigning genomic sequences to CATH,Nucleic Acids Res,

28,277-82.

Pearl, F.M., Bennett, C.F., Bray, J.E., Harrison, A.P., Martin, N., Shepherd, A., Sillitoe, I.,

Thornton, J. and Orengo, C.A.(2003)The CATH database: an extended protein family

resource for structural and functional genomics,Nucleic Acids Res,

31,452-5.

Rabiner, L.R.(1989)A tutorial on Hidden Markov Models and selected applications in speech

recognition,Proc. IEEE,

77,257-286.

Rost, B.(1997)Protein structures sustain evolutionary drift,Fold Des,

2,S19-24.

Sayle, R.A. and Milner-White, E.J.(1995)RASMOL: biomolecular graphics for all,Trends

Biochem Sci,

20,374.

Schiering, N., Casale, E., Caccia, P., Giordano, P. and Battistini, C.(2000)Dimer formation

through domain swapping in the crystal structure of the Grb2-SH2-Ac-pYVNV

complex,Biochemistry,

39,13376-82.

Shao, Y. and Bystroff, C.(2003)Predicting interresidue contacts using templates and

pathways,Proteins,

53 Suppl 6,497-502.

Shindyalov, I.N. and Bourne, P.E.(1998)Protein structure alignment by incremental

combinatorial extension (CE) of the optimal path,Protein Eng,

11,739-47.

Smith, V.F. and Matthews, C.R.(2001)Testing the role of chain connectivity on the stability and

structure of dihydrofolate reductase from E. coli: fragment complementation and circular

permutation reveal stable, alternatively folded forms,Protein Sci,

10,116-28.

Sokal, R.R. and Rohlf, F.J.(1973)Introduction to Biostatistics,(eds),W.H Freeman and company,

San Francisco,pp. 220-222.

Szustakowski, J.D. and Weng, Z.(2000)Protein structure alignment using a genetic

algorithm,Proteins,

38,428-40.

Taylor, W.R. and Orengo, C.A.(1989)Protein structure alignment,J Mol Biol,

208,1-22.

Viguera, A.R., Blanco, F.J. and Serrano, L.(1995)The order of secondary structure elements does

not determine the structure of a protein but does affect its folding kinetics,J Mol

Biol,

247,670-81.

Westhead, D.R., Slidel, T.W., Flores, T.P. and Thornton, J.M.(1999)Protein structural topology:

Automated analysis and diagrammatic representation,Protein Sci,

8,897-904.

Yang, A.S. and Honig, B.(1999)Sequence to structure alignment in comparative modeling using

PrISM,Proteins,

Suppl 3,66-72.

Yang, A.S. and Honig, B.(2000)An integrated approach to the analysis and modeling of protein

sequences and structures. I. Protein structural alignment and a quantitative measure for

protein structural distance,J Mol Biol,

301,665-78.

Figure and Table Captions

Figure 1. Structural comparison between Escherichia coli Glucosamine-6-Phosphate deaminase

(PDB code 1fsfA, 266 residues) and yeast thiamin pyrophospho-kinase (PDB code 1ig0A, 319

residues), using four methods. Each figure shows only the aligned residues, red for 1fsfA, yellow

for 1ig0A. Below each figure is a topology (TOPS) cartoon, with strands as triangles, helices as

circles. The diagrams are oriented roughly as the proteins are aligned, with the aligned segments

shaded

. For simplicity, only the secondary structure units that are in common between the

two proteins are shown.

The alignments may contain additional small aligned fragments that

are not included in the TOPS diagrams.

(a) SCALI alignment, 104 aligned residues, RMSD =

5.4Å, one permutation. Aligned segments (1fsfA/1ig0A): 1-9/114-122, 10-24/125-139, 33-

40/182-189, 43-58/197-212, 63-70/213-220, 133-140/38-45, 190-195/53-58, 200-218/64-82,

237-241/93-97, 247-256/101-110.

(b) CE alignment, 111 aligned residues, RMSD = 5.1Å.

Aligned segments (1fsfA/1ig0A): 14-28/49-63, 35-43/64-72, 46-64/73-91, 67-74/92-99, 85-

86/100-101, 89-102/102-115, 104-110/116-122, 115-116/123-124, 119/125, 120-130/129-139,

131-146/185-200, 150-156/201-207.

Aligned segments (1fsfA/1ig0A): 23-27/32-36, 33-42/37-46, 45-53/48-56, 66-72/63-69, 85-

93/71-79, 95-100/85-90, 104-110/92-98, 111-116/119-124, 119-129/128-138, 132-139/186-193,

145-148/194-197, 189-192/202-205, 196-203/217-224, 220-223/225-228, 230-233/303-306,

237-240/308-311.

(d) SARF alignment, 105 aligned residues. RMSD=2.9Å. Aligned segments

(1fsfA/1ig0A): 1-7/114-120, 10/125, 12-23/126-137, 34-42/183-191, 62-74/212-224, 91-98/203-

210, 132-137/38-43, 188-204/54-70, 210-217/104-111, 221-225/83-78*, 235-246/92-103, 257-

263/154-160. A * denotes reversed segments.

Figure 2.

Structural comparison between PDB:1cbf and PDB:2ts1 using SARF method. The

alignment has RMSD of 2.76 with 76 residues aligned in space. The figure shows the C-alpha

trace for the aligned beta strands only, with thicker line for 1cbf, and thinner line for 2ts1. The

match positions in the alignment are shown in the same color for the two structures. The

alignment contains the “beta-strands pairing errors” as defined in Table 1-(2), which are

illustrated as one strand has two different colors, each of which is aligned to the segment from

another strand. Aligned segments (1cbf/2ts1): 22-23/187-188, 10/125, 24-30/216-222, 37-41/7-

3*, 48-50/63-65, 57-66/50-59, 68-72/116-120, 78/172, 79-90/174-185, 95-97/31-33, 98-102/189-

193, 109-121/198-210, 125-129/18-14*, 208/228. A * denotes reversed segments.

Figure 3. All-against-all structure comparison and clustering for 3-layer (αβα) proteins in

CATH 3.40. . A dot indicates the paired structures have a significant SCALI alignment.

Bordered regions are four subclasses, A, B, C, and D, listed here using the PDB code, chain and

domain identifier (26).

Subclass A: 1cbf01, 1aba00, 1ag8A1, 1ag8A2, 1alkA0, 1ami02, 1aua01,

1bg200, 1c8kA2, 1chmA1, 1dhs00, 1di6A0, 1dioB0, 1ekjA0, 1fuiA2, 1glaG2, 1iso00, 1lam01,

1lba00, 1poiB0, 1ra900, 1svq00, 1tplA2, 1udg00, 1vpt00, 1vsrA0, 2cevA0, 2ctc00, 2minB2,

2ts101, 3pmgA1, 3pmgA3, 1avpA0, 1rhs01, 1hfc00, 1ble00, 1cfr00.

Subclass B: 1ctt02, 1a3aA0,

1cby00, 1cl8A0, 1eq6A0, 1fua00, 1g8tB0, 1uch00, 2bltA0.

Subclass C: 1b94A0, 1b4uB0,

1e8gA3, 1eovA2, 1nox00, 1pvuA0.

Subclass D: 1tdj03, 1br6A1, 1cfe00, 1pinA0.

Figure 4. Diagrammatic hidden Markov models for the four sub-classes of 3-layer (αβα)

proteins, A, B, C and D as defined in Figure 2. In each subclass, the upper panel shows the

topology diagram without connectivities for that core structure. Strands are shown as arrows, and

helices as circles. Shaded helices are pointing down (or into the page). Dotted lines indicate

secondary structures that are sometimes present. The lower panel is the hidden Markov model

drawn for that core. Strands are shown as triangles, and helices are shown as circles. The

connectivities between the sub-structures are shown as arrows. Thicker lines indicate more

frequent connections.

(a) Subclass A: 37 proteins (b) Subclass B: 9 proteins, (c) Subclass C: 6

proteins

(d) Subclass D: 4 proteins

Figure 5. A possible new fold topology. This fold has never been observed (according to the

CATH released in Jan 2004) and yet is consistent with the model for subclass A of CATH

architecture 3.40 (Figure 4a).

Figure 6. The sequence information per position for subclass-A in CATH3.40. The stereo image

shows the core region of 1cbf (with C-alpha backbone trace only), a representative from

CATH3.40 subclass-A which consists of 34 topologies non-sequentially superimposable by

SCALI (Figure 3a). Colors represent the information content of the combined sequence profiles

at each aligned position, which is calculated as the p-value for obtaining the observed

distribution of polar and non-polar amino acids by chance (as described in Methods). Blue

represents a p=0.00, red is p=0.30 and higher. The p-value goes up in the hue scale from blue,

through green, to red. The high-information content positions tend to be deeply buried in the core

of the structure.

Figure 7. Diagrammatic hidden Markov models for the subclasses of 19 representative proteins

in CATH 2.60 (β sandwich) based on SCALI multiple alignments. Drawn as in Figure 4. (a) 12

proteins. (b) 3 proteins.

Figure 8. Diagrammatic hidden Markov models for the sub-classes of 29 αβ proteins in CATH

architecture 3.10 (β roll) based on SCALI alignments. Drawn as in Figure 4. (a) 6 proteins. (b) 5

proteins. (c) 2 proteins.

Figure 9. Topology of 1NIJ, which was a new fold in 2002. (a). Structure of 1NIJ. (b). Topology

of 1NIJ, which belongs to the first subclass of CATH 3.10 (Figure 8a).

Table 1. Systematic comparison of 111 SCALI alignments with CE, SALI and SARF. In this

table, the information of the averaged alignment length, RMSD, and FOM is derived form all

111 cases for the method of CE, SARF and SCALI. Only 76 alignments are used for the

evaluation of DALI method since 35 out of 111 are not alignable. The three specific types of

errors were defined as follows:

(1)

Local non-equivalent error: Non-equivalent secondary structures are aligned in 3D space.

(2)

Beta-strands misalignment error: The alignment contains either the cross-aligned strands

or unpaired strands. Cross-aligned strand error is where the paired beta-strands are aligned in

the opposite order (e.g. a 4-stranded 1234 beta-sheet is aligned to a 1324 beta-sheet, with

strand 2 is aligned to 2 and 3 is aligned to 3). Unpaired strand error is where paired strands

(strands that are making hydrogen bonds) are aligned to unpaired strands.

(3) Disjoint alignment error: The alignment contains two or more segments that are spatially

separate.

To evaluate the alignments from difference methods, a figure of merit (FOM) was defined. FOM

rewards aligned residues sharing the same secondary structure, having a Cα-Cα distance of less

than roughly ~3.5Å in the final 3D alignment, and penalizes the three types of errors defined

above. A lower FOM is better. Wilcoxson´s sum of ranks test was performed to evaluate the

statistical significance of the differences between the different alignment methods (See details in

Methods).

Figures

Figure 1(a)

Figure 1(b)

Figure 1(c)

Figure 1(d)

igure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

(a)

(b)

Table 1.

Method Average

alignment

length

RMSD

(Å)

Local non-

equivalent

(cases)

Strand mis-

alignment

error

(cases)

Disjoint

error

(cases)

Not

aligned

(cases)

Error-

free

(cases)

Average

FOM

CE 83.3 5.9 111 28 1 0 0 1.6

DALI 81.5 5.7 76 10 18 35 0 0.1

SARF 79.7 2.7 111 22 41 0 0 -6.8

SCALI 64.7 4.3 10 7 7 0 91 -10.5

NON-SEQUENTIAL AND FLEXIBLE PROTEIN STRUCTURE ALIGNMENT

Article

How a Spatial Arrangement of Secondary Structure Elements Is Dispersed in the Universe of Protein Folds

Article

Full-text available

Sep 2014
PLOS ONE

It has been known that topologically different proteins of the same class sometimes share the same spatial arrangement of secondary structure elements (SSEs). However, the frequency by which topologically different structures share the same spatial arrangement of SSEs is unclear. It is important to estimate this frequency because it provides both a deeper understanding of the geometry of protein folds and a valuable suggestion for predicting protein structures with novel folds. Here we clarified the frequency with which protein folds share the same SSE packing arrangement with other folds, the types of spatial arrangement of SSEs that are frequently observed across different folds, and the diversity of protein folds that share the same spatial arrangement of SSEs with a given fold, using a protein structure alignment program MICAN, which we have been developing. By performing comprehensive structural comparison of SCOP fold representatives, we found that approximately 80% of protein folds share the same spatial arrangement of SSEs with other folds. We also observed that many protein pairs that share the same spatial arrangement of SSEs belong to the different classes, often with an opposing N- to C-terminal direction of the polypeptide chain. The most frequently observed spatial arrangement of SSEs was the 2-layer α/β packing arrangement and it was dispersed among as many as 27% of SCOP fold representatives. These results suggest that the same spatial arrangements of SSEs are adopted by a wide variety of different folds and that the spatial arrangement of SSEs is highly robust against the N- to C-terminal direction of the polypeptide chain.

Non-sequential protein structure alignment based on variable length AFPs using the maximal clique

Conference Paper

Dec 2016

DISCO: A New Algorithm for Detecting 3D Protein Structure Similarity

Conference Paper

Full-text available

Sep 2012

Protein structure similarity is one of the most important aims pursued by bioinformatics and structural biology, nowadays. Although quite a few similarity methods have been proposed lately, yet fresh algorithms that fulfill new preconditions are needed to serve this purpose. In this paper, we provide a new similarity measure for 3D protein structures that detects not only similar structures but also similar substructures to a query protein, supporting both multiple and pairwise comparison procedures and combining many comparison characteristics. In order to handle similarity queries we utilize efficient and effective indexing techniques such as M-trees and we provide interesting results using real, previously tested protein data sets. © 2012 IFIP International Federation for Information Processing.

Finding optimal interaction interface alignments between biological complexes

Article

Full-text available

Jun 2015
BIOINFORMATICS

Biological molecules perform their functions through interactions with other molecules. Structure alignment of interaction interfaces between biological complexes is an indispensable step in detecting their structural similarities, which are key S: to understanding their evolutionary histories and functions. Although various structure alignment methods have been developed to successfully access the similarities of protein structures or certain types of interaction interfaces, existing alignment tools cannot directly align arbitrary types of interfaces formed by protein, DNA or RNA molecules. Specifically, they require a ': blackbox preprocessing ': to standardize interface types and chain identifiers. Yet their performance is limited and sometimes unsatisfactory. Here we introduce a novel method, PROSTA-inter, that automatically determines and aligns interaction interfaces between two arbitrary types of complex structures. Our method uses sequentially remote fragments to search for the optimal superimposition. The optimal residue matching problem is then formulated as a maximum weighted bipartite matching problem to detect the optimal sequence order-independent alignment. Benchmark evaluation on all non-redundant protein -: DNA complexes in PDB shows significant performance improvement of our method over TM-align and iAlign (with the ': blackbox preprocessing ': ). Two case studies where our method discovers, for the first time, structural similarities between two pairs of functionally related protein -: DNA complexes are presented. We further demonstrate the power of our method on detecting structural similarities between a protein -: protein complex and a protein -: RNA complex, which is biologically known as a protein -: RNA mimicry case. The PROSTA-inter web-server is publicly available at http://www.cbrc.kaust.edu.sa/prosta/. xin.gao@kaust.edu.sa. © The Author 2015. Published by Oxford University Press.

Probabilistic Graphical Models and Algorithms for Protein Problems

Article

Feng Jiao

INDEXING METHODS FOR PROTEIN TERTIARY AND PREDICTED STRUCTURES

Article

bi100975z

Data

Full-text available

Jan 2014

Protein Cutoff Scanning: Aplicação da Varredura Exaustiva de Distâncias Inter-resíduos na Análise de Contatos Intracadeia em Proteínas Globulares

Thesis

Full-text available

Feb 2008

Carlos Henrique Da Silveira

Neste trabalho foi feita uma análise comparativa entre duas metodologias clássicas no estudo de contatos em proteínas: a dependente de um delimitador de distância (CD - Cutoff Dependent) e outra que não é dependente de um delimitador, a decomposição de Delaunay (DT – Delaunay Tessellation). Essas técnicas foram avaliadas usando-se duas formas diferentes de representação de resíduos (centróides): pelo carbono alfa (CA) e pelo centro geométrico da cadeia lateral (GC). Um banco de dados foi montado, compreendendo dois conjuntos chamados ALPHA e BETA contendo cadeias das duas principais classes do sistema de classificação CATH: all-alpha e all beta, respectivamente. Um delimitador em 7.0 Å emergiu como um importante parâmetro de distância na análise dos contatos inter-resíduos em proteínas. Este valor marca o ponto de bifurcação no comportamento das curvas de contatos entre as técnicas CD e DT. Até 7,0 Å, as propriedades CD e DT são unificadas numa mais abrangente: nesta distância, todos os contatos (arestas) são totais e verdadeiro-positivos (completos e não-oclusos). A distância de 7,0 Å é o ponto também em que a primeira camada de vizinhos encontra-se otimamente separada das demais, constituindo-se principalmente de contatos de primeira-ordem. É demonstrado que 7,0 Å é um ponto de transição entre os comportamentos lineares e quadráticos da curva do número total de vizinhos por resíduo. Também é mostrado que a técnica DT tem uma conhecida anomalia em sua contagem de arestas que, em proteínas, pode produzir omissões indesejáveis e sistemáticas afetando principalmente a rede de contatos de proteínas betas com centróides em CA. Uma técnica auxiliar reconhecida por tratar essa anomalia é o quase-Delaunay (AD – Almost Delaunay). É observado que mesmo AD não se mostra uma técnica proveitosa em proteínas. É empiricamente demonstrado que DT+AD convergem para CD, na medida que o parâmetro de perturbação em AD cresce. Isto alerta que DT e técnicas correlatas devem ser usadas com precaução em proteínas. Como conseqüência, no estrito intervalo de 0,0 Å a 7,0 Å, CD revela-se uma metodologia mais simples, completa e confiável. Por fim, é evidenciado também que a redução na representação dos resíduos aos centróides CA e GC pode introduzir tendências estatísticas na análise de vizinhos em delimitadores até 6,8 Å, com CA em favor ALPHA e GC em favor de BETA. Para valores acima de 6,8 Å, este viés parece ser eliminado. Isto provê um argumento a mais em benefício do limite em 7,0 Å, como um parâmetro de referência, robusto e de carácter geral, a ser usado de forma segura como um confiável delimitador de distância nos estudos em massa de contatos de proteínas.

Computational Methods for Protein Structure Prediction and Modeling: Volume 1: Basic Characterization

Book

Jan 2007

Gapped BLAST and PSI-BLAST: A new generation of protein database search programs

Article

Full-text available

Sep 1997

Introduction to Biostatistics.

Article

Jan 1974

Protein structure alignment using a genetic algorithm

Article

Mar 2000
PROTEINS

We have developed a novel, fully automatic method for aligning the three-dimensional structures of two proteins. The basic approach is to first align the proteins' secondary structure elements and then extend the alignment to include any equivalent residues found in loops or turns. The initial secondary structure element alignment is determined by a genetic algorithm. After refinement of the secondary structure element alignment, the protein backbones are superposed and a search is performed to identify any additional equivalent residues in a convergent process. Alignments are evaluated using intramolecular distance matrices. Alignments can be performed with or without sequential connectivity constraints. We have applied the method to proteins from several well-studied families: globins, immunoglobulins, serine proteases, dihydrofolate reductases, and DNA methyltransferases. Agreement with manually curated alignments is excellent. A web-based server and additional supporting information are available at http://engpub1.bu.edu/∼josephs. Proteins 2000;38:428–440. © 2000 Wiley-Liss, Inc.

Protein Structure Alignment Using Evolutionary Computation

Chapter

Dec 2003

Considerable amount of structural data on 3D protein structure has established structure comparison as an essential technique for understanding protein sequence, structure, function, and evolution. The goal is to predict 3D protein structures from amino acid sequence information alone. A major step toward this goal is to determine a method for discovering common protein structures in databases such as the Protein Data Bank, so that a better understanding of protein structure and function can be pieced together. Structure comparison algorithms are used to identify a set of residue equivalencies between two proteins based on their 3D coordinates. This set of equivalencies is called a structure alignment, and it allows the superposition of one protein structure onto the other after rigid rotation and/or translation. Structure alignments can indicate if two proteins share the same fold, or structural unit. Structure alignment is also used as the gold standard for evaluating protein structure prediction methods. This chapter focuses on the application of evolutionary computation to protein structure similarity problems and provides an example of a hybridization of evolutionary algorithms and other optimization techniques. The combination of these approaches offers a new and exciting method for protein structure comparison with increased specificity and sensitivity compared with previous methods.

Structure of PvuII DNA-(cytosine N4) methyltransferase, an example of domain permutation and protein fold assignment

Article

Jul 1997
NUCLEIC ACIDS RES

Weimin Gong

We have determined the structure of PvuII methyltransferase (M.PvuII) complexed with S-adenosyl-l-methionine (AdoMet) by multiwavelength anomalous diffraction, using a crystal of the selenomethioninesubstituted protein. M.PvuII catalyzes transfer of the methyl group from AdoMet to the exocyclic amino (N4) nitrogen of the central cytosine in its recognition sequence 5′-CAGCTG-3′. The protein is dominated by an open α/β-sheet structure with a prominent V-shaped cleft: AdoMet and catalytic amino acids are located at the bottom of this cleft. The size and the basic nature of the cleft are consistent with duplex DNA binding. The target (methylatable) cytosine, if flipped out of the double helical DNA as seen for DNA methyltransferases that generate 5-methylcytosine, would fit into the concave active site next to the AdoMet. This M.PvuII α/β-sheet structure is very similar to those of M.HhaI (a cytosine C5 methyltransferase) and M.TaqI (an adenine N6 methyltransferase), consistent with a model predicting that DNA methyltransferases share a common structural fold while having the major functional regions permuted into three distinct linear orders. The main feature of the common fold is a seven-stranded β-sheet (6↓ 7↑ 5↓ 4↓ 1↓ 2↓ 3↓) formed by five parallel β-strands and an antiparallel β-hairpin. The β-sheet is flanked by six parallel α-helices, three on each side. The AdoMet binding site is located at the C-terminal ends of strands β1 and β2 and the active site is at the C-terminal ends of strands β4 and β5 and the N-terminal end of strand β7. The AdoMet-protein interactions are almost identical among M.PvuII, M.HhaI and M.TaqI, as well as in an RNA methyltransferase and at least one small molecule methyltransferase. The structural similarity among the active sites of M.PvuII, M.TaqI and M.HhaI reveals that catalytic amino acids essential for cytosine N4 and adenine N6 methylation coincide spatially with those for cytosine C5 methylation, suggesting a mechanism for amino methylation.

Classification of protein folds

Article

Jun 1994

Christine Orengo

Recent developments in automatic structure comparison have yielded several fast and flexible methods that allow extensive explorations of the structure databank. As a result, proteins have been clustered into a few hundred structural families. Many interesting and unexpected structural similarities have been revealed, and some folds have been shown to support diverse sequences and functions.

An Automatic Search for Similar Spatial Arrangements of α-Helices and β-Strands in Globular Proteins

Article

Jun 1989

A fast search algorithm to reveal similar polypeptide backbone structural motifs in proteins is proposed. It is based on the vector representation of a polypeptide chain fold in which the elements of regular secondary structures are approximated by linear segments (Abagyan and Maiorov, J. Biomol. Struct. Dyn. 5, 1267–1279 (1988)). The algorithm permits insertions and deletions in the polypeptide chain fragments to be compared. The fast search algorithm implemented in FASEAR program is used for collecting βαβ supersecondary structure units in a number of α/β proteins of Brookhaven Data Bank. Variation of geometrical parameters specifying backbone chain fold is estimated. It appears that the conformation of the majority of the fragments, although almost all of them are right-handed, is quite different from that of standard βαβ units. Apart from searching for specific type of secondary structure motif, the algorithm allows automatically to identify new recurrent folding patterns in proteins. It may be of particular interest for the development of tertiary template approach for prediction of protein three-dimensional structure as well for constructing artificial polypeptides with goal-oriented conformation.

Dimer Formation through Domain Swapping in the Crystal Structure of the Grb2SH2−Ac-pYVNV Complex ‡

Article

Nov 2000

Src homology 2 (SH2) domains are key modules in intracellular signal transduction. They link activated cell surface receptors to downstream targets by binding to phosphotyrosine-containing sequence motifs. The crystal structure of a Grb2-SH2 domain-phosphopeptide complex was determined at 2.4 Angstrom resolution. The asymmetric unit contains four polypeptide chains. There is an unexpected domain swap so that individual chains do not adopt a closed SH2 fold. Instead, reorganization of the EF loop leads to an open, nonglobular fold, which associates with an equivalent partner to generate an intertwined dimer. As in previously reported crystal structures of canonical Grb2-SH2 domain-peptide complexes, each of the four hybrid SH2 domains in the two domain-swapped dimers binds the phosphopeptide in a type 1 beta -turn conformation. This report is the first to describe domain swapping for an SH2 domain. While in vivo evidence of dimerization of Grb2 exists, our SH2 dimer is metastable and a physiological role of this new form of dimer formation remains to be demonstrated.

A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition

Article

Jan 1993

Lawrence Rabiner

Circularly Permuted Proteins in the Protein Structure Database

Article

Apr 2009
PROTEIN SCI

Some proteins are homologous to others after their sequence is circularly permuted. A few such proteins have been recognized, mainly by sequence comparison, but also by comparing their three-dimensional structures. Here we report the result of a systematic search for all protein pairs in the SCOP 90% id domain database that become structurally superimposable when the sequence of one of the pairs is circularly permuted. Using a reasonable set of criteria, we find that 47% of all protein domains are superimposable to at least one other protein domain in the database after their sequence is circularly permuted. Many of these are symmetric proteins, which superimpose to another protein both with and without a circular permutation of the sequence. However, 412 of the total 3035 domains are nonsymmetric, and these become structurally superimposable to another protein only after a circular permutation of the sequence. These include most known and many previously undetected circularly permuted proteins with remote homology.

Non-sequential Structure-based Alignments Reveal Topology-independent Core Packing Arrangements in Proteins

Abstract

Recommended publications

An ENSEMBLE machine learning approach for the prediction of all-alpha membrane proteins

Five Hierarchical Levels of Sequence-Structure Correlation in Proteins

Improved pairwise alignments of proteins in the Twilight Zone using local structure predictions

Crystal structure of the highly divergent pseudouridine synthase TruD reveals a circular permutation...

Assessment of the probabilities for evolutionary structural changes in protein folds