ArticlePDF Available

Non-sequential Structure-based Alignments Reveal Topology-independent Core Packing Arrangements in Proteins

Authors:

Abstract

Motivation: Proteins of the same class often share a secondary structure packing arrangement but differ in how the secondary structure units are ordered in the sequence. We find that proteins that share a common core also share local sequence-structure similarities, and these can be exploited to align structures with different topologies. In this study, segments from a library of local sequence-structure alignments were assembled hierarchically, enforcing the compactness and conserved inter-residue contacts but not sequential ordering. Previous structure-based alignment methods often ignore sequence similarity, local structural equivalence and compactness. Results: The new program, SCALI (Structural Core ALIgnment), can efficiently find conserved packing arrangements, even if they are non-sequentially ordered in space. SCALI alignments conserve remote sequence similarity and contain fewer alignment errors. Clustering of our pairwise non-sequential alignments shows that recurrent packing arrangements exist in topologically different structures. For example, the three-layer sandwich domain architecture may be divided into four structural subclasses based on internal packing arrangements. These subclasses represent an intermediate level of structure classification, more general than topology, but more specific than architecture as defined in CATH. A strategy is presented for developing a set of predictive hidden Markov models based on multiple SCALI alignments.
Non-sequential Structure-based Alignments Reveal Topology-
independent Core Packing Arrangements in Proteins
Xin Yuan* and Christopher Bystroff*
Department of Biology, Rensselaer Polytechnic Institute, Troy, NY 12180, USA
Phone: 518-276-3185, Fax: 518-276-2162
Email: bystrc@rpi.edu, yuanx2@rpi.edu
*To whom correspondence should be addressed
Keywords: topology independent structure alignment, protein classification, structure
prediction, hidden Markov model, conserved core packing, contact map
Bioinfor matics © Oxford University Press 2004; all rights reserved.
Bioinformatics Advance Access published November 5, 2004
ABSTRACT
Motivation: Proteins of the same class often share a secondary structure packing arrangement
but differ in how the secondary structure units are ordered in the sequence. We find that proteins
that share a common core also share local sequence-structure similarities, and these can be
exploited to align structures with different topologies. In this study, segments from a library of
local sequence-structure alignments were assembled hierarchically, enforcing the compactness
and conserved inter-residue contacts but not sequential ordering. Previous structure-based
alignment methods often ignore sequence similarity, local structural equivalence, and
compactness.
Results: The new program, SCALI (Structural Core ALIgnment), can efficiently find conserved
packing arrangements, even if they are non-sequentially ordered in space. SCALI alignments
conserve remote sequence similarity and contain fewer alignment errors. Clustering of our
pairwise non-sequential alignments shows that recurrent packing arrangements exist in
topologically different structures. For example, the 3-layer sandwich domain architecture may be
divided into four structural subclasses based on internal packing arrangements. These subclasses
represent an intermediate level of structure classification, more general than topology but more
specific than architecture as defined in CATH. A strategy is presented for developing a set of
predictive hidden Markov models based on multiple SCALI alignments.
Availability: An online topology independent SCALI structure comparison server is available at
http://www.bioinfo.rpi.edu/~bystrc/scali.html.
Contact: bystrc@rpi.edu; yuanx2@rpi.edu
1
INTRODUCTION
Recurrent structural motifs in proteins can be found by structure-based alignment
methods. Generally it is assumed that similar protein structures will align to each other in a
sequential manner, conserving the direction of the chain and the order of the structural units. But
many examples exist of structural similarity that is non-sequential, produced possibly by
sequence rearrangements (Janowski, et al., 2001;Bennett, et al., 1994;Schiering, et al.,
2000;Jeltsch, 1999;Gong, et al., 1997;Iwakura, et al., 2000;Viguera, et al., 1995;Smith, et al.,
2001;Jung, et al., 2001) or by convergent evolution (Rost, 1997;Milik, et al., 2003). Circular
permutants and other rearrangements represent the topologically possible and energetically
favorable ways of arranging secondary structure units along the chain.
Structural similarities that have permuted orders are interesting because they reveal
recurrent structural packing themes in proteins (Efimov, 1995;Abagyan, et al., 1989;Alexandrov,
1996). Examples are presented later in this paper. These recurrent themes may be used to build
predictive models. But so far, there are no sequence models for the well-known structural
paradigms at the level of protein architecture. Instead, the focus has been on predicting structure
at the family, superfamily or fold level (Eddy, 1998;Karplus, et al., 1998;Gough, et al., 2002).
The most successful of these are hidden Markov models (HMM), which generally do not allow
for non-sequential alignments. The lack of non-sequential HMMs may be due in part to the
difficulty in obtaining good structural alignments without sequential constraints. Because of this,
a recurrent structural motif in a new protein may not be recognized as such if it is sequentially
permuted.
Alignments of topologically different structures may be found by inspection using an
interactive graphics program such as Rasmol (Bernstein, 2000;Sayle, et al., 1995). The spatial
2
alignment of permuted segments is often remarkably good, yet most structure-based alignment
programs, such as DALI (Holm, et al., 1993), CE (Shindyalov, et al., 1998), VAST (Gibrat, et
al., 1996), PrISM (Yang, et al., 2000) and MAMMOTH (Ortiz, et al., 2002), cannot find these
superpositions because they assume the aligned segments are sequentially ordered. Two
exceptions to this rule are SARF (Alexandrov, 1996;Alexandrov, et al., 1996) and K2
(Szustakowski, et al., 2002;Szustakowski, et al., 2000), which consider non-topological
alignments using secondary structure element information. However, neither of these programs
considers the sequence similarity. Our goal was to build sequence models from structure-based
alignments; therefore we developed a new program that optimizes both the structure and
sequence similarity.
The new program, SCALI (Structural Core ALIgnment), was conceived based on the
following five criteria defining a biologically relevant structure-based alignment (Koehl,
2001;Flores, et al., 1993;Taylor, et al., 1989). (1) Aligned residues should conserve structure
locally (i.e. backbone angles). (2) Contacts between pairs of aligned residues should be
conserved. (3) The alignment as a whole should be spatially compact, rather than disperse. (4)
Aligned segments should have some degree of sequence similarity. (5) The sequence order of
aligned segments should be minimally permuted.
In preliminary studies, we constructed non-sequential alignments manually for two cases
of topologically different proteins with similar 3D core packing arrangements, one of which is
illustrated in Figure. 1. We then attempted to reproduce the manually constructed alignments,
automatically, using a fragment assembly strategy. The new program was compared with the two
of the most commonly used structure alignment programs, DALI and CE, and with two non-
sequential alignment programs, SARF and K2.
3
Pairwise SCALI alignments of representative protein structures were clustered to produce
multiple structure alignments. Within these clusters we found recurrent core packing
arrangements that could be used as models for structure prediction. Hidden Markov models
(HMM) based on these “cores” are presented here in a diagrammatic form. These models
represent a level of structural classification that is more general than “fold” or “topology” but
more specific than “architecture” or “class”. Applications of the new non-topological HMMs for
structure prediction and design are discussed. Recurrent core packing geometries may also tell us
something about the folding process.
METHODS
SCALI: non-sequential sequence-structure alignment
SCALI aligns structures in a three-step process. First we generate a library of gapless
local sequence-structure alignments (“fragments”) using HMMSTR (Bystroff, et al., 2000). The
second step is a tree search in alignment space, where each branch point is the addition of a new
fragment to the alignment. Finally, the best alignments are pruned and extended.
HMMSTR (Hidden Markov Model for protein STRucture) is an almost comprehensive
model for local sequence/structure correlations in proteins (Bystroff, et al., 2000). In HMMSTR,
each Markov state represents a single position in an I-sites motif (Bystroff, et al., 1998). Each
state contains information about the amino acid preference and the preferred backbone angles.
The transitions between the states model the adjacencies of motifs in protein sequences.
HMMSTR has been used for secondary structure prediction, remote homolog detection (Hou, et
al., 2003) and for developing knowledge-based contact potentials (Shao, et al., 2003). The
4
algorithms for using this and other HMMs are described in Rabiner’s classic tutorial (Rabiner,
1989).
To align two structures using SCALI, we first computed the position specific HMMSTR
state probabilities, denoted γ, using the Forward/Backward algorithm (Rabiner, 1989). The input
to this program was a sequence profile derived from PSI-BLAST (Altschul, et al., 1997) as
described previously (Bystroff, et al., 1998).
Next, we made an exhaustive list of all aligned fragments. To obtain this list, we first
calculated the alignment matrix A as the dot-product of the state probabilities:
A
ij
=
γ
iq
t arg et
γ
jq
template
q
, (1)
, where q represents a Markov state in the HMMSTR model, and
γ
iq
is the probability of state q
at position i. The score S(i, j, L) for a fragment of length L, starting at position i in the target and
j in the template is simply the sum over a diagonal segment of the alignment matrix A:
S(i, j, L) =
,
(2)
=
++
1,0
))((
Lk
kjki
A
All possible fragments, defined by the positions
i, j, and the length L, were compiled to a
list, subject to the following constraints. A fragment
(1) must have no gaps or insertions,
(2) must have no backbone angle difference greater than 90˚,
(3) must be at least 5 residues in length, and
(4) must not be contained within a longer fragment that has a higher score.
Fragments were sorted by their alignment score,
S (Eq. 2). In every example of two
aligned segments that have no backbone angle differences greater than 90°, the two segments are
5
superimposable with a low root-mean-square deviation (RMSD). There is no upper limit on the
length of a fragment.
A breadth-first tree search in alignment space was conducted using a contact map scoring
function. A contact map,
C, is an N x N matrix where C
ij
= 1 if the β-carbons (Cα for glycine) of
residues i and j are separated by less than 8Å, and 0 otherwise. The n (where n=200) fragments
with the highest scores, S (Eq. 2), were used as seed alignments for the tree search. At each
branch point, the parent alignment
y was extended using fragment x if and only if:
(1) no residue in
x is already aligned,
(2) there is at least one conserved contact between fragment x and a residue in y,
(3) distance geometric constraints are not violated, meaning Distance(i, j) < 3.8 × | l–k |,
and Distance(k, l) < 3.8 × | j–i |, for all positions i aligned to l, and j aligned to k,
(4) the resulting alignment has one of the top n scores (NS, as defined in Eq. 6).
The top n scoring alignments (parents and children) become the parent alignments of a new
search, until no new fragments could be added.
The similarity between two contact maps is more sensitive than the global RMSD when
comparing distantly related proteins (Yang, et al., 1999), since conformational plasticity can
result in a high overall RMSD even when most of the pairwise contacts are conserved. The
contact score, CS, is the sum over the dot products of the contact maps for all aligned segments:
T =
(3) (C
ij
× C
kl
)
ij
F
=
(4) )1( () )1((
klij
ij
klij
CCCC ×+×
CS = TNCpenalty × F (5)
6
Here, C
ij
is the contact property at position i, j in the target, C
kl
is the contact property at position
k, l in the template, where i is aligned to l, and j is aligned to k. The F (false positive and false
negative contacts) was penalized by non-contact penalty (NCpenalty).
The score NS is calculated as follows:
NS = CSNSpenalty × Nb , (6)
where NSpenalty is a constant for non-sequential penalty, and Nb is defined as the number of
breaks needed to convert the alignment into sequential order. For example, if we have the
alignment with the blocks labeled as ACBD, where each letter represents a sequentially aligned
segment, to reorder the blocks sequentially to ABCD, three breaks are required, therefore Nb=3.
For the arrangement CDAB, a circular permutation, Nb=1. The optimal settings for NSpenalty
and NCpenalty were determined empirically by reproducing manually constructed alignments.
The final step in generating the alignment is pruning and extension, based on global
RMSD. Occasionally SCALI aligned fragments that conserved a pattern of contacts but the two
structures were mirror images of each other. 3D superposition followed by pruning eliminated
these types of errors. After the superposition of structures, an aligned block was removed if it:
(1) had any distance difference greater than 9Å, or
(2) had any backbone angle difference greater than 100˚.
Similarly, an aligned block was extended on either end if the extension:
(1) had no backbone angle difference greater than 100˚, and
(2) had no distance difference greater than 9Å.
The larger cutoff in distance allowed for distorted packing arrangements. Pruning and extension
were applied iteratively as long as the RMSD and aligned length continued to improve.
Empirical values for the permutation penalty (NSpenalty) and non-contact penalty
7
(NCpenalty) were determined by attempting to reproduce the manual non-sequential alignments
for two study cases: 1fsfA versus 1ig0A, and 1jx6A versus 1qo7A. The two manual alignments
were made by inspection using the molecular modeling program InsightII (Accelrys, Inc.).
Different parameter settings were assessed by inspecting the automated alignments and
comparing them with the manual ones. The SCALI result for a difficult sequential alignment
between remote homologs (sequence identity of 4.5%), 1rec_ versus 1eg4A, was also inspected.
The final settings were NCpenalty = 0.15 and NSpenalty = 6.0.
Many of the automatically generated alignments (available from the website) were
inspected in order to validate the method. In particular, we looked for the types of errors as
described in the Results and
in Table 1. No further changes were made to the algorithm once the
validation was undertaken.
DALI, CE, SARF alignments
CE and SARF programs were downloaded and the alignments were generated locally.
DALI alignments were obtained from the server (www.ebi.ac.uk/dali/Interactive.html). For each
program, the default settings were used. We were unable to run the K2 program in-house for
technical reasons. Programs were written to test each alignment for specific types of errors, as
described in the Results.
Comparison and evaluation of the alignments from different methods
A comparison of various structural alignment methods with SCALI was performed using
a reference set of 111 alignments derived from CATH. The alignments of SCALI were compared
to those from DALI, CE and SARF, which were generated as described above. The alignment
8
results are summarized in
Table 1. While the alignments were judged by visual inspections, an
evaluation was also undertaken using a figure-of-merit (FOM) scoring function denoted as:
FOM = )()()()5.3()(
54321
BerrwDisjwNonlocalwRMSDwLenw
+
+
+
+ (7)
Here, FOM is computed as the scaled sum of the five criteria which are represented as Len,
RMSD, Nonlocal, Didj, and Berr in (7). Each criterion is assigned a weight (w
1
to
w
4
). Len is
the number of locally correct aligned residues in the alignment, where the positions having
backbone angle deviations greater than 120° were ignored. RMSD is the root mean square
deviation of the aligned Cα coordinates. Nonlocal is the percentage of the aligned residues that
were locally non-equivalent, having backbone angle deviations greater than 120°. Disj is the
number of disjoint segments in the alignment, and Berr is the number of misaligned beta strands
(as described in
Table. 1-(2)). The weights were chosen so as to roughly equalize the
contribution of each factor: w
1
= -0.2, w
2
= 2.0, w
3
= 15, w
4
= 4, and w
5
= 8. A smaller FOM is
better.
The One-tailed paired Wilcoxson´s sum of signed ranks test (Sokal, et al., 1973) was
calculated to measure the significance of the difference in FOM between the four alignment
methods (CE, DALI, SARF and SCALI). The differences in FOM scores for the two methods
being compared were ranked based on their absolute values, and the positive and negative ranks
were summed separately. The smaller sum (T
s
) was used to calculate a Z-score (Eq. 8, n=111).
Z =
| Ts
n(n
+
1)
4
|
n(2n + 1)( n +1)
24
(8)
The calculated Z was compared to tabulated critical values in (Sokal, et al., 1973) to
obtain a p-value, or significance level. Lower is more significant. The DALI method aligned
only 76 cases, so for comparisons to DALI, n=76 and only those alignments were used.
9
Clustering multiple SCALI structural alignments
Pairwise SCALI non-sequential alignments were clustered using a simple greedy
algorithm. The set of all pairwise alignments defines a graph where each vertex represents one
protein, and an edge exists if the alignment between the two structures had RMSD 4.0Å and at
least 50 residues aligned. The first cluster was the vertex with the most edges and all of its
connected vertices. The second cluster was the vertex with the most edges and all of the
connected vertices after removing the first cluster, and so on.
Theoretically, this simple clustering method could group together different structures by
transitive association (Structure A is similar to B, and B is similar to C, but A is not similar to C).
Surprisingly this did not happen. Instead, alignments within a cluster conserved the same spatial
location. We should note again here that the structures in the CATH database are single domains,
and therefore we did not expect to find multiple cores within one structure. Also, domain folding
is usually an all-or-none, two-state phenomenon, and this may explain why we did not observe
transitive association.
Hidden Markov models based on SCALI multiple structure alignments
A hidden Markov state was defined for each column in the multiple structure alignment
after clustering of our alignments. The state sequence profiles were initialized by summing the
profiles of the aligned positions. The aligned proteins were representatives of the CATH
topologies. Each protein was given equal weight in the summation. Any non-aligned residues
were condensed to a single “Loop” state that connects the aligned states. The Loop states emit
sequences whose length was drawn from a probability distribution. The probability distribution
may be flat, allowing any size loop with equal probability.
10
State-state transitions were defined according to the sequential ordering of the states in
each member protein. In many cases, since the alignments were non-sequential, cyclic state paths
were possible. These paths are not physically meaningful, since they would imply that two
residues can occupy the same position in space. Therefore, “self-avoiding” states were defined as
Markov states that could be visited at most once in any state pathway. The development of a
modified Forward-Backward algorithm (Rabiner, 1989) that handles self-avoiding states is
ongoing and will produce the correct results on our newly defined HMMs.
In the figures, the Markov states for aligned positions are grouped into single icons
representing secondary structures, according to the TOPS convention (Westhead, et al., 1999).
Loop states are not drawn, but would occur on each of the arrows.
Information content of HMM states
The information content is defined as the likelihood of obtaining a similar distribution of
polar and non-polar residues by chance given the number of observations. To estimate this
likelihood, we ran 5000 simulations for each Markov state. We randomly chose amino acids
from the background distribution N times, where N was the number of observations. The p-value
for non-polar was calculated as the fraction of the 5000 randomly-generated profiles where the
percent non-polar matched or exceeded the percent non-polar of the observed data. If the Markov
state represented a polar position, then a polar p-value was calculated using a similar method.
Run-time complexity, implementation and availability
The alignment algorithm was implemented in Fortran90. The run time complexity for the
main alignment program is O(min(L1, L2)
2
×L1
×
L2), where L1 and L2 are the lengths for the
target and template protein, respectively. The typical run time for proteins of length 250 on one
11
700MHz Pentium3 CPU was about 15 minutes. A searchable database of pre-calculated
alignments may be found at http://www.bioinfo.rpi.edu/~bystrc/scali.html. Development of an
installation package is in progress and will appear at the same site.
RESULTS
Validation of structure-based alignments
To assess its ability to reproduce state-of-the-art sequential structure-based alignments,
SCALI was tested on a set of 120 pairs of distant structural homologs, defined as members of the
same topology class in the CATH database but having less than 25% sequence identity. The
alignments were compared with those from CE and DALI programs. All three methods produced
similar aligned sub-structures. However, in CE and DALI alignments there were segments that
should not have been aligned by the intuitive criteria defined above. Specifically, aligned
residues sometimes lacked local structural similarity and/or the aligned region was not compact.
To assess its ability to find non-sequential alignments, SCALI was compared with DALI,
CE and SARF on a set of topologically different structures. One example is the alignment of
structures 1fsfA and 1ig0A, shown in
Figure 1. These two proteins were aligned manually first,
and the manual alignment was used to develop the method. Both proteins contain parallel six-
stranded beta sheets with five alpha helices arranged anti-parallel to the strands, two on one side
and three on the other. The sheet continues in both cases, but in 1ig0A the seventh strand is anti-
parallel. The seven strands appear in order 1765234 in 1fsfA and 4321567 in 1ig0A (i.e. Strand 1
in 1fsfA is the structural equivalent of strand 4 in 1ig0A, and so on). In our manual alignment,
the six parallel strands and the five helices could be aligned with one circular permutation.
12
SCALI reproduced this alignment, superimposing the eleven secondary structure units with
RMSD of 5.4Å (
Figure 1a).
Both CE and DALI produced sequential alignments with the beta sheets in the flipped
orientation. CE aligned strands 4325 in 1fsfA with strands 4325 in 1ig0A, aligning unpaired
strands to paired strands (
Figure 1b). Strand 1, which is between strands 2 and 5 in 1ig0A, was
left un-aligned (green-colored in
Figure 1b). DALI aligned strands 43257 in 1fsfA to strands
32157 in 1ig0A, but one strand is skipped and the two strand 7’s point in opposite directions
(
Figure 1c). Both CE and DALI alignments contained additional alignments of non-equivalent
secondary structures. SARF was able to find the correct six stranded sheet alignment but did not
align three of the helices (
Figure 1d). The alignment is disjoint and several aligned segments are
locally different. Two segments are aligned in reverse. Some of the beta strand alignments are
offset by one residue (see
Figure 1. caption).
To further examine the ability and the quality of aligning different topologies with
automatic SCALI method, a set of 111 pairs of different topologies that had similar core
structures were used. This list includes randomly selected proteins that shared the same
architecture but differ in topology, according to CATH classification (Orengo, et al.,
1997;Orengo, 1994;Pearl, et al., 2003;Pearl, et al., 2000). All of the pairwise alignments returned
from CE, DALI, SARF and SCALI were compared to each other (
Table 1) without manual
curation. CE returned the alignments for all of the structure pairs, whereas DALI returned only
76 alignments, with 35 pairs rejected as non-superimposable. As expected, neither CE nor DALI
returned non-sequential alignments, since they were not designed to do so. SARF returned
alignments, including non-sequential and reverse alignments, for all of the test cases. In the
13
alignments, three specific types of errors were observed. These are defined and tabulated in
Table 1.
Among the returned alignments, none of the CE, DALI and SARF methods returned
error-free alignments. Both types of strand pairing miss-alignment (“unpaired strand” and “cross-
aligned strand” in
Table 1) occurred in CE and DALI, which are similar to the strand pairing
error as shown in
Figure 1b. SARF also made strands pairing errors, and one example is shown
in
Figure 2. In this alignment, one beta strand in 1cbf is aligned to two beta strands (in green and
pink) and one beta strand in 2ts1 is aligned to two strands (in blue and pink), which result in non-
equivalent paired hydrogen bonding among the aligned strands. The alignments returned from
SCALI did not contain strand pairing errors due to its algorithmic design, which requires
conserved contacts among the aligned segments. Compared with CE, the better performance of
DALI on these difficult cases seems to be due to its ability to decide when to align and when not
to. It will fail to return the alignment if a sequential structural comparison is too difficult, while
CE will return the alignment anyway, even though it may contain many errors. Both types of
strand misalignment result in a parallel displacement of one strand and cause only a minor
increase in the RMSD.
SARF often produced a subset or superset of the SCALI alignment, but SARF aligned
segments in reverse and allowed non-equivalent local structures to align. SCALI does not allow
segments to be aligned in reverse. SARF alignments usually had disjoint pieces (41 out of 111).
SARF alignments were often displaced by one residue from SCALI alignments. This would
change the direction of the side chain in beta strands. 91 out of 111 of our alignments were
compact and contained no obvious errors.
14
To further evaluate the quality of the 111 alignments generated from various structural
alignment methods, the figure of merit (FOM) was defined and computed (Eq. 7 in Methods). In
principle, the FOM should capture the quality of the structural alignment by rewarding correct
spatial alignments and penalizing the errors (defined in the
Table 1 caption). The FOM scoring
function gives approximately equal penalties to the local non-equivalence errors and beta strand
misalignment errors, and less penalty to the disjoint alignment errors. Using the FOM scores to
evaluate the quality of each alignment method, SCALI performed the best on 81 out of 111 non-
sequential alignments. If we ignore the 35 cases where DALI failed to align the structures, the
best among the four methods is SCALI, followed by SARF, DALI and CE. The Wilcoxon’s sum
of signed ranks test was performed for all possible six paired comparisons and the results show
that the differences between methods are statistically significant at the level of 0.1%. We should
note that the choice of topologically different protein pairs favored SARF and SCALI over the
other two methods.
Cluster analysis
By clustering pairwise SCALI alignments, we obtained non-sequential multiple structure
alignments for several CATH architectures, including the “up-down α bundle”, “β sandwich”, “β
roll”, “3-layer αβα sandwich” proteins, and others. As an example, we choose the 3-layer αβα
“sandwich” (CATH code 3.40) for the all-against-all comparisons. This protein architecture is
the most common and diverse, comprising 61 different topologies (Orengo, et al., 1997;Orengo,
1994;Pearl, et al., 2003;Pearl, et al., 2000). After clustering all 1830 alignments, 56 out of the 61
structures were divided into four subclasses (
Figure 3), each with a conserved core packing
arrangement. The other five proteins (1b0pA, 1adn, 1div, 1inp, 1qhkA) had unique core packing
arrangements. Proteins within a cluster conserved at least 50 residues in a compact region that
15
aligned with RMSD less than 4.0Å. This cutoff produced some false negatives but very few false
positives. Proteins within a cluster that fell below this significance test were still found to
conserve the recurrent core, albeit more distorted.
The clustered alignments may be modeled as HMMs, where each aligned segment is a
state and the variable sequential connections between the segments define the state-state
transitions.
Figure 4 shows diagrammatic HMMs for each of the αβα clusters. In each model,
some topological connections between the sub-structures are observed and others are not,
probably reflecting the physical constraints on secondary structure packing (Honig, 1999). There
are often compact subsets of connections that dominate, consistent with the previous argument
that certain motifs, described as “attractors”, occur as the core of a protein’s structure more
frequently than others (Holm, et al., 1996). An example is the right-handed parallel βαβ motif.
In each cluster, all observed topologies are represented as pathways through the HMM.
Based on these models, certain pathways exist that might represent proteins that have not been
observed in crystal structures. For example, we may predict that the topology shown in
Figure 5
is possible, based on the HMM for subclass-A in Figure 4a. However, this topology has not yet
been observed and would be considered as a novel fold if found.
When we analyzed the sequence information per position from the multiple structure
alignments, we could clearly see a concentration of sequence information in the core positions. If
we define the p-value as the likelihood of obtaining a similar distribution of polar and non-polar
residues by chance given the number of observations (as described in Methods), all the high-
information content positions (low p-value) tend to be deeply buried in the core. One example of
such result is shown in
Figure 6, which shows the sequence information content for CATH
architecture 3.40 subclass A (as in
Figure 4a).
16
Multiple SCALI alignments have been carried out for up-down bundle α
proteins,
sandwich β
proteins, 3-layer (αβα) and roll proteins in CATH database. The resulting models
are shown in
Figure 7 and 8. A full analysis of these conserved core packing arrangements is
ongoing.
Discussion
SCALI alignments are comparable to CE and DALI methods for comparing proteins that
share the same topology, better if we agree that structure-based alignments should be compact
and that aligned pieces should be locally similar in their backbone angles. In the cases where
proteins share only a core packing arrangement but with different topologies, SCALI is able to
find the proper structural equivalences, while previous methods fail, either because they assume
a sequential ordering (DALI, CE) or because they do not enforce compactness, local equivalence,
and sequence similarity (SARF).
Multiple non-sequential alignments from SCALI have been used to construct non-linear
profile HMMs, similar to the way profile hidden Markov models have been constructed to model
protein families and superfamilies (Eddy, 1998;Karplus, et al., 1998;Gough, et al., 2002), and
these may be useful in predicting structures that have a recurrent core packing arrangements.
Cluster analysis of our non-sequential alignments shows that some core packing arrangements
have occurred dozens of times, each with a different topology. It is therefore reasonable to
assume that there are many more permutants among the unsolved proteins.
Classification of cores
There are two widely cited classification schemes for protein domain structures, CATH
(Orengo, 1994;Orengo, et al., 1997;Pearl, et al., 2000;Pearl, et al., 2003) and SCOP (Murzin, et
al., 1995). Both have a top-down hierarchy, starting from classes based on secondary structure
17
content, then gross arrangement of secondary structure units, and then classes based on the
topological connections between those units. At this level (“topology” in the CATH or "fold" in
SCOP) we expect structures to superimpose sequentially. The recurrent packing motifs discussed
above represent a structure classification scheme that is more specific than "architecture" but not
as specific as "topology".
A new, intermediate classification level based on non-sequential multiple alignments may
help us understand the universe of protein folds. We may call these “core” types, and apply
codified names to each. For example, the model described in Figure 4a may be termed
unambiguously as a “3-layer 2α(all down)-5β(all up)-2α(all down)” core. Figure 4b may be
termed unambiguously as a “3-layer 1α(up)-4β(2 up, 2 down)-2α(all down).” A numerical
representation may be substituted for easy searching, such as “a.00/b.11111/a.00” for Figure 4a,
with /’s separating the structural layers and binary digits indicating the number and orientation of
the secondary structures. However, some domains may not lend themselves easily to the
“layered” notation.
Can we make predictive models from non-sequential alignments?
The highest degree of sequence identity that we found in the SCALI non-sequential
alignments at architecture level in CATH database was only 12%. We cannot completely exclude
the possibility of a common ancestor, but it is more likely that these core similarities are the
result of convergent evolution, where energetic stability was the selection pressure. A conserved
packing arrangement of secondary structures should energetically favor some sequence patterns,
and this idea is supported by the results shown in Figure 6. Conserved secondary structure and
3D packing environment does appear to define conserved sequence patterns, at least binary
(polar/non-polar) patterns.
18
We have observed certain core packing arrangements multiple times with different
topologies. If these cores are recurrent themes in nature, then we might expect to see some future
“new folds” fall into these same classes. That is, “new folds” may be permuted “old folds”. For
example, the hypothetical protein yjiA from E. coli (PDB code 1NIJ) solved in 2002 (Khil et al.)
was found to be a “new fold” according to CASP5 (Moult, et al., 2003). It was a new type of
alpha-beta protein consists of a single mixed β-sheet with strand order 15234 where strands 3
and 5 are anti-parallel to the others (Aloy, et al., 2003). We have found a cluster of proteins that
have the same core packing arrangement, all solved before 2002, and among the possible
topologies given the HMM (
Figure 8a) was the topology of the new protein 1NIJ (Figure 9).
Since core alignments conserve sequence information, and cores are often recurrent, therefore
self-avoiding HMMs based on SCALI alignments have the potential for predicting the core
structure of topologically novel proteins based on the sequence alone.
19
ACKNOWLEDGEMENTS
This research was supported by NSF grant EIA-0229454 and 0343206.
REFERENCES
Abagyan, R.A. and Maiorov, V.N.(1989)An automatic search for similar spatial arrangements of
alpha-helices and beta-strands in globular proteins,J Biomol Struct Dyn,
6,1045-60.
Alexandrov, N.N.(1996)SARFing the PDB,Protein Eng,
9,727-32.
Alexandrov, N.N. and Fischer, D.(1996)Analysis of topological and nontopological structural
similarities in the PDB: new examples with old structures,Proteins,
25,354-65.
Aloy, P., Stark, A., Hadley, C. and Russell, R.B.(2003)Predictions without templates: new folds,
secondary structure, and contacts in CASP5,Proteins,
53 Suppl 6,436-56.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman,
D.J.(1997)Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs,Nucleic Acids Res,
25,3389-402.
Bennett, M.J., Choe, S. and Eisenberg, D.(1994)Domain swapping: entangling alliances between
proteins,Proc Natl Acad Sci U S A,
91,3127-31.
Bernstein, H.J.(2000)Recent changes to RasMol, recombining the variants,Trends Biochem
Sci,
25,453-5.
Bystroff, C. and Baker, D.(1998)Prediction of local structure in proteins using a library of
sequence-structure motifs,J Mol Biol,
281,565-77.
Bystroff, C., Thorsson, V. and Baker, D.(2000)HMMSTR: a hidden Markov model for local
sequence-structure correlations in proteins,J Mol Biol,
301,173-90.
Eddy, S.R.(1998)Profile hidden Markov models,Bioinformatics,
14,755-63.
20
Efimov, A.V.(1995)Structural similarity between two-layer alpha/beta and beta-proteins,J Mol
Biol,
245,402-15.
Flores, T.P., Orengo, C.A., Moss, D.S. and Thornton, J.M.(1993)Comparison of conformational
characteristics in structurally similar protein pairs,Protein Sci,
2,1811-26.
Gibrat, J.F., Madej, T. and Bryant, S.H.(1996)Surprising similarities in structure
comparison,Curr Opin Struct Biol,
6,377-85.
Gong, W., O'Gara, M., Blumenthal, R.M. and Cheng, X.(1997)Structure of pvu II DNA-
(cytosine N4) methyltransferase, an example of domain permutation and protein fold
assignment,Nucleic Acids Res,
25,2702-15.
Gough, J. and Chothia, C.(2002)SUPERFAMILY: HMMs representing all proteins of known
structure. SCOP sequence searches, alignments and genome assignments,Nucleic Acids
Res,
30,268-72.
Holm, L. and Sander, C.(1993)Protein structure comparison by alignment of distance matrices,J
Mol Biol,
233,123-38.
Holm, L. and Sander, C.(1996)Mapping the protein universe,Science,
273,595-603.
Honig, B.(1999)Protein folding: from the levinthal paradox to structure prediction,J Mol
Biol,
293,283-93.
Hou, Y., Hsu, W., Lee, M.L. and Bystroff, C.(2003)Efficient remote homology detection using
local structure,Bioinformatics,
19,2294-301.
Iwakura, M., Nakamura, T., Yamane, C. and Maki, K.(2000)Systematic circular permutation of
an entire protein reveals essential folding elements,Nat Struct Biol,
7,580-5.
21
Janowski, R., Kozak, M., Jankowska, E., Grzonka, Z., Grubb, A., Abrahamson, M. and Jaskolski,
M.(2001)Human cystatin C, an amyloidogenic protein, dimerizes through three-dimensional
domain swapping,Nat Struct Biol,
8,316-20.
Jeltsch, A.(1999)Circular permutations in the molecular evolution of DNA methyltransferases,J
Mol Evol,
49,161-4.
Jung, J. and Lee, B.(2001)Circularly permuted proteins in the protein structure database,Protein
Sci,
10,1881-6.
Karplus, K., Barrett, C. and Hughey, R.(1998)Hidden Markov models for detecting remote
protein homologies,Bioinformatics,
14,846-56.
Khil, P.P., Obmolova, G., Teplyakov, A., Howard, A., J., Gilliland, G. L. & Camerini-otero, R.
D.Crystal Structure of the Yjia Protein from E. Coli.,To be published.,
Koehl, P.(2001)Protein structure similarities,Curr Opin Struct Biol,
11,348-53.
Milik, M., Szalma, S. and Olszewski, K.A.(2003)Common Structural Cliques: a tool for protein
structure and function analysis,Protein Eng,
16,543-52.
Moult, J., Fidelis, K., Zemla, A. and Hubbard, T.(2003)Critical assessment of methods of protein
structure prediction (CASP)-round V,Proteins,
53 Suppl 6,334-9.
Murzin, A.G., Brenner, S.E., Hubbard, T. and Chothia, C.(1995)SCOP: a structural classification
of proteins database for the investigation of sequences and structures,J Mol Biol,
247,536-40.
Orengo, C.A.(1994)Classification of protein folds,Curr Opin Struct Biol,
4,429-440.
Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B. and Thornton,
J.M.(1997)CATH--a hierarchic classification of protein domain structures,Structure,
5,1093-
108.
22
Ortiz, A.R., Strauss, C.E. and Olmea, O.(2002)MAMMOTH (matching molecular models
obtained from theory): an automated method for model comparison,Protein Sci,
11,2606-21.
Pearl, F.M., Lee, D., Bray, J.E., Sillitoe, I., Todd, A.E., Harrison, A.P., Thornton, J.M. and
Orengo, C.A.(2000)Assigning genomic sequences to CATH,Nucleic Acids Res,
28,277-82.
Pearl, F.M., Bennett, C.F., Bray, J.E., Harrison, A.P., Martin, N., Shepherd, A., Sillitoe, I.,
Thornton, J. and Orengo, C.A.(2003)The CATH database: an extended protein family
resource for structural and functional genomics,Nucleic Acids Res,
31,452-5.
Rabiner, L.R.(1989)A tutorial on Hidden Markov Models and selected applications in speech
recognition,Proc. IEEE,
77,257-286.
Rost, B.(1997)Protein structures sustain evolutionary drift,Fold Des,
2,S19-24.
Sayle, R.A. and Milner-White, E.J.(1995)RASMOL: biomolecular graphics for all,Trends
Biochem Sci,
20,374.
Schiering, N., Casale, E., Caccia, P., Giordano, P. and Battistini, C.(2000)Dimer formation
through domain swapping in the crystal structure of the Grb2-SH2-Ac-pYVNV
complex,Biochemistry,
39,13376-82.
Shao, Y. and Bystroff, C.(2003)Predicting interresidue contacts using templates and
pathways,Proteins,
53 Suppl 6,497-502.
Shindyalov, I.N. and Bourne, P.E.(1998)Protein structure alignment by incremental
combinatorial extension (CE) of the optimal path,Protein Eng,
11,739-47.
Smith, V.F. and Matthews, C.R.(2001)Testing the role of chain connectivity on the stability and
structure of dihydrofolate reductase from E. coli: fragment complementation and circular
permutation reveal stable, alternatively folded forms,Protein Sci,
10,116-28.
23
Sokal, R.R. and Rohlf, F.J.(1973)Introduction to Biostatistics,(eds),W.H Freeman and company,
San Francisco,pp. 220-222.
Szustakowski, J.D. and Weng, Z.(2000)Protein structure alignment using a genetic
algorithm,Proteins,
38,428-40.
Taylor, W.R. and Orengo, C.A.(1989)Protein structure alignment,J Mol Biol,
208,1-22.
Viguera, A.R., Blanco, F.J. and Serrano, L.(1995)The order of secondary structure elements does
not determine the structure of a protein but does affect its folding kinetics,J Mol
Biol,
247,670-81.
Westhead, D.R., Slidel, T.W., Flores, T.P. and Thornton, J.M.(1999)Protein structural topology:
Automated analysis and diagrammatic representation,Protein Sci,
8,897-904.
Yang, A.S. and Honig, B.(1999)Sequence to structure alignment in comparative modeling using
PrISM,Proteins,
Suppl 3,66-72.
Yang, A.S. and Honig, B.(2000)An integrated approach to the analysis and modeling of protein
sequences and structures. I. Protein structural alignment and a quantitative measure for
protein structural distance,J Mol Biol,
301,665-78.
24
Figure and Table Captions
Figure 1. Structural comparison between Escherichia coli Glucosamine-6-Phosphate deaminase
(PDB code 1fsfA, 266 residues) and yeast thiamin pyrophospho-kinase (PDB code 1ig0A, 319
residues), using four methods. Each figure shows only the aligned residues, red for 1fsfA, yellow
for 1ig0A. Below each figure is a topology (TOPS) cartoon, with strands as triangles, helices as
circles. The diagrams are oriented roughly as the proteins are aligned, with the aligned segments
shaded
. For simplicity, only the secondary structure units that are in common between the
two proteins are shown.
The alignments may contain additional small aligned fragments that
are not included in the TOPS diagrams.
(a) SCALI alignment, 104 aligned residues, RMSD =
5.4Å, one permutation. Aligned segments (1fsfA/1ig0A): 1-9/114-122, 10-24/125-139, 33-
40/182-189, 43-58/197-212, 63-70/213-220, 133-140/38-45, 190-195/53-58, 200-218/64-82,
237-241/93-97, 247-256/101-110.
(b) CE alignment, 111 aligned residues, RMSD = 5.1Å.
Aligned segments (1fsfA/1ig0A): 14-28/49-63, 35-43/64-72, 46-64/73-91, 67-74/92-99, 85-
86/100-101, 89-102/102-115, 104-110/116-122, 115-116/123-124, 119/125, 120-130/129-139,
131-146/185-200, 150-156/201-207.
(c) DALI alignment, 106 aligned residues, RMSD = 4.9Å.
Aligned segments (1fsfA/1ig0A): 23-27/32-36, 33-42/37-46, 45-53/48-56, 66-72/63-69, 85-
93/71-79, 95-100/85-90, 104-110/92-98, 111-116/119-124, 119-129/128-138, 132-139/186-193,
145-148/194-197, 189-192/202-205, 196-203/217-224, 220-223/225-228, 230-233/303-306,
237-240/308-311.
(d) SARF alignment, 105 aligned residues. RMSD=2.9Å. Aligned segments
(1fsfA/1ig0A): 1-7/114-120, 10/125, 12-23/126-137, 34-42/183-191, 62-74/212-224, 91-98/203-
210, 132-137/38-43, 188-204/54-70, 210-217/104-111, 221-225/83-78*, 235-246/92-103, 257-
263/154-160. A * denotes reversed segments.
25
Figure 2.
Structural comparison between PDB:1cbf and PDB:2ts1 using SARF method. The
alignment has RMSD of 2.76 with 76 residues aligned in space. The figure shows the C-alpha
trace for the aligned beta strands only, with thicker line for 1cbf, and thinner line for 2ts1. The
match positions in the alignment are shown in the same color for the two structures. The
alignment contains the “beta-strands pairing errors” as defined in Table 1-(2), which are
illustrated as one strand has two different colors, each of which is aligned to the segment from
another strand. Aligned segments (1cbf/2ts1): 22-23/187-188, 10/125, 24-30/216-222, 37-41/7-
3*, 48-50/63-65, 57-66/50-59, 68-72/116-120, 78/172, 79-90/174-185, 95-97/31-33, 98-102/189-
193, 109-121/198-210, 125-129/18-14*, 208/228. A * denotes reversed segments.
Figure 3. All-against-all structure comparison and clustering for 3-layer (αβα) proteins in
CATH 3.40. . A dot indicates the paired structures have a significant SCALI alignment.
Bordered regions are four subclasses, A, B, C, and D, listed here using the PDB code, chain and
domain identifier (26).
Subclass A: 1cbf01, 1aba00, 1ag8A1, 1ag8A2, 1alkA0, 1ami02, 1aua01,
1bg200, 1c8kA2, 1chmA1, 1dhs00, 1di6A0, 1dioB0, 1ekjA0, 1fuiA2, 1glaG2, 1iso00, 1lam01,
1lba00, 1poiB0, 1ra900, 1svq00, 1tplA2, 1udg00, 1vpt00, 1vsrA0, 2cevA0, 2ctc00, 2minB2,
2ts101, 3pmgA1, 3pmgA3, 1avpA0, 1rhs01, 1hfc00, 1ble00, 1cfr00.
Subclass B: 1ctt02, 1a3aA0,
1cby00, 1cl8A0, 1eq6A0, 1fua00, 1g8tB0, 1uch00, 2bltA0.
Subclass C: 1b94A0, 1b4uB0,
1e8gA3, 1eovA2, 1nox00, 1pvuA0.
Subclass D: 1tdj03, 1br6A1, 1cfe00, 1pinA0.
26
Figure 4. Diagrammatic hidden Markov models for the four sub-classes of 3-layer (αβα)
proteins, A, B, C and D as defined in Figure 2. In each subclass, the upper panel shows the
topology diagram without connectivities for that core structure. Strands are shown as arrows, and
helices as circles. Shaded helices are pointing down (or into the page). Dotted lines indicate
secondary structures that are sometimes present. The lower panel is the hidden Markov model
drawn for that core. Strands are shown as triangles, and helices are shown as circles. The
connectivities between the sub-structures are shown as arrows. Thicker lines indicate more
frequent connections.
(a) Subclass A: 37 proteins (b) Subclass B: 9 proteins, (c) Subclass C: 6
proteins
(d) Subclass D: 4 proteins
Figure 5. A possible new fold topology. This fold has never been observed (according to the
CATH released in Jan 2004) and yet is consistent with the model for subclass A of CATH
architecture 3.40 (Figure 4a).
Figure 6. The sequence information per position for subclass-A in CATH3.40. The stereo image
shows the core region of 1cbf (with C-alpha backbone trace only), a representative from
CATH3.40 subclass-A which consists of 34 topologies non-sequentially superimposable by
SCALI (Figure 3a). Colors represent the information content of the combined sequence profiles
at each aligned position, which is calculated as the p-value for obtaining the observed
distribution of polar and non-polar amino acids by chance (as described in Methods). Blue
represents a p=0.00, red is p=0.30 and higher. The p-value goes up in the hue scale from blue,
through green, to red. The high-information content positions tend to be deeply buried in the core
of the structure.
27
Figure 7. Diagrammatic hidden Markov models for the subclasses of 19 representative proteins
in CATH 2.60 (β sandwich) based on SCALI multiple alignments. Drawn as in Figure 4. (a) 12
proteins. (b) 3 proteins.
Figure 8. Diagrammatic hidden Markov models for the sub-classes of 29 αβ proteins in CATH
architecture 3.10 (β roll) based on SCALI alignments. Drawn as in Figure 4. (a) 6 proteins. (b) 5
proteins. (c) 2 proteins.
Figure 9. Topology of 1NIJ, which was a new fold in 2002. (a). Structure of 1NIJ. (b). Topology
of 1NIJ, which belongs to the first subclass of CATH 3.10 (Figure 8a).
Table 1. Systematic comparison of 111 SCALI alignments with CE, SALI and SARF. In this
table, the information of the averaged alignment length, RMSD, and FOM is derived form all
111 cases for the method of CE, SARF and SCALI. Only 76 alignments are used for the
evaluation of DALI method since 35 out of 111 are not alignable. The three specific types of
errors were defined as follows:
(1)
Local non-equivalent error: Non-equivalent secondary structures are aligned in 3D space.
(2)
Beta-strands misalignment error: The alignment contains either the cross-aligned strands
or unpaired strands. Cross-aligned strand error is where the paired beta-strands are aligned in
the opposite order (e.g. a 4-stranded 1234 beta-sheet is aligned to a 1324 beta-sheet, with
strand 2 is aligned to 2 and 3 is aligned to 3). Unpaired strand error is where paired strands
(strands that are making hydrogen bonds) are aligned to unpaired strands.
28
(3) Disjoint alignment error: The alignment contains two or more segments that are spatially
separate.
To evaluate the alignments from difference methods, a figure of merit (FOM) was defined. FOM
rewards aligned residues sharing the same secondary structure, having a Cα-Cα distance of less
than roughly ~3.5Å in the final 3D alignment, and penalizes the three types of errors defined
above. A lower FOM is better. Wilcoxson´s sum of ranks test was performed to evaluate the
statistical significance of the differences between the different alignment methods (See details in
Methods).
29
Figures
Figure 1(a)
30
Figure 1(b)
31
Figure 1(c)
32
Figure 1(d)
33
F
igure 2
34
Figure 3
35
Figure 4
36
Figure 5
37
Figure 6
38
Figure 7
39
Figure 8
40
Figure 9
(a)
(b)
41
42
Table 1.
Method Average
alignment
length
RMSD
(Å)
Local non-
equivalent
(cases)
Strand mis-
alignment
error
(cases)
Disjoint
error
(cases)
Not
aligned
(cases)
Error-
free
(cases)
Average
FOM
CE 83.3 5.9 111 28 1 0 0 1.6
DALI 81.5 5.7 76 10 18 35 0 0.1
SARF 79.7 2.7 111 22 41 0 0 -6.8
SCALI 64.7 4.3 10 7 7 0 91 -10.5
... In non-sequential alignments, the structural alignment does not obey the sequence constraints and we could observe a fragment on the side of the C-terminal of one protein aligned with a fragment on the side of N-terminal of the other protein. Only few alignment algorithms report non-sequential alignments, e.g., SARF [19], MultiProt [20], and SCALI [21]. ...
... Clustering-based methods [34,19,38,53,21,40] seek to assemble the alignment from smaller compatible (similar) element pairs such that the score of the alignment is as high as possible [54]. Generally speaking most of these methods have two steps: (1) Generating similar small substructures and (2) assembling them. ...
... We used the HMMSTR [56] method to extract the set of alignment seeds following the same approach used in SCALI [21]. This allows us to compare our proposed method for alignment propagation directly with the fragment assembly method proposed in SCALI which uses the same set of HMMSTR fragments. ...
... However, the frequency by which different folds share the same spatial arrangement of SSEs is unclear. It is important to estimate this frequency because it provides both an insight into the completeness of the secondary structure packing pattern and an estimation of the upper limit of the prediction success of rewiring or multiple loop permutation for predicting protein structures with novel folds [16][17][18]. The second question is ''What types of SSE spatial arrangements are frequently observed across different folds?'' ...
... For these folds, the particular connectivity of SSEs, which is closely related to the local interactions along the chain in the loop regions, may be essential for adopting such a particular fold. Conversely, it has been demonstrated that some spatial arrangements of SSEs are observed in many different folds with different SSE connectivities [16,[19][20][21][22]. For these folds, non-local interactions may play a dominant role in maintaining the fold structure. ...
... The third question is ''How diverse are the protein folds that share the same spatial arrangement of SSEs with a given fold?'' It is well known that different folds of the same SCOP class often share the same spatial arrangement of SSEs [16,25]. However, it remains unclear how often protein folds belonging to different SCOP classes share the same spatial arrangement of SSEs. ...
Article
Full-text available
It has been known that topologically different proteins of the same class sometimes share the same spatial arrangement of secondary structure elements (SSEs). However, the frequency by which topologically different structures share the same spatial arrangement of SSEs is unclear. It is important to estimate this frequency because it provides both a deeper understanding of the geometry of protein folds and a valuable suggestion for predicting protein structures with novel folds. Here we clarified the frequency with which protein folds share the same SSE packing arrangement with other folds, the types of spatial arrangement of SSEs that are frequently observed across different folds, and the diversity of protein folds that share the same spatial arrangement of SSEs with a given fold, using a protein structure alignment program MICAN, which we have been developing. By performing comprehensive structural comparison of SCOP fold representatives, we found that approximately 80% of protein folds share the same spatial arrangement of SSEs with other folds. We also observed that many protein pairs that share the same spatial arrangement of SSEs belong to the different classes, often with an opposing N- to C-terminal direction of the polypeptide chain. The most frequently observed spatial arrangement of SSEs was the 2-layer α/β packing arrangement and it was dispersed among as many as 27% of SCOP fold representatives. These results suggest that the same spatial arrangements of SSEs are adopted by a wide variety of different folds and that the spatial arrangement of SSEs is highly robust against the N- to C-terminal direction of the polypeptide chain.
... Representative examples of non-sequential alignment methods in earlier studies include: MASS [5], SCALI [6], SAMO [7], GANGSTA+ [8], SNAP [9] [13]. Many different problems from bioinformatics have been solved using cliques [14,15]. ...
... A parameter that is also taken into account during the study of 3D protein structures is sequentiality, which means subsequent amino acids in one protein must correspond to subsequent amino acids in the partner protein. The majority of methods follow this restriction while the number of methods that are nonsequential is still limited ( [6], [7]). The drawback of sequential approaches is that they can decrease the possibility to discover evolutionary relationships ( [8]) as new protein structures can arise from the combination and permutation of substructures of a protein ( [9]). ...
Conference Paper
Full-text available
Protein structure similarity is one of the most important aims pursued by bioinformatics and structural biology, nowadays. Although quite a few similarity methods have been proposed lately, yet fresh algorithms that fulfill new preconditions are needed to serve this purpose. In this paper, we provide a new similarity measure for 3D protein structures that detects not only similar structures but also similar substructures to a query protein, supporting both multiple and pairwise comparison procedures and combining many comparison characteristics. In order to handle similarity queries we utilize efficient and effective indexing techniques such as M-trees and we provide interesting results using real, previously tested protein data sets. © 2012 IFIP International Federation for Information Processing.
... interfaces can contain amino acids/nucleotides that are sequentially remote (possibly from different chains) but structurally close to each other in the interaction interface. Although sequence orderindependent pairwise protein structure alignment methods have also been developed and successfully applied in protein structure studies (Dundas et al., 2007;Xie and Bourne, 2008;Yuan and Bystroff, 2005), they are designed for optimizing the global alignment, instead of focusing on the interfaces of interests. Therefore, existing pairwise protein structure alignment methods are not suitable for studying structural similarities between interaction interfaces of complex structures. ...
Article
Full-text available
Biological molecules perform their functions through interactions with other molecules. Structure alignment of interaction interfaces between biological complexes is an indispensable step in detecting their structural similarities, which are key S: to understanding their evolutionary histories and functions. Although various structure alignment methods have been developed to successfully access the similarities of protein structures or certain types of interaction interfaces, existing alignment tools cannot directly align arbitrary types of interfaces formed by protein, DNA or RNA molecules. Specifically, they require a ': blackbox preprocessing ': to standardize interface types and chain identifiers. Yet their performance is limited and sometimes unsatisfactory. Here we introduce a novel method, PROSTA-inter, that automatically determines and aligns interaction interfaces between two arbitrary types of complex structures. Our method uses sequentially remote fragments to search for the optimal superimposition. The optimal residue matching problem is then formulated as a maximum weighted bipartite matching problem to detect the optimal sequence order-independent alignment. Benchmark evaluation on all non-redundant protein -: DNA complexes in PDB shows significant performance improvement of our method over TM-align and iAlign (with the ': blackbox preprocessing ': ). Two case studies where our method discovers, for the first time, structural similarities between two pairs of functionally related protein -: DNA complexes are presented. We further demonstrate the power of our method on detecting structural similarities between a protein -: protein complex and a protein -: RNA complex, which is biologically known as a protein -: RNA mimicry case. The PROSTA-inter web-server is publicly available at http://www.cbrc.kaust.edu.sa/prosta/. xin.gao@kaust.edu.sa. © The Author 2015. Published by Oxford University Press.
... A non-sequential alignment refers to one in which the sequential order of residues in a protein is ignored, and only the spatial proximity between two residues is taken into consideration. Many structure alignment tools support both sequential or non-sequential structure alignment [47,137,34,139]. ...
... These methods are also based on geometric hashing [41], or SSE information [15]. Multiprot [58] aims to solve the multiple structural alignment problem with detection of partial solutions; it computes the best scoring structural alignments, which can be either sequential or sequenceorder independent [72], if one seeks geometric patterns which do not follow the sequence order. ...
... Other protein engineering and design techniques include rational single-and multiple-site mutations (30)(31)(32)(33)(34), a combinatorial method (35,36), in vitro evolution (37)(38)(39)(40), and computational optimization of side chain packing (41), but none of these methods has the capacity to permute the protein sequence. We expect our method to be widely applicable, because sequence rearrangements are observed in known protein structures (2,3,5,42). Potentially, rewiring may be used to engineer the stability and folding kinetics of proteins, because topological properties are known to influence the rates of folding (43) and possibly unfolding (44). ...
Thesis
Full-text available
Neste trabalho foi feita uma análise comparativa entre duas metodologias clássicas no estudo de contatos em proteínas: a dependente de um delimitador de distância (CD - Cutoff Dependent) e outra que não é dependente de um delimitador, a decomposição de Delaunay (DT – Delaunay Tessellation). Essas técnicas foram avaliadas usando-se duas formas diferentes de representação de resíduos (centróides): pelo carbono alfa (CA) e pelo centro geométrico da cadeia lateral (GC). Um banco de dados foi montado, compreendendo dois conjuntos chamados ALPHA e BETA contendo cadeias das duas principais classes do sistema de classificação CATH: all-alpha e all beta, respectivamente. Um delimitador em 7.0 Å emergiu como um importante parâmetro de distância na análise dos contatos inter-resíduos em proteínas. Este valor marca o ponto de bifurcação no comportamento das curvas de contatos entre as técnicas CD e DT. Até 7,0 Å, as propriedades CD e DT são unificadas numa mais abrangente: nesta distância, todos os contatos (arestas) são totais e verdadeiro-positivos (completos e não-oclusos). A distância de 7,0 Å é o ponto também em que a primeira camada de vizinhos encontra-se otimamente separada das demais, constituindo-se principalmente de contatos de primeira-ordem. É demonstrado que 7,0 Å é um ponto de transição entre os comportamentos lineares e quadráticos da curva do número total de vizinhos por resíduo. Também é mostrado que a técnica DT tem uma conhecida anomalia em sua contagem de arestas que, em proteínas, pode produzir omissões indesejáveis e sistemáticas afetando principalmente a rede de contatos de proteínas betas com centróides em CA. Uma técnica auxiliar reconhecida por tratar essa anomalia é o quase-Delaunay (AD – Almost Delaunay). É observado que mesmo AD não se mostra uma técnica proveitosa em proteínas. É empiricamente demonstrado que DT+AD convergem para CD, na medida que o parâmetro de perturbação em AD cresce. Isto alerta que DT e técnicas correlatas devem ser usadas com precaução em proteínas. Como conseqüência, no estrito intervalo de 0,0 Å a 7,0 Å, CD revela-se uma metodologia mais simples, completa e confiável. Por fim, é evidenciado também que a redução na representação dos resíduos aos centróides CA e GC pode introduzir tendências estatísticas na análise de vizinhos em delimitadores até 6,8 Å, com CA em favor ALPHA e GC em favor de BETA. Para valores acima de 6,8 Å, este viés parece ser eliminado. Isto provê um argumento a mais em benefício do limite em 7,0 Å, como um parâmetro de referência, robusto e de carácter geral, a ser usado de forma segura como um confiável delimitador de distância nos estudos em massa de contatos de proteínas.
Article
We have developed a novel, fully automatic method for aligning the three-dimensional structures of two proteins. The basic approach is to first align the proteins' secondary structure elements and then extend the alignment to include any equivalent residues found in loops or turns. The initial secondary structure element alignment is determined by a genetic algorithm. After refinement of the secondary structure element alignment, the protein backbones are superposed and a search is performed to identify any additional equivalent residues in a convergent process. Alignments are evaluated using intramolecular distance matrices. Alignments can be performed with or without sequential connectivity constraints. We have applied the method to proteins from several well-studied families: globins, immunoglobulins, serine proteases, dihydrofolate reductases, and DNA methyltransferases. Agreement with manually curated alignments is excellent. A web-based server and additional supporting information are available at http://engpub1.bu.edu/∼josephs. Proteins 2000;38:428–440. © 2000 Wiley-Liss, Inc.
Chapter
Considerable amount of structural data on 3D protein structure has established structure comparison as an essential technique for understanding protein sequence, structure, function, and evolution. The goal is to predict 3D protein structures from amino acid sequence information alone. A major step toward this goal is to determine a method for discovering common protein structures in databases such as the Protein Data Bank, so that a better understanding of protein structure and function can be pieced together. Structure comparison algorithms are used to identify a set of residue equivalencies between two proteins based on their 3D coordinates. This set of equivalencies is called a structure alignment, and it allows the superposition of one protein structure onto the other after rigid rotation and/or translation. Structure alignments can indicate if two proteins share the same fold, or structural unit. Structure alignment is also used as the gold standard for evaluating protein structure prediction methods. This chapter focuses on the application of evolutionary computation to protein structure similarity problems and provides an example of a hybridization of evolutionary algorithms and other optimization techniques. The combination of these approaches offers a new and exciting method for protein structure comparison with increased specificity and sensitivity compared with previous methods.
Article
We have determined the structure of PvuII methyltransferase (M.PvuII) complexed with S-adenosyl-l-methionine (AdoMet) by multiwavelength anomalous diffraction, using a crystal of the selenomethioninesubstituted protein. M.PvuII catalyzes transfer of the methyl group from AdoMet to the exocyclic amino (N4) nitrogen of the central cytosine in its recognition sequence 5′-CAGCTG-3′. The protein is dominated by an open α/β-sheet structure with a prominent V-shaped cleft: AdoMet and catalytic amino acids are located at the bottom of this cleft. The size and the basic nature of the cleft are consistent with duplex DNA binding. The target (methylatable) cytosine, if flipped out of the double helical DNA as seen for DNA methyltransferases that generate 5-methylcytosine, would fit into the concave active site next to the AdoMet. This M.PvuII α/β-sheet structure is very similar to those of M.HhaI (a cytosine C5 methyltransferase) and M.TaqI (an adenine N6 methyltransferase), consistent with a model predicting that DNA methyltransferases share a common structural fold while having the major functional regions permuted into three distinct linear orders. The main feature of the common fold is a seven-stranded β-sheet (6↓ 7↑ 5↓ 4↓ 1↓ 2↓ 3↓) formed by five parallel β-strands and an antiparallel β-hairpin. The β-sheet is flanked by six parallel α-helices, three on each side. The AdoMet binding site is located at the C-terminal ends of strands β1 and β2 and the active site is at the C-terminal ends of strands β4 and β5 and the N-terminal end of strand β7. The AdoMet-protein interactions are almost identical among M.PvuII, M.HhaI and M.TaqI, as well as in an RNA methyltransferase and at least one small molecule methyltransferase. The structural similarity among the active sites of M.PvuII, M.TaqI and M.HhaI reveals that catalytic amino acids essential for cytosine N4 and adenine N6 methylation coincide spatially with those for cytosine C5 methylation, suggesting a mechanism for amino methylation.
Article
Recent developments in automatic structure comparison have yielded several fast and flexible methods that allow extensive explorations of the structure databank. As a result, proteins have been clustered into a few hundred structural families. Many interesting and unexpected structural similarities have been revealed, and some folds have been shown to support diverse sequences and functions.
Article
A fast search algorithm to reveal similar polypeptide backbone structural motifs in proteins is proposed. It is based on the vector representation of a polypeptide chain fold in which the elements of regular secondary structures are approximated by linear segments (Abagyan and Maiorov, J. Biomol. Struct. Dyn. 5, 1267–1279 (1988)). The algorithm permits insertions and deletions in the polypeptide chain fragments to be compared. The fast search algorithm implemented in FASEAR program is used for collecting βαβ supersecondary structure units in a number of α/β proteins of Brookhaven Data Bank. Variation of geometrical parameters specifying backbone chain fold is estimated. It appears that the conformation of the majority of the fragments, although almost all of them are right-handed, is quite different from that of standard βαβ units. Apart from searching for specific type of secondary structure motif, the algorithm allows automatically to identify new recurrent folding patterns in proteins. It may be of particular interest for the development of tertiary template approach for prediction of protein three-dimensional structure as well for constructing artificial polypeptides with goal-oriented conformation.
Article
Src homology 2 (SH2) domains are key modules in intracellular signal transduction. They link activated cell surface receptors to downstream targets by binding to phosphotyrosine-containing sequence motifs. The crystal structure of a Grb2-SH2 domain-phosphopeptide complex was determined at 2.4 Angstrom resolution. The asymmetric unit contains four polypeptide chains. There is an unexpected domain swap so that individual chains do not adopt a closed SH2 fold. Instead, reorganization of the EF loop leads to an open, nonglobular fold, which associates with an equivalent partner to generate an intertwined dimer. As in previously reported crystal structures of canonical Grb2-SH2 domain-peptide complexes, each of the four hybrid SH2 domains in the two domain-swapped dimers binds the phosphopeptide in a type 1 beta -turn conformation. This report is the first to describe domain swapping for an SH2 domain. While in vivo evidence of dimerization of Grb2 exists, our SH2 dimer is metastable and a physiological role of this new form of dimer formation remains to be demonstrated.
Article
Some proteins are homologous to others after their sequence is circularly permuted. A few such proteins have been recognized, mainly by sequence comparison, but also by comparing their three-dimensional structures. Here we report the result of a systematic search for all protein pairs in the SCOP 90% id domain database that become structurally superimposable when the sequence of one of the pairs is circularly permuted. Using a reasonable set of criteria, we find that 47% of all protein domains are superimposable to at least one other protein domain in the database after their sequence is circularly permuted. Many of these are symmetric proteins, which superimpose to another protein both with and without a circular permutation of the sequence. However, 412 of the total 3035 domains are nonsymmetric, and these become structurally superimposable to another protein only after a circular permutation of the sequence. These include most known and many previously undetected circularly permuted proteins with remote homology.