ArticlePDF Available

EMPSC: A New Method Based on Ellipsoidal Model for Protein Structure Comparison

Authors:

Abstract and Figures

This paper proposes a new method EMPSC for the well-known PSC (Protein Structure Comparison) problem. The proposed method EMPSC is a protein structural alignment algorithm based on ellipsoidal model abstraction. We segment the protein 3D structure into two different kinds of structures, including Secondary Structure Elements recognized by DSSP 1 and other coil/loop structures. These SSEs will be the initial alignment center for obtaining the transformation coordinate systems. Different heuristic filters and geometric hashing based global alignment estimation are used for quick finding better initial alignments. In the refined alignment stage of analysis, a standard refinement algorithm is invoked to fine-tune the alignment outputted by the first stage. Our experimental results reveal that EMPSC generally achieves comparable accuracy and better performance in comparison with the existing PSC algorithms. Moreover, we analyzed the factors that affect the EMPSC performance and SSE-based PSC algorithms. Further investigation in multiple protein structure comparison and local structure comparison will be continued.
Content may be subject to copyright.
EMPSC: A New Method Based on Ellipsoidal Model for Protein Structure
Comparison
Yhi Shiau1,2, Jia-Nan Wang1, Yu-Feng Huang1, Chien-Kang Huang3
yshiau@cht.com.tw, jnwang@mars.csie.ntu.edu.tw, yfhuang@csie.ntu.edu.tw, ckhuang@ntu.edu.tw
1Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan
2Chunghwa Telecom Laboratories, Tauyuan, Taiwan
3Department of Engineering Science and Ocean Engineering, National Taiwan University, Taipei, Taiwan
Abstract
This paper proposes a new method EMPSC for the well-known PSC (Protein Structure Comparison)
problem. The proposed method EMPSC is a protein structural alignment algorithm based on
ellipsoidal model abstraction. We segment the protein 3D structure into two different kinds of
structures, including Secondary Structure Elements recognized by DSSP1 and other coil/loop
structures. These SSEs will be the initial alignment center for obtaining the transformation coordinate
systems. Different heuristic filters and geometric hashing based global alignment estimation are used
for quick finding better initial alignments. In the refined alignment stage of analysis, a standard
refinement algorithm is invoked to fine-tune the alignment outputted by the first stage. Our
experimental results reveal that EMPSC generally achieves comparable accuracy and better
performance in comparison with the existing PSC algorithms. Moreover, we analyzed the factors that
affect the EMPSC performance and SSE-based PSC algorithms. Further investigation in multiple
protein structure comparison and local structure comparison will be continued.
Keywords: Eigenvector, Dynamic Programming, Geometric Hashing, Secondary Structure Elements.
Availability: The EMPSC tool is accessible at our lab website http://ballerina.csie.ntu.edu.tw
Introduction
Since 1747 Beccari discovered proteins, the proteins play the important role in biochemical reactions,
the study of protein functionalities attracted biochemical researchers. Within these research topics,
Protein Structure Comparison (PSC) is one of the most basic and important subjects to detect the
evolutionary and functional relationships between them. And we know that the functionality of one
protein is related to its 3D structure2, that is, proteins with similar substructures may have similar
functions. Therefore, improving the methodology and tools of PSC is an important issue in molecular
biology and bioinformatics for many years3-7.
Today the biochemists need more fast and accurate PSC tools, as the protein database grows fast with
the help of computation power in the recent biochemistry research. For examples, the Protein Data
Bank8 content increases to 35,813 proteins on 28-Mar-2006, the Swiss-Prot content increases to
163,000 and the TrEMBL increases to 1,450,000. Obviously, continuing to improve PSC tools to
handle the fast growing massive protein data is a great challenge.
In order to detect the functional or evolutionary relationships between proteins, the PSC algorithms
compare the 1D sequences information or 3D structures information of amino acid sequences. The
purpose of PSC is to identify maxima equivalent Cα atoms upon which to optimally align the 3D
structures of compared proteins. Previously proposed PSC algorithms exploit many different
computing approaches, including Monte Carlo (Dali9), Dynamic programming10-12 (VAST13), 3D
clustering14, graph theory15, spline approximation16 and geometric hashing17. In order to further speed
up the PSC performance, quick-n-dirty approaches is applied in today’s PSC, like CE18, SAP19,
ProSup20,21, and FLASH22. As the PSC is the NP-hard problem, most approaches tried to propose
different heuristics to approximate the optimal solution. Therefore, quick-n-dirty approaches are the
main stream of today’s PSC algorithms. In the following section, we will further explore these
quick-n-dirty approaches.
CE method started from the concept of aligned fragment pairs (AFPs), that is, the initial alignment
finding is grown from AFPs. CE uses a combinatorial extension of an alignment path defined by AFPs
to obtain the extension paths that pass the similarity threshold defined by the inter-residue distance
constraint. The inter-residue distance constraint is an approximation for superimposing the two
proteins. In refinement, the top-20 alignment paths are evaluated based on r.m.s.d. and the best one
selected. The best alignment path is further refined with the dynamic programming approach to obtain
the final global alignment solution. The problem of CE method is it is time consumed while
calculating of inter-residue distance constraint. On the other hand, according to the observation of our
experiments, we found CE method tends to find most corresponding residues rather than smaller
r.m.s.d
SAP method uses iterated double dynamic programming and applied it in both the initial alignment
finding and the refinement processes. The initial alignment finding begins from every residue pair;
each residue from the compared proteins. SAP finds out the potentially equivalent residue pairs based
on local structure and environment. Then, for each potential equivalent residue pair, SAP calculates
the alignment score with dynamic programming approach. The parameters of the scoring function
include the direction component, the orientation component, the sequence term, and the spatial term.
The initial alignment finding is done by sorting all the elements of the bias-matrix and taking a
number (top-20) of the highest scoring elements. After initial alignment finding, top-20 results are
further refined with the dynamic programming approach. In order to avoid the exhaustive search of all
possible residue pairs, the useful option of taking a randomized selection has been exploited in SAP
results. The main issue of SAP is the higher time complexity because double dynamic programming is
very time consumed for large matrix.
ProSup method tried to approximate the solution from seed pairs. The seed fragments are the similar
fragments of compared proteins. ProSup superimposes all possible seeds and evaluates the whole
protein with the superimposing equivalences approach instead of using dynamic programming. In
refinement, a standard procedure that combines dynamic programming and least-square constraint23,24
further refines the initial alignments. ProSup can output multiple solutions. The accuracy and
efficiency is affected by the length of seed fragments.
FLASH method greatly reduces the time complexity by with an aggressive abstraction of protein
structures. FLASH identifies the Secondary Structure Elements (SSEs) of compared proteins with
DSSP tool. It works on a vector-represented SSE as a data reduction for protein’s 3D structure. The
experiments revealed that the SSE in the protein provides better initial alignment performance.
FLASH only considered the information from α-helix and β-sheet, and neglected the coil information.
It establishes the angle-distance map for all SSE pairs. After calculating the SSE matching
probabilities, FLASH uses a greedy procedure to select viable alignment solutions (at least 3).
Statistical significance is applied in FLASH to filter out inappropriate initial alignment findings. In
refinement, FLASH applies the same standard refined alignment algorithm as ProSup. With
SSE-based data reduction, FLASH greatly speedup the time complexity in initial alignment finding.
And it can output multiple solutions. However, if two proteins have similar local structure, but the
global similarity doesn’t exceed the criterion of statistical significance, then FLASH stops further
comparison and won’t give any result. In addition, if the protein has no SSE structure, then FLASH
method doesn’t apply. The single vector representation also raises problems. To represent α-helix in a
single vector is proper, as the α-helix is hard to bend. But the representing vector of β-sheet may lose
some structure information. As the β-sheet structure is usually bending or curved, the length of
identified β-sheet will affect the derived single vector effectiveness seriously.
Based on previous study, we hope to develop a new method which is more efficient than conventional
PSC approaches without the limits of FLASH. In this paper, we propose a new protein structural
alignment method based on ellipsoidal model, named as EMPSC, which applies generic heuristic
mapping function in initial alignment stage. There are four steps in EMPSC method – preprocessing,
initial alignment, refinement and final evaluation. EMPSC identify the SSEs of the target protein with
DSSP tool. The remaining parts of this protein, mostly the coil/loop structure, will also be considered.
Rather than using single vector representation as FLASH, we use ellipsoid model to represent each
sequence segment. The detailed algorithm describes in the Proposed Method section.
EMPSC has many better characteristics of previous approaches. The initial alignment finding is
selected from segment pairs (mainly SSEs) instead of residue pairs. EMPSC can also output multiple
solutions. Like FLASH, we believe that SSEs is clearly more important for protein’s structure
conformation. And abstract the structure information with SSEs (α-helix, β-sheet and coil) could be
better than CE’s AFP and ProSup’s seed. In addition, EMPSC like CE and ProSup, the initial
alignment finding come from the view of local alignment.
Proposed Method
The workflow diagram of EMPSC is depicted in Fig 1, and the algorithm will be described in details.
The algorithm basically has four steps.
(1) Preprocessing: Segment the proteins into SSEs with DSSP tool, and generate the
ellipsoidal representation for each segment (mainly α-helix, β-sheet).
(2) Initial alignment: Generate the potential aligned segment pairs, looking for a good initial
alignment via heuristic filtering.
(3) Refinement: Iteratively apply a dynamic programming algorithm to refine the initial
alignment, which is a well-known procedures as most PSC algorithms.
(4) Final evaluation: Evaluate the refined alignments and provide the number of
corresponding residues and r.m.s.d of alignment solutions.
(1a) Identify the SSEs from protein
(1b) Cluster remaining residues from
protein
(1c) Generate the ellipsoidal
representation for each segment
(2a) Generate the pairs of candidate aligned
SSE segments from compared proteins
(2b) Filter the pairs with heuristic filtering
function
(2d) Rank the candidate pairs with a fast
global alignment approach,
and Select Top-N best pairs
(3) Iteratively tune the global alignment
center with superimposing compared proteins
(dynamic programming approach)
(4) Final evaluation
(2c) Superimpose candidate pairs by
center-eigenvector transformation
For each protein For each pair of compared proteins
Fig 1. The workflow diagram of the EMPSC algorithm.
Preprocessing: Generate Ellipsoidal Representation
As most modern PSC algorithms, in the initial alignment step, we don’t estimate the alignment quality
by comparing the structures of two proteins atom by atom, but comparing their abstract models. The
ellipsoid model rather than all residues of SSEs, were used to represent the proteins’ overall 3D
structures in EMPSC. The step (1) of the EMPSC algorithm is the process of generating ellipsoidal
representation for each segment of the target protein. At first, in step (1a), we use the DSSP tool to
identify the SSEs (α-helix, β-sheet) of the protein. In step (1b), contiguous residues of remaining will
be clustered into one segment. Therefore, the remaining residues of this protein (that is the coil/loop
information) are further clustered to a set of new segments (coil sub-segments) according the
adjacencies of residues. After that, SSEs will be represented by a set of 3D-ellipsoidal model. Fig 2
depicted the relationship between residue chain and 3D-ellipsoidal model. The PCA (Principal
Component Analysis)23 is applied in finding their 3 orthogonal eigenvectors and 3 respective
eigenvalues. According to the steps (1c), we can decompose the protein into a set of residue segments
and ellipsoidal representations as Fig 3.
Residue chain
Fig 2. A 3D-ellipsoidal ball and their 3 eigenvectors
(a) (b) (c)
Fig 3. The generating process of ellipsoidal representation. The figure (a) displays the example protein. The
figure (b) shows that the protein was processed by DSSP and the SSEs is identified. The figure (c) shows the
further processing of remaining segments.
Initial Alignment: Find a Good Superimposed Transformation for the Initial Alignments
Finding good superimposed transformations are the main task in obtaining initial alignments of
EMPSC. The whole process is described in step (2) of the EMPSC algorithm. In general, the initial
alignment searching is a filtering process that eliminates impossible or dissimilar segments mapping
and finds the top-N best segments pairs, that is, top-N superimposed transformations between two
compared proteins.
At first (2a), we generate all possible initial alignments from the compared proteins. In this step, every
pair SSE segments, including α-helix and β-sheet only, of the compared proteins can be the center of
the new coordinates, and the remaining coil sub-segments will be used in the biochemical filtering
step (2b).
Before estimating the quality of every initial alignment in step (2d), EMPSC will first check the local
structure of each initial alignment, that is, EMPSC will calculate the similarity between the two
mapping SSE pair (α-helix and β-sheet) and the at most four surrounding coil sub-segments. In step
(2b), EMPSC will further filter the initial alignments with a heuristic filtering function. Conceptually,
we can define an integrated filtering function combining all factors that is effective for the judgment
of good initial alignment. However, instead of implementing one integrated filtering function, we
currently implemented several subsequent filtering processes which filter out unmatched or dissimilar
pairs. These processes include three different kind of filtering – type filter, mass filter, and biochemical
filter. The type filter makes sure the secondary structure types of mapping segments are the same
(such as α-α, or β-β). The mass filter makes sure the difference of reside numbers between the two
mapping segments are must less than four residues. The biochemical filter checks the similarity of
biochemical properties between two segments. We found that the biochemical properties of coil
segments that before or after the SSEs (α-helix and β-sheet) are beneficial for initial finding, rather
than biochemical properties of SSE itself. Therefore, in this filtering process, EMPSC makes sure
biochemical features of the surrounding coil sub-segments are similar. The biochemical feature is
currently defined as the ratios of hydrophobic residues, polar with uncharged resides and polar with
charged residues in compared segments. The detail of biochemical filter is described as following.
The biochemical filter:
Given two compared reside sequences A and B, the biochemical_diff score is defined as following:
()
=
=
3
1
,_
iii baBAdifflbiochemica
where a1, b1 are the ratio of the residues belonging to hydrophobic,
a
2, b2 are the ratio of the residues belonging to polar with uncharged,
a
3, b3 are the ratio of the residues belonging to polar with charged,
corresponding to proteins A and B, respectively.
If the biochemical_diff is bigger than threshold (empirical value is 0.7), filter out the targeted candidate pairs.
The remaining candidates mapping SSE segments need to pass all filtering criteria. In step (2c),
EMPSC aligns the geometric center and the 3 primary eigenvectors of the candidate mapping
segments, and then, new coordinates for the two compared proteins will be generated, as Fig 4.
In step (2d), a fast global alignment estimation based on geometric hashing is developed to estimate
the quality of the initial alignments. The geometric hashing is a fast way to compare the 3D structure.
Take the geometric center of mapping SSEs as the origin, the position of each Cα atom of one target
protein is transforms to the polar coordinate system. In the preprocessing of geometric hashing, Every
Cα atom will be put into the hash table according to the distance between the Cα atom and the origin.
We also transform the other protein into new coordinate, and calculate the estimated alignment score.
The estimated scoring function of global alignment is described in Fig 5. Finally, only top-N
candidates will be selected for further refinement. That is, we reserve the top-N superimposed
transformations as the good initial alignments.
Fig 4. Align the 3 eigenvectors of two ellipsoids according to their magnitude.
Global Alignment Estimation based on Geometric Hashing
Assume OA, OB are the origins of new coordinate for protein A and B, respectively.
RA and RB are two resides in protein A and B, respectively.
The hashing function is
()
(
)
i
ii
iRofindexbinsizebin
RO
R____%
A1
,dist
hash == &,
where i can be A or B.
In order to avoid the collision, the bin size of the hash table is larger than the diameter of most proteins.
The scoring function of global alignment estimation is defined as
()
(
)
(
)
(
)
(
)
(
)
(
)
{
}
<==
BB AA
PR cBABABAc
PR
BA dRRRRRRdPP ,disthashhash|,distmax,Score 2
2,
where dc is the distance cutoff for alignment construction.
The higher score implies the lower r.m.s.d. and the higher number of corresponding residues.
Fig 5. The global alignment estimation based on geometric hashing
Refinement and Final Evaluation
The step (3) applies the same refined methods as most quick-n-dirty PSC algorithms. The least square
method is applied in refinement process, and the step (3) will repeatedly refine the initial alignment
solutions until the number of corresponding residues converges. In this step, the algorithm will
iteratively tune the global alignment center with superimpose two proteins using the dynamic
programming approach. Finally, in step 4, EMPSC will output the tuned alignments of the top-N
candidate from step (2) as the N alternative solutions.
Complexity Analysis
In data preprocessing, EMPSC will clusters the protein to form a set of ellipsoids with the DSSP tool
and proposed ellipsoid clustering method. The ellipsoid clustering is very fast (less than 1 second),
and the time complexity is O(r) where r is the number of residues of the segments. In the initial
alignment finding stage, the time complexity for EMPSC algorithm in this stage is O(eloge + pn) with
the scoring function based on fast the O(n)hash function, where n is the number of residues in the
protein, e is the number of segments in the protein, p is the number of mapping SSE candidate
segment pairs and p is much smaller than e. The complexity of refined alignment stage is O(Cn2),
while the C is number of iterations before the refinement process is converged for each initial
alignment. According to the observation, the number of iterations in EMPSC is usually less than 10.
In the discussion section, we will find how the refinement process affects the execution time of
EMPSC.
Experiments and Results
Three experiments are tested in order to test EMPSC in different conditions of protein structure
comparison problems. In each experiments, EMPSC provides maximal 10 alternative solutions, that is,
the Top-10 initial alignments in EMPSC are selected. These results reveal the efficiency and
effectiveness of EMPSC in comparison with Dali, CE, VAST, ProSup, and FLASH. The results of
Dali and ProSup are coming from the original papers. The results of CE and FLASH are gathered
from our experiment environment, and they are consistent with the original papers. The computing
environment for the experiments consists of Dual Pentium-4 Xeon 3.06GHz CPU and 2GB DRAM
memory. All testing programs are not parallelized.
One-against-all search for structural neighbors
As previous research work, we choose cAMP-dependent protein kinase to experiment on
one-against-all search for structural neighbors. In order to compare with the existing results of Dali,
CE, VAST, ProSup, and FLASH, the parameter dc (the distance cutoff for alignment construction) is
assigned to 6Å. For all method, we listed the number of maximal correspondent residues and minimal
r.m.s.d. Since we can find the program of CE and FLASH, we also list the execution time of CE and
FLASH running in our computing environment. Comparing the value of r.m.s.d and number of
corresponding residues, EMPSC can perform as well as other previous methods.
Table 1. An experimental set of structural neighbors of cAMP-dependent protein kinase(1atp:E) identified by
different PSC methods
A sample set of structural neighbors of cAMP-dependent protein kinase (1atp:E)(336)
Dali CE VAST ProSup FLASH EMPSC
Protein
(residues) rmsd/#res rmsd/#res/sec rmsd/#res rmsd/#res rmsd/#res/sec rmsd/#res/sec
2cpk:E(336) 0.4 / 336 0.37/336/4.9 0.4 / 334 0.4 / 336 0.37/336/0.38 0.37/336/6.1
1apm:E(341) 0.3 / 336 0.33/336/4.94 0.3 / 334 0.3 / 336 0.33/336/0.51 0.32/336/6.19
1cdk:A(343) 0.4 / 336 0.38/336/4.95 0.4 / 334 0.4 / 336 0.38/336/0.39 0.38/336/6.36
1ydt:E(334) 0.5 / 334 0.45/336/4.5 0.5 / 334 0.5 / 334 0.45/336/0.31 0.45/334/6.03
1bkx:A(337) 0.8 / 334 0.76/336/4.61 0.7 / 314 0.8 / 336 0.76/336/0.09 0.75/334/6.16
1bx6:_(337) 1.0 / 334 1.01/336/4.67 1.0 / 314 1.0 / 336 1.01/336/0.43 1.01/334/6.09
1stc:E(334) 1.1 / 334 1.1/336/4.98 1.1 / 333 1.1 / 334 1.1/336/0.07 1.09/334/5.25
1cmk:E(350) 2.0 / 335 2/336/5.82 2.0 / 331 1.5 / 316 1.72/330/0.77 1.71/330/6.72
1daw:A(327) 3.1 / 267 2.77/266/10.99 2.8 / 259 2.0 / 239 1.87/250/0.53 1.92/252/5.91
1qmz:C(296) 2.5 / 259 2.07/252/6.94 2.3 / 233 1.9 / 239 1.9/251/0.67 1.96/253/5.38
1day:A(327) 2.7 / 263 2.61/262/9.96 2.9 / 262 2.0 / 239 1.96/252/0.49 1.98/253/5.99
1koa:_(447) 2.8 / 261 2.7/258/10.81 2.4 / 225 2.1 / 233 2.16/249/0.42 2.14/249/8.26
1jnk:_(346) 2.8 / 253 2.49/194/11.97 3.0 / 240 2.2 / 220 2.19/242/0.29 2.23/244/6.21
1gag:A(300) 2.8 / 265 2.87/267/7.78 2.7 / 247 2.3 / 232 2.36/251/0.67 2.46/254/5.37
1bl7:A(351) 3.5 / 254 3.14/246/8.79 3.1 / 223 2.3 / 220 2.39/235/0.54 2.4/236/6.33
1cja:B(327) 4.7 / 165 4.19/165/10.9 - 2.7 / 115 2.85/143/0.48 3.01/149/6.36
1e7v:A(850) 4.0 / 159 4.43/165/43.51 - 2.8 / 116 3/142/0.75 3.1/155/15.59
1bo1:B(318) 3.9 / 138 3.9/145/12.25 - 3.0 / 103 2.98/136/0.16 2.9/135/5.71
1b40:A(517) 3.4 / 45 5.68/83/21.4 - 2.9 / 57 3/107/0.67 3.36/105/9.44
1lar:B(533) 2.6 / 34 5.77/123/23.11 - 3.0 / 66 3.07/88/0.79 3.21/86/10.14
10 difficult cases
In this experiment, we experiment on a well-known data set, 10 difficult cases25 reported by Fisher,
1996. Table 2 displays all the structure alignment results for 10 difficult cases. The EMPSC performs
worse in case (1ten:_(89) vs. 3hhr:B(195))
Tab l e 2. Comparison of different structure alignment results for 10 difficult cases
10 difficult cases (Fisher 1996)
Dali CE VAST ProSup FLASH
EMPSC
Protein 1
(residues)
Protein 2
(residues) rmsd/#res rmsd/#res/sec rmsd/#res rmsd/#res rmsd/#res/sec rmsd/#res/sec
1bge:B(159) 2gmf:A(121) 3.3 / 94 4.02/102/2.59 2.3 / 71 2.4 / 87 -/-/-a2.56/95/0.44
1cew:I(108) 1mol:A(94) 2.3 / 81 2.34/81/2.07 2.0 / 71 1.9 / 76 1.92/79/0.07 2.11/81/0.49
1cid:_(177) 2rhe:_(114) 3.1 / 96 2.97/98/2.4 2.0 / 78 2.3 / 84 2.24/94/0.24 2.23/94/1.19
1crl:_(534) 1ede:_(310) 3.6 / 212 3.91/220/16.29 3.7 / 186 2.6 / 161 2.49/191/0.79 2.7/199/9.3
1fxi:A(96) 1ubq:_(76) 2.5 / 52 2.79/64/1.79 2.1 / 48 2.6 / 54 2.47/62/0.03 2.56/63/0.47
1ten:_(89) 3hhr:B(195) 1.9 / 86 1.9/87/2.14 1.5 / 76 1.7 / 85 1.73/86/0.21 2.2/76/1.01
1tie:_(166) 4fgf:_(124) 3.1 / 114 2.86/115/2.23 1.6 / 76 2.4 / 104 2.28/108/0.29 2.44/113/1.15
2sim:_(381) 1nsb:A(390) 3.2 / 289 2.99/276/9.24 4.2 / 299 2.6 / 248 2.61/276/7.8 2.71/282/8.96
2aza:A(129) 1paz:_(120) 3.0 / 82 2.9/85/1.94 2.1 / 70 2.6 / 82 2.34/81/0.1 2.22/82/0.88
3hla:B(99) 2rhe:_(114) 3.0 / 74 3.46/85/2.49 2.3 / 58 2.7 / 71 2.94/79/0.09 2.75/77/0.65
a This result is available in the original FLASH paper, but we could not get any result while running the FLASH
program provided by the authors.
Special cases in global alignment – dissimilar protein comparisons
In this experiment, we experiments on dissimilar but comparable proteins. If two proteins are
dissimilar or quite different in same family, EMPSC can still obtain better results than ProSup and
FLASH methods. There proteins are selected from the first experiment, because they belong to the
same family. Table 3 displays the proteins that we selected and the results. According to these results,
EMPSC performs comparablely both in number of corresponding residues and r.m.s.d.
The “–” notation in FLASH column of Table 3 represents that FLASH didn’t find any statistical
significant solutions. As FLASH uses the statistical method to increase its computing speed in
choosing candidates, in similar cases of proteins, it can get very good solutions. But in dissimilar
cases of proteins, the statistical significance evaluation will reject further processing. Generally
speaking, only globally similar proteins could pass the statistical significance measurement. Even
though, there is some significant local alignment information, it will not process. Comparing the
global structure from local structure alignment is the advantage of EMPSC algorithm. EMPSC
algorithm performs the worst in the following three cases – (1e7v:A(850) vs. 1jnk:_(346)),
(1e7v:A(850) vs. 1day:A(327)) and (1bo1:B(318) vs. 1day:A(327)).
Tab l e 3. Comparison of dissimilar cases of proteins with ProSup, FLASH, and EM
CE ProSup FLASH EMPSC
rmsd/#res/sec rmsd/#res Rmsd/#res/sec rmsd/#res/sec
1cja:B(327) 1daw:A(327) 3.73/157/8.15 3.0 / 131 -/-/-a3.02/153/5.83
1cja:B(327) 1qmz:C(296) 3.58/153/7.28 3.0 / 133 2.81/150/0.39 2.83/151/5.26
1cja:B(327) 1day:A(327) 3.79/157/8.14 3.1 / 131 -/-/- 3.04/152/5.71
1cja:B(327) 1koa:_(447) 4.6/173/15.81 3.0 / 130 2.92/150/0.8 3.14/160/7.9
1cja:B(327) 1jnk:_(346) 4.24/162/12.93 2.9 / 133 3.09/140/0.83 3.07/157/6.22
1e7v:A(850) 1daw:A(327) 3.59/135/28.94 3.1 / 110 3.19/147/1.93 2.99/144/15.02
1e7v:A(850) 1qmz:C(296) 4.34/156/45.17 2.8 / 119 2.78/139/1.5 3.15/147/13.61
1e7v:A(850) 1day:A(327) 4.25/148/42.39 3.0 / 110 -/-/- 3.48/97/14.92
1e7v:A(850) 1koa:_(447) 4.31/163/77.09 3.2 / 115 2.99/146/1.1 3.17/144/20.84
1e7v:A(850) 1jnk:_(346) 3.91/143/58.76 3.0 / 120 2.89/152/1.31 3.23/88/15.93
1bo1:B(318) 1daw:A(327) 3.75/144/9.01 3.1 / 121 3.13/129/0.39 2.86/136/5.58
1bo1:B(318) 1qmz:C(296) 3.6/145/8.55 2.8 / 123 2.72/133/0.28 2.71/134/4.99
1bo1:B(318) 1day:A(327) 4.04/151/10.05 2.9 / 123 2.85/139/0.39 3.4/100/5.52
1bo1:B(318) 1koa:_(447) 4.03/146/16.97 2.8 / 118 -/-/- 2.81/130/7.77
1bo1:B(318) 1jnk:_(346) 3.96/151/16.28 2.8 / 114 -/-/- 3.21/139/5.9
a represents that FLASH doesn’t provide any solution.
Discussion
Efficiency and Number of Alternative Solutions
Although, in previous section, we listed the execution time of every comparison In Table 1, Table 2
and Table 3, it is very hard to observe the performance relationship between CE, FLASH and EMPSC.
Therefore, we added the reside numbers of the two compared proteins, and plot the relationship
diagram of execution time vs. total residues, as Fig 6. In this figure, it is obviously that EMPSC is
truly faster than CE, especially for large protein structure comparisons. However, EMPSC looks
slower than FLASH.
0
10
20
30
40
50
60
70
80
90
0 200 400 600 800 1000 1200 1400 1600
Total Residues of Compared Proteins
Execution Time (seconds)
CE
EMPSC
FLASH
Trend (CE)
Trend (EMPSC)
Trend (FLASH)
Fig 6. The execution time of CE, FLASH and EMPSC, given different total residues of compared proteins.
These trend lines for each method are polynomial regressions of order 2 which are provided by MS Excel Trend
function.
In order to know whether we can further speed up EMPSC, we did more experiments about EMPSC
with different numbers of alternative solutions. We repeated the experiments in previous section with
Top-3 and Top-5 alternative solutions and compared it with Top-10 results and FLASH. The detail
results listed as Tab le 4 in appendix section, and we plot the diagram of the execution time versus
different number of alternative solutions, as Fig 7. In this diagram, we can found that execution time
of EMPSC is perfectly proportional to the number of alternative solutions. After profiling our EMPSC
program, we found EMPSC spend most execution time in alignment refining process. And we found
that, if FLASH provides alternative solutions, it spends about the same execution time as EMPSC.
Two data points of FLASH in Fig 7 show such case. This conclusion can be applied in any PSC
algorithm that claims fast but provides only one solution (like FAST15), except hash-based alignment
refining algorithm.
Obviously, according to our observation, EMPSC is a good choice for solving protein structure
comparison problems. In addition, we can conclude that further enhancement of PSC algorithms
should be focused on the alignment refining process.
0
5
10
15
20
25
30
35
0 200 400 600 800 1000 1200 1400 1600
Total Residues of Compared Proteins
Execution Time (seconds)
FLASH
EMPSC Top-3
EMPSC Top-5
EMPSC Top-10
Trend (FLASH)
Trend (EMPSC Top-3)
Trend (EMPSC Top-5)
Trend (EMPSC Top-10)
Fig 7. The execution time of FLASH and EMPSC Top-3, Top-5, Top-10 alternative solutions. These trend lines
for each method are polynomial regressions of order 2 which are provided by MS Excel Trend function.
Characteristic of EMPSC Algorithm
The proposed EMPSC algorithm possesses three major features, which we believe that these features
make EMPSC a good choice of PSC algorithms. First, the ellipsoidal representation can provide a
good summary of 3D information for residues segment. Particularly, two ellipsoidal models can easily
map to each other via transformations or rotations of coordination systems. As we said in introduction,
to represent α-helix in a single vector is proper, because the α-helix is hard to bend. The representing
vector of β-sheet does drop some structure information. As the β-sheet structure is usually bending or
curved, the length of identified β-sheet will affect the derived single vector effectiveness seriously.
With the ellipsoidal model, EMPSC can effectively abstract the curved β-sheet, because the 3
orthogonal eigenvector of the ellipsoid keeps more information of residues’ distribution in space. This
is an advantage of EMPSC in comparison with the vector representation in FLASH algorithm. In
addition, the ellipsoidal model does not only support the α-helix and β-sheet structures abstraction, but
also can be used to represent loop or coil structures. As the results of the above experiments reveal
that EMPSC is a good PSC solution in comparison with previous algorithms, it is convinced that
ellipsoidal representation at least provides a good abstraction of 3D structure information as well as
others (SAP’s residue pair, CE’s AFP, ProSup’s seed, and even FLASH’s SSE).
Second, EMPSC provide a platform that can plug in different filters for different purposes. Via the
different combinations of filters, EMPSC can filter the candidate mapping segment pairs according to
profession people’s requirement. In our current experiments results, the combination of type filter,
mass filter and biochemical filter can get a good accuracy and efficiency in most cases. In addition,
we also found that biochemical filter is especially effective for comparing similar proteins of the same
family. Besides of the three filters, we also tried the eigenvector filter which makes sure the
eigenvector26,27 of the mapping segments are similar. Unfortunately the eigenvector filter doesn’t
show any further improvement, so we didn’t use it in current EMPSC algorithm.
Third, like traditional PSC algorithms, EMPSC is not only good at global structure comparisons for
similar proteins, but also provides useful information in local structure comparisons for dissimilar
proteins in same family. That is, local alignment is viable for EMPSC algorithm. Fig 8 and Fig 9
reveal that even under the dissimilar condition of global alignment, two proteins may still share some
common local structures with biological significance. To detect the similar local structure in proteins
is as important as searching similar global structure in traditional PSC problems.
In order to view the results of EMPSC algorithm, we also developed a tool that can output the
comparison results in molscript28, and provide a web service now. Fig 8 and Fig 9 mentioned in
previous section are the sample outputs. Instead of superimposing the structures of two proteins, we
draw the results in vertically tiled windows. In these pictures, the yellow part in the picture indicates
that the SSE is chosen as the aligned center, and the red part indicates the corresponding residues in
each compared protein.
(a) 1daw:A (b) 1cja:B
Fig 8. The structure comparisons results of protein 1daw:A and protein 1cja:B with EMPSC algorithms (The
distance between aligned residues 6Å). The yellow part means that the SSE is chosen as an aligned center. The
red part means the corresponding residues in each protein.
(a) 1qmz:C (b) 1bo1:B
Fig 9. The structure comparisons results of protein 1qmz:C and protein 1bo1:B with EMPSC algorithms (The
distance between aligned residues 6Å). The yellow part means that the SSE is chosen as an aligned center. The
red part means the corresponding residues in each protein.
Further Development
In the future, we will further revise the EMPSC algorithms in the following aspects. First, the amino
acid types of residues will be investigated whether they are helpful for EMPSC. Second, SSE-based
PSC algorithms, like FLASH and EMPSC, have to rely on SSE identification tools15. Under some
conditions the compared proteins sharing similar global structures but dramatically different SSE
identifications as Fig 10, it is hard for SSE-based algorithms to find a good alignment center. A
possible solution is to segment the longer SSE or connect some shorter SSEs. This can be done by
modifying the SSE identification tools, like DSSP, without modifying original SSE-based PSC
algorithms. Third, EMPSC has potential for local structure comparison, but result optimization of
local structure alignment should be further investigated. As mentioned above, we will further enhance
the global and local alignment ability of EMPSC to develop multiple protein structure alignment in
the near future.
Fig 10. The proteins with similar global structures but dramatically different SSE identifications – protein 1apt
(left) and protein 1bxo (right).
References
1. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of
hydrogen-bonded and geometrical features. Biopolymers 1983;22(12):2577-2637.
2. Brändén C-I, Tooze J. Introduction to protein structure. New York: Garland Pub.; 1999. xiv, 410
p.
3. Brown NP, Orengo CA, Taylor WR. A protein structure comparison methodology. Comput
Chem 1996;20(3):359-380.
4. Gibrat JF, Madej T, Bryant SH. Surprising similarities in structure comparison. Curr Opin
Struct Biol 1996;6(3):377-385.
5. Holm L, Sander C. Mapping the protein universe. Science 1996;273(5275):595-603.
6. Koehl P. Protein structure similarities. Curr Opin Struct Biol 2001;11(3):348-353.
7. Orengo C. Classification of protein folds. Curr Opin Struct Biol 1994;4(3):429-440.
8. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne
PE. The Protein Data Bank. Nucleic Acids Res 2000;28(1):235-242.
9. Holm L, Sander C. Dali: a network tool for protein structure comparison. Trends Biochem Sci
1995;20(11):478-480.
10. Gerstein M, Levitt M. Using iterative dynamic programming to obtain accurate pairwise and
multiple alignments of protein structures. Proc Int Conf Intell Syst Mol Biol 1996;4:59-67.
11. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the
amino acid sequence of two proteins. J Mol Biol 1970;48(3):443-453.
12. Subbiah S, Laurents DV, Levitt M. Structural similarity of DNA-binding domains of
bacteriophage repressors and the globin core. Curr Biol 1993;3(3):141-148.
13. Madej T, Gibrat JF, Bryant SH. Threading a database of protein cores. Proteins
1995;23(3):356-369.
14. Vriend G, Sander C. Detection of common three-dimensional substructures in proteins. Proteins
1991;11(1):52-58.
15. Zhu J, Weng Z. FAST: a novel protein structure alignment algorithm. Proteins
2005;58(3):618-627.
16. Can T, Wang YF. CTSS: A Robust and Efficient Method for Protein Structure Alignment Based
on Local Geometrical and Biological Features. Proc IEEE Comput Soc Bioinform Conf
2003;2:169-179.
17. Chang P-K, Chen C-C, Ouhyoung M. A Tool for Structure Alignment of Molecules. IEEE Sixth
International Symposium on Multimedia Software Engineering - Special Session on
Bioinformatics 2004:354-361.
18. Shindyalov IN, Bourne PE. Protein structure alignment by incremental combinatorial extension
(CE) of the optimal path. Protein Eng 1998;11(9):739-747.
19. Taylor WR. Protein structure comparison using iterated double dynamic programming. Protein
Sci 1999;8(3):654-665.
20. Lackner P, Koppensteiner WA, Domingues FS, Sippl MJ. Automated large scale evaluation of
protein structure predictions. Proteins 1999;Suppl 3:7-14.
21. Lackner P, Koppensteiner WA, Sippl MJ, Domingues FS. ProSup: a refined tool for protein
structure alignment. Protein Eng 2000;13(11):745-752.
22. Shih ES, Hwang MJ. Protein structure comparison by probability-based matching of secondary
structure elements. Bioinformatics 2003;19(6):735-741.
23. Lesk AM. Protein architecture : a practical approach. Oxford England ; New York: IRL Press;
1991. xiv, 287 p.
24. Zhang Z. Iterative point matching for registration of free-form curves and surfaces. Int J
Comput Vision 1994;13(2):119-152.
25. Fischer D, Elofsson A, Rice D, Eisenberg D. Assessing the performance of fold recognition
methods by means of a comprehensive benchmark. Pac Symp Biocomput 1996:300-318.
26. Bezdek JC, Pal MR, Keller J, Krisnapuram R. Fuzzy Models and Algorithms for Pattern
Recognition and Image Processing: Kluwer Academic Publishers; 1999. 792 p.
27. Frigui H, Krishnapuram R. A Robust Competitive Clustering Algorithm With Applications in
Computer Vision. IEEE Trans Pattern Anal Mach Intell 1999;21(5):450-465.
Appendix
Table 4. The results of repeated experiments using EMPSC with Top-3, Top-5, Top-10 alternative solutions
respectively.
EMPSC Top-3 EMPSC Top-5 EMPSC Top-10 Protein 1
(residues)
Protein 2
(residues) rmsd/#res/sec rmsd/#res/sec rmsd/#res/sec
Structural neighbors of cAMP-dependent protein kinase
1atp:E(336) 2cpk:E(336) 0.37/336/2.12 0.37/336/3.2 0.37/336/6.1
1atp:E(336) 1apm:E(341) 0.32/336/2.12 0.32/336/3.2 0.32/336/6.19
1atp:E(336) 1cdk:A(343) 0.38/336/2.15 0.38/336/3.3 0.38/336/6.36
1atp:E(336) 1ydt:E(334) 0.45/334/2.22 0.45/334/3.2 0.45/334/6.03
1atp:E(336) 1bkx:A(337) 0.75/334/2.31 0.75/334/3.4 0.75/334/6.16
1atp:E(336) 1bx6:_(337) 1.01/334/2.11 1.01/334/3.2 1.01/334/6.09
1atp:E(336) 1stc:E(334) 1.09/334/2.04 1.09/334/3.2 1.09/334/5.25
1atp:E(336) 1cmk:E(350) 1.71/330/2.13 1.71/330/3.5 1.71/330/6.72
1atp:E(336) 1daw:A(327) 1.92/252/2.08 1.92/252/3.1 1.92/252/5.91
1atp:E(336) 1qmz:C(296) 1.96/253/1.85 1.96/253/2.9 1.96/253/5.38
1atp:E(336) 1day:A(327) 1.98/253/2.06 1.98/253/3.2 1.98/253/5.99
1atp:E(336) 1koa:_(447) 2.14/249/2.92 2.14/249/4.3 2.14/249/8.26
1atp:E(336) 1jnk:_(346) 2.23/244/2.13 2.23/244/3.3 2.23/244/6.21
1atp:E(336) 1gag:A(300) 2.46/254/1.82 2.46/254/3 2.46/254/5.37
1atp:E(336) 1bl7:A(351) 2.4/236/2.16 2.4/236/3.4 2.4/236/6.33
1atp:E(336) 1cja:B(327) 3.01/149/2.01 3.01/149/3.1 3.01/149/6.36
1atp:E(336) 1e7v:A(850) 3.2/70/5.39 3.1/155/8.2 3.1/155/15.59
1atp:E(336) 1bo1:B(318) 3.4/69/2.05 3.6/92/3 2.9/135/5.71
1atp:E(336) 1b40:A(517) 3.36/105/3.29 3.36/105/5.3 3.36/105/9.44
1atp:E(336) 1lar:B(533) 3.21/86/2.91 3.21/86/5 3.21/86/10.14
10 difficult cases (Fisher 1996)
1bge:B(159) 2gmf:A(121) 2.56/95/0.33 2.56/95/0.3 2.56/95/0.44
1cew:I(108) 1mol:A(94) 2.11/81/0.19 2.11/81/0.3 2.11/81/0.49
1cid:_(177) 2rhe:_(114) 2.16/93/0.4 2.23/94/0.6 2.23/94/1.19
1crl:_(534) 1ede:_(310) 2.7/199/3.42 2.7/199/5 2.7/199/9.3
1fxi:A(96) 1ubq:_(76) 2.56/63/0.15 2.56/63/0.2 2.56/63/0.47
1ten:_(89) 3hhr:B(195) 2.2/76/0.35 2.2/76/0.5 2.2/76/1.01
1tie:_(166) 4fgf:_(124) 3.28/61/0.41 3.28/61/0.6 2.44/113/1.15
2sim:_(381) 1nsb:A(390) 2.67/280/3.68 2.67/280/5.2 2.71/282/8.96
2aza:A(129) 1paz:_(120) 2.25/82/0.3 2.25/82/0.5 2.22/82/0.88
3hla:B(99) 2rhe:_(114) 2.87/43/0.23 2.8/79/0.4 2.75/77/0.65
Protein family comparisons
1cja:B(327) 1daw:A(327) 3.2/72/1.92 2.9/152/3 3.02/153/5.83
1cja:B(327) 1qmz:C(296) 2.83/151/1.94 2.83/151/2.8 2.83/151/5.26
1cja:B(327) 1day:A(327) 2.92/145/1.97 3.04/152/3 3.04/152/5.71
1cja:B(327) 1koa:_(447) 3.14/160/2.7 3.14/160/4.2 3.14/160/7.9
1cja:B(327) 1jnk:_(346) 3.07/157/2.23 3.07/157/3.2 3.07/157/6.22
1e7v:A(850) 1daw:A(327) 3.4/77/5.27 3.52/83/8 2.99/144/15.02
1e7v:A(850) 1qmz:C(296) 3.2/93/4.73 3.2/93/7.2 3.15/147/13.61
1e7v:A(850) 1day:A(327) 3.5/79/5.27 3.5/79/8 3.48/97/14.92
1e7v:A(850) 1koa:_(447) 3.67/79/7.48 3.67/79/11.3 3.17/144/20.84
1e7v:A(850) 1jnk:_(346) 3.23/88/5.63 3.23/88/8.5 3.23/88/15.93
1bo1:B(318) 1daw:A(327) 3.6/97/1.99 3.6/97/3.4 2.86/136/5.58
1bo1:B(318) 1qmz:C(296) 2.71/134/1.85 2.71/134/2.6 2.71/134/4.99
1bo1:B(318) 1day:A(327) 3.4/100/1.99 3.4/100/2.9 3.4/100/5.52
1bo1:B(318) 1koa:_(447) 3.42/66/2.6 3.42/66/4.4 2.81/130/7.77
1bo1:B(318) 1jnk:_(346) 3.2/138/2.08 3.2/138/3.1 3.21/139/5.9
... In addition, we also want to detect substructures related to function or structure support via local structure detection. Therefore, we apply EMPSC algorithm [14] of rough alignment to detect similar local structure between two protein structures. EMPSC is one of the global structure comparison algorithm based on protein secondary structure elements (SSE) information. ...
Article
Full-text available
Local region conservation has been observed in recent years and become more and more important in structure biology. Recent researches point out that local conservation regions are correlated to protein functional sites and functions and studies show that some local conservation on sequence or structure are close to binding area. Hence, in order to realize how function works, we can discover local structure region to understand protein function via observation in local conservation. Furthermore, many researches show that function would be activate on the surface of protein structure, but not whole structure and local region conservation can be discovered from sequence, structure or both in current status. Sequence conservation has been discovered in recent researches. There are existing examples which show that structure conservation can be mapped from sequence conservation; however, it is still a problem to mining structure conservation via structure comparison. Structure conservation has become a hot topic to be discussed. Protein function needs to take place in local region to activate the biochemical reaction. Therefore, our motivation is to apply protein structure comparison algorithm to mining local structure conservation. Because these local structure conservations would be used to support structure or provide function, we use functional site to connect the relationship between local structure conservation and protein function. Given functional hierarchical classification, we can easily identify protein function and using proteins with the same EC number to mining or discover conservation which may be related to function. Furthermore, we try to extract local structure region associated to its protein functional site.
... Based on previous studies, we propose heuristic strategy which is more efficient than conventional PSC approaches without the limits of FLASH [21]. In this paper, we approach different heuristic strategies based on ellipsoidal model, geometric hashing, and filtering criteria to compare protein structures, named as Ellipsoidal Model Protein Structure Comparison (EMPSC) [20]. The most important concept inside EMPSC is ellipsoidal representation that we build ellipsoidal model for SSEs identified with DSSP [12] and coil/loop structure. ...
Conference Paper
Full-text available
Many protein structure comparison methods use secondary structure information to do fast structure similarity search for initial alignment finding and refine the results from possible optimal candidate solutions by iteratively dynamic programming to optimize the final results. In this paper, we develop a method, Ellipsoidal Model Protein Structure Comparison, based on the concept of secondary structure elements alignment followed by iteratively refinement. In order to utilize all possible structure information to obtain alternative solutions for further analysis, we use ellipsoidal model to represent not only mainly -helices and -sheets, but the remaining fragments for structural alignment. Different heuristic filters and geometric hashing based global alignment estimation are applied for quick finding better initial alignments. We also provide top-N solutions without increasing extra computational time rather than only best solution in the previous works. Now, we provide the online web service, Ballerina (http://ballerina.csie.ntu.edu.tw/), for protein structure comparison.
... Our strategy is to describe local structure representation of matched residues via protein structure comparison and then detect frequent substructure. In addition, we use EMPSC [19] as protein structure alignment tool to compare protein structures pair-wisely. As shown inFigure 2, the overall framework contains three major parts: (I) local structure generation via pair-wise local structure comparison, (II) substructure comparison and similarity measurement, (III) similar substructure grouping and representative pattern selection.Figure 3. The flow chart for mining conserved structural patterns via NRS-based conservation mining approach ...
Conference Paper
Full-text available
Local region conservation has been studied for many years because biologists believe that local conservation could be highly related to protein functions. The concept of local region conservation comes from a motif, a fragment with biological or functional meaning. Besides, structure-based identification of homologues often succeeds where sequence-alone-based methods fail, because in many cases evolution retains the folding pattern long after sequence similarity becomes undetectable. Thus, prediction of protein function from sequence and structure is a difficult problem, because homologous proteins often have different functions. Alternative methods include inferring conservation patterns in members of a functionally uncharacterized family for which many sequences and structures are known. The researches show that sequence conservation could be discovered that their corresponding residues in 3D space are a compact region and close to ligand. But the question is that is it possible to discover compact regions via protein structure analysis; therefore, our motivation is find out a local structure representation and apply the concept of mining frequent item set to discover local structure conservation. In the experiments, we use enzyme classification to discover local structure conservations, which we can easily identify the connection linked by detected local structure conservations and substrates.
Article
Recent developments in automatic structure comparison have yielded several fast and flexible methods that allow extensive explorations of the structure databank. As a result, proteins have been clustered into a few hundred structural families. Many interesting and unexpected structural similarities have been revealed, and some folds have been shown to support diverse sequences and functions.
Article
A heuristic method has been developed for registering two sets of 3-D curves obtained by using an edge-based stereo system, or two dense 3-D maps obtained by using a correlation-based stereo system. Geometric matching in general is a difficult unsolved problem in computer vision. Fortunately, in many practical applications, some a priori knowledge exists which considerably simplifies the problem. In visual navigation, for example, the motion between successive positions is usually approximately known. From this initial estimate, our algorithm computes observer motion with very good precision, which is required for environment modeling (e.g., building a Digital Elevation Map). Objects are represented by a set of 3-D points, which are considered as the samples of a surface. No constraint is imposed on the form of the objects. The proposed algorithm is based on iteratively matching points in one set to the closest points in the other. A statistical method based on the distance distribution is used to deal with outliers, occlusion, appearance and disappearance, which allows us to do subset-subset matching. A least-squares technique is used to estimate 3-D motion from the point correspondences, which reduces the average distance between points in the two sets. Both synthetic and real data have been used to test the algorithm, and the results show that it is efficient and robust, and yields an accurate motion estimate.
Article
The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.
Article
We present a fully automatic algorithm for three-dimensional alignment of protein structures and for the detection of common substructures and structural repeats. Given two proteins, the algorithm first identifies all pairs of structurally similar fragments and subsequently clusters into larger units pairs of fragments that are compatible in three dimensions. The detection of similar substructures is independent of insertion/deletion penalties and can be chosen to be independent of the topology of loop connections and to allow for reversal of chain direction. Using distance geometry filters and other approximations, the algorithm, implemented in the WHAT IF program, is so fast that structural comparison of a single protein with the entire database of known protein structures can be performed routinely on a workstation. The method reproduces known non-trivial superpositions such as plastocyanin on azurin. In addition, we report surprising structural similarity between ubiquitin and a (2Fe-2S) ferredoxin.
Article
A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed. From these findings it is possible to determine whether significant homology exists between the proteins. This information is used to trace their possible evolutionary development.The maximum match is a number dependent upon the similarity of the sequences. One of its definitions is the largest number of amino acids of one protein that can be matched with those of a second protein allowing for all possible interruptions in either of the sequences. While the interruptions give rise to a very large number of comparisons, the method efficiently excludes from consideration those comparisons that cannot contribute to the maximum match.Comparisons are made from the smallest unit of significance, a pair of amino acids, one from each protein. All possible pairs are represented by a two-dimensional array, and all possible comparisons are represented by pathways through the array. For this maximum match only certain of the possible pathways must be evaluated. A numerical value, one in this case, is assigned to every cell in the array representing like amino acids. The maximum match is the largest number that would result from summing the cell values of every pathway.