ArticlePDF Available

EMPSC: A New Method Based on Ellipsoidal Model for Protein Structure Comparison

January 2006

January 2006

Authors:

Yhi Shiau

Telecommunication Laboratories, Chunghwa Telecom Co., Ltd.

Yu-Feng Huang

ACT Genomics Inc.

Chien-Kang Huang

National Taiwan University

This paper proposes a new method EMPSC for the well-known PSC (Protein Structure Comparison) problem. The proposed method EMPSC is a protein structural alignment algorithm based on ellipsoidal model abstraction. We segment the protein 3D structure into two different kinds of structures, including Secondary Structure Elements recognized by DSSP 1 and other coil/loop structures. These SSEs will be the initial alignment center for obtaining the transformation coordinate systems. Different heuristic filters and geometric hashing based global alignment estimation are used for quick finding better initial alignments. In the refined alignment stage of analysis, a standard refinement algorithm is invoked to fine-tune the alignment outputted by the first stage. Our experimental results reveal that EMPSC generally achieves comparable accuracy and better performance in comparison with the existing PSC algorithms. Moreover, we analyzed the factors that affect the EMPSC performance and SSE-based PSC algorithms. Further investigation in multiple protein structure comparison and local structure comparison will be continued.

A 3D-ellipsoidal ball and their 3 eigenvectors

…

. Comparison of dissimilar cases of proteins with ProSup, FLASH, and EM

…

The proteins with similar global structures but dramatically different SSE identifications – protein 1apt (left) and protein 1bxo (right).

…

Figures - uploaded by Chien-Kang Huang

Content may be subject to copyright.

Content uploaded by Chien-Kang Huang

Content may be subject to copyright.

EMPSC: A New Method Based on Ellipsoidal Model for Protein Structure

Comparison

Yhi Shiau1,2, Jia-Nan Wang1, Yu-Feng Huang1, Chien-Kang Huang3

yshiau@cht.com.tw, jnwang@mars.csie.ntu.edu.tw, yfhuang@csie.ntu.edu.tw, ckhuang@ntu.edu.tw

1Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan

2Chunghwa Telecom Laboratories, Tauyuan, Taiwan

3Department of Engineering Science and Ocean Engineering, National Taiwan University, Taipei, Taiwan

Abstract

This paper proposes a new method EMPSC for the well-known PSC (Protein Structure Comparison)

problem. The proposed method EMPSC is a protein structural alignment algorithm based on

ellipsoidal model abstraction. We segment the protein 3D structure into two different kinds of

structures, including Secondary Structure Elements recognized by DSSP1 and other coil/loop

structures. These SSEs will be the initial alignment center for obtaining the transformation coordinate

systems. Different heuristic filters and geometric hashing based global alignment estimation are used

for quick finding better initial alignments. In the refined alignment stage of analysis, a standard

refinement algorithm is invoked to fine-tune the alignment outputted by the first stage. Our

experimental results reveal that EMPSC generally achieves comparable accuracy and better

performance in comparison with the existing PSC algorithms. Moreover, we analyzed the factors that

affect the EMPSC performance and SSE-based PSC algorithms. Further investigation in multiple

protein structure comparison and local structure comparison will be continued.

Keywords: Eigenvector, Dynamic Programming, Geometric Hashing, Secondary Structure Elements.

Availability: The EMPSC tool is accessible at our lab website http://ballerina.csie.ntu.edu.tw

Introduction

Since 1747 Beccari discovered proteins, the proteins play the important role in biochemical reactions,

the study of protein functionalities attracted biochemical researchers. Within these research topics,

Protein Structure Comparison (PSC) is one of the most basic and important subjects to detect the

evolutionary and functional relationships between them. And we know that the functionality of one

protein is related to its 3D structure2, that is, proteins with similar substructures may have similar

functions. Therefore, improving the methodology and tools of PSC is an important issue in molecular

biology and bioinformatics for many years3-7.

Today the biochemists need more fast and accurate PSC tools, as the protein database grows fast with

the help of computation power in the recent biochemistry research. For examples, the Protein Data

Bank8 content increases to 35,813 proteins on 28-Mar-2006, the Swiss-Prot content increases to

163,000 and the TrEMBL increases to 1,450,000. Obviously, continuing to improve PSC tools to

handle the fast growing massive protein data is a great challenge.

In order to detect the functional or evolutionary relationships between proteins, the PSC algorithms

compare the 1D sequences information or 3D structures information of amino acid sequences. The

purpose of PSC is to identify maxima equivalent Cα atoms upon which to optimally align the 3D

structures of compared proteins. Previously proposed PSC algorithms exploit many different

computing approaches, including Monte Carlo (Dali9), Dynamic programming10-12 (VAST13), 3D

clustering14, graph theory15, spline approximation16 and geometric hashing17. In order to further speed

up the PSC performance, quick-n-dirty approaches is applied in today’s PSC, like CE18, SAP19,

ProSup20,21, and FLASH22. As the PSC is the NP-hard problem, most approaches tried to propose

different heuristics to approximate the optimal solution. Therefore, quick-n-dirty approaches are the

main stream of today’s PSC algorithms. In the following section, we will further explore these

quick-n-dirty approaches.

CE method started from the concept of aligned fragment pairs (AFPs), that is, the initial alignment

finding is grown from AFPs. CE uses a combinatorial extension of an alignment path defined by AFPs

to obtain the extension paths that pass the similarity threshold defined by the inter-residue distance

constraint. The inter-residue distance constraint is an approximation for superimposing the two

proteins. In refinement, the top-20 alignment paths are evaluated based on r.m.s.d. and the best one

selected. The best alignment path is further refined with the dynamic programming approach to obtain

the final global alignment solution. The problem of CE method is it is time consumed while

calculating of inter-residue distance constraint. On the other hand, according to the observation of our

experiments, we found CE method tends to find most corresponding residues rather than smaller

r.m.s.d

SAP method uses iterated double dynamic programming and applied it in both the initial alignment

finding and the refinement processes. The initial alignment finding begins from every residue pair;

each residue from the compared proteins. SAP finds out the potentially equivalent residue pairs based

on local structure and environment. Then, for each potential equivalent residue pair, SAP calculates

the alignment score with dynamic programming approach. The parameters of the scoring function

include the direction component, the orientation component, the sequence term, and the spatial term.

The initial alignment finding is done by sorting all the elements of the bias-matrix and taking a

number (top-20) of the highest scoring elements. After initial alignment finding, top-20 results are

further refined with the dynamic programming approach. In order to avoid the exhaustive search of all

possible residue pairs, the useful option of taking a randomized selection has been exploited in SAP

results. The main issue of SAP is the higher time complexity because double dynamic programming is

very time consumed for large matrix.

ProSup method tried to approximate the solution from seed pairs. The seed fragments are the similar

fragments of compared proteins. ProSup superimposes all possible seeds and evaluates the whole

protein with the superimposing equivalences approach instead of using dynamic programming. In

refinement, a standard procedure that combines dynamic programming and least-square constraint23,24

further refines the initial alignments. ProSup can output multiple solutions. The accuracy and

efficiency is affected by the length of seed fragments.

FLASH method greatly reduces the time complexity by with an aggressive abstraction of protein

structures. FLASH identifies the Secondary Structure Elements (SSEs) of compared proteins with

DSSP tool. It works on a vector-represented SSE as a data reduction for protein’s 3D structure. The

experiments revealed that the SSE in the protein provides better initial alignment performance.

FLASH only considered the information from α-helix and β-sheet, and neglected the coil information.

It establishes the angle-distance map for all SSE pairs. After calculating the SSE matching

probabilities, FLASH uses a greedy procedure to select viable alignment solutions (at least 3).

Statistical significance is applied in FLASH to filter out inappropriate initial alignment findings. In

refinement, FLASH applies the same standard refined alignment algorithm as ProSup. With

SSE-based data reduction, FLASH greatly speedup the time complexity in initial alignment finding.

And it can output multiple solutions. However, if two proteins have similar local structure, but the

global similarity doesn’t exceed the criterion of statistical significance, then FLASH stops further

comparison and won’t give any result. In addition, if the protein has no SSE structure, then FLASH

method doesn’t apply. The single vector representation also raises problems. To represent α-helix in a

single vector is proper, as the α-helix is hard to bend. But the representing vector of β-sheet may lose

some structure information. As the β-sheet structure is usually bending or curved, the length of

identified β-sheet will affect the derived single vector effectiveness seriously.

Based on previous study, we hope to develop a new method which is more efficient than conventional

PSC approaches without the limits of FLASH. In this paper, we propose a new protein structural

alignment method based on ellipsoidal model, named as EMPSC, which applies generic heuristic

mapping function in initial alignment stage. There are four steps in EMPSC method – preprocessing,

initial alignment, refinement and final evaluation. EMPSC identify the SSEs of the target protein with

DSSP tool. The remaining parts of this protein, mostly the coil/loop structure, will also be considered.

Rather than using single vector representation as FLASH, we use ellipsoid model to represent each

sequence segment. The detailed algorithm describes in the Proposed Method section.

EMPSC has many better characteristics of previous approaches. The initial alignment finding is

selected from segment pairs (mainly SSEs) instead of residue pairs. EMPSC can also output multiple

solutions. Like FLASH, we believe that SSEs is clearly more important for protein’s structure

conformation. And abstract the structure information with SSEs (α-helix, β-sheet and coil) could be

better than CE’s AFP and ProSup’s seed. In addition, EMPSC like CE and ProSup, the initial

alignment finding come from the view of local alignment.

Proposed Method

The workflow diagram of EMPSC is depicted in Fig 1, and the algorithm will be described in details.

The algorithm basically has four steps.

(1) Preprocessing: Segment the proteins into SSEs with DSSP tool, and generate the

ellipsoidal representation for each segment (mainly α-helix, β-sheet).

(2) Initial alignment: Generate the potential aligned segment pairs, looking for a good initial

alignment via heuristic filtering.

(3) Refinement: Iteratively apply a dynamic programming algorithm to refine the initial

alignment, which is a well-known procedures as most PSC algorithms.

(4) Final evaluation: Evaluate the refined alignments and provide the number of

corresponding residues and r.m.s.d of alignment solutions.

(1a) Identify the SSEs from protein

(1b) Cluster remaining residues from

protein

(1c) Generate the ellipsoidal

representation for each segment

(2a) Generate the pairs of candidate aligned

SSE segments from compared proteins

(2b) Filter the pairs with heuristic filtering

function

(2d) Rank the candidate pairs with a fast

global alignment approach,

and Select Top-N best pairs

(3) Iteratively tune the global alignment

center with superimposing compared proteins

(dynamic programming approach)

(4) Final evaluation

(2c) Superimpose candidate pairs by

center-eigenvector transformation

For each protein For each pair of compared proteins

Fig 1. The workflow diagram of the EMPSC algorithm.

Preprocessing: Generate Ellipsoidal Representation

As most modern PSC algorithms, in the initial alignment step, we don’t estimate the alignment quality

by comparing the structures of two proteins atom by atom, but comparing their abstract models. The

ellipsoid model rather than all residues of SSEs, were used to represent the proteins’ overall 3D

structures in EMPSC. The step (1) of the EMPSC algorithm is the process of generating ellipsoidal

representation for each segment of the target protein. At first, in step (1a), we use the DSSP tool to

identify the SSEs (α-helix, β-sheet) of the protein. In step (1b), contiguous residues of remaining will

be clustered into one segment. Therefore, the remaining residues of this protein (that is the coil/loop

information) are further clustered to a set of new segments (coil sub-segments) according the

adjacencies of residues. After that, SSEs will be represented by a set of 3D-ellipsoidal model. Fig 2

depicted the relationship between residue chain and 3D-ellipsoidal model. The PCA (Principal

Component Analysis)23 is applied in finding their 3 orthogonal eigenvectors and 3 respective

eigenvalues. According to the steps (1c), we can decompose the protein into a set of residue segments

and ellipsoidal representations as Fig 3.

Residue chain

Fig 2. A 3D-ellipsoidal ball and their 3 eigenvectors

(a) (b) (c)

Fig 3. The generating process of ellipsoidal representation. The figure (a) displays the example protein. The

figure (b) shows that the protein was processed by DSSP and the SSEs is identified. The figure (c) shows the

further processing of remaining segments.

Initial Alignment: Find a Good Superimposed Transformation for the Initial Alignments

Finding good superimposed transformations are the main task in obtaining initial alignments of

EMPSC. The whole process is described in step (2) of the EMPSC algorithm. In general, the initial

alignment searching is a filtering process that eliminates impossible or dissimilar segments mapping

and finds the top-N best segments pairs, that is, top-N superimposed transformations between two

compared proteins.

At first (2a), we generate all possible initial alignments from the compared proteins. In this step, every

pair SSE segments, including α-helix and β-sheet only, of the compared proteins can be the center of

the new coordinates, and the remaining coil sub-segments will be used in the biochemical filtering

step (2b).

Before estimating the quality of every initial alignment in step (2d), EMPSC will first check the local

structure of each initial alignment, that is, EMPSC will calculate the similarity between the two

mapping SSE pair (α-helix and β-sheet) and the at most four surrounding coil sub-segments. In step

(2b), EMPSC will further filter the initial alignments with a heuristic filtering function. Conceptually,

we can define an integrated filtering function combining all factors that is effective for the judgment

of good initial alignment. However, instead of implementing one integrated filtering function, we

currently implemented several subsequent filtering processes which filter out unmatched or dissimilar

pairs. These processes include three different kind of filtering – type filter, mass filter, and biochemical

filter. The type filter makes sure the secondary structure types of mapping segments are the same

(such as α-α, or β-β). The mass filter makes sure the difference of reside numbers between the two

mapping segments are must less than four residues. The biochemical filter checks the similarity of

biochemical properties between two segments. We found that the biochemical properties of coil

segments that before or after the SSEs (α-helix and β-sheet) are beneficial for initial finding, rather

than biochemical properties of SSE itself. Therefore, in this filtering process, EMPSC makes sure

biochemical features of the surrounding coil sub-segments are similar. The biochemical feature is

currently defined as the ratios of hydrophobic residues, polar with uncharged resides and polar with

charged residues in compared segments. The detail of biochemical filter is described as following.

The biochemical filter:

Given two compared reside sequences A and B, the biochemical_diff score is defined as following:

()

∑

−=

iii baBAdifflbiochemica

where a1, b1 are the ratio of the residues belonging to hydrophobic,

2, b2 are the ratio of the residues belonging to polar with uncharged,

3, b3 are the ratio of the residues belonging to polar with charged,

corresponding to proteins A and B, respectively.

If the biochemical_diff is bigger than threshold (empirical value is 0.7), filter out the targeted candidate pairs.

The remaining candidates mapping SSE segments need to pass all filtering criteria. In step (2c),

EMPSC aligns the geometric center and the 3 primary eigenvectors of the candidate mapping

segments, and then, new coordinates for the two compared proteins will be generated, as Fig 4.

In step (2d), a fast global alignment estimation based on geometric hashing is developed to estimate

the quality of the initial alignments. The geometric hashing is a fast way to compare the 3D structure.

Take the geometric center of mapping SSEs as the origin, the position of each Cα atom of one target

protein is transforms to the polar coordinate system. In the preprocessing of geometric hashing, Every

Cα atom will be put into the hash table according to the distance between the Cα atom and the origin.

We also transform the other protein into new coordinate, and calculate the estimated alignment score.

The estimated scoring function of global alignment is described in Fig 5. Finally, only top-N

candidates will be selected for further refinement. That is, we reserve the top-N superimposed

transformations as the good initial alignments.

Fig 4. Align the 3 eigenvectors of two ellipsoids according to their magnitude.

Global Alignment Estimation based on Geometric Hashing

Assume OA, OB are the origins of new coordinate for protein A and B, respectively.

RA and RB are two resides in protein A and B, respectively.

The hashing function is

()

(

)

iRofindexbinsizebin

R____%

,dist

hash == &,

where i can be A or B.

In order to avoid the collision, the bin size of the hash table is larger than the diameter of most proteins.

The scoring function of global alignment estimation is defined as

()

(

)

(

)

(

)

(

)

(

)

(

)

{

}

∑

∈∈<∧=−=

BB AA

PR cBABABAc

BA dRRRRRRdPP ,disthashhash|,distmax,Score 2

where dc is the distance cutoff for alignment construction.

The higher score implies the lower r.m.s.d. and the higher number of corresponding residues.

Fig 5. The global alignment estimation based on geometric hashing

Refinement and Final Evaluation

The step (3) applies the same refined methods as most quick-n-dirty PSC algorithms. The least square

method is applied in refinement process, and the step (3) will repeatedly refine the initial alignment

solutions until the number of corresponding residues converges. In this step, the algorithm will

iteratively tune the global alignment center with superimpose two proteins using the dynamic

programming approach. Finally, in step 4, EMPSC will output the tuned alignments of the top-N

candidate from step (2) as the N alternative solutions.

Complexity Analysis

In data preprocessing, EMPSC will clusters the protein to form a set of ellipsoids with the DSSP tool

and proposed ellipsoid clustering method. The ellipsoid clustering is very fast (less than 1 second),

and the time complexity is O(r) where r is the number of residues of the segments. In the initial

alignment finding stage, the time complexity for EMPSC algorithm in this stage is O(eloge + pn) with

the scoring function based on fast the O(n)hash function, where n is the number of residues in the

protein, e is the number of segments in the protein, p is the number of mapping SSE candidate

segment pairs and p is much smaller than e. The complexity of refined alignment stage is O(Cn2),

while the C is number of iterations before the refinement process is converged for each initial

alignment. According to the observation, the number of iterations in EMPSC is usually less than 10.

In the discussion section, we will find how the refinement process affects the execution time of

EMPSC.

Experiments and Results

Three experiments are tested in order to test EMPSC in different conditions of protein structure

comparison problems. In each experiments, EMPSC provides maximal 10 alternative solutions, that is,

the Top-10 initial alignments in EMPSC are selected. These results reveal the efficiency and

effectiveness of EMPSC in comparison with Dali, CE, VAST, ProSup, and FLASH. The results of

Dali and ProSup are coming from the original papers. The results of CE and FLASH are gathered

from our experiment environment, and they are consistent with the original papers. The computing

environment for the experiments consists of Dual Pentium-4 Xeon 3.06GHz CPU and 2GB DRAM

memory. All testing programs are not parallelized.

One-against-all search for structural neighbors

As previous research work, we choose cAMP-dependent protein kinase to experiment on

one-against-all search for structural neighbors. In order to compare with the existing results of Dali,

CE, VAST, ProSup, and FLASH, the parameter dc (the distance cutoff for alignment construction) is

assigned to 6Å. For all method, we listed the number of maximal correspondent residues and minimal

r.m.s.d. Since we can find the program of CE and FLASH, we also list the execution time of CE and

FLASH running in our computing environment. Comparing the value of r.m.s.d and number of

corresponding residues, EMPSC can perform as well as other previous methods.

Table 1. An experimental set of structural neighbors of cAMP-dependent protein kinase(1atp:E) identified by

different PSC methods

A sample set of structural neighbors of cAMP-dependent protein kinase (1atp:E)(336)

Dali CE VAST ProSup FLASH EMPSC

Protein

(residues) rmsd/#res rmsd/#res/sec rmsd/#res rmsd/#res rmsd/#res/sec rmsd/#res/sec

2cpk:E(336) 0.4 / 336 0.37/336/4.9 0.4 / 334 0.4 / 336 0.37/336/0.38 0.37/336/6.1

1apm:E(341) 0.3 / 336 0.33/336/4.94 0.3 / 334 0.3 / 336 0.33/336/0.51 0.32/336/6.19

1cdk:A(343) 0.4 / 336 0.38/336/4.95 0.4 / 334 0.4 / 336 0.38/336/0.39 0.38/336/6.36

1ydt:E(334) 0.5 / 334 0.45/336/4.5 0.5 / 334 0.5 / 334 0.45/336/0.31 0.45/334/6.03

1bkx:A(337) 0.8 / 334 0.76/336/4.61 0.7 / 314 0.8 / 336 0.76/336/0.09 0.75/334/6.16

1bx6:_(337) 1.0 / 334 1.01/336/4.67 1.0 / 314 1.0 / 336 1.01/336/0.43 1.01/334/6.09

1stc:E(334) 1.1 / 334 1.1/336/4.98 1.1 / 333 1.1 / 334 1.1/336/0.07 1.09/334/5.25

1cmk:E(350) 2.0 / 335 2/336/5.82 2.0 / 331 1.5 / 316 1.72/330/0.77 1.71/330/6.72

1daw:A(327) 3.1 / 267 2.77/266/10.99 2.8 / 259 2.0 / 239 1.87/250/0.53 1.92/252/5.91

1qmz:C(296) 2.5 / 259 2.07/252/6.94 2.3 / 233 1.9 / 239 1.9/251/0.67 1.96/253/5.38

1day:A(327) 2.7 / 263 2.61/262/9.96 2.9 / 262 2.0 / 239 1.96/252/0.49 1.98/253/5.99

1koa:_(447) 2.8 / 261 2.7/258/10.81 2.4 / 225 2.1 / 233 2.16/249/0.42 2.14/249/8.26

1jnk:_(346) 2.8 / 253 2.49/194/11.97 3.0 / 240 2.2 / 220 2.19/242/0.29 2.23/244/6.21

1gag:A(300) 2.8 / 265 2.87/267/7.78 2.7 / 247 2.3 / 232 2.36/251/0.67 2.46/254/5.37

1bl7:A(351) 3.5 / 254 3.14/246/8.79 3.1 / 223 2.3 / 220 2.39/235/0.54 2.4/236/6.33

1cja:B(327) 4.7 / 165 4.19/165/10.9 - 2.7 / 115 2.85/143/0.48 3.01/149/6.36

1e7v:A(850) 4.0 / 159 4.43/165/43.51 - 2.8 / 116 3/142/0.75 3.1/155/15.59

1bo1:B(318) 3.9 / 138 3.9/145/12.25 - 3.0 / 103 2.98/136/0.16 2.9/135/5.71

1b40:A(517) 3.4 / 45 5.68/83/21.4 - 2.9 / 57 3/107/0.67 3.36/105/9.44

1lar:B(533) 2.6 / 34 5.77/123/23.11 - 3.0 / 66 3.07/88/0.79 3.21/86/10.14

10 difficult cases

In this experiment, we experiment on a well-known data set, 10 difficult cases25 reported by Fisher,

1996. Table 2 displays all the structure alignment results for 10 difficult cases. The EMPSC performs

worse in case (1ten:_(89) vs. 3hhr:B(195))

Tab l e 2. Comparison of different structure alignment results for 10 difficult cases

10 difficult cases (Fisher 1996)

Dali CE VAST ProSup FLASH

EMPSC

Protein 1

(residues)

Protein 2

(residues) rmsd/#res rmsd/#res/sec rmsd/#res rmsd/#res rmsd/#res/sec rmsd/#res/sec

1bge:B(159) 2gmf:A(121) 3.3 / 94 4.02/102/2.59 2.3 / 71 2.4 / 87 -/-/-a2.56/95/0.44

1cew:I(108) 1mol:A(94) 2.3 / 81 2.34/81/2.07 2.0 / 71 1.9 / 76 1.92/79/0.07 2.11/81/0.49

1cid:_(177) 2rhe:_(114) 3.1 / 96 2.97/98/2.4 2.0 / 78 2.3 / 84 2.24/94/0.24 2.23/94/1.19

1crl:_(534) 1ede:_(310) 3.6 / 212 3.91/220/16.29 3.7 / 186 2.6 / 161 2.49/191/0.79 2.7/199/9.3

1fxi:A(96) 1ubq:_(76) 2.5 / 52 2.79/64/1.79 2.1 / 48 2.6 / 54 2.47/62/0.03 2.56/63/0.47

1ten:_(89) 3hhr:B(195) 1.9 / 86 1.9/87/2.14 1.5 / 76 1.7 / 85 1.73/86/0.21 2.2/76/1.01

1tie:_(166) 4fgf:_(124) 3.1 / 114 2.86/115/2.23 1.6 / 76 2.4 / 104 2.28/108/0.29 2.44/113/1.15

2sim:_(381) 1nsb:A(390) 3.2 / 289 2.99/276/9.24 4.2 / 299 2.6 / 248 2.61/276/7.8 2.71/282/8.96

2aza:A(129) 1paz:_(120) 3.0 / 82 2.9/85/1.94 2.1 / 70 2.6 / 82 2.34/81/0.1 2.22/82/0.88

3hla:B(99) 2rhe:_(114) 3.0 / 74 3.46/85/2.49 2.3 / 58 2.7 / 71 2.94/79/0.09 2.75/77/0.65

a This result is available in the original FLASH paper, but we could not get any result while running the FLASH

program provided by the authors.

Special cases in global alignment – dissimilar protein comparisons

In this experiment, we experiments on dissimilar but comparable proteins. If two proteins are

dissimilar or quite different in same family, EMPSC can still obtain better results than ProSup and

FLASH methods. There proteins are selected from the first experiment, because they belong to the

same family. Table 3 displays the proteins that we selected and the results. According to these results,

EMPSC performs comparablely both in number of corresponding residues and r.m.s.d.

The “–” notation in FLASH column of Table 3 represents that FLASH didn’t find any statistical

significant solutions. As FLASH uses the statistical method to increase its computing speed in

choosing candidates, in similar cases of proteins, it can get very good solutions. But in dissimilar

cases of proteins, the statistical significance evaluation will reject further processing. Generally

speaking, only globally similar proteins could pass the statistical significance measurement. Even

though, there is some significant local alignment information, it will not process. Comparing the

global structure from local structure alignment is the advantage of EMPSC algorithm. EMPSC

algorithm performs the worst in the following three cases – (1e7v:A(850) vs. 1jnk:_(346)),

(1e7v:A(850) vs. 1day:A(327)) and (1bo1:B(318) vs. 1day:A(327)).

Tab l e 3. Comparison of dissimilar cases of proteins with ProSup, FLASH, and EM

CE ProSup FLASH EMPSC

rmsd/#res/sec rmsd/#res Rmsd/#res/sec rmsd/#res/sec

1cja:B(327) 1daw:A(327) 3.73/157/8.15 3.0 / 131 -/-/-a3.02/153/5.83

1cja:B(327) 1qmz:C(296) 3.58/153/7.28 3.0 / 133 2.81/150/0.39 2.83/151/5.26

1cja:B(327) 1day:A(327) 3.79/157/8.14 3.1 / 131 -/-/- 3.04/152/5.71

1cja:B(327) 1koa:_(447) 4.6/173/15.81 3.0 / 130 2.92/150/0.8 3.14/160/7.9

1cja:B(327) 1jnk:_(346) 4.24/162/12.93 2.9 / 133 3.09/140/0.83 3.07/157/6.22

1e7v:A(850) 1daw:A(327) 3.59/135/28.94 3.1 / 110 3.19/147/1.93 2.99/144/15.02

1e7v:A(850) 1qmz:C(296) 4.34/156/45.17 2.8 / 119 2.78/139/1.5 3.15/147/13.61

1e7v:A(850) 1day:A(327) 4.25/148/42.39 3.0 / 110 -/-/- 3.48/97/14.92

1e7v:A(850) 1koa:_(447) 4.31/163/77.09 3.2 / 115 2.99/146/1.1 3.17/144/20.84

1e7v:A(850) 1jnk:_(346) 3.91/143/58.76 3.0 / 120 2.89/152/1.31 3.23/88/15.93

1bo1:B(318) 1daw:A(327) 3.75/144/9.01 3.1 / 121 3.13/129/0.39 2.86/136/5.58

1bo1:B(318) 1qmz:C(296) 3.6/145/8.55 2.8 / 123 2.72/133/0.28 2.71/134/4.99

1bo1:B(318) 1day:A(327) 4.04/151/10.05 2.9 / 123 2.85/139/0.39 3.4/100/5.52

1bo1:B(318) 1koa:_(447) 4.03/146/16.97 2.8 / 118 -/-/- 2.81/130/7.77

1bo1:B(318) 1jnk:_(346) 3.96/151/16.28 2.8 / 114 -/-/- 3.21/139/5.9

a “─” represents that FLASH doesn’t provide any solution.

Discussion

Efficiency and Number of Alternative Solutions

Although, in previous section, we listed the execution time of every comparison In Table 1, Table 2

and Table 3, it is very hard to observe the performance relationship between CE, FLASH and EMPSC.

Therefore, we added the reside numbers of the two compared proteins, and plot the relationship

diagram of execution time vs. total residues, as Fig 6. In this figure, it is obviously that EMPSC is

truly faster than CE, especially for large protein structure comparisons. However, EMPSC looks

slower than FLASH.

0 200 400 600 800 1000 1200 1400 1600

Total Residues of Compared Proteins

Execution Time (seconds)

EMPSC

FLASH

Trend (CE)

Trend (EMPSC)

Trend (FLASH)

Fig 6. The execution time of CE, FLASH and EMPSC, given different total residues of compared proteins.

These trend lines for each method are polynomial regressions of order 2 which are provided by MS Excel Trend

function.

In order to know whether we can further speed up EMPSC, we did more experiments about EMPSC

with different numbers of alternative solutions. We repeated the experiments in previous section with

Top-3 and Top-5 alternative solutions and compared it with Top-10 results and FLASH. The detail

results listed as Tab le 4 in appendix section, and we plot the diagram of the execution time versus

different number of alternative solutions, as Fig 7. In this diagram, we can found that execution time

of EMPSC is perfectly proportional to the number of alternative solutions. After profiling our EMPSC

program, we found EMPSC spend most execution time in alignment refining process. And we found

that, if FLASH provides alternative solutions, it spends about the same execution time as EMPSC.

Two data points of FLASH in Fig 7 show such case. This conclusion can be applied in any PSC

algorithm that claims fast but provides only one solution (like FAST15), except hash-based alignment

refining algorithm.

Obviously, according to our observation, EMPSC is a good choice for solving protein structure

comparison problems. In addition, we can conclude that further enhancement of PSC algorithms

should be focused on the alignment refining process.

0 200 400 600 800 1000 1200 1400 1600

Total Residues of Compared Proteins

Execution Time (seconds)

FLASH

EMPSC Top-3

EMPSC Top-5

EMPSC Top-10

Trend (FLASH)

Trend (EMPSC Top-3)

Trend (EMPSC Top-5)

Trend (EMPSC Top-10)

Fig 7. The execution time of FLASH and EMPSC Top-3, Top-5, Top-10 alternative solutions. These trend lines

for each method are polynomial regressions of order 2 which are provided by MS Excel Trend function.

Characteristic of EMPSC Algorithm

The proposed EMPSC algorithm possesses three major features, which we believe that these features

make EMPSC a good choice of PSC algorithms. First, the ellipsoidal representation can provide a

good summary of 3D information for residues segment. Particularly, two ellipsoidal models can easily

map to each other via transformations or rotations of coordination systems. As we said in introduction,

to represent α-helix in a single vector is proper, because the α-helix is hard to bend. The representing

vector of β-sheet does drop some structure information. As the β-sheet structure is usually bending or

curved, the length of identified β-sheet will affect the derived single vector effectiveness seriously.

With the ellipsoidal model, EMPSC can effectively abstract the curved β-sheet, because the 3

orthogonal eigenvector of the ellipsoid keeps more information of residues’ distribution in space. This

is an advantage of EMPSC in comparison with the vector representation in FLASH algorithm. In

addition, the ellipsoidal model does not only support the α-helix and β-sheet structures abstraction, but

also can be used to represent loop or coil structures. As the results of the above experiments reveal

that EMPSC is a good PSC solution in comparison with previous algorithms, it is convinced that

ellipsoidal representation at least provides a good abstraction of 3D structure information as well as

others (SAP’s residue pair, CE’s AFP, ProSup’s seed, and even FLASH’s SSE).

Second, EMPSC provide a platform that can plug in different filters for different purposes. Via the

different combinations of filters, EMPSC can filter the candidate mapping segment pairs according to

profession people’s requirement. In our current experiments results, the combination of type filter,

mass filter and biochemical filter can get a good accuracy and efficiency in most cases. In addition,

we also found that biochemical filter is especially effective for comparing similar proteins of the same

family. Besides of the three filters, we also tried the eigenvector filter which makes sure the

eigenvector26,27 of the mapping segments are similar. Unfortunately the eigenvector filter doesn’t

show any further improvement, so we didn’t use it in current EMPSC algorithm.

Third, like traditional PSC algorithms, EMPSC is not only good at global structure comparisons for

similar proteins, but also provides useful information in local structure comparisons for dissimilar

proteins in same family. That is, local alignment is viable for EMPSC algorithm. Fig 8 and Fig 9

reveal that even under the dissimilar condition of global alignment, two proteins may still share some

common local structures with biological significance. To detect the similar local structure in proteins

is as important as searching similar global structure in traditional PSC problems.

In order to view the results of EMPSC algorithm, we also developed a tool that can output the

comparison results in molscript28, and provide a web service now. Fig 8 and Fig 9 mentioned in

previous section are the sample outputs. Instead of superimposing the structures of two proteins, we

draw the results in vertically tiled windows. In these pictures, the yellow part in the picture indicates

that the SSE is chosen as the aligned center, and the red part indicates the corresponding residues in

each compared protein.

(a) 1daw:A (b) 1cja:B

Fig 8. The structure comparisons results of protein 1daw:A and protein 1cja:B with EMPSC algorithms (The

distance between aligned residues ≤ 6Å). The yellow part means that the SSE is chosen as an aligned center. The

red part means the corresponding residues in each protein.

(a) 1qmz:C (b) 1bo1:B

Fig 9. The structure comparisons results of protein 1qmz:C and protein 1bo1:B with EMPSC algorithms (The

distance between aligned residues ≤ 6Å). The yellow part means that the SSE is chosen as an aligned center. The

red part means the corresponding residues in each protein.

Further Development

In the future, we will further revise the EMPSC algorithms in the following aspects. First, the amino

acid types of residues will be investigated whether they are helpful for EMPSC. Second, SSE-based

PSC algorithms, like FLASH and EMPSC, have to rely on SSE identification tools15. Under some

conditions the compared proteins sharing similar global structures but dramatically different SSE

identifications as Fig 10, it is hard for SSE-based algorithms to find a good alignment center. A

possible solution is to segment the longer SSE or connect some shorter SSEs. This can be done by

modifying the SSE identification tools, like DSSP, without modifying original SSE-based PSC

algorithms. Third, EMPSC has potential for local structure comparison, but result optimization of

local structure alignment should be further investigated. As mentioned above, we will further enhance

the global and local alignment ability of EMPSC to develop multiple protein structure alignment in

the near future.

Fig 10. The proteins with similar global structures but dramatically different SSE identifications – protein 1apt

(left) and protein 1bxo (right).

References

1. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of

hydrogen-bonded and geometrical features. Biopolymers 1983;22(12):2577-2637.

2. Brändén C-I, Tooze J. Introduction to protein structure. New York: Garland Pub.; 1999. xiv, 410

3. Brown NP, Orengo CA, Taylor WR. A protein structure comparison methodology. Comput

Chem 1996;20(3):359-380.

4. Gibrat JF, Madej T, Bryant SH. Surprising similarities in structure comparison. Curr Opin

Struct Biol 1996;6(3):377-385.

5. Holm L, Sander C. Mapping the protein universe. Science 1996;273(5275):595-603.

6. Koehl P. Protein structure similarities. Curr Opin Struct Biol 2001;11(3):348-353.

7. Orengo C. Classification of protein folds. Curr Opin Struct Biol 1994;4(3):429-440.

8. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne

PE. The Protein Data Bank. Nucleic Acids Res 2000;28(1):235-242.

9. Holm L, Sander C. Dali: a network tool for protein structure comparison. Trends Biochem Sci

1995;20(11):478-480.

10. Gerstein M, Levitt M. Using iterative dynamic programming to obtain accurate pairwise and

multiple alignments of protein structures. Proc Int Conf Intell Syst Mol Biol 1996;4:59-67.

11. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the

amino acid sequence of two proteins. J Mol Biol 1970;48(3):443-453.

12. Subbiah S, Laurents DV, Levitt M. Structural similarity of DNA-binding domains of

bacteriophage repressors and the globin core. Curr Biol 1993;3(3):141-148.

13. Madej T, Gibrat JF, Bryant SH. Threading a database of protein cores. Proteins

1995;23(3):356-369.

14. Vriend G, Sander C. Detection of common three-dimensional substructures in proteins. Proteins

1991;11(1):52-58.

15. Zhu J, Weng Z. FAST: a novel protein structure alignment algorithm. Proteins

2005;58(3):618-627.

16. Can T, Wang YF. CTSS: A Robust and Efficient Method for Protein Structure Alignment Based

on Local Geometrical and Biological Features. Proc IEEE Comput Soc Bioinform Conf

2003;2:169-179.

17. Chang P-K, Chen C-C, Ouhyoung M. A Tool for Structure Alignment of Molecules. IEEE Sixth

International Symposium on Multimedia Software Engineering - Special Session on

Bioinformatics 2004:354-361.

18. Shindyalov IN, Bourne PE. Protein structure alignment by incremental combinatorial extension

(CE) of the optimal path. Protein Eng 1998;11(9):739-747.

19. Taylor WR. Protein structure comparison using iterated double dynamic programming. Protein

Sci 1999;8(3):654-665.

20. Lackner P, Koppensteiner WA, Domingues FS, Sippl MJ. Automated large scale evaluation of

protein structure predictions. Proteins 1999;Suppl 3:7-14.

21. Lackner P, Koppensteiner WA, Sippl MJ, Domingues FS. ProSup: a refined tool for protein

structure alignment. Protein Eng 2000;13(11):745-752.

22. Shih ES, Hwang MJ. Protein structure comparison by probability-based matching of secondary

structure elements. Bioinformatics 2003;19(6):735-741.

23. Lesk AM. Protein architecture : a practical approach. Oxford England ; New York: IRL Press;

1991. xiv, 287 p.

24. Zhang Z. Iterative point matching for registration of free-form curves and surfaces. Int J

Comput Vision 1994;13(2):119-152.

25. Fischer D, Elofsson A, Rice D, Eisenberg D. Assessing the performance of fold recognition

methods by means of a comprehensive benchmark. Pac Symp Biocomput 1996:300-318.

26. Bezdek JC, Pal MR, Keller J, Krisnapuram R. Fuzzy Models and Algorithms for Pattern

Recognition and Image Processing: Kluwer Academic Publishers; 1999. 792 p.

27. Frigui H, Krishnapuram R. A Robust Competitive Clustering Algorithm With Applications in

Computer Vision. IEEE Trans Pattern Anal Mach Intell 1999;21(5):450-465.

Appendix

Table 4. The results of repeated experiments using EMPSC with Top-3, Top-5, Top-10 alternative solutions

respectively.

EMPSC Top-3 EMPSC Top-5 EMPSC Top-10 Protein 1

(residues)

Protein 2

(residues) rmsd/#res/sec rmsd/#res/sec rmsd/#res/sec

Structural neighbors of cAMP-dependent protein kinase

1atp:E(336) 2cpk:E(336) 0.37/336/2.12 0.37/336/3.2 0.37/336/6.1

1atp:E(336) 1apm:E(341) 0.32/336/2.12 0.32/336/3.2 0.32/336/6.19

1atp:E(336) 1cdk:A(343) 0.38/336/2.15 0.38/336/3.3 0.38/336/6.36

1atp:E(336) 1ydt:E(334) 0.45/334/2.22 0.45/334/3.2 0.45/334/6.03

1atp:E(336) 1bkx:A(337) 0.75/334/2.31 0.75/334/3.4 0.75/334/6.16

1atp:E(336) 1bx6:_(337) 1.01/334/2.11 1.01/334/3.2 1.01/334/6.09

1atp:E(336) 1stc:E(334) 1.09/334/2.04 1.09/334/3.2 1.09/334/5.25

1atp:E(336) 1cmk:E(350) 1.71/330/2.13 1.71/330/3.5 1.71/330/6.72

1atp:E(336) 1daw:A(327) 1.92/252/2.08 1.92/252/3.1 1.92/252/5.91

1atp:E(336) 1qmz:C(296) 1.96/253/1.85 1.96/253/2.9 1.96/253/5.38

1atp:E(336) 1day:A(327) 1.98/253/2.06 1.98/253/3.2 1.98/253/5.99

1atp:E(336) 1koa:_(447) 2.14/249/2.92 2.14/249/4.3 2.14/249/8.26

1atp:E(336) 1jnk:_(346) 2.23/244/2.13 2.23/244/3.3 2.23/244/6.21

1atp:E(336) 1gag:A(300) 2.46/254/1.82 2.46/254/3 2.46/254/5.37

1atp:E(336) 1bl7:A(351) 2.4/236/2.16 2.4/236/3.4 2.4/236/6.33

1atp:E(336) 1cja:B(327) 3.01/149/2.01 3.01/149/3.1 3.01/149/6.36

1atp:E(336) 1e7v:A(850) 3.2/70/5.39 3.1/155/8.2 3.1/155/15.59

1atp:E(336) 1bo1:B(318) 3.4/69/2.05 3.6/92/3 2.9/135/5.71

1atp:E(336) 1b40:A(517) 3.36/105/3.29 3.36/105/5.3 3.36/105/9.44

1atp:E(336) 1lar:B(533) 3.21/86/2.91 3.21/86/5 3.21/86/10.14

10 difficult cases (Fisher 1996)

1bge:B(159) 2gmf:A(121) 2.56/95/0.33 2.56/95/0.3 2.56/95/0.44

1cew:I(108) 1mol:A(94) 2.11/81/0.19 2.11/81/0.3 2.11/81/0.49

1cid:_(177) 2rhe:_(114) 2.16/93/0.4 2.23/94/0.6 2.23/94/1.19

1crl:_(534) 1ede:_(310) 2.7/199/3.42 2.7/199/5 2.7/199/9.3

1fxi:A(96) 1ubq:_(76) 2.56/63/0.15 2.56/63/0.2 2.56/63/0.47

1ten:_(89) 3hhr:B(195) 2.2/76/0.35 2.2/76/0.5 2.2/76/1.01

1tie:_(166) 4fgf:_(124) 3.28/61/0.41 3.28/61/0.6 2.44/113/1.15

2sim:_(381) 1nsb:A(390) 2.67/280/3.68 2.67/280/5.2 2.71/282/8.96

2aza:A(129) 1paz:_(120) 2.25/82/0.3 2.25/82/0.5 2.22/82/0.88

3hla:B(99) 2rhe:_(114) 2.87/43/0.23 2.8/79/0.4 2.75/77/0.65

Protein family comparisons

1cja:B(327) 1daw:A(327) 3.2/72/1.92 2.9/152/3 3.02/153/5.83

1cja:B(327) 1qmz:C(296) 2.83/151/1.94 2.83/151/2.8 2.83/151/5.26

1cja:B(327) 1day:A(327) 2.92/145/1.97 3.04/152/3 3.04/152/5.71

1cja:B(327) 1koa:_(447) 3.14/160/2.7 3.14/160/4.2 3.14/160/7.9

1cja:B(327) 1jnk:_(346) 3.07/157/2.23 3.07/157/3.2 3.07/157/6.22

1e7v:A(850) 1daw:A(327) 3.4/77/5.27 3.52/83/8 2.99/144/15.02

1e7v:A(850) 1qmz:C(296) 3.2/93/4.73 3.2/93/7.2 3.15/147/13.61

1e7v:A(850) 1day:A(327) 3.5/79/5.27 3.5/79/8 3.48/97/14.92

1e7v:A(850) 1koa:_(447) 3.67/79/7.48 3.67/79/11.3 3.17/144/20.84

1e7v:A(850) 1jnk:_(346) 3.23/88/5.63 3.23/88/8.5 3.23/88/15.93

1bo1:B(318) 1daw:A(327) 3.6/97/1.99 3.6/97/3.4 2.86/136/5.58

1bo1:B(318) 1qmz:C(296) 2.71/134/1.85 2.71/134/2.6 2.71/134/4.99

1bo1:B(318) 1day:A(327) 3.4/100/1.99 3.4/100/2.9 3.4/100/5.52

1bo1:B(318) 1koa:_(447) 3.42/66/2.6 3.42/66/4.4 2.81/130/7.77

1bo1:B(318) 1jnk:_(346) 3.2/138/2.08 3.2/138/3.1 3.21/139/5.9

Mining Conserved Local Structure from Functional Hierarchical Classification via Local Structure Comparison

Article

Full-text available

Local region conservation has been observed in recent years and become more and more important in structure biology. Recent researches point out that local conservation regions are correlated to protein functional sites and functions and studies show that some local conservation on sequence or structure are close to binding area. Hence, in order to realize how function works, we can discover local structure region to understand protein function via observation in local conservation. Furthermore, many researches show that function would be activate on the surface of protein structure, but not whole structure and local region conservation can be discovered from sequence, structure or both in current status. Sequence conservation has been discovered in recent researches. There are existing examples which show that structure conservation can be mapped from sequence conservation; however, it is still a problem to mining structure conservation via structure comparison. Structure conservation has become a hot topic to be discussed. Protein function needs to take place in local region to activate the biochemical reaction. Therefore, our motivation is to apply protein structure comparison algorithm to mining local structure conservation. Because these local structure conservations would be used to support structure or provide function, we use functional site to connect the relationship between local structure conservation and protein function. Given functional hierarchical classification, we can easily identify protein function and using proteins with the same EC number to mining or discover conservation which may be related to function. Furthermore, we try to extract local structure region associated to its protein functional site.

Heuristic Strategy for Geometric Hashing Based Protein Structure Comparison of Ellipsoidal Representation

Conference Paper

Full-text available

Nov 2007

Many protein structure comparison methods use secondary structure information to do fast structure similarity search for initial alignment finding and refine the results from possible optimal candidate solutions by iteratively dynamic programming to optimize the final results. In this paper, we develop a method, Ellipsoidal Model Protein Structure Comparison, based on the concept of secondary structure elements alignment followed by iteratively refinement. In order to utilize all possible structure information to obtain alternative solutions for further analysis, we use ellipsoidal model to represent not only mainly -helices and -sheets, but the remaining fragments for structural alignment. Different heuristic filters and geometric hashing based global alignment estimation are applied for quick finding better initial alignments. We also provide top-N solutions without increasing extra computational time rather than only best solution in the previous works. Now, we provide the online web service, Ballerina (http://ballerina.csie.ntu.edu.tw/), for protein structure comparison.

The Study on Local Structure Representation and Local Conserved Structure Discovery

Conference Paper

Full-text available

Sep 2007

Local region conservation has been studied for many years because biologists believe that local conservation could be highly related to protein functions. The concept of local region conservation comes from a motif, a fragment with biological or functional meaning. Besides, structure-based identification of homologues often succeeds where sequence-alone-based methods fail, because in many cases evolution retains the folding pattern long after sequence similarity becomes undetectable. Thus, prediction of protein function from sequence and structure is a difficult problem, because homologous proteins often have different functions. Alternative methods include inferring conservation patterns in members of a functionally uncharacterized family for which many sequences and structures are known. The researches show that sequence conservation could be discovered that their corresponding residues in 3D space are a compact region and close to ligand. But the question is that is it possible to discover compact regions via protein structure analysis; therefore, our motivation is find out a local structure representation and apply the concept of mining frequent item set to discover local structure conservation. In the experiments, we use enzyme classification to discover local structure conservations, which we can easily identify the connection linked by detected local structure conservations and substrates.

Iterative point matching for registration of free-from curves and surfaces

Article

Jan 1994
INT J COMPUT VISION

Zhengyou Zhang

Classification of protein folds

Article

Jun 1994

Christine Orengo

Recent developments in automatic structure comparison have yielded several fast and flexible methods that allow extensive explorations of the structure databank. As a result, proteins have been clustered into a few hundred structural families. Many interesting and unexpected structural similarities have been revealed, and some folds have been shown to support diverse sequences and functions.

Iterative point matching of free-form curves and surfaces

Article

Oct 1994

Zhengyou Zhang

A heuristic method has been developed for registering two sets of 3-D curves obtained by using an edge-based stereo system, or two dense 3-D maps obtained by using a correlation-based stereo system. Geometric matching in general is a difficult unsolved problem in computer vision. Fortunately, in many practical applications, some a priori knowledge exists which considerably simplifies the problem. In visual navigation, for example, the motion between successive positions is usually approximately known. From this initial estimate, our algorithm computes observer motion with very good precision, which is required for environment modeling (e.g., building a Digital Elevation Map). Objects are represented by a set of 3-D points, which are considered as the samples of a surface. No constraint is imposed on the form of the objects. The proposed algorithm is based on iteratively matching points in one set to the closest points in the other. A statistical method based on the distance distribution is used to deal with outliers, occlusion, appearance and disappearance, which allows us to do subset-subset matching. A least-squares technique is used to estimate 3-D motion from the point correspondences, which reduces the average distance between points in the two sets. Both synthetic and real data have been used to test the algorithm, and the results show that it is efficient and robust, and yields an accurate motion estimate.

Fuzzy Models and Algorithms for Pattern Recognition and Image Processing

Article

Jan 1999

An Introduction to Protein Structure

Article

Jan 1999

The Protein Data Bank

Article

Dec 1999
NUCLEIC ACIDS RES

The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.

A Protein Structure Comparison Methodology.

Article

Sep 1996
Comput Chem

Protein architecture : a practical approach / A. M. Lesk

Article

Jan 1991

Arthur M. Lesk

Incluye bibliografía e índice

Detection of Common ThreeDimen-sional Substructures in Proteins

Article

Sep 1991

We present a fully automatic algorithm for three-dimensional alignment of protein structures and for the detection of common substructures and structural repeats. Given two proteins, the algorithm first identifies all pairs of structurally similar fragments and subsequently clusters into larger units pairs of fragments that are compatible in three dimensions. The detection of similar substructures is independent of insertion/deletion penalties and can be chosen to be independent of the topology of loop connections and to allow for reversal of chain direction. Using distance geometry filters and other approximations, the algorithm, implemented in the WHAT IF program, is so fast that structural comparison of a single protein with the entire database of known protein structures can be performed routinely on a workstation. The method reproduces known non-trivial superpositions such as plastocyanin on azurin. In addition, we report surprising structural similarity between ubiquitin and a (2Fe-2S) ferredoxin.

A General Method Applicable to Search for Similarities in Amino Acid Sequence of 2 Proteins

Article

Apr 1970

A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed. From these findings it is possible to determine whether significant homology exists between the proteins. This information is used to trace their possible evolutionary development.The maximum match is a number dependent upon the similarity of the sequences. One of its definitions is the largest number of amino acids of one protein that can be matched with those of a second protein allowing for all possible interruptions in either of the sequences. While the interruptions give rise to a very large number of comparisons, the method efficiently excludes from consideration those comparisons that cannot contribute to the maximum match.Comparisons are made from the smallest unit of significance, a pair of amino acids, one from each protein. All possible pairs are represented by a two-dimensional array, and all possible comparisons are represented by pathways through the array. For this maximum match only certain of the possible pathways must be evaluated. A numerical value, one in this case, is assigned to every cell in the array representing like amino acids. The maximum match is the largest number that would result from summing the cell values of every pathway.

EMPSC: A New Method Based on Ellipsoidal Model for Protein Structure Comparison

Abstract and Figures

Recommended publications

Using Variable-Length Aligned Fragment Pairs and an Improved Transition Function for Flexible Protei...

Heuristic Strategy for Geometric Hashing Based Protein Structure Comparison of Ellipsoidal Represent...

Heuristic Strategy for Geometric Hashing Based Protein Structure Comparison of Ellipsoidal Represent...

The Study on Local Structure Representation and Local Conserved Structure Discovery

Introducing Sequence-Order Constraint into Prediction of Protein Binding Sites with Automatically Ex...