Conference PaperPDF Available

A tool for structure alignment of molecules

Authors:

Abstract and Figures

In this paper, a novel tool is proposed to align two molecules (not just proteins) based on their 3D structural data, and the user can observe the result of alignment visually via the tool. Most existing tools are designed only for alignment of proteins. Here, a new tool is developed to address shared structural features between protein structures and tRNA structures, that is, molecular mimicry, although they are two very different types of molecules. In order to align two molecules A and B, geometric hashing is applied to globally find initial matching of approximately overlapped atoms, thus parts of molecule A can be matched to parts of molecule B. Next, a fine tuning process is introduced, which is based on local optimization of overlapped parts, and the iterative closest point (ICP) is used until the number of overlapped atoms within a given distance threshold can not be increased any more. The results show that our method is useful to structurally align two molecules, not restricted to align two proteins only. Besides, our tool outperforms in terms of RMSD and number of matched atom pairs in comparison to other tools.
Content may be subject to copyright.
A Tool for Structure Alignment of Molecules
Pei-Ken Chang, Chien-Cheng Chen and Ming Ouhyoung
Department of Computer Science and Information Engineering, National Taiwan University
{zick, ccchen}@cmlab.csie.ntu.edu.tw, ming@csie.ntu.edu.tw
Abstract
In this paper, a novel tool is proposed to align two
molecules (not just proteins) based on their 3D
structural data, and the user can observe the result of
alignment visually via the tool. Most existing tools are
designed only for alignment of proteins. Here, a new
tool is developed to address shared structural features
between protein structures and tRNA structures, that is,
molecular mimicry, although they are two very
different types of molecules.
In order to align two molecules A and B, Geometric
Hashing is applied to globally find initial matching of
approximately overlapped atoms, thus parts of
molecule A can be matched to parts of molecule B.
Next, a fine tuning process is introduced, which is
based on local optimization of overlapped parts, and
the Iterative Closest Point (ICP) is used until the
number of overlapped atoms within a given distance
threshold can not be increased any more. The results
show that our method is useful to structurally align
two molecules, not restricted to align two proteins only.
Besides, our tool outperforms in terms of RMSD and
number of matched atom pairs in comparison to other
tools.
1. Introduction
Search engines for 3D models have been developed
in recent years [1] [2], however, can similar techniques
be used in molecules? If so, the benefit can be great.
The reason is that large number of protein structures
can be determined by high throughput machines,
classifying proteins into families and assigning
functions to those novel proteins become major tasks
in recent years. The Protein Data Bank (PDB)
currently contains more than 25,000 structures and it is
estimated that the number of structures in the PDB
may exceed 35,000 by 2005. Though proteins have
been grouped together on the basis of structural
similarities in the FSSP [3], CATH [4], and SCOP
databases [5], much effort still has been put into
finding the similarities among proteins. Moreover, the
rapid growth in the amount of structural data of
proteins far exceeds the ability of experimental
techniques to identify the locations and key amino
acids of active sites. Although the structural genomics
initiative (SGI) proposes to solve 10,000 protein 3D
structures in this decade, however, many biological
functions still remain unknown.
With the help of alignment tools, the structural
similarity between proteins is revealed, as well as the
functional and evolutional relationships. Holm and
Sander [6] mentioned that structural similarities among
distantly related proteins are often preserved in the
process of evolution, but very little similarity at the
sequence level.
There is an interesting problem studied, that is
molecular mimicry. The molecular mimicry problem [7]
is that a protein and a nucleic acid share a similar
substructure, and sometimes it will even extend to
similarity in interaction. Nissen et. al [8] indicated that
the structure of Elongation Factor-G is similar to that
of the complex of Elongation Factor-Tu and tRNA.
Selmer et. al [9] mentioned that Ribosomal Recycling
Factor looks like tRNA. In addition, exploitation of 3D
structural data is a key factor to enhance structure-
based drug design (SBDD), and the prediction of
protein functions and possible active sites in proteins
have become quite popular in SBDD, especially at
front-ends to molecular docking [10] [11] or
alternative active sites are sought otherwise.
This paper is organized as follows. Some related
works are discussed in section 2. The geometric
hashing algorithm and ICP algorithm we use are
detailed in section 3. The experimental results are
provided in section 4 while conclusion is given in
section 5.
2. Previous Work
In general, structure alignment based on 3D
structure has been shown to be NP complete by
Lathrop [12] and so heuristics are used to simplify the
problem. Therefore, better methods for structure
alignment are needed. Fisher et al. [13] used geometric
hashing for a C
α
-only representation of protein
structure, and a follow-up is described in Tsai et al.
[14]. Their method is based on preprocessing and
recognition algorithms of complexity O(n
3
), where n is
the number of residues of interest. Later, Pennec and
Ayache [15] [16] introduced a 3D reference frame
attached to each residue, which reduces the complexity
of recognition to O(n
2
). Shindyalov and Bourne [17]
proposed a method that involves a combinatorial
extension (CE) of an alignment path defined by
aligned fragment pairs (AFPs) rather than the more
conventional techniques which use dynamic
programming and Monte Carlo optimization.
Combinations of AFPs that represent possible
continuous alignment paths are selectively extended or
discarded thereby leading to a single optimal alignment.
Zemla [18] proposed LGA (local-global alignment)
algorithm, where longest continuous sequence is first
found, and then a second step called GDT (global
distance test) is applied. Both longest segment of
residues under selected RMSD (root mean square
distance) and largest set of equivalent residues that
deviate less than a given distance threshold are
obtained. Blankenbecler et al. [19] proposed to use
fuzzy alignment variables and iterative minimization of
a cost function. Milik et al. [20] used graph matching
and represented atoms as nodes and bond distance as
edge labels. The search method is based on
comparison of local structure features of proteins that
share a common biochemical function, and so does not
depend on overall similarity of structures and
sequences of compared proteins.
From the above survey, it is clear that all the above
papers are concerned with proteins, and complexity
reduction in alignment according to features of
proteins or segments of aligned one dimensional
sequence. Therefore, they can not solve the general
molecule alignment problem unless the tools are
modified.
3. Algorithms
In this paper, we propose a tool to align two
molecules based on their 3D structural data. The
alignment problem between two molecules A and B is
solved in two steps: Geometric Hashing and a fine
tuning process. Geometric Hashing globally finds
initial matching of approximately overlapped atoms.
Thus, parts of molecule A can be matched to parts of
molecule B. Secondly, the fine tuning process is based
on local optimization of overlapped parts, and the
Iterative Closest Point (ICP) algorithm is used until the
number of overlapped atoms within a given distance
threshold can not be increased any more.
3.1. Geometric Hashing: Step One
Geometric hashing algorithm is introduced to
structurally align two molecules. Geometric hashing
algorithm is a technique originally developed in
computer vision for object recognition and can easily
be made parallel [21] [22]. In short, the geometric
hashing algorithm is composed of two stages:
preprocessing and recognition. The basic idea is to
store in a database at preprocessing time a redundant
representation of the models by rigid transformation.
By doing so, the representation of the query object
processed at recognition time will present some
similarities with that of some database models.
Matching is possible even when the recognizable
database objects have undergone transformations or
when only partial information is present.
Often the two interesting molecules are both
proteins, so we will illustrate the solution in such a
situation first. For some cases, e.g. molecular mimicry,
two molecules belong to different type, there would be
some variance while calculating, and we will describe
later.
The three atoms N, C
α
and C in each amino acid
form a triangle which uniquely defines the position and
orientation of the amino acid in the three-dimensional
structure of a protein. Since the length of NC
α
and
C
α
C are fixed, and NC
α
C bond angle is also
changeless. As alignment considered, the
correspondence between two triplets of points in three-
dimensional space is sufficient to uniquely determine a
rigid transformation. With this mechanism, we can
choose a single residue as a basis. A basis is calculated
by the following steps and illustrated in Figure 1(a).
1. Normalize
NC
α
J
JJJJK
to
1
e
J
K
2.
1
2
1
eCC
e
eCC
α
α
×
=
×
J
K JJJJK
J
JK
J
K JJJJK
3.
321
eee
=
×
J
JKJJKJK
There are two phases, preprocessing and
recognition, in the geometric hashing algorithm. To
solve the problem of representation by different
reference coordinates, coordinate information based on
different reference frame of a model is encoded in the
preprocessing phase and stored in a large memory, in
this case, a hash table. The contents of the hash table
are independent of the scene and thus can be computed
offline to reduce the time needed for recognition.
Accessing to the memory is based on geometric
information that is invariant of the object’s pose and
computed directly from the scene. During the
recognition phase, the method accesses the previously
constructed hash table using the indices of the encoded
coordinate information of the input object and finds
their common spatial features.
In the phase of preprocessing, we calculate one
basis for each residue to generate coordinates for each
atom in a protein. In the phase of recognition, we
choose a reference frame of the protein B. For each
different reference frame of protein A in the hash table,
we accumulate the number of matched atoms by
checking whether there are two atoms close enough.
We set a threshold distance MatchThres (MatchThres
= 1 to 2Å is proper), beyond which atoms will not be
considered as a match. If no atoms can be matched
within MatchThres, we assign the score to 0. If there is
an atom within MatchThres, we assign the score to 1.
The process is repeated with each reference frame of
the protein B until all the reference frames of these two
proteins have been tested.
In the case of aligning two different kinds of
molecules, the algorithm is slightly modified while
creating the bases. For each atom whose coordinate is
P, select two atoms connected with the atom, assuming
that the coordinates for these two atoms are Q
1
and Q
2
respectively. The rule for constructing basis is
1. Normalize
1
PQ
JJJJK
to
1
e
JK
2.
12
2
12
ePQ
e
ePQ
×
=
×
J
K JJJJK
J
JK
J
K JJJJK
3.
321
eee
=
×
J
JKJJKJK
and is illustrated in Figure 1(b). The origin of the
new coordinate frame is P. If an atom is connected
with
n atoms, there would be )1( × nn coordinate
frames made for this atom. In this way, the number of
constructed coordinate frames is too large so that the
execution is not efficient. In order to decrease the
execution time, the criteria for selecting atoms to
create bases is listed in Table 1. Then we calculate two
bases for each residue, while we calculate four bases
for each nucleotide. In proteins, the
CA ” atom is
on the backbone and attached with a side-chain, and
the “ CB ” atom is the attached atom. In nucleic acids,
the “ C4*” atom and the “ C3*” atom are both on the
similar position as the “ CA ” atom in proteins. And
“ O4*” atom and “ C2*” are on the similar position as
the “ CB ” atom in proteins. This is illustrated in
Figure 2.
Table 1. The rule for selecting atoms to
construct coordinate frames.
Type of the
molecule
Name of the
atom lie in P
Name of the
atom lie in Q
1
Proteins “ CA ” “ CB ”
“ C4*” O4*” Nucleic Acids
“ C3*” “ C2*”
(a) (b)
Figure 1. Calculation of a basis. (a) The protein structure. (b) The general molecule structure.
(a) (b)
Figure 2. A sketch of molecules to explain the rule for coordinate frame construction. (a) Amino
acid. (b)Nucleotide.
3.2. Fine Tuning Process: Step Two
Once the previous process is done by geometric
hashing for global optimization with an output of
approximate alignment, the following process is a fine
tuning process based on local optimization of
overlapped parts. This step is necessary, since the 3D
structural data in PDB always involve sampling error
in X-ray crystallography in determining atom positions.
Furthermore, geometric hashing just provides initial
alignment. Therefore the alignment needs fine tuning,
and so Iterative Closest Point (ICP) algorithm [23] [24]
is chosen. As illustrated in Figure 3, ICP algorithm is
used in this process repeatedly, until the number of
overlapped atoms within a given distance threshold
can be increased no more.
The ICP algorithm proposes a solution to a key
registration problem below: given two three-
dimensional shapes, estimate the optimal translation
and rotation that register the two shapes by minimizing
the mean square distance between them. The algorithm
guarantees that a local minimum of a mean square
objective function is found [23]. In our implementation,
we select 100 rigid transformations that lead to
maximum numbers of overlapped pairs. The results
show that ICP indeed increases the number of atoms
matched.
Figure 3. The flow chart for fine tuning
process.
4. Experimental Results
4.1. The Molecular Alignment Problem
Our tool can be used in solving the comparison of
two molecules that belong to different types. The data
and the problem of molecular mimicry (Figure 4,
Figure 5) are provided by a graduate student Mr. Han
Liang from Professor Laura Landweber’s group in
Dept. of Ecology and Evolutionary Biology, Princeton
University [25].
One data set [8] consists of EFG (Elongation
Factor-G) and EF-tu (the complex of Elongation
Factor-Tu and tRNA), and the orientations of the
original data are almost the same. The other data set [9]
consists of RRF (Ribosomal Recycling Factor) and
tRNA, but they are not in the same orientation
originally. The aligning results of these two data sets
are shown in Figure 6 and Figure 7.
After calculation by our tool, the rotation matrix
between EFG and EF-tu/tRNA is
984929.015437.00780061.0
160734.0983483.00832201.0
0638711.00945041.0993473.0
and the translation vector is
(
)
83966.6935684.009762.2
.
For the case of RRF and tRNA, the rotation matrix
is
626458.0293322.0722159.0
776578.0314407.0545962.0
0669089.0902835.042475.0
Figure 4. EFG vs. EF-tu/tRNA complex
(Nissen et. al 1995 shows that the binding to
ribosome is at the same place and orient-
tation.) This picture is from Professor Laura
Landweber’s group of Ecology and Evolu-
tionary Biology Dept. Princeton University,
and the orientation is manually selected.
Figure 5. RRF vs tRNA (Selmer et. al 1999
shows that the binding to ribosome is at
different place and orientation.), and again
the orientation is manually selected.
and the translation vector is
()
2718.331501.656722.36
4.2. Comparison with Other Alignment Tools
In order to compare with other tools, we will use
the same set of proteins as in the paper of
Blankenbecler et al. [19]. Note that other protein
alignment methods usually use the knowledge of
matched 1D sequence alignment for proteins, and they
are optimized for proteins only focusing on backbone
atoms C
α
matching. Our tool does not have this
assumption, and will work for arbitrary molecules,
including tRNA. Still, for comparison purpose, we use
the same set of six proteins. Figure 8 shows that our
tool is better compared to other methods, where Figure
8(a) is reported from Blankenbecler’s [19], in which
Yale [26], Dali [27] [28], CE [17] and Lund [19]
methods are compared, while Figure 8(b) is from our
tool as compared to data in Figure 8(a).
The reasons why our method is better are
1.
Given a fixed RMSD for pairs of matched
atoms, our method has the most number of
backbone C
α
atoms;
2.
Given fixed number of matched C
α
, our method
has the lowest RMSD.
In terms of computation cost, the major cost is in
the first step, the geometric hashing. In the case of
proteins, the coordinate frames are generated from the
amino acid C
α
atoms only, and thus the computation
cost is low. For the six pairs of target proteins, all
alignment calculation is done ranging from 6 seconds
to 47 seconds. Table 2 shows the computation time on
a Pentium-4 3GHz PC.
In the case of molecules such as RNA and DNA,
the nucleic acid has a carbon ring in its base, and
therefore the number of possible coordinate frames
tends to be much more than that of proteins. Certainly,
the computation time is longer. In the case of RRF vs.
tRNA, where there are over 1000 atoms in tRNA, the
computation time is around 24 minutes, while in the
case of EFG vs. EF-tu/tRNA complex (over 4000
atoms), the computation time can be as long as 36
hours on the same 3 GHz PC. Even so, our tool can
still solve this problem, which is a very important
problem called "molecular mimicry". As far as we
know, our method is the first one to solve this kind of
problems, because our algorithm is sequence
independent, and does not use the knowledge of 1D
sequence similarity in molecule pairs.
5. Conclusion
A novel tool is developed to align two molecules
based on 3D structural data. In contrast to other
algorithms, it takes more computation time to align
two molecules by our tool. However, other tools might
be restricted to align two proteins. The experiments are
conducted based on the data from the PDB and
demonstrate that the proposed tool is useful and
versatile.
The first experiment is the molecular alignment
problem. Given two molecules, our tool will generate
the rotation matrix and translation vector so that the
above two molecules are optimally aligned. In our
experiments, the results are the same, no matter where
we randomly place the molecules in a different
location with different orientation.
Figure 6: Alignment of two molecules using our tool for EFG vs. EF-tu/tRNA complex, where the
atom number is over 4000 and the computation time is about 36 hours on a Pentium-4 3GHz PC.
Figure 7: Alignment of two molecules using our tool for RRF vs. tRNA, where the atom number
is over 1000 and the computation time is about 24 minutes on a Pentium-4 3GHz PC.
(a) (b)
Figure 8: Alignment results for a set of protein pairs in terms of RMSD of matched atom pairs
and number of aligned atoms (N). In this figure, (a) is from Blankenbecler et al. fuzzy alignment
method. The results from Yale (red squares), Dali (green triangles), CE (blue circles), and Lund
method (solid lines) are also given in their paper. (b) is from our tool as a comparison. It shows
that our results are better as compared with other methods.
Table 2: Computation time of alignment of six pairs of proteins, where MatchThres means the
threshold used in initial geometric hashing, while the other columns are in seconds.
MatchThres (Å) 8DFR-4DFRa 1MBD-1MBA 1TIE-4FGF 1CID-2RHE 7FABl2-1REIa 1FXIa-1UBQ
1.0 7 5 4 3 2 1
1.5 9 6 5 5 2 1
2.0 11 7 6 5 3 1
2.5 13 10 8 7 3 2
3.0 18 12 9 9 4 3
3.5 22 16 13 11 5 3
4.0 30 20 15 13 7 3
4.5 37 26 20 18 8 6
5.0 47 34 24 22 10 6
In the second experiment, several protein pairs are
used to compare the results with four popular
alignment tools, namely Yale [26], Dali [27] [28], CE
[17] and Lund [19] methods. Our tool performs the
best in terms of RMSD and number of matched atom
pairs.
6. References
[1] D.Y. Chen, X.P. Tian, Y.T. Shen, and M. Ouhyoung, “On
visual similarity based 3D model retrieval” Comput. Graph.
Forum, 22(3), 2003, pp. 223-232.
[2] T. Funkhouser, P. Min, M. Kazhdan, J. Chen , A.
Halderman, D. Dobkin, and D. Jacobs, “A search engine for
3d models” ACM T. Graphics, 22(1), Jan. 2003, pp. 83-105.
[3] L. Holm and C. Sander, “Touring protein fold space with
Dali/FSSP” Nucl. Acids Res., 26, 1998, pp. 316-319.
[4] C.A. Orengo, A.D. Michie, S. Jones, D.T. Jones, M.B.
Swindells, and J.M. Thornton, “CATH - a hierarchic
classification of protein domain structures”, Structure, 5(8),
Aug. 1997, pp. 1093-1108.
[5] A.G. Murzin, S.E. Brenner, T. Hubbard, and C. Chothia,
“SCOP: a structural classification of proteins for the
investigation of sequences and structures”, J. Mol. Biol., 247,
1995, pp. 536-540.
[6] L. Holm and C. Sander, “Mapping the protein universe”,
Science, 273, Aug. 1996, pp. 595-602.
[7] P. Nissen, M. Kjeldgaard, and J. Nyborg,
“Macromolecular mimicry”, EMBO J., 19, 2000, pp. 489-495.
[8] P. Nissen, M. Kjeldgaard, S. Thirup, G. Polekhina, L.
Reshetnikova, B.F.C. Clark, and J. Nyborg, “Crystal
structure of the ternary complex of Phe-tRNA
Phe
, EF-Tu, and
a GTP analog”, Science, 270, Dec. 1995, pp. 1464-1472.
[9] M. Selmer, S. Al-Karadaghi, G. Hirokawa, A. Kaji, and A.
Liljas, “Crystal structure of Thermotoga maritima ribosome
recycling factor: a tRNA mimic”, Science, 286, Dec. 1999,
pp. 2349-2352.
[10] V. Cappello, A. Tramontano, and U. Koch,
“Classification of proteins based on the properties of the
ligand-binding site: the case of adenine-binding proteins”,
Proteins, 47(2), May 2002, pp. 106-115.
[11] G.R. Smith and M.J. Sternberg, “Prediction of protein-
protein interactions by docking methods”, Curr. Opin. Struct.
Biol., 12(1), Feb. 2002, pp. 28-35.
[12] R.H. Lathrop, “The protein threading problem with
sequence amino acid interaction preferences is NP-complete”,
Protein Eng., 7, 1994, pp. 1059- 1068.
[13] D. Fischer, O. Bachar, R. Nussinov, and H. Wolfson,
“An efficient automated computer vision based technique for
detection of three dimensional structural motifs in proteins”,
J. Biomol. Struct. Dyn., 9(4), Feb. 1992, pp. 769-789.
[14] C.J. Tsai, S.L. Lin, H. Wolfson, and R. Nussinov,
“Techniques for searching for structural similarities between
protein cores, protein surfaces and between protein-protein
interfaces”, Techniques in Protein Chemistry, VII, 1996, pp.
419-429.
[15] X. Pennec and N. Ayache, “An O(n
2
) algorithm for 3D
substructure matching of proteins”, Shape and Pattern
Matching in Computational Biology - Proc. First Int.
Workshop, 1994, pp. 25-40.
[16] X. Pennec and N. Ayache, “A geometric algorithm to
find small but highly similar 3D substructures in proteins”,
Bioinformatics, 14(6), 1998, pp. 516-522.
[17] I.N. Shindyalov and P.E. Bourne, “Protein structure
alignment by incremental combinatorial extension (CE) of
the optimal path”, Protein Eng., 11(9), Sep. 1998, pp. 739-
747.
[18] A. Zemla, “LGA: A method for finding 3D similarities
in protein structures”, Nucleic Acids Res., 31(13), Jul. 2003,
pp. 3370-3374.
[19] R. Blankenbecler, M. Ohlsson, C. Peterson, and M.
Ringner, “Matching protein structures with fuzzy
alignments”, Proc. Natl. Acad. Sci. USA., 100(21), Oct. 2003,
pp. 11936-11940.
[20] M. Milik, S. Szalma, and K.A. Olszewski1, “Common
structural cliques: a tool for protein structure and function
analysis”, Protein Eng., 16(8), Aug. 2003, pp. 543-552.
[21] Y. Lamdan and H.J. Wolfson, “Geometric hashing: a
general and efficient model-based recognition scheme”,
Proceedings of the Second ICCV, 1988, pp. 238-249.
[22] H.J. Wolfson and I. Rigoutsos, “Geometric hashing: an
overview”, IEEE comp. Science and Eng., 4, 1997, pp. 10-21.
[23] P.J. Besl and N.D. McKay, “A method for registration of
3-D shapes”, IEEE T. Pattern ANAL., 14, 1992, pp. 239-256.
[24] Z. Zhang, “Iterative point matching for registration of
free-form curves and surfaces”, Int. J. Comput. Vision, 13(2),
1994, pp. 119-152.
[25] H. Liang and L.F. Landweber, “Computational tests of
molecular mimicry between tRNA and protein translation
factors”, submitted, 2004.
[26] M. Gerstein and M. Levitt, “Using iterative dynamic
programming to obtain accurate pairwise and multiple
alignments of protein structures”, Proc. Int. Conf. Intell. Syst.
Mol. Biol., 4, 1996, pp. 59-67.
[27] L. Holm and C. Sander, “Protein structure comparison
by alignment of distance matrices”, J. Mol. Biol., 233, 1993,
pp. 123-138.
[28] L. Holm and J. Park, “DaliLite workbench for protein
structure comparison”, Bioinformatics, 16, 2000, pp. 566-567.
... Comparison of structures can be done indirectly, for example in DALI [6] are compared not actual structures, but their distance matrices. Skeletons of proteins are compared in algorithms CE [4], MaxSub [7], 3dSEARCH [8]. Probabilistic and statistic methods are used in MATRAS [11] and Lgscore [10]. ...
Article
A new algorithm is proposed for protein alignment based on their 3D structures. Initial protein matching uses comparison of their secondary structure elements. Each element of secondary structure is described by a vector representing its principal axis of inertia. All possible pairs of vectors from both proteins are chosen to compute similarity function values. A number of matched vector pairs with the best values of similarity function are considered as initial matches. Then, for all initial matches Fractional Iterative Closest Point algorithm is applied to improve correspondence between proteins.
... The purpose of PSC is to identify maxima equivalent C α atoms upon which to align the 3D structures of compared proteins optimally. Previously proposed PSC algorithms exploit many different computing approaches including Monte Carlo [9], dynamic programming [7, 17, 23, 24], 3D clustering [25], graph theory [29], spline approximation [4] and geometric hashing [5]. Today some non-sequential PSC algorithms are proposed [14, 27]. ...
Conference Paper
Full-text available
Many protein structure comparison methods use secondary structure information to do fast structure similarity search for initial alignment finding and refine the results from possible optimal candidate solutions by iteratively dynamic programming to optimize the final results. In this paper, we develop a method, Ellipsoidal Model Protein Structure Comparison, based on the concept of secondary structure elements alignment followed by iteratively refinement. In order to utilize all possible structure information to obtain alternative solutions for further analysis, we use ellipsoidal model to represent not only mainly -helices and -sheets, but the remaining fragments for structural alignment. Different heuristic filters and geometric hashing based global alignment estimation are applied for quick finding better initial alignments. We also provide top-N solutions without increasing extra computational time rather than only best solution in the previous works. Now, we provide the online web service, Ballerina (http://ballerina.csie.ntu.edu.tw/), for protein structure comparison.
... Second, we next superimposed two partner structures for each protein–RNA mimicry case (EF-G vs. EF-Tu–tRNA; RRF vs. tRNA). The optimal orientations of two structures in the superimposition were calculated by the BIND2 program, which applied geometric hashing to find globally maximal matching of two molecules (Chang et al. 2004). To be more cautious, the superimpositions determined manually were then used to confirm the best geometry alignment of two partner structures. ...
Article
Full-text available
With rapidly increasing availability of three-dimensional structures, one major challenge for the post-genome era is to infer the functions of biological molecules based on their structural similarity. While quantitative studies of structural similarity between the same type of biological molecules (e.g., protein vs. protein) have been carried out intensively, the comparable study of structural similarity between different types of biological molecules (e.g., protein vs. RNA) remains unexplored. Here we have developed a new bioinformatics approach to quantitatively study the structural similarity between two different types of biopolymers--proteins and RNA--based on the spatial distribution of conserved elements. We applied it to two previously proposed tRNA-protein mimicry pairs whose functional relatedness between two molecules has been recently determined experimentally. Our method detected the biologically meaningful signals, which are consistent with experimental evidence.
... The purpose of PSC is to identify maxima equivalent C α atoms upon which to align the 3D structures of compared proteins optimally. Previously proposed PSC algorithms exploit many different computing approaches including Monte Carlo[9], dynamic programming[7,17,23,24], 3D clustering[25], graph theory[29], spline approximation[4]and geometric hashing[5]. Today some non-sequential PSC algorithms are proposed[14,27]. ...
Article
Full-text available
A methodology for performing sequence-free comparison of functional sites in protein structures is introduced. The method is based on a new notion of similarity among superimposed groups of amino acid residues that evaluates both geometry and physico-chemical properties. The method is specifically designed to handle disconnected and sparsely distributed sets of residues. A genetic algorithm is employed to find the superimposition of protein segments that maximizes their similarity. The method was evaluated by performing an all-to-all comparison on two separate sets of ligand-binding sites, comprising 47 protein-FAD (Flavin-Adenine Dinucleotide) and 64 protein-NAD (Nicotinamide-Adenine Dinucleotide) complexes, and comparing the results with those of an existing sequence-based structural alignment tool (TM-Align). The quality of the two methodologies is judged by the methods’ capacity to, among other, correctly predict the similarities in the protein-ligand contact patterns of each pair of binding sites. The results show that using a sequence-free method significantly improves over the sequence-based one, resulting in 23 significant binding-site homologies being detected by the new method but ignored by the sequence-based one.
Article
Full-text available
This paper proposes a new method EMPSC for the well-known PSC (Protein Structure Comparison) problem. The proposed method EMPSC is a protein structural alignment algorithm based on ellipsoidal model abstraction. We segment the protein 3D structure into two different kinds of structures, including Secondary Structure Elements recognized by DSSP 1 and other coil/loop structures. These SSEs will be the initial alignment center for obtaining the transformation coordinate systems. Different heuristic filters and geometric hashing based global alignment estimation are used for quick finding better initial alignments. In the refined alignment stage of analysis, a standard refinement algorithm is invoked to fine-tune the alignment outputted by the first stage. Our experimental results reveal that EMPSC generally achieves comparable accuracy and better performance in comparison with the existing PSC algorithms. Moreover, we analyzed the factors that affect the EMPSC performance and SSE-based PSC algorithms. Further investigation in multiple protein structure comparison and local structure comparison will be continued.
Conference Paper
Structure alignment could help to find shape similarities between proteins and guide structure classification and fold recognition. Common substructure detection and extraction are especially important, for which could guide the biologist to discover binding site or active site. We represent each segment of alpha-carbon backbone by using dihedral angles and curve moment invariants. Then, local and global structure alignment could be performed by iterative closest point algorithm. Maximum common substructures between a pair of proteins or within a protein could be found. Active sites also could be detected by the proposed algorithm.
Conference Paper
A web-based three-dimensional (3D) protein retrieval system is available for protein structure data including all PDB and FSSP dataset. In this system, we use a visual-based matching method to compare the protein structure from multiple viewpoints. It takes less than three seconds for each query with 90% accuracy on the average. Availability: The web-based query interface and downloadable files can be accessed via http://3d.csie.ntu.edu.tw/ ProteinRetrieval/ Supplementary information: Further details of the proposed method are available at http://graphics.csie.ntu.edu.tw/~jsyeh/3Dprotein/
Article
Full-text available
A large number of 3D models are created and available on the Web, since more and more 3D modelling anddigitizing tools are developed for ever increasing applications. The techniques for content-based 3D model retrievalthen become necessary. In this paper, a visual similarity-based 3D model retrieval system is proposed.This approach measures the similarity among 3D models by visual similarity, and the main idea is that if two 3Dmodels are similar, they also look similar from all viewing angles. Therefore, one hundred orthogonal projectionsof an object, excluding symmetry, are encoded both by Zernike moments and Fourier descriptors as features forlater retrieval. The visual similarity-based approach is robust against similarity transformation, noise, model degeneracyetc., and provides 42%, 94% and 25% better performance (precision-recall evaluation diagram) thanthree other competing approaches: (1) the spherical harmonics approach developed by Funkhouser et al., (2) theMPEG-7 Shape 3D descriptors, and (3) the MPEG-7 Multiple View Descriptor. The proposed system is on the Webfor practical trial use (http://3d.csie.ntu.edu.tw), and the database contains more than 10,000 publicly available3D models collected from WWW pages. Furthermore, a user friendly interface is provided to retrieve 3D modelsby drawing 2D shapes. The retrieval is fast enough on a server with Pentium IV 2.4 GHz CPU, and it takes about2 seconds and 0.1 seconds for querying directly by a 3D model and by hand drawn 2D shapes, respectively. Categories and Subject Descriptors (according to ACM CCS): H.3.1 [Information Storage and Retrieval]: Indexing Methods
Article
A heuristic method has been developed for registering two sets of 3-D curves obtained by using an edge-based stereo system, or two dense 3-D maps obtained by using a correlation-based stereo system. Geometric matching in general is a difficult unsolved problem in computer vision. Fortunately, in many practical applications, some a priori knowledge exists which considerably simplifies the problem. In visual navigation, for example, the motion between successive positions is usually approximately known. From this initial estimate, our algorithm computes observer motion with very good precision, which is required for environment modeling (e.g., building a Digital Elevation Map). Objects are represented by a set of 3-D points, which are considered as the samples of a surface. No constraint is imposed on the form of the objects. The proposed algorithm is based on iteratively matching points in one set to the closest points in the other. A statistical method based on the distance distribution is used to deal with outliers, occlusion, appearance and disappearance, which allows us to do subset-subset matching. A least-squares technique is used to estimate 3-D motion from the point correspondences, which reduces the average distance between points in the two sets. Both synthetic and real data have been used to test the algorithm, and the results show that it is efficient and robust, and yields an accurate motion estimate.
Article
This chapter discusses the techniques for searching for structural similarities among protein cores, protein surfaces and among protein–protein interfaces. The chapter shows that for comparisons of protein structures, implementing considerations of connectivity into the matching procedure, further improves the geometric hashing technique. A number of protein structure comparison techniques have been developed that fall into three categories: (1) techniques based on dynamic programming, (2) technique that matches the 3D structures belonging to fragments of contiguous amino acids, (3) technique derived from computer vision. The geometric hashing is a highly efficient tool for structural comparisons of proteins and for docking. As it is based in computer vision, it matches unconnected points in space. This enables matching protein structures in a manner that is entirely independent of their amino acid sequence order, and carrying out docking of a ligand onto a receptor surface. It avoids the time consuming search of entire conformational space by matching the points in a transformation invariant manner. Consequently, high quality matches are obtained in short times.
Article
To facilitate understanding of, and access to, the information available for protein structures, we have constructed the Structural Classification of Proteins (scop) database. This database provides a detailed and comprehensive description of the structural and evolutionary relationships of the proteins of known structure. It also provides for each entry Links to co-ordinates, images of the structure, interactive viewers, sequence data and literature references. Two search facilities are available. The homology search permits users to enter a sequence and obtain a list of any structures to which it has significant levels of sequence similarity The key word search finds, for a word entered by the user, matches from both the text of the scop database and the headers of Brookhaven Protein Databank structure files. The database is freely accessible on World Wide Web (WWW) with an entry point to URL http://scop.mrc-lmb.cam.ac.uk/scop/ scop: an old English poet or minstrel (Oxford English Dictionary); ckon: pile, accumulation (Russian Dictionary).
Article
As the number of 3D models available on the Web grows, there is an increasing need for a search engine to help people find them. Unfortunately, traditional text-based search techniques are not always effective for 3D data. In this paper, we investigate new shape-based search methods. The key challenges are to develop query methods simple enough for novice users and matching algorithms robust enough to work for arbitrary polygonal models. We present a web-based search engine system that supports queries based on 3D sketches, 2D sketches, 3D models, and/or text keywords. For the shape-based queries, we have developed a new matching algorithm that uses spherical harmonics to compute discriminating similarity measures without requiring repair of model degeneracies or alignment of orientations. It provides 46--245% better performance than related shape matching methods during precision-recall experiments, and it is fast enough to return query results from a repository of 20,000 models in under a second. The net result is a growing interactive index of 3D models available on the Web (i.e., a Google for 3D models).