ArticlePDF Available

MACiE (Mechanism, Annotation and Classification in Enzymes): novel tools for searching catalytic mechanisms

Authors:
  • Medicines Discovery Catapult
  • Digbi Health

Abstract and Figures

MACiE (Mechanism, Annotation and Classification in Enzymes) is a database of enzyme reaction mechanisms, and is publicly available as a web-based data resource. This paper presents the first release of a web-based search tool to explore enzyme reaction mechanisms in MACiE. We also present Version 2 of MACiE, which doubles the dataset available (from Version 1). MACiE can be accessed from http://www.ebi.ac.uk/thornton-srv/databases/MACiE/
Content may be subject to copyright.
MACiE (Mechanism, Annotation and Classification
in Enzymes): novel tools for searching catalytic
mechanisms
Gemma L. Holliday*, Daniel E. Almonacid
1
, Gail J. Bartlett, Noel M. O’Boyle
1
,
James W. Torrance, Peter Murray-Rust
1
, John B. O. Mitchell
1
and Janet M. Thornton
EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and
1
Unilever Centre for
Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road,
Cambridge CB2 1EW, UK
Received August 4, 2006; Revised September 18, 2006; Accepted October 1, 2006
ABSTRACT
MACiE (Mechanism, Annotation and Classification in
Enzymes) is a database of enzyme reaction mecha-
nisms, and is publicly available as a web-based data
resource. This paper presents the first release of a
web-based search tool to explore enzyme reaction
mechanisms in MACiE. We also present Version 2 of
MACiE, which doubles the dataset available (from
Version 1). MACiE can be accessed from http://www.
ebi.ac.uk/thornton-srv/databases/MACiE/
INTRODUCTION
Enzymes are proteins that catalyse the repertoire of chemical
reactions found in nature, and as such are vitally important
molecules. What is so fascinating about these proteins is
that they have a wonderful diversity and can carry out highly
complex chemical conversions under physiological condi-
tions and retain their stereospecificity and regiospecificity,
unlike many organic chemical reactions. They range in size
and can have molecular weights of several thousand to sev-
eral million Daltons, and still they can catalyse reactions on
molecules as small as carbon dioxide or nitrogen, or as large
as a complete chromosome.
Although enzymes are large molecules, the actual catalysis
only takes place in a small cavity, the active site. It is
here that a small number of amino acid residues contribute
to catalytic function, and where the substrates bind. With
the advent of structure determination methods for proteins
and by using clever chemical/biochemical experimental
design, scientists have been able to propose catalytic mecha-
nisms for many enzymes. Although a great deal of knowledge
exists for enzymes, including their structures, gene
sequences, mechanisms, metabolic pathways and kinetic
data, it tends to be spread between many different databases
and throughout the literature. Most web resources relating to
enzymes [such as BRENDA (1), KEGG (2), the IUBMB
Enzyme Nomenclature website (http://www.chem.qmul.ac.
uk/iubmb/enzyme/) (3) and IntEnz (4)] focus on the overall
reaction, accompanied in some cases by a textual or graphical
description of the mechanism. However, this does not allow
for detailed in silico searching of the chemical steps which
take place in the reaction. MACiE (5) combines detailed
stepwise mechanistic information [including 2-D animations
(6)], a wide coverage of both chemical space and the protein
structure universe, and the chemical intelligence of the
Chemical Markup Language for Reactions (CMLReact) (7).
This usefully complements both the mechanistic detail of
the Structure–Function Linkage Database (SFLD) for a
small number of rather ‘promiscuous’ enzyme superfamilies
(8) and the wider coverage with less chemical detail provided
by EzCatDB (9), which also contains a limited number of 3D
animations. Entries in MACiE are linked, where appropriate,
to all of these related data resources.
DATASET AND CONTENT
The dataset for MACiE version 2 was devised to increase the
enzyme reaction space coverage of MACiE while trying to
keep structural homology to a minimum. Each entry added
in the new version was selected so that it fulfils the following
criteria:
(i) The EC sub-subclass was not previously in MACiE.
(ii) There is a three-dimensional crystal structure of the
enzyme deposited in the Protein Data Bank (wwPDB)
(10).
(iii) There is a mechanism available from the primary
literature which explains most of the observed experi-
mental results.
*To whom correspondence should be addressed. Tel: +44 1223 492535; Fax: +44 1223 494486; Email: gemma@ebi.ac.uk
Present address:
Gail J. Bartlett, Division of Mathematical Biology, National Institute of Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, UK
2006 The Author(s).
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Published online 1 November 2006 Nucleic Acids Research, 2007, Vol. 35, Database issue D515–D520
doi:10.1093/nar/gkl774
(iv) The enzyme is unique at the H level of the CATH code
(11), unless the homologue already in MACiE has a
significantly different chemical mechanism.
Using the above criteria MACiE was expanded from
100 entries in version 1 to a total of 202 entries, which
span 199 EC numbers (version 1 spanned 96 EC numbers)
and covers a total of 862 reaction steps. There are almost
4000 EC numbers defined, but the number of different
reaction mechanisms needed to bring about all these overall
transformations is not clear. For example, the serine protease
family of proteins has many different substrates, but the
mechanisms are broadly similar. In contrast the b-lactamase
enzymes, which have the same EC number, have four com-
pletely different mechanisms. Within the EC code, the fourth
digit usually defines the substrate specificity, which can be
very variable in large enzyme families—but the reaction
mechanisms for enzymes with the same first three digits are
usually essentially the same. In total there are 224 EC sub-
subclasses, with only 181 having known structures (12). Of
these MACiE covers 158, i.e. 87%. However, there are proba-
bly many more mechanisms that are yet to be defined or
discovered.
As can be seen from Figure 1, MACiE covers a good
proportion of the EC reaction space, with an average relative
difference between the size of corresponding EC classes
of 4%, with the transferases having the largest difference.
When the coverage with respect to EC code present in
the PDB is examined, it can be seen that MACiE again
represents the coverage of enzymes with known structures
very well, with an average relative difference between the
corresponding EC classes in MACiE of 5%.
All entries in MACiE contain overall reaction annotation
including the information detailed in Table 1. Each elemen-
tary reaction or step within an entry is fully annotated as is
detailed in Figure 2, this includes comments that have been
added by the annotators. An extension of the content from
MACiE Version 1 is the addition of inferred return steps.
These are explicitly labelled as being inferred in the comment
field and are necessary to return the enzyme to a state where it
is ready to undergo another round of catalysis.
There is sometimes more than one proposed mechanism that
is consistent with the available experimental data. In MACiE,
we have attempted not only to choose the best supported
mechanism, but also where possible to annotate enzymes
with reasonable alternative mechanisms. Unfortunately, in
the current release such annotations are only available as
comments on the stage or overall reaction, although future
releases of MACiE will include full entries for these alterna-
tives.
Further details of the annotation process and a glossary of
terms used can be found on the MACiE website (http://www.
ebi.ac.uk/thornton-srv/databases/MACiE/documentation/ and
http://www.ebi.ac.uk/thornton-srv/databases/MACiE/glossary.
html, respectively).
DATABASE STRUCTURE
The challenge with MACiE has been to capture and usefully
represent all the different catalytic steps that occur during the
course of an enzymatic reaction. These reactions may consist
of any number of steps, and in MACiE we have reactions
ranging from 1 step to 16 steps. The representation of these
reactions has evolved from a flat file entered in a commer-
cially available chemical database program (ISIS/Base) to
the highly structured and powerful CMLReact (7), which is
an application of XML (the eXtensible Markup Language).
The final step in this evolution has been the conversion of
the CMLReact into the relational database format of MySQL.
CMLReact has a heirarchical structure, facilitating its
conversion into the relational database format of MySQL.
The conversion relies on the CML Schema and requires the
MACiE entries to be consistent with the Schema, which
adds an internal consistency check into our authoring process.
Figure 1. EC wheels showing the EC coverage of MACiE Version 2 (left), the complete EC space (centre) and the coverage of EC space in the PDB by unique
EC serial numbers (right).
Table 1. Overall reaction annotation content
Catalysis and reaction
specific information
Non-catalysis
specific information
Enzyme name
(common IUPAB/JCBN name)
PDB code
EC code Non-catalytic domain CATH code
Catalytic residues involved Non-catalytic UniProt code
Cofactors involved Species name (common and scientific)
Reactants and products Other database
identifiers, e.g. EzCatDB, SFLD, etc.
Catalytic domain CATH code Literature references
Catalytic UniProt code
Bonds involved, formed,
cleaved, changed in order
Reactive centres
Overall reaction comments
D516 Nucleic Acids Research, 2007, Vol. 35, Database issue
Each CML tag-type becomes an MySQL table; each tag
becomes a row in that MySQL table; each attribute of that
tag corresponds to a column in the MySQL table. The tree
structure of the CML is preserved in the MySQL version;
for each row of each table, there are columns specifying
which row of which other table corresponds to the row’s
parent tag in the CML version.
The CML version of MACiE, which is the official archive
version, is available from the website as individual entries,
and the new website uses the relational version of MACiE
to perform the online analysis and searching.
DATABASE FEATURES
The original release of MACiE contained static images and
annotation for the overall reaction and each step associated
with the mechanism; it also included an animated reaction
mechanism for approximately half the reactions then in
MACiE. Links to various related resources, such as the
RCSB PDB (13), IUBMB nomenclature database, CATH,
EzCatDB, PDBSum (14), BRENDA, the Catalytic Site
Atlas (15), KEGG and the Enzyme Structures Database,
were also included. This new release extends these links to
include the Macromolecular Structures Database (MSD)
(16), SFLD, UniProt (17), and replaces the IUBMB nomen-
clature database links with links to IntEnz. The new features
in MACiE are detailed in the following sections.
Searching MACiE
There are two levels of search implemented in MACiE. The
basic level searches are implemented from the main page
(http://www.ebi.ac.uk/thornton-srv/databases/MACiE) and are
Table 2. Searches available in MACiE
Basic Complex
MACiE entry identifier Species name (overall annotation)
Current EC codes Overall reactants and products
Obsolete EC codes Reaction comments (overall reactions
and steps)
Catalytic Domain
CATH codes
Amino acid residues (up to six residues)
All CATH codes Step mechanisms and/or mechanism
components (single and combinations of)
PDB code Chemical changes
Enzyme name Chemical changes with mechanism or
mechanism components
Catalytic Domain
UniProt Codes
Chemical changes with amino acid
residues
All UniProt Codes Amino acid residues with mechanism or
mechanism components
Chemical changes with amino acid
residues and mechanisms or mechanism
components
Alternative mechanisms
Figure 2. An example of the annotation found in a MACiE entry. Reaction shown corresponds to fructose-bisphosphate aldolase (entry 52).
Figure 3. EC code search heuristics.
Nucleic Acids Research, 2007, Vol. 35, Database issue D517
mainly for accessing the entries from the top level, i.e. for
searching entries in MACiE by EC code, enzyme name,
etc. The complex searches are all available from the query
pages of MACiE (http://www.ebi.ac.uk/thornton-srv/databases/
MACiE/queryMACiE.html) and are mainly for searching for
specific mechanisms, mechanism components or residues and
their functions in the reaction steps, although there are some
overall reaction searches implemented as well. Table 2 lists
the searches available in MACiE and the Supplementary Data
contain a detailed listing of the searches available.
The following sections describe searching by EC code,
PDB code or enzyme name, all of which use heuristics to
extend the coverage of MACiE.
EC code. The EC code search implemented in MACiE is
detailed in Figure 3 and can be accessed at any point in the
scheme shown. The search for current EC numbers will always
walk up the EC code tree until it finds a match, no matter at
what level the search is entered. Thus the search will always
return a result. As the EC code of enzymes may change over
time, a search for obsolete EC codes has also been imple-
mented, although this search will not always return a result.
However, it should be noted that the higher up the EC hierar-
chy search has gone, the less likely it is that the returned
mechanism will be a match to the query. The obsolete EC
code search works in the same way as the current EC code.
If no matches are found at the serial number level of the
EC code, an advanced search option will allow the user to
search for a structural homologue of an enzyme with a
given EC code, which is shown in Figure 4 and described
below. This advanced search option takes the entered EC
code and finds the PDB codes of all of the matches to that
EC code in the Catalytic Site Atlas (CSA). A homology
search is then performed on those PDB codes for a match
in MACiE. This homology search is described in more detail
in the following section.
The CSA is a database of catalytic residues in proteins of
known structure. It contains much less mechanistic informa-
tion than MACiE, but has a considerably wider coverage of
protein structures than MACiE does. This wider coverage is
partly because the CSA contains not only manually annotated
entries, but also contains entries that are automatically
annotated based on sequence alignment to the manual entries.
PDB code. There are over 19 000 crystal structures relating to
enzymes deposited in the PDB. As MACiE entries require
extensive literature searching and analysis, only a small
fraction of these PDB entries are covered explicitly, 202 in
total. However, we have used the CSA to identify homologues
of these enzymes, extending this coverage to 7528 PDB codes.
Figure 5 details the search performed in MACiE, when a
protein structure described by a PDB code is entered.
Although the entries returned by this search will be homo-
logues, this does not guarantee that the mechanism and the
catalytic residue assignments are the same. This is because
the homology method (see below) can retrieve very distant
relatives. Owing to this limitation, all homologous entries
are compared by EC code, and when there is a divergence
between the MACiE entry and the homologue at the serial
number level, this is clearly indicated to the user. We also
Figure 4. Advanced EC search heuristics.
Figure 5. PDB search heuristics.
Figure 6. Enzyme name search heuristics.
D518 Nucleic Acids Research, 2007, Vol. 35, Database issue
list the amino acid residues that are annotated as catalytic in
both MACiE and the CSA. Thus it is clear if there is any
difference between EC numbers and catalytic residues. If
the EC number differs but the catalytic residues between
query and homologue are of identical types, it can be inferred
that the mechanisms are likely to be the same, but where both
differ, the mechanisms are unlikely to be transferable. From
the results page we link both to the MACiE entry and the
CSA entry.
Homology in MACiE. We have been working to bring
MACiE and the CSA closer together. This includes using
the CSA to determine homologues (those enzymes which
are evolutionarily related) of entries in MACiE. The CSA
finds homologues using a PSI-BLAST search (with an
E-value cut-off of 0.0005 and five iterations) against all
sequences currently in the PDB, plus all sequences in a
non-redundant subset of UniProt. The UniProt sequences
are included purely in order to increase the range of the
PSI-BLAST search by bridging gaps between distantly
related sequences in the PDB; only sequences occurring in
the PDB are retrieved for entry into the CSA. In the CSA,
and thus MACiE, homologous entries are only included
if the residues which align with the catalytic residues in
the parent literature entry are identical in residue type. In
other words, there must be no mutations at the catalytic res-
idue positions. There are, however, a few exceptions to this
rule:
(i) In order to allow for the many active site mutants in the
PDB, one (and only one) catalytic residue per site can be
different in type from the equivalent in the parent
literature entry. This is only permissible if all residue
spacing is identical to that in the parent literature entry,
and there are at least two catalytic residues.
(ii) Sites with only one catalytic residue are permitted to be
mutant provided that the residue number is identical to
that in the parent entry.
(iii) Fuzzy matching of residues is permitted within the
following groups: [V,L,I], [F,W,Y], [S,T], [D,E], [K,R],
[D,N], [E,Q], [N,Q]. This fuzzy matching cannot be used
in combination with rules (i) or (ii) above.
Figure 8. Frequency distribution of amino acid residues. This shows the frequency of catalytic amino acid residues in MACiE (blue), versus the frequency of
residues in MACiE (cyan), versus the frequency of residues in the wwPDB (red). The frequency of catalytic amino acid residues in MACiE is calculated by
taking the number of residues (of a given type) annotated in MACiE divided by the total number of annotated residues in MACiE, multiplied by 100.
Figure 7. Growth of MACiE. This shows the growth in the number of EC
codes (blue), EC sub-sub classes (cyan) and catalytic domain CATH codes
(red) in MACiE.
Nucleic Acids Research, 2007, Vol. 35, Database issue D519
Enzyme name. This is currently implemented as a partial
string match, thus entering ‘beta’ will return all the
b-lactamases and betaine-aldehyde dehydrogenase. If no
results are returned from the partial name search, then the
name search heuristics (shown in Figure 6) are implemented.
This search utilizes the IntEnz database (4). MACiE
searches for a name in IntEnz, either a synonym, alternative
name or common name, and returns the EC code of that
name. The EC code is then used to search MACiE. If no
matches are found to the sub-subclass level of the EC code,
the user is offered an advanced EC code search (see Figure 4).
Statistics
The other major development in MACiE has been the
inclusion of database statistics that are all generated on the
fly from the SQL tables. A full listing of the statistics
available can be found in the Supplementary Data. The
growth of MACiE is shown in Figure 7 in terms of EC
coverage and CATH coverage.
The statistics in MACiE can also be used to examine the
function and distribution of amino acid residues (G.L. Holliday,
D.E. Almonacid, J.M. Thornton and J.B.O. Mitchell,
manuscript in preparation) (see Figure 8), the distribution of
mechanism and mechanism components and the bond order
changes occurring in each step of the reaction.
FUTURE DEVELOPMENTS
MACiE is a continually developing resource, and in the
future we hope to include 3D data, which will incorporate
various statistics and searches related to the analysis of
these data. We will also continue to extend the coverage of
MACiE to include alternative reaction mechanisms that
have been suggested for various enzymes, as well as new
mechanisms. Finally, we intend to build a user interface
which will allow for chemical diagrams to be drawn
and used to search MACiE, an entry process which is more
usable and also to implement the classification of enzyme
mechanisms that we are developing.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
ACKNOWLEDGEMENTS
We would like to thank the EPSRC (G.L.H. and J.B.O.M.),
BBSRC (G.J.B. and J.M.T.—CASE studentship in associa-
tion with Roche Products Ltd; N.M.O.B. and J.B.O.M.—grant
BB/C51320X/1), the Wellcome Trust, EMBL, IBM (G.L.H.
and J.M.T.), the Chilean Government’s Ministerio de
Planificacio
´
n y Cooperacio
´
n and the Cambridge Overseas
Trust (D.E.A.) for funding and Unilever for supporting the
Centre for Molecular Science Informatics. J.W.T. is funded
by a European Molecular Biology Laboratory studentship,
and is also affiliated with Cambridge University Department
of Chemistry. Funding to pay the Open Access publication
charges for this article was provided by the Wellcome Trust.
Conflict of interest statement. None declared.
REFERENCES
1. Schomburg,I., Chang,A., Ebeling,C., Gremse,M., Heldt,C., Huhn,G.
and Schomburg,D. (2004) BRENDA, the enzyme database: updates
and major new developments. Nucleic Acids Res., 32,
D431–D433.
2. Kanehisa,M., Goto,S., Kawashima,S., Okuno,Y. and Hattori,M. (2004)
The KEGG resource for deciphering the genome. Nucleic Acids Res.,
32, D277–D280.
3. IUBMB (2005) Recommendations of the Nomenclature Committee of
the International Union of Biochemistry and Molecular Biology on the
nomenclature and classification of enzyme-catalysed reactions.
4. Fleischmann,A., Darsow,M., Degtyarenko,K., Fleischmann,W.,
Boyce,S., Axelsen,K., Bairoch,A., Schomburg,D., Tipton,K.F. and
Apweiler,R. (2004) IntEnz, the integrated relational enzyme database.
Nucleic Acids Res., 32, D434–D437.
5. Holliday,G.L., Bartlett,G.J., Almonacid,D.E., O’Boyle,N.M.,
Murray-Rust,P., Thornton,J.M. and Mitchell,J.B.O. (2005) MACiE: a
database of enzyme reaction mechanisms. Bioinformatics, 21,
4315–4316.
6. Holliday,G.L., Mitchell,J.B.O. and Murray-Rust,P. (2004) CMLSnap:
animated reaction mechanisms. Internet J. Chem., 7, Article 4.
7. Holliday,G.L., Murray-Rust,P. and Rzepa,H.S. (2006) Chemical
Markup, XML, and the World Wide Web. 6. CMLReact, an
XML vocabulary for chemical reactions. J. Chem. Inf. Model., 46,
145–157.
8. Pegg,S.C.-H., Brown,S.D., Ojha,S., Seffernick,J., Meng,E.C.,
Morris,J.H., Chang,P.J., Huang,C.C., Ferrin,T.E. and Babbitt,P.C.
(2006) Leveraging enzyme structure–function relationships for
functional inference and experimental design: the Structure–Function
Linkage Database. Biochemistry, 45, 2545–2555.
9. Nagano,N. (2005) EzCatDB: the Enzyme Catalytic-mechanism
DataBase. Nucleic Acids Res., 33, D407–D412.
10. Berman,H.M., Henrick,K. and Nakamura,H. (2003) Announcing
the worldwide Protein Data Bank. Nature Struct. Biol.,
10, 980.
11. Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and
Thornton,J.M. (1997) CATH—a hierarchic classification of protein
domain structures. Structure, 5, 1093–1108.
12. Martin,A.C. (2004) PDBSprotEC: a Web-accessible database linking
PDB chains to EC numbers via SwissProt. Bioinformatics, 20,
986–988.
13. Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N.,
Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data
Bank. Nucleic Acids Res., 28, 235–242.
14. Laskowski,R.A., Chistyakov,V.V. and Thornton,J.M. (2005)
PDBsum more: new summaries and analyses of the known 3D
structures of proteins and nucleic acids. Nucleic Acids Res., 33,
D266–D268.
15. Porter,C.T., Bartlett,G.J. and Thornton,J.M. (2004) The Catalytic Site
Atlas: a resource of catalytic sites and residues identified in enzymes
using structural data. Nucleic Acids Res., 32, D129–D133.
16. Golovin,A., Oldfield,T.J., Tate,J.G., Velankar,S., Barton,G.J.,
Boutselakis,H., Dimitropoulos,D., Fillon,J., Hussain,A., Ionides,J.M.
et al. (2004) E-MSD: an integrated data resource for bioinformatics.
Nucleic Acids Res., 32, D211–D216.
17. Bairoch,A., Apweiler,R., Wu,C.H., Barker,W.C., Boeckmann,B.,
Ferro,S., Gasteiger,E., Huang,H., Lopez,R., Magrane,M. et al. (2005)
The Universal Protein Resource (UniProt). Nucleic Acids Res., 33,
D154–D159.
D520 Nucleic Acids Research, 2007, Vol. 35, Database issue
... To check the prediction on zinc dependence of the enzymes that catalyze formation and breakdown of chemical bonds, as well as group transfer reactions, we have analyzed the involvement of transition metals as cofactors in different enzyme classes according to the enzyme descriptions in the International Union of Biochemistry and Molecular Biology (IUBMB) Enzyme Nomenclature, as documented, among others, in the ExporEnz database [177,178]. Since we needed specific information on metals as catalysts, and not just as structural elements, we searched the MACiE (Mechanism, Annotation and Classification in Enzymes) database [179,180] and its recently described Metal-MACiE supplement, a database of metal-based reaction mechanisms [181,182]. Andreini and coworkers [181] compared metal-containing enzymes listed in these databases with metal-containing proteins in the PDB and showed that the relative occurrence of catalytic metals in Metal-MACiE matched well that in the PDB. ...
... An interesting observation in this section is the absence of Fe in any RNA structures, and the authors' interpretation that iron is likely to catalyze RNA cleavage makes sense. However, the data that are cited in support of this idea, on ribozyme activation by iron (ref.179), logically suggest the opposite of their argument, namely, that iron is tolerated by RNA molecules, at least molecules like ribozymes that seem to be most relevant for the RNA World.Authors' response: This is a very important point, so we have revisited the experimental protocols in ref.[332] (no.179 in the original manuscript). ...
... However, the data that are cited in support of this idea, on ribozyme activation by iron (ref.179), logically suggest the opposite of their argument, namely, that iron is tolerated by RNA molecules, at least molecules like ribozymes that seem to be most relevant for the RNA World.Authors' response: This is a very important point, so we have revisited the experimental protocols in ref.[332] (no.179 in the original manuscript). In these experiments, it was checked whether diverse metals could activate a particular ribozyme, which was pre-selected for the ability to be activated by Mn 2+ , Co 2+ , Ni 2+ , Zn 2+ and Cd 2+ . ...
Chapter
Full-text available
The accompanying article (A.Y. Mulkidjanian, Biology Direct 4:26) puts forward a detailed hypothesis on the role of zinc sulfide (ZnS) in the origin of life on Earth. The hypothesis suggests that life emerged within compartmentalized, photosynthesizing ZnS formations of hydrothermal origin (the Zn world), assembled in sub-aerial settings on the surface of the primeval Earth.
... In the present work, we compiled 2 datasets macie-254 and csalit-688. Macie-254 is derived from the MACiE (mechanism, annotation and classification in enzymes) database [37], which provides manually curated list of catalytic residues with their putative roles in mechanistic steps of an enzymatic reaction. From 335 MACiE entries, enzymes having catalytic site defined in single pdb chain were used to prepare a non-redundant set of 254 proteins at 60% sequence identity using CD-HIT [38]. ...
Article
Full-text available
Background: Knowledge of catalytic residues can play an essential role in elucidating mechanistic details of an enzyme. However, experimental identification of catalytic residues is a tedious and time-consuming task, which can be expedited by computational predictions. Despite significant development in active-site prediction methods, one of the remaining issues is ranked positions of putative catalytic residues among all ranked residues. In order to improve ranking of catalytic residues and their prediction accuracy, we have developed a meta-approach based method CSmetaPred. In this approach, residues are ranked based on the mean of normalized residue scores derived from four well-known catalytic residue predictors. The mean residue score of CSmetaPred is combined with predicted pocket information to improve prediction performance in meta-predictor, CSmetaPred_poc. Results: Both meta-predictors are evaluated on two comprehensive benchmark datasets and three legacy datasets using Receiver Operating Characteristic (ROC) and Precision Recall (PR) curves. The visual and quantitative analysis of ROC and PR curves shows that meta-predictors outperform their constituent methods and CSmetaPred_poc is the best of evaluated methods. For instance, on CSAMAC dataset CSmetaPred_poc (CSmetaPred) achieves highest Mean Average Specificity (MAS), a scalar measure for ROC curve, of 0.97 (0.96). Importantly, median predicted rank of catalytic residues is the lowest (best) for CSmetaPred_poc. Considering residues ranked ≤20 classified as true positive in binary classification, CSmetaPred_poc achieves prediction accuracy of 0.94 on CSAMAC dataset. Moreover, on the same dataset CSmetaPred_poc predicts all catalytic residues within top 20 ranks for ~73% of enzymes. Furthermore, benchmarking of prediction on comparative modelled structures showed that models result in better prediction than only sequence based predictions. These analyses suggest that CSmetaPred_poc is able to rank putative catalytic residues at lower (better) ranked positions, which can facilitate and expedite their experimental characterization. Conclusions: The benchmarking studies showed that employing meta-approach in combining residue-level scores derived from well-known catalytic residue predictors can improve prediction accuracy as well as provide improved ranked positions of known catalytic residues. Hence, such predictions can assist experimentalist to prioritize residues for mutational studies in their efforts to characterize catalytic residues. Both meta-predictors are available as webserver at: http://14.139.227.206/csmetapred/ .
... M-CSA, as well as MACiE (22)(23)(24) and CSA (25)(26)(27), were created to capture and organize these and other kinds of mechanistic data available in the literature, and to make them available in a standardised and computer readable format for the community. Additionally, these datasets have been helpful to explore overall themes related to enzyme mechanisms such as the evolution of new chemical function and the roles of specific catalytic residues, cofactors, and metal ions in the chemistry of life (28)(29)(30). ...
Article
Full-text available
M-CSA (Mechanism and Catalytic Site Atlas) is a database of enzyme active sites and reaction mechanisms that can be accessed at www.ebi.ac.uk/thornton-srv/m-csa. Our objectives with M-CSA are to provide an open data resource for the community to browse known enzyme reaction mechanisms and catalytic sites, and to use the dataset to understand enzyme function and evolution. M-CSA results from the merging of two existing databases, MACiE (Mechanism, Annotation and Classification in Enzymes), a database of enzyme mechanisms, and CSA (Catalytic Site Atlas), a database of catalytic sites of enzymes. We are releasing M-CSA as a new website and underlying database architecture. At the moment, M-CSA contains 961 entries, 423 of these with detailed mechanism information, and 538 with information on the catalytic site residues only. In total, these cover 81% (195/241) of third level EC numbers with a PDB structure, and 30% (840/2793) of fourth level EC numbers with a PDB structure, out of 6028 in total. By searching for close homologues, we are able to extend M-CSA coverage of PDB and UniProtKB to 51 993 structures and to over five million sequences, respectively, of which about 40% and 30% have a conserved active site.
... FunTree allows users to explore the evolution of enzyme function through sequence, structure, phylogenetic, and functional information. Structural protein domains from CATH are searched against the MACiE database [53] to identify those with enzymatic function. MACiE uses expert manual curation to identify residues involved in catalytic functions. ...
Chapter
This chapter describes the generation of the data in the CATH-Gene3D online resource and how it can be used to study protein domains and their evolutionary relationships. Methods will be presented for: comparing protein structures, recognizing homologs, predicting domain structures within protein sequences, and subclassifying superfamilies into functionally pure families, together with a guide on using the webpages.
... Information in the MACiE [28] database (https://www.ebi.ac.uk/thornton-srv/databases/ MACiE/) and the scientific literature [29] (Fig 2) reveal differences in sequence motifs between the Class A (MACiE entry M0002), C (M0257) and D (M0210) serine beta-lactamases, involving residues that perform the catalytic mechanism of action. ...
Article
Full-text available
Beta-lactamases represent the main bacterial mechanism of resistance to beta-lactam antibiotics and are a significant challenge to modern medicine. We have developed an automated classification and analysis protocol that exploits structure- and sequence-based approaches and which allows us to propose a grouping of serine beta-lactamases that more consistently captures and rationalizes the existing three classification schemes: Classes, (A, C and D, which vary in their implementation of the mechanism of action); Types (that largely reflect evolutionary distance measured by sequence similarity); and Variant groups (which largely correspond with the Bush-Jacoby clinical groups). Our analysis platform exploits a suite of in-house and public tools to identify Functional Determinants (FDs), i.e. residue sites, responsible for conferring different phenotypes between different classes, different types and different variants. We focused on Class A beta-lactamases, the most highly populated and clinically relevant class, to identify FDs implicated in the distinct phenotypes associated with different Class A Types and Variants. We show that our FunFHMMer method can separate the known beta-lactamase classes and identify those positions likely to be responsible for the different implementations of the mechanism of action in these enzymes. Two novel algorithms, ASSP and SSPA, allow detection of FD sites likely to contribute to the broadening of the substrate profiles. Using our approaches, we recognise 151 Class A types in UniProt. Finally, we used our beta-lactamase FunFams and ASSP profiles to detect 4 novel Class A types in microbiome samples. Our platforms have been validated by literature studies, in silico analysis and some targeted experimental verification. Although developed for the serine beta-lactamases they could be used to classify and analyse any diverse protein superfamily where sub-families have diverged over both long and short evolutionary timescales.
Preprint
Full-text available
Data-driven modeling has emerged as a new paradigm for biocatalyst design and discovery. Biocatalytic databases that integrate enzyme structure and function data are in urgent need. Here, we described IntEnzyDB as an integrated structure-kinetics database for facile statistical modeling and machine learning. IntEnzyDB employs a relational architecture with flattened data structure, which allows rapid data operation. This architecture also makes it easy for IntEnzyDB to incorporate more types of enzyme function data. IntEnzyDB contains enzyme kinetics and structure data from six enzyme commission classes. Using 1019 enzyme structure-kinetics pairs, we investigated the efficiency-perturbing propensity for mutations that are close or distal to the active site. The statistical results show that efficiency-enhancing mutations are globally encoded; deleterious mutations are much more likely to occur in close mutations than in distal mutations. Finally, we described a web interface that allows public users to access enzymology data stored in IntEnzyDB. IntEnzyDB will provide a computational facility for data-driven modeling in biocatalysis and molecular evolution.
Article
Data-driven modeling has emerged as a new paradigm for biocatalyst design and discovery. Biocatalytic databases that integrate enzyme structure and function data are in urgent need. Here we describe IntEnzyDB as an integrated structure-kinetics database for facile statistical modeling and machine learning. IntEnzyDB employs a relational database architecture with a flattened data structure, which allows rapid data operation. This architecture also makes it easy for IntEnzyDB to incorporate more types of enzyme function data. IntEnzyDB contains enzyme kinetics and structure data from six enzyme commission classes. Using 1050 enzyme structure-kinetics pairs, we investigated the efficiency-perturbing propensities of mutations that are close or distal to the active site. The statistical results show that efficiency-enhancing mutations are globally encoded and that deleterious mutations are much more likely to occur in close mutations than in distal mutations. Finally, we describe a web interface that allows public users to access enzymology data stored in IntEnzyDB. IntEnzyDB will provide a computational facility for data-driven modeling in biocatalysis and molecular evolution.
Article
Metal coordination with proteinaceous ligands has greatly expanded the chemical toolbox of proteins and their biological roles. The structure and function of natural metalloproteins have been determined according to the physicochemical properties of metal ions bound to the active sites. Concurrently, amino acid sequences are optimized for metal coordination geometry and/or dedicated action of metal ions in proteinaceous environments. In some occasions, however, natural enzymes exhibit promiscuous reactivity with more than one designated metal ion, under in vitro and/or in vivo conditions. In this review, we discuss selected examples of metalloenzymes that bind various first-row, mid- to late-transition metal ions for their native catalytic activities. From these examples, we suggest that environmental, inorganic, and biochemical factors, such as bioavailability, native organism, cellular compartment, reaction mechanism, binding affinity, protein sequence, and structure, might be responsible for determining metal selectivity or promiscuity. The current work proposes how natural metalloproteins might have emerged and adapted for specific metal incorporation under the given circumstances and may provide insights into the design and engineering of de novo metalloproteins.
Chapter
Three-dimensional (3D) motifs are patterns of local structure associated with function, typically based on residues in binding or catalytic sites. Protein structures of unknown function can be annotated by comparing them to known 3D motifs. Many methods have been developed for identifying 3D motifs and for searching structures for their occurrence. Approaches vary in the type and amount of input evidence, how the motifs are described and matched, whether the results include a measure of statistical significance, and how the motifs relate to function. Compared to algorithm development, less progress has been made in providing publicly searchable databases of 3D motifs that are both functionally specific and cover a broad range of functions. A roadblock has been the difficulty of generating detailed structure-function classifications; instead, automated, large-scale studies have relied upon pre-existing classifications of either structure or function. Complementary to 3D motif methods are approaches focused on molecular surface descriptions, global structure (fold) comparisons, predicting interactions with other macromolecules, and identifying physiological substrates by docking databases of small molecules.
Book
Full-text available
Systems biology focuses on complex interactions within biological systems, using a holistic approach. Why is the whole greater than the sum of it’s parts? Because the parts interact, making the whole an emergent characteristic of the parts and their interactions with each other. High-throughput studies of biological systems are rapidly accumulating a wealth of 'omics'-scale data. Visualization is a key aspect of both the analysis and understanding of these data. It is common to describe biological processes as pathway diagrams. The pathway nodes represent the participating molecules in the biological process (genes, proteins, metabolites etc.) and edges connecting the nodes describe the relationship between the participants ( reactions, interactions etc.). In this thesis, I have focused on metabolic pathways describing the metabolic processes of an organism. Metabolic pathways are series of chemical reactions occurring within a cell. Although all chemical reactions are technically reversible, conditions in the cell are often such that it is thermodynamically more favorable for a reaction to flow in one direction. High throughput technologies exist for measuring expression of genes, and abundances of proteins and metabolites. Transcriptomics datasets are freely available from online databases, notably ArrayExpress and GEO where datasets can be searched based on tissue of interest, disease of interest, organism of interest etc. Pathway diagrams are also available from various online databases; among them are WikiPathways and Reactome. Pathway diagrams can be used to integrate and co-analyze the different layers of data to have a complete overview of the biological process. Pathway analysis softwares are available for performing such analyses. PathVisio is a widely adopted pathway editing, visualization, and statistics tool. PathVisio can furthermore be used for drawing pathway diagrams. The genes, proteins, metabolites in the pathway diagrams can be annotated with unique identifiers from online databases. To visualize data onto the diagrams the data uploaded must also be annotated with database identifiers. There are various online gene, protein, and metabolite databases. Identifiers from almost any of them can be used to annotate diagrams and datasets. PathVisio works together with BridgeDB to make the mapping between database identifiers easier, and identifier mapping databases are available which can map the gene or gene product related identifiers of one gene product from many online databases to each other. Such a mapping database is also available for metabolites. These identifier mapping databases are what allows mapping data onto the diagram, and visualizing it using colours. However, by visualizing data about nodes alone, we are missing a key component to complete the picture: the data about interactions. Not many experimental techniques exist to measure metabolic fluxes; i.e. the reactions that actually occur in the cell as an end result of the transcriptional, translational, and regulatory effects in a cell. Metabolic fluxes are therefore often estimated through modelling. Mathematical models are created in which equations represent the reactions in the in-silico cell. There are various techniques of analysing these mathematical models to obtain metabolic flux values through the different reactions in the model. Even though, mathematical models are an excellent tool for simulating the dynamic reactions occurring within cells, they are notoriously difficult to correct, share, and update. Pathway diagrams, on the other hand, are widely considered useful for representing a process, while maintaining the knowledge about the topology of the process. Creating pathway diagrams of mathematical models would not only allow modellers to better understand and update their models; it would also enable modellers and biologists to collaborate better and share knowledge. In this thesis we describe a software plugin for PathVisio that makes this workflow possible. The PathSBML plugin was developed in collaboration with Sriharsha Pamu as part of the Google Summer of Code 2013 program where I served as a mentor. It converts computational models commonly encoded in the Systems Biology Markup Language (SBML) to pathway diagrams encoded in the Graphical Pathway Markup Language (GPML) format used by WikiPathways and PathVisio. The plugin also allows a direct import of models available from the open access database Biomodels.org. This enables visualizing a model as a pathway diagram, running that model on the online Biomodels website or in other modelling software and visualizing the model output on the model’s diagram as described in this thesis. However, enabling flux visualization required development of more components. In order to visualize flux data on the reactions and interactions of a pathway in PathVisio, the possibility to annotate the lines signifying such interactions in a pathway was created. Changes to the core of the PathVisio software and the data model for saving a pathway diagram were made in order to allow that. This enabled storing the annotation information about reactions/interactions, similar to how that was already possible for the nodes of the pathway diagrams i.e the gene, proteins, and metabolites. For mapping uploaded data onto the diagram an identifier mapping database is needed as described above, which is why a new BridgeDb derby database was created for mapping reaction and interaction identifiers from the different online data sources. The mappings were obtained from the Open Access Database Rhea. Additionally, the IntViz plugin for PathVisio was developed in collaboration with Rhizhou Guo from the Eindhoven University of Technology as part of his Master’s thesis. This plugin adds Visualization options for interactions. Rule based and gradient based visualization options are now available for visualizing data on the reactions and interactions in a pathway. This plugin also has a slider feature that allows visualizing time series data by sliding through time. However, in order to include flux data in pathway analysis and perform a meaningful analysis on a genome scale level, a large number of pathways with annotated interactions are necessary. Most interactions in WikiPathways pathways are not annotated yet, but the pathways in Reactome are. A Java based converter was created that converts Reactome pathways to the GPML format. This allows Reactome to take advantage of the community curation model of the WikiPathways community, in addition to performing pathway analysis using PathVisio, and newly including flux data, additionally to transcriptomics, proteomics, and metabolomics data. This allows combined statistical pathway analysis (combined enrichment scores) and the results to be quantitatively visualized using PathVisio. This integration will give a more complete overview of key players in a given biological process. This thesis has extended the pathway analysis software PathVisio’s capabilities by a complete toolset, enabling the integrations and visualization of interaction data. It has added to the wealth of knowledge available through WikiPathways by adding the human and plant collections of Reactome pathways. This improves pathway analysis capabilities by adding new genes, and new proteins to WikiPathways’ already large collection of genes, proteins and metabolites, in addition to interaction annotations. These interaction annotations could be mined to automatically annotate other interactions between the same participants in other pathways in WikiPathways. It has also opened the PathVisio software to the modelling community allowing them to visualize their models and results dynamically. The best way to make an analysis reliable and repeatable is to automate it. In this thesis I also developed PathVisioRPC, an XML based Remote Procedure Call interface for PathVisio, that allows users to directly call PathVisio functions to draw and annotate biological pathways, visualize data on them and perform pathway statistics from within different programming languages. The entire analysis workflow can be automated by writing a script calling the relevant PathVisio functions, creating the possibility for easy integration of Pathway Analysis into Data Analysis Pipelines. This is further demonstrated in this thesis, by creating a pathway analysis module for the existing microarray analysis pipeline ArrayAnalysis.org. The final chapter of this theses applies the principle of combining flux and gene expression data to investigate differences in the metabolism of metabolically unhealthy obese adults in comparison to metabolically healthy obese adults. The flux data originated from flux balance analysis of a model describing the flux in adipose tissue in the absorptive state, whereas pre-existing array data sets comparing the adipose tissue of metabolically unhealthy obese adults with metabolically healthy obese adults was used for gene expression. Pathway analysis was performed to identify the pathways that were significantly affected in metabolically unhealthy obese adults in comparison to metabolically healthy obese adults. Fourteen pathways were found to be significantly different. These fourteen pathways were merged into a network and all the pathways were found to be connected through three central genes FASN, ACACA, and ACACB and microRNAs and transcription factors that target these genes. All these three genes were downregulated. The flux data confirms that FASN, ACACA, ACACB might be important regulators as non-zero fluxes were obtained for the reactions catalysed by the enzymes encoded by these genes, by performing flux balance analysis using the metabolic model for the adipose tissue. This indicates that the reactions catalyzed by these genes are active in the adipose tissue, since the metabolic reactions catalyzed by these genes carry fluxes. The networks were further enriched with drugs and diseases. The disease associations helped to identify other diseases that people with metabolic syndrome will be prone to develop, such as cardiomyopathy, mental retardation, obesity, and insulin resistance. The drug associations helped to identify drugs currently in use for other diseases, amongst which are Cerulenin, Fomepizole, Mecasermin, Mefloquine, Nedocromil and Quercetin, which have clinical effects that would be desirable in treating metabolic syndrome. This content described in this thesis is a step towards the complete picture of a biological process and enables integration and visualization of metabolic fluxes from mathematical modelling on interactions alongside experimental measurements of genes, proteins, and metabolites on nodes of pathway diagrams or pathway representations of the models themselves.
Article
Full-text available
The Macromolecular Structure Database (MSD) group (http://www.ebi.ac.uk/msd/) continues to enhance the quality and consistency of macromolecular structure data in the Protein Data Bank (PDB) and to work towards the integration of various bioinformatics data resources. We have implemented a simple form‐based interface that allows users to query the MSD directly. The MSD ‘atlas pages’ show all of the information in the MSD for a particular PDB entry. The group has designed new search interfaces aimed at specific areas of interest, such as the environment of ligands and the secondary structures of proteins. We have also implemented a novel search interface that begins to integrate separate MSD search services in a single graphical tool. We have worked closely with collaborators to build a new visualization tool that can present both structure and sequence data in a unified interface, and this data viewer is now used throughout the MSD services for the visualization and presentation of search results. Examples showcasing the functionality and power of these tools are available from tutorial webpages (http://www. ebi.ac.uk/msd‐srv/docs/roadshow_tutorial/).
Article
Full-text available
Reactions with many steps can be represented by a single XML-based table of the atoms, bonds and electrons. For each step the complete Chemical Markup Language representation of all components is given. These snapshots can then be combined to give an animated description of the complete reaction, both in "2D" chemical structure diagrams and in three dimensions. Here we demonstrate the method's power with enzymatic reactions. Preprint submitted to the Internet Journal of Chemistry and archived as a PRE-REFEREED PREPRINT under the Journal's ROMEO-GREEN policy. The manuscsript is an HTML + SVG hyperdocument of many components. The main paper is deposited but the hyperlinks have not been added so will appear broken. A major theme of the article is the animation of reactions using SVG and for this the reader should view the individual document components. (As a last resort they may download the ZIP file, unpack it, and view it in a modern browser). If they do not have SVG they should install a plugin, e.g. from http://www.adobe.com/svg. The MAIN PAPER is paper.html THE MATERIAL IS COPYRIGHT AND MAY NOT CURRENTLY BE ALTERED OR REDISTRIBUTED. For more information, and animated demos, see http://wwmm.ch.cam.ac.uk/moin/CmlSnap
Article
Full-text available
The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.
Article
Full-text available
mentation will be kept publicly available and the distribution sites will mirror the PDB archive using identical contents and subdirec- tory structure. However, each member of the wwPDB will be able to develop its own web site, with a unique view of the primary data, providing a variety of tools and resources for the global community. An Advisory Board consisting of appointees from the wwPDB, the International Union of Crystallography and the International Council on Magnetic Resonance in Biological Systems will provide guidance through annual meetings with the wwPDB consortium. This board is responsible for reviewing and deter- mining policy as well as providing a forum for resolving issues related to the wwPDB. Specific details about the Advisory Board can be found in the wwPDB charter, available on the wwPDB web site. The RCSB is the 'archive keeper' of wwPDB. It has sole write access to the PDB archive and control over directory structure and contents, as well as responsibility for dis- tributing new PDB identifiers to all deposi- tion sites. The PDB archive is a collection of flat files in the legacy PDB file format 3 and in the mmCIF 4 format that follows the PDB exchange dictionary (http://deposit.pdb.org/ mmcif/). This dictionary describes the syntax and semantics of PDB data that are processed and exchanged during the process of data annotation. It was designed to provide consis- tency in data produced in structure laborato- ries, processed by the wwPDB members and used in bioinformatics applications. The PDB archive does not include the websites, browsers, software and database query engines developed by researchers worldwide. The members of the wwPDB will jointly agree to any modifications or extensions to the PDB exchange dictionary. As data tech- nology progresses, other data formats (such as XML) and delivery methods may be included in the official PDB archive if all the wwPDB members concur on the alteration. Any new formats will follow the naming and description conventions of the PDB exchange dictionary. In addition, the legacy PDB for- mat would not be modified unless there is a compelling reason for a change. Should such a situation occur, all three wwPDB members would have to agree on the changes and give the structural biology community 90 days advance notice. The creation of the wwPDB formalizes the international character of the PDB and ensures that the archive remains single and uniform. It provides a mechanism to ensure consistent data for software developers and users world- wide. We hope that this will encourage individ- ual creativity in developing tools for presenting structural data, which could benefit the scien- tific research community in general.
Article
The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.
Article
The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.
Chapter
In 1998, members of the Research Collaboratory for Structural Bioinformatics became the managers of the Protein Data Bank archive. This chapter details the systems used for the deposition, annotation and distribution of the data in the archive. This chapter is also available as HTML from the International Tables Online site hosted by the IUCr.
Article
Protein evolution gives rise to families of structurally related proteins, within which sequence identities can be extremely low. As a result, structure-based classifications can be effective at identifying unanticipated relationships in known structures and in optimal cases function can also be assigned. The ever increasing number of known protein structures is too large to classify all proteins manually, therefore, automatic methods are needed for fast evaluation of protein structures. We present a semi-automatic procedure for deriving a novel hierarchical classification of protein domain structures (CATH). The four main levels of our classification are protein class (C), architecture (A), topology (T) and homologous superfamily (H). Class is the simplest level, and it essentially describes the secondary structure composition of each domain. In contrast, architecture summarises the shape revealed by the orientations of the secondary structure units, such as barrels and sandwiches. At the topology level, sequential connectivity is considered, such that members of the same architecture might have quite different topologies. When structures belonging to the same T-level have suitably high similarities combined with similar functions, the proteins are assumed to be evolutionarily related and put into the same homologous superfamily. Analysis of the structural families generated by CATH reveals the prominent features of protein structure space. We find that nearly a third of the homologous superfamilies (H-levels) belong to ten major T-levels, which we call superfolds, and furthermore that nearly two-thirds of these H-levels cluster into nine simple architectures. A database of well-characterised protein structure families, such as CATH, will facilitate the assignment of structure-function/evolution relationships to both known and newly determined protein structures.