ArticlePDF Available

MACiE (Mechanism, Annotation and Classification in Enzymes): novel tools for searching catalytic mechanisms

February 2007
Nucleic Acids Research 35(Database issue):D515-20

February 2007
35(Database issue):D515-20

DOI:10.1093/nar/gkl774

Source
PubMed

License
CC BY-NC 2.0

Authors:

Gemma Holliday

Medicines Discovery Catapult

Daniel Almonacid

Digbi Health

Gail Bartlett

University of Bristol

Show all 8 authorsHide

MACiE (Mechanism, Annotation and Classification in Enzymes) is a database of enzyme reaction mechanisms, and is publicly available as a web-based data resource. This paper presents the first release of a web-based search tool to explore enzyme reaction mechanisms in MACiE. We also present Version 2 of MACiE, which doubles the dataset available (from Version 1). MACiE can be accessed from http://www.ebi.ac.uk/thornton-srv/databases/MACiE/

EC wheels showing the EC coverage of MACiE Version 2 (left), the complete EC space (centre) and the coverage of EC space in the PDB by unique EC serial numbers (right).

…

An example of the annotation found in a MACiE entry. Reaction shown corresponds to fructose-bisphosphate aldolase (entry 52).

…

. Searches available in MACiE

…

EC code search heuristics.

…

Advanced EC search heuristics.

…

Figures - uploaded by John Blayney Owen Mitchell

Content may be subject to copyright.

Content uploaded by John Blayney Owen Mitchell

Content may be subject to copyright.

Available via license: CC BY-NC 2.0

Content may be subject to copyright.

MACiE (Mechanism, Annotation and Classification

in Enzymes): novel tools for searching catalytic

mechanisms

Gemma L. Holliday*, Daniel E. Almonacid

, Gail J. Bartlett, Noel M. O’Boyle

James W. Torrance, Peter Murray-Rust

, John B. O. Mitchell

and Janet M. Thornton

EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and

Unilever Centre for

Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road,

Cambridge CB2 1EW, UK

Received August 4, 2006; Revised September 18, 2006; Accepted October 1, 2006

ABSTRACT

MACiE (Mechanism, Annotation and Classification in

Enzymes) is a database of enzyme reaction mecha-

nisms, and is publicly available as a web-based data

resource. This paper presents the first release of a

web-based search tool to explore enzyme reaction

mechanisms in MACiE. We also present Version 2 of

MACiE, which doubles the dataset available (from

Version 1). MACiE can be accessed from http://www.

ebi.ac.uk/thornton-srv/databases/MACiE/

INTRODUCTION

Enzymes are proteins that catalyse the repertoire of chemical

reactions found in nature, and as such are vitally important

molecules. What is so fascinating about these proteins is

that they have a wonderful diversity and can carry out highly

complex chemical conversions under physiological condi-

tions and retain their stereospeciﬁcity and regiospeciﬁcity,

unlike many organic chemical reactions. They range in size

and can have molecular weights of several thousand to sev-

eral million Daltons, and still they can catalyse reactions on

molecules as small as carbon dioxide or nitrogen, or as large

as a complete chromosome.

Although enzymes are large molecules, the actual catalysis

only takes place in a small cavity, the active site. It is

here that a small number of amino acid residues contribute

to catalytic function, and where the substrates bind. With

the advent of structure determination methods for proteins

and by using clever chemical/biochemical experimental

design, scientists have been able to propose catalytic mecha-

nisms for many enzymes. Although a great deal of knowledge

exists for enzymes, including their structures, gene

sequences, mechanisms, metabolic pathways and kinetic

data, it tends to be spread between many different databases

and throughout the literature. Most web resources relating to

enzymes [such as BRENDA (1), KEGG (2), the IUBMB

Enzyme Nomenclature website (http://www.chem.qmul.ac.

uk/iubmb/enzyme/) (3) and IntEnz (4)] focus on the overall

reaction, accompanied in some cases by a textual or graphical

description of the mechanism. However, this does not allow

for detailed in silico searching of the chemical steps which

take place in the reaction. MACiE (5) combines detailed

stepwise mechanistic information [including 2-D animations

(6)], a wide coverage of both chemical space and the protein

structure universe, and the chemical intelligence of the

Chemical Markup Language for Reactions (CMLReact) (7).

This usefully complements both the mechanistic detail of

the Structure–Function Linkage Database (SFLD) for a

small number of rather ‘promiscuous’ enzyme superfamilies

(8) and the wider coverage with less chemical detail provided

by EzCatDB (9), which also contains a limited number of 3D

animations. Entries in MACiE are linked, where appropriate,

to all of these related data resources.

DATASET AND CONTENT

The dataset for MACiE version 2 was devised to increase the

enzyme reaction space coverage of MACiE while trying to

keep structural homology to a minimum. Each entry added

in the new version was selected so that it fulﬁls the following

criteria:

(i) The EC sub-subclass was not previously in MACiE.

(ii) There is a three-dimensional crystal structure of the

enzyme deposited in the Protein Data Bank (wwPDB)

(10).

(iii) There is a mechanism available from the primary

literature which explains most of the observed experi-

mental results.

*To whom correspondence should be addressed. Tel: +44 1223 492535; Fax: +44 1223 494486; Email: gemma@ebi.ac.uk

Present address:

Gail J. Bartlett, Division of Mathematical Biology, National Institute of Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, UK

 2006 The Author(s).

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/

by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Published online 1 November 2006 Nucleic Acids Research, 2007, Vol. 35, Database issue D515–D520

doi:10.1093/nar/gkl774

(iv) The enzyme is unique at the H level of the CATH code

(11), unless the homologue already in MACiE has a

significantly different chemical mechanism.

Using the above criteria MACiE was expanded from

100 entries in version 1 to a total of 202 entries, which

span 199 EC numbers (version 1 spanned 96 EC numbers)

and covers a total of 862 reaction steps. There are almost

4000 EC numbers deﬁned, but the number of different

reaction mechanisms needed to bring about all these overall

transformations is not clear. For example, the serine protease

family of proteins has many different substrates, but the

mechanisms are broadly similar. In contrast the b-lactamase

enzymes, which have the same EC number, have four com-

pletely different mechanisms. Within the EC code, the fourth

digit usually deﬁnes the substrate speciﬁcity, which can be

very variable in large enzyme families—but the reaction

mechanisms for enzymes with the same ﬁrst three digits are

usually essentially the same. In total there are 224 EC sub-

subclasses, with only 181 having known structures (12). Of

these MACiE covers 158, i.e. 87%. However, there are proba-

bly many more mechanisms that are yet to be deﬁned or

discovered.

As can be seen from Figure 1, MACiE covers a good

proportion of the EC reaction space, with an average relative

difference between the size of corresponding EC classes

of 4%, with the transferases having the largest difference.

When the coverage with respect to EC code present in

the PDB is examined, it can be seen that MACiE again

represents the coverage of enzymes with known structures

very well, with an average relative difference between the

corresponding EC classes in MACiE of 5%.

All entries in MACiE contain overall reaction annotation

including the information detailed in Table 1. Each elemen-

tary reaction or step within an entry is fully annotated as is

detailed in Figure 2, this includes comments that have been

added by the annotators. An extension of the content from

MACiE Version 1 is the addition of inferred return steps.

These are explicitly labelled as being inferred in the comment

ﬁeld and are necessary to return the enzyme to a state where it

is ready to undergo another round of catalysis.

There is sometimes more than one proposed mechanism that

is consistent with the available experimental data. In MACiE,

we have attempted not only to choose the best supported

mechanism, but also where possible to annotate enzymes

with reasonable alternative mechanisms. Unfortunately, in

the current release such annotations are only available as

comments on the stage or overall reaction, although future

releases of MACiE will include full entries for these alterna-

tives.

Further details of the annotation process and a glossary of

terms used can be found on the MACiE website (http://www.

ebi.ac.uk/thornton-srv/databases/MACiE/documentation/ and

http://www.ebi.ac.uk/thornton-srv/databases/MACiE/glossary.

html, respectively).

DATABASE STRUCTURE

The challenge with MACiE has been to capture and usefully

represent all the different catalytic steps that occur during the

course of an enzymatic reaction. These reactions may consist

of any number of steps, and in MACiE we have reactions

ranging from 1 step to 16 steps. The representation of these

reactions has evolved from a ﬂat ﬁle entered in a commer-

cially available chemical database program (ISIS/Base) to

the highly structured and powerful CMLReact (7), which is

an application of XML (the eXtensible Markup Language).

The ﬁnal step in this evolution has been the conversion of

the CMLReact into the relational database format of MySQL.

CMLReact has a heirarchical structure, facilitating its

conversion into the relational database format of MySQL.

The conversion relies on the CML Schema and requires the

MACiE entries to be consistent with the Schema, which

adds an internal consistency check into our authoring process.

Figure 1. EC wheels showing the EC coverage of MACiE Version 2 (left), the complete EC space (centre) and the coverage of EC space in the PDB by unique

EC serial numbers (right).

Table 1. Overall reaction annotation content

Catalysis and reaction

specific information

Non-catalysis

specific information

Enzyme name

(common IUPAB/JCBN name)

PDB code

EC code Non-catalytic domain CATH code

Catalytic residues involved Non-catalytic UniProt code

Cofactors involved Species name (common and scientific)

Reactants and products Other database

identifiers, e.g. EzCatDB, SFLD, etc.

Catalytic domain CATH code Literature references

Catalytic UniProt code

Bonds involved, formed,

cleaved, changed in order

Reactive centres

Overall reaction comments

D516 Nucleic Acids Research, 2007, Vol. 35, Database issue

Each CML tag-type becomes an MySQL table; each tag

becomes a row in that MySQL table; each attribute of that

tag corresponds to a column in the MySQL table. The tree

structure of the CML is preserved in the MySQL version;

for each row of each table, there are columns specifying

which row of which other table corresponds to the row’s

parent tag in the CML version.

The CML version of MACiE, which is the ofﬁcial archive

version, is available from the website as individual entries,

and the new website uses the relational version of MACiE

to perform the online analysis and searching.

DATABASE FEATURES

The original release of MACiE contained static images and

annotation for the overall reaction and each step associated

with the mechanism; it also included an animated reaction

mechanism for approximately half the reactions then in

MACiE. Links to various related resources, such as the

RCSB PDB (13), IUBMB nomenclature database, CATH,

EzCatDB, PDBSum (14), BRENDA, the Catalytic Site

Atlas (15), KEGG and the Enzyme Structures Database,

were also included. This new release extends these links to

include the Macromolecular Structures Database (MSD)

(16), SFLD, UniProt (17), and replaces the IUBMB nomen-

clature database links with links to IntEnz. The new features

in MACiE are detailed in the following sections.

Searching MACiE

There are two levels of search implemented in MACiE. The

basic level searches are implemented from the main page

(http://www.ebi.ac.uk/thornton-srv/databases/MACiE) and are

Table 2. Searches available in MACiE

Basic Complex

MACiE entry identifier Species name (overall annotation)

Current EC codes Overall reactants and products

Obsolete EC codes Reaction comments (overall reactions

and steps)

Catalytic Domain

CATH codes

Amino acid residues (up to six residues)

All CATH codes Step mechanisms and/or mechanism

components (single and combinations of)

PDB code Chemical changes

Enzyme name Chemical changes with mechanism or

mechanism components

Catalytic Domain

UniProt Codes

Chemical changes with amino acid

residues

All UniProt Codes Amino acid residues with mechanism or

mechanism components

Chemical changes with amino acid

residues and mechanisms or mechanism

components

Alternative mechanisms

Figure 2. An example of the annotation found in a MACiE entry. Reaction shown corresponds to fructose-bisphosphate aldolase (entry 52).

Figure 3. EC code search heuristics.

Nucleic Acids Research, 2007, Vol. 35, Database issue D517

mainly for accessing the entries from the top level, i.e. for

searching entries in MACiE by EC code, enzyme name,

etc. The complex searches are all available from the query

pages of MACiE (http://www.ebi.ac.uk/thornton-srv/databases/

MACiE/queryMACiE.html) and are mainly for searching for

speciﬁc mechanisms, mechanism components or residues and

their functions in the reaction steps, although there are some

overall reaction searches implemented as well. Table 2 lists

the searches available in MACiE and the Supplementary Data

contain a detailed listing of the searches available.

The following sections describe searching by EC code,

PDB code or enzyme name, all of which use heuristics to

extend the coverage of MACiE.

EC code. The EC code search implemented in MACiE is

detailed in Figure 3 and can be accessed at any point in the

scheme shown. The search for current EC numbers will always

walk up the EC code tree until it ﬁnds a match, no matter at

what level the search is entered. Thus the search will always

return a result. As the EC code of enzymes may change over

time, a search for obsolete EC codes has also been imple-

mented, although this search will not always return a result.

However, it should be noted that the higher up the EC hierar-

chy search has gone, the less likely it is that the returned

mechanism will be a match to the query. The obsolete EC

code search works in the same way as the current EC code.

If no matches are found at the serial number level of the

EC code, an advanced search option will allow the user to

search for a structural homologue of an enzyme with a

given EC code, which is shown in Figure 4 and described

below. This advanced search option takes the entered EC

code and ﬁnds the PDB codes of all of the matches to that

EC code in the Catalytic Site Atlas (CSA). A homology

search is then performed on those PDB codes for a match

in MACiE. This homology search is described in more detail

in the following section.

The CSA is a database of catalytic residues in proteins of

known structure. It contains much less mechanistic informa-

tion than MACiE, but has a considerably wider coverage of

protein structures than MACiE does. This wider coverage is

partly because the CSA contains not only manually annotated

entries, but also contains entries that are automatically

annotated based on sequence alignment to the manual entries.

PDB code. There are over 19 000 crystal structures relating to

enzymes deposited in the PDB. As MACiE entries require

extensive literature searching and analysis, only a small

fraction of these PDB entries are covered explicitly, 202 in

total. However, we have used the CSA to identify homologues

of these enzymes, extending this coverage to 7528 PDB codes.

Figure 5 details the search performed in MACiE, when a

protein structure described by a PDB code is entered.

Although the entries returned by this search will be homo-

logues, this does not guarantee that the mechanism and the

catalytic residue assignments are the same. This is because

the homology method (see below) can retrieve very distant

relatives. Owing to this limitation, all homologous entries

are compared by EC code, and when there is a divergence

between the MACiE entry and the homologue at the serial

number level, this is clearly indicated to the user. We also

Figure 4. Advanced EC search heuristics.

Figure 5. PDB search heuristics.

Figure 6. Enzyme name search heuristics.

D518 Nucleic Acids Research, 2007, Vol. 35, Database issue

list the amino acid residues that are annotated as catalytic in

both MACiE and the CSA. Thus it is clear if there is any

difference between EC numbers and catalytic residues. If

the EC number differs but the catalytic residues between

query and homologue are of identical types, it can be inferred

that the mechanisms are likely to be the same, but where both

differ, the mechanisms are unlikely to be transferable. From

the results page we link both to the MACiE entry and the

CSA entry.

Homology in MACiE. We have been working to bring

MACiE and the CSA closer together. This includes using

the CSA to determine homologues (those enzymes which

are evolutionarily related) of entries in MACiE. The CSA

ﬁnds homologues using a PSI-BLAST search (with an

E-value cut-off of 0.0005 and ﬁve iterations) against all

sequences currently in the PDB, plus all sequences in a

non-redundant subset of UniProt. The UniProt sequences

are included purely in order to increase the range of the

PSI-BLAST search by bridging gaps between distantly

related sequences in the PDB; only sequences occurring in

the PDB are retrieved for entry into the CSA. In the CSA,

and thus MACiE, homologous entries are only included

if the residues which align with the catalytic residues in

the parent literature entry are identical in residue type. In

other words, there must be no mutations at the catalytic res-

idue positions. There are, however, a few exceptions to this

rule:

(i) In order to allow for the many active site mutants in the

PDB, one (and only one) catalytic residue per site can be

different in type from the equivalent in the parent

literature entry. This is only permissible if all residue

spacing is identical to that in the parent literature entry,

and there are at least two catalytic residues.

(ii) Sites with only one catalytic residue are permitted to be

mutant provided that the residue number is identical to

that in the parent entry.

(iii) Fuzzy matching of residues is permitted within the

following groups: [V,L,I], [F,W,Y], [S,T], [D,E], [K,R],

[D,N], [E,Q], [N,Q]. This fuzzy matching cannot be used

in combination with rules (i) or (ii) above.

Figure 8. Frequency distribution of amino acid residues. This shows the frequency of catalytic amino acid residues in MACiE (blue), versus the frequency of

residues in MACiE (cyan), versus the frequency of residues in the wwPDB (red). The frequency of catalytic amino acid residues in MACiE is calculated by

taking the number of residues (of a given type) annotated in MACiE divided by the total number of annotated residues in MACiE, multiplied by 100.

Figure 7. Growth of MACiE. This shows the growth in the number of EC

codes (blue), EC sub-sub classes (cyan) and catalytic domain CATH codes

(red) in MACiE.

Nucleic Acids Research, 2007, Vol. 35, Database issue D519

Enzyme name. This is currently implemented as a partial

string match, thus entering ‘beta’ will return all the

b-lactamases and betaine-aldehyde dehydrogenase. If no

results are returned from the partial name search, then the

name search heuristics (shown in Figure 6) are implemented.

This search utilizes the IntEnz database (4). MACiE

searches for a name in IntEnz, either a synonym, alternative

name or common name, and returns the EC code of that

name. The EC code is then used to search MACiE. If no

matches are found to the sub-subclass level of the EC code,

the user is offered an advanced EC code search (see Figure 4).

Statistics

The other major development in MACiE has been the

inclusion of database statistics that are all generated on the

ﬂy from the SQL tables. A full listing of the statistics

available can be found in the Supplementary Data. The

growth of MACiE is shown in Figure 7 in terms of EC

coverage and CATH coverage.

The statistics in MACiE can also be used to examine the

function and distribution of amino acid residues (G.L. Holliday,

D.E. Almonacid, J.M. Thornton and J.B.O. Mitchell,

manuscript in preparation) (see Figure 8), the distribution of

mechanism and mechanism components and the bond order

changes occurring in each step of the reaction.

FUTURE DEVELOPMENTS

MACiE is a continually developing resource, and in the

future we hope to include 3D data, which will incorporate

various statistics and searches related to the analysis of

these data. We will also continue to extend the coverage of

MACiE to include alternative reaction mechanisms that

have been suggested for various enzymes, as well as new

mechanisms. Finally, we intend to build a user interface

which will allow for chemical diagrams to be drawn

and used to search MACiE, an entry process which is more

usable and also to implement the classiﬁcation of enzyme

mechanisms that we are developing.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

ACKNOWLEDGEMENTS

We would like to thank the EPSRC (G.L.H. and J.B.O.M.),

BBSRC (G.J.B. and J.M.T.—CASE studentship in associa-

tion with Roche Products Ltd; N.M.O.B. and J.B.O.M.—grant

BB/C51320X/1), the Wellcome Trust, EMBL, IBM (G.L.H.

and J.M.T.), the Chilean Government’s Ministerio de

Planificacio

n y Cooperacio

n and the Cambridge Overseas

Trust (D.E.A.) for funding and Unilever for supporting the

Centre for Molecular Science Informatics. J.W.T. is funded

by a European Molecular Biology Laboratory studentship,

and is also affiliated with Cambridge University Department

of Chemistry. Funding to pay the Open Access publication

charges for this article was provided by the Wellcome Trust.

Conflict of interest statement. None declared.

REFERENCES

1. Schomburg,I., Chang,A., Ebeling,C., Gremse,M., Heldt,C., Huhn,G.

and Schomburg,D. (2004) BRENDA, the enzyme database: updates

and major new developments. Nucleic Acids Res., 32,

D431–D433.

2. Kanehisa,M., Goto,S., Kawashima,S., Okuno,Y. and Hattori,M. (2004)

The KEGG resource for deciphering the genome. Nucleic Acids Res.,

32, D277–D280.

3. IUBMB (2005) Recommendations of the Nomenclature Committee of

the International Union of Biochemistry and Molecular Biology on the

nomenclature and classification of enzyme-catalysed reactions.

4. Fleischmann,A., Darsow,M., Degtyarenko,K., Fleischmann,W.,

Boyce,S., Axelsen,K., Bairoch,A., Schomburg,D., Tipton,K.F. and

Apweiler,R. (2004) IntEnz, the integrated relational enzyme database.

Nucleic Acids Res., 32, D434–D437.

5. Holliday,G.L., Bartlett,G.J., Almonacid,D.E., O’Boyle,N.M.,

Murray-Rust,P., Thornton,J.M. and Mitchell,J.B.O. (2005) MACiE: a

database of enzyme reaction mechanisms. Bioinformatics, 21,

4315–4316.

6. Holliday,G.L., Mitchell,J.B.O. and Murray-Rust,P. (2004) CMLSnap:

animated reaction mechanisms. Internet J. Chem., 7, Article 4.

7. Holliday,G.L., Murray-Rust,P. and Rzepa,H.S. (2006) Chemical

Markup, XML, and the World Wide Web. 6. CMLReact, an

XML vocabulary for chemical reactions. J. Chem. Inf. Model., 46,

145–157.

8. Pegg,S.C.-H., Brown,S.D., Ojha,S., Seffernick,J., Meng,E.C.,

Morris,J.H., Chang,P.J., Huang,C.C., Ferrin,T.E. and Babbitt,P.C.

(2006) Leveraging enzyme structure–function relationships for

functional inference and experimental design: the Structure–Function

Linkage Database. Biochemistry, 45, 2545–2555.

9. Nagano,N. (2005) EzCatDB: the Enzyme Catalytic-mechanism

DataBase. Nucleic Acids Res., 33, D407–D412.

10. Berman,H.M., Henrick,K. and Nakamura,H. (2003) Announcing

the worldwide Protein Data Bank. Nature Struct. Biol.,

10, 980.

11. Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and

Thornton,J.M. (1997) CATH—a hierarchic classification of protein

domain structures. Structure, 5, 1093–1108.

12. Martin,A.C. (2004) PDBSprotEC: a Web-accessible database linking

PDB chains to EC numbers via SwissProt. Bioinformatics, 20,

986–988.

13. Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N.,

Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data

Bank. Nucleic Acids Res., 28, 235–242.

14. Laskowski,R.A., Chistyakov,V.V. and Thornton,J.M. (2005)

PDBsum more: new summaries and analyses of the known 3D

structures of proteins and nucleic acids. Nucleic Acids Res., 33,

D266–D268.

15. Porter,C.T., Bartlett,G.J. and Thornton,J.M. (2004) The Catalytic Site

Atlas: a resource of catalytic sites and residues identified in enzymes

using structural data. Nucleic Acids Res., 32, D129–D133.

16. Golovin,A., Oldfield,T.J., Tate,J.G., Velankar,S., Barton,G.J.,

Boutselakis,H., Dimitropoulos,D., Fillon,J., Hussain,A., Ionides,J.M.

et al. (2004) E-MSD: an integrated data resource for bioinformatics.

Nucleic Acids Res., 32, D211–D216.

17. Bairoch,A., Apweiler,R., Wu,C.H., Barker,W.C., Boeckmann,B.,

Ferro,S., Gasteiger,E., Huang,H., Lopez,R., Magrane,M. et al. (2005)

The Universal Protein Resource (UniProt). Nucleic Acids Res., 33,

D154–D159.

D520 Nucleic Acids Research, 2007, Vol. 35, Database issue

On the origin of life in the zinc world: 2. Validation of the hypothesis on the photosynthesizing zinc sulfide edifices as cradles of life on earth

Chapter

Full-text available

Jan 2011

The accompanying article (A.Y. Mulkidjanian, Biology Direct 4:26) puts forward a detailed hypothesis on the role of zinc sulfide (ZnS) in the origin of life on Earth. The hypothesis suggests that life emerged within compartmentalized, photosynthesizing ZnS formations of hydrothermal origin (the Zn world), assembled in sub-aerial settings on the surface of the primeval Earth.

CSmetaPred: A consensus method for prediction of catalytic residues

Article

Full-text available

Dec 2017
BMC BIOINFORMATICS

Background: Knowledge of catalytic residues can play an essential role in elucidating mechanistic details of an enzyme. However, experimental identification of catalytic residues is a tedious and time-consuming task, which can be expedited by computational predictions. Despite significant development in active-site prediction methods, one of the remaining issues is ranked positions of putative catalytic residues among all ranked residues. In order to improve ranking of catalytic residues and their prediction accuracy, we have developed a meta-approach based method CSmetaPred. In this approach, residues are ranked based on the mean of normalized residue scores derived from four well-known catalytic residue predictors. The mean residue score of CSmetaPred is combined with predicted pocket information to improve prediction performance in meta-predictor, CSmetaPred_poc. Results: Both meta-predictors are evaluated on two comprehensive benchmark datasets and three legacy datasets using Receiver Operating Characteristic (ROC) and Precision Recall (PR) curves. The visual and quantitative analysis of ROC and PR curves shows that meta-predictors outperform their constituent methods and CSmetaPred_poc is the best of evaluated methods. For instance, on CSAMAC dataset CSmetaPred_poc (CSmetaPred) achieves highest Mean Average Specificity (MAS), a scalar measure for ROC curve, of 0.97 (0.96). Importantly, median predicted rank of catalytic residues is the lowest (best) for CSmetaPred_poc. Considering residues ranked ≤20 classified as true positive in binary classification, CSmetaPred_poc achieves prediction accuracy of 0.94 on CSAMAC dataset. Moreover, on the same dataset CSmetaPred_poc predicts all catalytic residues within top 20 ranks for ~73% of enzymes. Furthermore, benchmarking of prediction on comparative modelled structures showed that models result in better prediction than only sequence based predictions. These analyses suggest that CSmetaPred_poc is able to rank putative catalytic residues at lower (better) ranked positions, which can facilitate and expedite their experimental characterization. Conclusions: The benchmarking studies showed that employing meta-approach in combining residue-level scores derived from well-known catalytic residue predictors can improve prediction accuracy as well as provide improved ranked positions of known catalytic residues. Hence, such predictions can assist experimentalist to prioritize residues for mutational studies in their efforts to characterize catalytic residues. Both meta-predictors are available as webserver at: http://14.139.227.206/csmetapred/ .

Mechanism and Catalytic Site Atlas (M-CSA): A database of enzyme reaction mechanisms and active sites

Article

Full-text available

Nov 2017
NUCLEIC ACIDS RES

M-CSA (Mechanism and Catalytic Site Atlas) is a database of enzyme active sites and reaction mechanisms that can be accessed at www.ebi.ac.uk/thornton-srv/m-csa. Our objectives with M-CSA are to provide an open data resource for the community to browse known enzyme reaction mechanisms and catalytic sites, and to use the dataset to understand enzyme function and evolution. M-CSA results from the merging of two existing databases, MACiE (Mechanism, Annotation and Classification in Enzymes), a database of enzyme mechanisms, and CSA (Catalytic Site Atlas), a database of catalytic sites of enzymes. We are releasing M-CSA as a new website and underlying database architecture. At the moment, M-CSA contains 961 entries, 423 of these with detailed mechanism information, and 538 with information on the catalytic site residues only. In total, these cover 81% (195/241) of third level EC numbers with a PDB structure, and 30% (840/2793) of fourth level EC numbers with a PDB structure, out of 6028 in total. By searching for close homologues, we are able to extend M-CSA coverage of PDB and UniProtKB to 51 993 structures and to over five million sequences, respectively, of which about 40% and 30% have a conserved active site.

CATH-Gene3D: Generation of the Resource and Its Use in Obtaining Structural and Functional Annotations for Protein Sequences

Chapter

Feb 2017
Meth Mol Biol

This chapter describes the generation of the data in the CATH-Gene3D online resource and how it can be used to study protein domains and their evolutionary relationships. Methods will be presented for: comparing protein structures, recognizing homologs, predicting domain structures within protein sequences, and subclassifying superfamilies into functionally pure families, together with a guide on using the webpages.

Novel Computational Protocols for Functionally Classifying and Characterising Serine Beta-Lactamases

Article

Full-text available

Jun 2016
PLOS COMPUT BIOL

Beta-lactamases represent the main bacterial mechanism of resistance to beta-lactam antibiotics and are a significant challenge to modern medicine. We have developed an automated classification and analysis protocol that exploits structure- and sequence-based approaches and which allows us to propose a grouping of serine beta-lactamases that more consistently captures and rationalizes the existing three classification schemes: Classes, (A, C and D, which vary in their implementation of the mechanism of action); Types (that largely reflect evolutionary distance measured by sequence similarity); and Variant groups (which largely correspond with the Bush-Jacoby clinical groups). Our analysis platform exploits a suite of in-house and public tools to identify Functional Determinants (FDs), i.e. residue sites, responsible for conferring different phenotypes between different classes, different types and different variants. We focused on Class A beta-lactamases, the most highly populated and clinically relevant class, to identify FDs implicated in the distinct phenotypes associated with different Class A Types and Variants. We show that our FunFHMMer method can separate the known beta-lactamase classes and identify those positions likely to be responsible for the different implementations of the mechanism of action in these enzymes. Two novel algorithms, ASSP and SSPA, allow detection of FD sites likely to contribute to the broadening of the substrate profiles. Using our approaches, we recognise 151 Class A types in UniProt. Finally, we used our beta-lactamase FunFams and ASSP profiles to detect 4 novel Class A types in microbiome samples. Our platforms have been validated by literature studies, in silico analysis and some targeted experimental verification. Although developed for the serine beta-lactamases they could be used to classify and analyse any diverse protein superfamily where sub-families have diverged over both long and short evolutionary timescales.

IntEnzyDB: an Integrated Structure-Kinetics Enzymology Database

Preprint

Full-text available

Jul 2022

Data-driven modeling has emerged as a new paradigm for biocatalyst design and discovery. Biocatalytic databases that integrate enzyme structure and function data are in urgent need. Here, we described IntEnzyDB as an integrated structure-kinetics database for facile statistical modeling and machine learning. IntEnzyDB employs a relational architecture with flattened data structure, which allows rapid data operation. This architecture also makes it easy for IntEnzyDB to incorporate more types of enzyme function data. IntEnzyDB contains enzyme kinetics and structure data from six enzyme commission classes. Using 1019 enzyme structure-kinetics pairs, we investigated the efficiency-perturbing propensity for mutations that are close or distal to the active site. The statistical results show that efficiency-enhancing mutations are globally encoded; deleterious mutations are much more likely to occur in close mutations than in distal mutations. Finally, we described a web interface that allows public users to access enzymology data stored in IntEnzyDB. IntEnzyDB will provide a computational facility for data-driven modeling in biocatalysis and molecular evolution.

IntEnzyDB: an Integrated Structure–Kinetics Enzymology Database

Article

Oct 2022

Data-driven modeling has emerged as a new paradigm for biocatalyst design and discovery. Biocatalytic databases that integrate enzyme structure and function data are in urgent need. Here we describe IntEnzyDB as an integrated structure-kinetics database for facile statistical modeling and machine learning. IntEnzyDB employs a relational database architecture with a flattened data structure, which allows rapid data operation. This architecture also makes it easy for IntEnzyDB to incorporate more types of enzyme function data. IntEnzyDB contains enzyme kinetics and structure data from six enzyme commission classes. Using 1050 enzyme structure-kinetics pairs, we investigated the efficiency-perturbing propensities of mutations that are close or distal to the active site. The statistical results show that efficiency-enhancing mutations are globally encoded and that deleterious mutations are much more likely to occur in close mutations than in distal mutations. Finally, we describe a web interface that allows public users to access enzymology data stored in IntEnzyDB. IntEnzyDB will provide a computational facility for data-driven modeling in biocatalysis and molecular evolution.

Emergence of metal selectivity and promiscuity in metalloenzymes

Article

May 2019

Metal coordination with proteinaceous ligands has greatly expanded the chemical toolbox of proteins and their biological roles. The structure and function of natural metalloproteins have been determined according to the physicochemical properties of metal ions bound to the active sites. Concurrently, amino acid sequences are optimized for metal coordination geometry and/or dedicated action of metal ions in proteinaceous environments. In some occasions, however, natural enzymes exhibit promiscuous reactivity with more than one designated metal ion, under in vitro and/or in vivo conditions. In this review, we discuss selected examples of metalloenzymes that bind various first-row, mid- to late-transition metal ions for their native catalytic activities. From these examples, we suggest that environmental, inorganic, and biochemical factors, such as bioavailability, native organism, cellular compartment, reaction mechanism, binding affinity, protein sequence, and structure, might be responsible for determining metal selectivity or promiscuity. The current work proposes how natural metalloproteins might have emerged and adapted for specific metal incorporation under the given circumstances and may provide insights into the design and engineering of de novo metalloproteins.

3D motifs

Chapter

Apr 2017

Three-dimensional (3D) motifs are patterns of local structure associated with function, typically based on residues in binding or catalytic sites. Protein structures of unknown function can be annotated by comparing them to known 3D motifs. Many methods have been developed for identifying 3D motifs and for searching structures for their occurrence. Approaches vary in the type and amount of input evidence, how the motifs are described and matched, whether the results include a measure of statistical significance, and how the motifs relate to function. Compared to algorithm development, less progress has been made in providing publicly searchable databases of 3D motifs that are both functionally specific and cover a broad range of functions. A roadblock has been the difficulty of generating detailed structure-function classifications; instead, automated, large-scale studies have relied upon pre-existing classifications of either structure or function. Complementary to 3D motif methods are approaches focused on molecular surface descriptions, global structure (fold) comparisons, predicting interactions with other macromolecules, and identifying physiological substrates by docking databases of small molecules.

Towards the Complete Picture: Combining modelling and experimental data in a systems biology approach

Book

Full-text available

Jan 2017

Anwesha Bohler

Systems biology focuses on complex interactions within biological systems, using a holistic approach. Why is the whole greater than the sum of it’s parts? Because the parts interact, making the whole an emergent characteristic of the parts and their interactions with each other. High-throughput studies of biological systems are rapidly accumulating a wealth of 'omics'-scale data. Visualization is a key aspect of both the analysis and understanding of these data. It is common to describe biological processes as pathway diagrams. The pathway nodes represent the participating molecules in the biological process (genes, proteins, metabolites etc.) and edges connecting the nodes describe the relationship between the participants ( reactions, interactions etc.). In this thesis, I have focused on metabolic pathways describing the metabolic processes of an organism. Metabolic pathways are series of chemical reactions occurring within a cell. Although all chemical reactions are technically reversible, conditions in the cell are often such that it is thermodynamically more favorable for a reaction to flow in one direction. High throughput technologies exist for measuring expression of genes, and abundances of proteins and metabolites. Transcriptomics datasets are freely available from online databases, notably ArrayExpress and GEO where datasets can be searched based on tissue of interest, disease of interest, organism of interest etc. Pathway diagrams are also available from various online databases; among them are WikiPathways and Reactome. Pathway diagrams can be used to integrate and co-analyze the different layers of data to have a complete overview of the biological process. Pathway analysis softwares are available for performing such analyses. PathVisio is a widely adopted pathway editing, visualization, and statistics tool. PathVisio can furthermore be used for drawing pathway diagrams. The genes, proteins, metabolites in the pathway diagrams can be annotated with unique identifiers from online databases. To visualize data onto the diagrams the data uploaded must also be annotated with database identifiers. There are various online gene, protein, and metabolite databases. Identifiers from almost any of them can be used to annotate diagrams and datasets. PathVisio works together with BridgeDB to make the mapping between database identifiers easier, and identifier mapping databases are available which can map the gene or gene product related identifiers of one gene product from many online databases to each other. Such a mapping database is also available for metabolites. These identifier mapping databases are what allows mapping data onto the diagram, and visualizing it using colours. However, by visualizing data about nodes alone, we are missing a key component to complete the picture: the data about interactions. Not many experimental techniques exist to measure metabolic fluxes; i.e. the reactions that actually occur in the cell as an end result of the transcriptional, translational, and regulatory effects in a cell. Metabolic fluxes are therefore often estimated through modelling. Mathematical models are created in which equations represent the reactions in the in-silico cell. There are various techniques of analysing these mathematical models to obtain metabolic flux values through the different reactions in the model. Even though, mathematical models are an excellent tool for simulating the dynamic reactions occurring within cells, they are notoriously difficult to correct, share, and update. Pathway diagrams, on the other hand, are widely considered useful for representing a process, while maintaining the knowledge about the topology of the process. Creating pathway diagrams of mathematical models would not only allow modellers to better understand and update their models; it would also enable modellers and biologists to collaborate better and share knowledge. In this thesis we describe a software plugin for PathVisio that makes this workflow possible. The PathSBML plugin was developed in collaboration with Sriharsha Pamu as part of the Google Summer of Code 2013 program where I served as a mentor. It converts computational models commonly encoded in the Systems Biology Markup Language (SBML) to pathway diagrams encoded in the Graphical Pathway Markup Language (GPML) format used by WikiPathways and PathVisio. The plugin also allows a direct import of models available from the open access database Biomodels.org. This enables visualizing a model as a pathway diagram, running that model on the online Biomodels website or in other modelling software and visualizing the model output on the model’s diagram as described in this thesis. However, enabling flux visualization required development of more components. In order to visualize flux data on the reactions and interactions of a pathway in PathVisio, the possibility to annotate the lines signifying such interactions in a pathway was created. Changes to the core of the PathVisio software and the data model for saving a pathway diagram were made in order to allow that. This enabled storing the annotation information about reactions/interactions, similar to how that was already possible for the nodes of the pathway diagrams i.e the gene, proteins, and metabolites. For mapping uploaded data onto the diagram an identifier mapping database is needed as described above, which is why a new BridgeDb derby database was created for mapping reaction and interaction identifiers from the different online data sources. The mappings were obtained from the Open Access Database Rhea. Additionally, the IntViz plugin for PathVisio was developed in collaboration with Rhizhou Guo from the Eindhoven University of Technology as part of his Master’s thesis. This plugin adds Visualization options for interactions. Rule based and gradient based visualization options are now available for visualizing data on the reactions and interactions in a pathway. This plugin also has a slider feature that allows visualizing time series data by sliding through time. However, in order to include flux data in pathway analysis and perform a meaningful analysis on a genome scale level, a large number of pathways with annotated interactions are necessary. Most interactions in WikiPathways pathways are not annotated yet, but the pathways in Reactome are. A Java based converter was created that converts Reactome pathways to the GPML format. This allows Reactome to take advantage of the community curation model of the WikiPathways community, in addition to performing pathway analysis using PathVisio, and newly including flux data, additionally to transcriptomics, proteomics, and metabolomics data. This allows combined statistical pathway analysis (combined enrichment scores) and the results to be quantitatively visualized using PathVisio. This integration will give a more complete overview of key players in a given biological process. This thesis has extended the pathway analysis software PathVisio’s capabilities by a complete toolset, enabling the integrations and visualization of interaction data. It has added to the wealth of knowledge available through WikiPathways by adding the human and plant collections of Reactome pathways. This improves pathway analysis capabilities by adding new genes, and new proteins to WikiPathways’ already large collection of genes, proteins and metabolites, in addition to interaction annotations. These interaction annotations could be mined to automatically annotate other interactions between the same participants in other pathways in WikiPathways. It has also opened the PathVisio software to the modelling community allowing them to visualize their models and results dynamically. The best way to make an analysis reliable and repeatable is to automate it. In this thesis I also developed PathVisioRPC, an XML based Remote Procedure Call interface for PathVisio, that allows users to directly call PathVisio functions to draw and annotate biological pathways, visualize data on them and perform pathway statistics from within different programming languages. The entire analysis workflow can be automated by writing a script calling the relevant PathVisio functions, creating the possibility for easy integration of Pathway Analysis into Data Analysis Pipelines. This is further demonstrated in this thesis, by creating a pathway analysis module for the existing microarray analysis pipeline ArrayAnalysis.org. The final chapter of this theses applies the principle of combining flux and gene expression data to investigate differences in the metabolism of metabolically unhealthy obese adults in comparison to metabolically healthy obese adults. The flux data originated from flux balance analysis of a model describing the flux in adipose tissue in the absorptive state, whereas pre-existing array data sets comparing the adipose tissue of metabolically unhealthy obese adults with metabolically healthy obese adults was used for gene expression. Pathway analysis was performed to identify the pathways that were significantly affected in metabolically unhealthy obese adults in comparison to metabolically healthy obese adults. Fourteen pathways were found to be significantly different. These fourteen pathways were merged into a network and all the pathways were found to be connected through three central genes FASN, ACACA, and ACACB and microRNAs and transcription factors that target these genes. All these three genes were downregulated. The flux data confirms that FASN, ACACA, ACACB might be important regulators as non-zero fluxes were obtained for the reactions catalysed by the enzymes encoded by these genes, by performing flux balance analysis using the metabolic model for the adipose tissue. This indicates that the reactions catalyzed by these genes are active in the adipose tissue, since the metabolic reactions catalyzed by these genes carry fluxes. The networks were further enriched with drugs and diseases. The disease associations helped to identify other diseases that people with metabolic syndrome will be prone to develop, such as cardiomyopathy, mental retardation, obesity, and insulin resistance. The drug associations helped to identify drugs currently in use for other diseases, amongst which are Cerulenin, Fomepizole, Mecasermin, Mefloquine, Nedocromil and Quercetin, which have clinical effects that would be desirable in treating metabolic syndrome. This content described in this thesis is a step towards the complete picture of a biological process and enables integration and visualization of metabolic fluxes from mathematical modelling on interactions alongside experimental measurements of genes, proteins, and metabolites on nodes of pathway diagrams or pathway representations of the models themselves.

E-MSD: An integrated data resource for bioinformatics

Article

Full-text available

Jan 2004
NUCLEIC ACIDS RES

The Macromolecular Structure Database (MSD) group (http://www.ebi.ac.uk/msd/) continues to enhance the quality and consistency of macromolecular structure data in the Protein Data Bank (PDB) and to work towards the integration of various bioinformatics data resources. We have implemented a simple form‐based interface that allows users to query the MSD directly. The MSD ‘atlas pages’ show all of the information in the MSD for a particular PDB entry. The group has designed new search interfaces aimed at specific areas of interest, such as the environment of ligands and the secondary structures of proteins. We have also implemented a novel search interface that begins to integrate separate MSD search services in a single graphical tool. We have worked closely with collaborators to build a new visualization tool that can present both structure and sequence data in a unified interface, and this data viewer is now used throughout the MSD services for the visualization and presentation of search results. Examples showcasing the functionality and power of these tools are available from tutorial webpages (http://www. ebi.ac.uk/msd‐srv/docs/roadshow_tutorial/).

CMLSnap: Animated reaction mechanisms

Article

Full-text available

Oct 2004

Reactions with many steps can be represented by a single XML-based table of the atoms, bonds and electrons. For each step the complete Chemical Markup Language representation of all components is given. These snapshots can then be combined to give an animated description of the complete reaction, both in "2D" chemical structure diagrams and in three dimensions. Here we demonstrate the method's power with enzymatic reactions. Preprint submitted to the Internet Journal of Chemistry and archived as a PRE-REFEREED PREPRINT under the Journal's ROMEO-GREEN policy. The manuscsript is an HTML + SVG hyperdocument of many components. The main paper is deposited but the hyperlinks have not been added so will appear broken. A major theme of the article is the animation of reactions using SVG and for this the reader should view the individual document components. (As a last resort they may download the ZIP file, unpack it, and view it in a modern browser). If they do not have SVG they should install a plugin, e.g. from http://www.adobe.com/svg. The MAIN PAPER is paper.html THE MATERIAL IS COPYRIGHT AND MAY NOT CURRENTLY BE ALTERED OR REDISTRIBUTED. For more information, and animated demos, see http://wwmm.ch.cam.ac.uk/moin/CmlSnap

The Protein Data Bank

Article

Full-text available

Feb 2000

The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.

Berman, H, Henrick, K and Nakamura, H. Announcing the worldwide Protein Data Bank. Nat Struct Biol 10: 980

Article

Full-text available

Jan 2004
Nat Struct Biol

mentation will be kept publicly available and the distribution sites will mirror the PDB archive using identical contents and subdirec- tory structure. However, each member of the wwPDB will be able to develop its own web site, with a unique view of the primary data, providing a variety of tools and resources for the global community. An Advisory Board consisting of appointees from the wwPDB, the International Union of Crystallography and the International Council on Magnetic Resonance in Biological Systems will provide guidance through annual meetings with the wwPDB consortium. This board is responsible for reviewing and deter- mining policy as well as providing a forum for resolving issues related to the wwPDB. Specific details about the Advisory Board can be found in the wwPDB charter, available on the wwPDB web site. The RCSB is the 'archive keeper' of wwPDB. It has sole write access to the PDB archive and control over directory structure and contents, as well as responsibility for dis- tributing new PDB identifiers to all deposi- tion sites. The PDB archive is a collection of flat files in the legacy PDB file format 3 and in the mmCIF 4 format that follows the PDB exchange dictionary (http://deposit.pdb.org/ mmcif/). This dictionary describes the syntax and semantics of PDB data that are processed and exchanged during the process of data annotation. It was designed to provide consis- tency in data produced in structure laborato- ries, processed by the wwPDB members and used in bioinformatics applications. The PDB archive does not include the websites, browsers, software and database query engines developed by researchers worldwide. The members of the wwPDB will jointly agree to any modifications or extensions to the PDB exchange dictionary. As data tech- nology progresses, other data formats (such as XML) and delivery methods may be included in the official PDB archive if all the wwPDB members concur on the alteration. Any new formats will follow the naming and description conventions of the PDB exchange dictionary. In addition, the legacy PDB for- mat would not be modified unless there is a compelling reason for a change. Should such a situation occur, all three wwPDB members would have to agree on the changes and give the structural biology community 90 days advance notice. The creation of the wwPDB formalizes the international character of the PDB and ensures that the archive remains single and uniform. It provides a mechanism to ensure consistent data for software developers and users world- wide. We hope that this will encourage individ- ual creativity in developing tools for presenting structural data, which could benefit the scien- tific research community in general.

Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes

Article

Jan 1992

E. C. Webb

The Protein Data Bank

Article

Jan 2000

Helen Berman

The Protein Data Bank/ Nucleic Acids Research

Article

Jan 2000

The Protein Data Bank

Article

Dec 1999
NUCLEIC ACIDS RES

The Protein Data Bank, 1999–

Chapter

Jan 2001

In 1998, members of the Research Collaboratory for Structural Bioinformatics became the managers of the Protein Data Bank archive. This chapter details the systems used for the deposition, annotation and distribution of the data in the archive. This chapter is also available as HTML from the International Tables Online site hosted by the IUCr.

CATH—A Hierarchic Classification of Protein Domain Structures

Article

Sep 1997

Protein evolution gives rise to families of structurally related proteins, within which sequence identities can be extremely low. As a result, structure-based classifications can be effective at identifying unanticipated relationships in known structures and in optimal cases function can also be assigned. The ever increasing number of known protein structures is too large to classify all proteins manually, therefore, automatic methods are needed for fast evaluation of protein structures. We present a semi-automatic procedure for deriving a novel hierarchical classification of protein domain structures (CATH). The four main levels of our classification are protein class (C), architecture (A), topology (T) and homologous superfamily (H). Class is the simplest level, and it essentially describes the secondary structure composition of each domain. In contrast, architecture summarises the shape revealed by the orientations of the secondary structure units, such as barrels and sandwiches. At the topology level, sequential connectivity is considered, such that members of the same architecture might have quite different topologies. When structures belonging to the same T-level have suitably high similarities combined with similar functions, the proteins are assumed to be evolutionarily related and put into the same homologous superfamily. Analysis of the structural families generated by CATH reveals the prominent features of protein structure space. We find that nearly a third of the homologous superfamilies (H-levels) belong to ten major T-levels, which we call superfolds, and furthermore that nearly two-thirds of these H-levels cluster into nine simple architectures. A database of well-characterised protein structure families, such as CATH, will facilitate the assignment of structure-function/evolution relationships to both known and newly determined protein structures.

MACiE (Mechanism, Annotation and Classification in Enzymes): novel tools for searching catalytic mechanisms

Abstract and Figures

Recommended publications

TransportDB: A comprehensive database resource for cytoplasmic membrane transport systems and outer...

Collaboration and Virtualization in Large Information Systems Projects

Analyzing microarray data using CLANS

PLARIS: a Web Framework for Offering Automatically Classified Biomedical Multimedia Resources