ArticlePDF Available

Mueller, L.A., Zhang, P. & Rhee, S.Y. AraCyc: A biochemical pathway database for Arabidopsis. Plant Physiol. 132, 453-460

July 2003
Plant Physiology 132(2):453-60

July 2003
132(2):453-60

DOI:10.1104/pp.102.017236

Source
PubMed

Authors:

Lukas A Mueller

Peifen Zhang

Carnegie Institution for Science

AraCyc is a database containing biochemical pathways of Arabidopsis, developed at The Arabidopsis Information Resource (http://www.arabidopsis.org). The aim of AraCyc is to represent Arabidopsis metabolism as completely as possible with a user-friendly Web-based interface. It presently features more than 170 pathways that include information on compounds, intermediates, cofactors, reactions, genes, proteins, and protein subcellular locations. The database uses Pathway Tools software, which allows the users to visualize a bird's eye view of all pathways in the database down to the individual chemical structures of the compounds. The database was built using Pathway Tools' Pathologic module with MetaCyc, a collection of pathways from more than 150 species, as a reference database. This initial build was manually refined and annotated. More than 20 plant-specific pathways, including carotenoid, brassinosteroid, and gibberellin biosyntheses have been added from the literature. A list of more than 40 plant pathways will be added in the coming months. The quality of the initial, automatic build of the database was compared with the manually improved version, and with EcoCyc, an Escherichia coli database using the same software system that has been manually annotated for many years. In addition, a Perl interface, PerlCyc, was developed that allows programmers to access Pathway Tools databases from the popular Perl language. AraCyc is available at the tools section of The Arabidopsis Information Resource Web site (http://www.arabidopsis.org/tools/aracyc).

Comparison of pathway distribution between AraCyc, AraCyc initial build, MetaCyc, and EcoCyc. The number of pathways in the different top-level classifications (Biosynthesis, Energy Metabolism, Intermediary Metabolism, and Degradation) are shown as pie charts. The major class in AraCyc is the biosynthesis class with 73 or 44.2% of pathways, up from 38.4% in the initial build. The major class in MetaCyc is the degradation class, probably reflecting the many bacterial degradation pathways that are known. The distribution within these classes in EcoCyc is, however, very similar to AraCyc.

…

Overlaying expression data on the overview diagram. The overview diagram gives a bird ’ s eye view of all of the pathways in the database. The pathways are shown as glyphs consisting of nodes, which represent the metabolites, and lines, which represent the reactions. Expression data can be uploaded as a simple tab-delimited file. The lines representing the reactions are painted in a color relative to the expression level, with a dynamically generated scale depicted on the right side of the screen. For this example, data from a published data set (Ramonell et al., 2002), downloaded from the ArabidopsisFunctionalGenom- ics Consortium site, was used.

…

Figures - uploaded by Lukas A Mueller

Content may be subject to copyright.

Content uploaded by Lukas A Mueller

Content may be subject to copyright.

AraCyc: A Biochemical Pathway

Database for Arabidopsis

Lukas A. Mueller*, Peifen Zhang, and Seung Y. Rhee

The Arabidopsis Information Resource, Department of Plant Biology, Carnegie Institution of Washington,

260 Panama Street, Stanford, California 94305

AraCyc is a database containing biochemical pathways of Arabidopsis, developed at The Arabidopsis Information Resource

(http://www.arabidopsis.org). The aim of AraCyc is to represent Arabidopsis metabolism as completely as possible with a

user-friendly Web-based interface. It presently features more than 170 pathways that include information on compounds,

intermediates, cofactors, reactions, genes, proteins, and protein subcellular locations. The database uses Pathway Tools

software, which allows the users to visualize a bird’s eye view of all pathways in the database down to the individual

chemical structures of the compounds. The database was built using Pathway Tools’ Pathologic module with MetaCyc, a

collection of pathways from more than 150 species, as a reference database. This initial build was manually refined and

annotated. More than 20 plant-specific pathways, including carotenoid, brassinosteroid, and gibberellin biosyntheses have

been added from the literature. A list of more than 40 plant pathways will be added in the coming months. The quality of

the initial, automatic build of the database was compared with the manually improved version, and with EcoCyc, an

Escherichia coli database using the same software system that has been manually annotated for many years. In addition, a Perl

interface, PerlCyc, was developed that allows programmers to access Pathway Tools databases from the popular Perl language.

AraCyc is available at the tools section of The Arabidopsis Information Resource Web site (http://www.arabidopsis.org/

tools/aracyc).

The genome of the flowering plant Arabidopsis

was the first plant genome to be fully sequenced

(Arabidopsis Genome Initiative, 2000). Initially, ap-

proximately 26,000 genes were identified in the

genomic sequence, based on different computational

methods, and were assigned to functional categories

(Arabidopsis Genome Initiative, 2000). About 9% of

these genes have been studied experimentally (Ara-

bidopsis Genome Initiative, 2000), and about 32% of

all genes in Arabidopsis could not yet be assigned to

any functional category (Reiser et al., 2002). From the

initial annotation of the genome, it has been esti-

mated that about 4,000 genes may be involved in

metabolism (Arabidopsis Genome Initiative, 2000).

In this work, we used Pathway Tools software

(Karp et al., 2002a) to build a database for Arabidop-

sis metabolism. The software allows automatic gen-

eration of pathway databases using functional as-

signment of genes and also allows manual editing of

pathways through a graphical user interface. Al-

though most of the functional annotations were de-

rived computationally, we hypothesized that there

was enough information to build an initial metabo-

lism database, which could be used to facilitate man-

ual literature curation of genes involved in metabo-

lism.

The Pathway Tools software suite is a comprehen-

sive system to identify, curate, store, and publish

biochemical pathways on the Web in the form of

pathway genome databases (PGDBs; Karp et al.,

2002a). PGDBs contain the entire genomic informa-

tion of an organism, including its metabolic com-

pounds, reactions, biochemical pathways, enzymes,

and enzyme complexes. There are three components

in the Pathway Tools: (a) Pathologic, which allows a

new PGDB to be built from data sets consisting es-

sentially of gene annotations; (b) Pathway/Genome

Editor, which allows pathways to be edited and new

pathways to be added; and (c) Pathway/Genome

Navigator, which allows users to query and browse

the database, both locally and on the Web. The Patho-

logic analysis predicts the pathways of an organism

using a reference PGDB from which pathways are

extracted using a pathway-scoring algorithm (Paley

and Karp, 2002). The reference PGDB used in this

work is MetaCyc (http://metacyc.org; Karp et al.,

2002b), a metabolic-pathway database that describes

449 curated pathways and 1,115 enzymes occurring

in 158 organisms.

The Pathway Tools system has been applied exten-

sively to annotate microbial genomes and has been

optimized to a point where it exceeds expert analyses

in comprehensiveness and matches expert analyses

in accuracy (Paley and Karp, 2002). However, it was

unknown how well it would handle a eukaryotic

genome. The software had been applied previously

This work was supported by the National Science Foundation

(grant no. DBI–9978564) and by the National Institutes of Health

(grant no. R01–GM65466–01).

* Corresponding author; e-mail mueller@acoma.stanford.edu;

fax 650 –325–6857.

Article, publication date, and citation information can be found

at www.plantphysiol.org/cgi/doi/10.1104/pp.102.017236.

to only one eukaryote, yeast. Because of this limited

exposure to eukaryotic organisms, we expected a

lower accuracy of the initial database build as com-

pared with prokaryotic databases. A eukaryotic ge-

nome not only is more complex, but also has an

enormous difference in scale. A typical bacterium,

such as Escherichia coli, contains 4.6 million bp of

DNA and has on the order of 4,392 genes; the E. coli

pathway/genome database, EcoCyc, lists 164 differ-

ent pathways and 914 enzymes. In comparison, Ara-

bidopsis has a genome of 125 million bases, and

comprises more than 26,000 genes, which corre-

sponds to 20 times the amount of DNA and almost

five times the number of genes in E. coli. The result-

ing pathway/genome database can therefore be ex-

pected to be many-fold more complex than EcoCyc.

In addition, eukaryotes have subcellular compart-

ments, many different cell-types, and elaborated life

cycles with a complex series of developmental stages,

which qualitatively increases the complexity of bio-

chemical processes.

In this paper, we describe how AraCyc was initially

built, we compare the quality of the resulting data-

base to the version of AraCyc that has been improved

through manual verification and annotation, and we

compare the overall quality of AraCyc to EcoCyc. We

also describe what adaptations to the Pathway Tools

software were necessary to better accommodate a

eukaryotic organism.

RESULTS

Pathologic Analysis

The Pathologic module of Pathway Tools was run

using Arabidopsis enzyme annotations that were ob-

tained from the Arabidopsis sequencing project (Ara-

bidopsis Genome Initiative, 2000), which were edited

manually to remove extraneous words and charac-

ters that could interfere with the enzyme name-

matching software. A total of about 6,000 genes were

retained and formatted for input into Pathologic ac-

cording to the Pathway Tools documentation (P.

Karp and S. Paley, unpublished data). Pathologic

recognizes enzyme functions using an enzyme name-

matching program and a database of enzyme names

and synonyms, and extracts corresponding pathways

from the MetaCyc database using a pathway scoring

algorithm (see “Materials and Methods”; Paley and

Karp, 2002).

Overall Statistics of the Initial Build

Pathologic recognized 1,858 enzymes for which it

knew a defined function (roughly 7% of the total

number of genes in the genome), and another 1,650

gene products (6.3% of the genome) were identified

as putative enzymes (Table I). The putative enzymes

comprised both enzymes annotated with generic

names such as “kinase,” for which the precise func-

tion was unknown, as well as enzymes that were

specific to plants that were not in MetaCyc, such as

“gibberellin oxidase.”

In total, AraCyc contained 173 pathways after the

initial build (Table I), containing 767 enzymes and

1,132 reactions (or 750 unique reactions if same reac-

tions in different pathways are counted once). One or

more enzymes were annotated to 611 (342 unique)

reactions, whereas 521 reactions, or 45% (408 unique,

or 54%), lacked enzyme annotations. There were thus

883 enzymes with a defined function that were not

attributed to any pathway; this was the case for many

generic enzymes such as cytochrome P450s, where

the reaction is not specific enough to place it in a

pathway.

Statistics of Manual Editing of AraCyc

AraCyc has been manually edited since the auto-

matic build. Curation includes deleting inappropri-

ate pathways, adding missing pathways, or updating

Table I. Summary data for the AraCyc data sets and comparison with EcoCyc

The number of pathways, reactions, genes, and missing enzyme annotations in pathways are given for AraCyc, the initial build of AraCyc, and

EcoCyc for comparison. AraCyc has approximately 42% of reactions with missing annotations, down from 45% for the initial build. EcoCyc has

only approximately 7% missing annotations. Twenty-two pathways were deleted from the intial build, and 23 new ones were added. More

pathways will be added in the future.

AraCyc AraCyc Initial Build EcoCyc

Pathways (excluding superpathways) 174 173 164

Reactions, total 1,096 1,132 845

Reactions, unique 833 750 706

Unique genes associated with pathways 958 767 695

Missing enzyme annotations, total 469 (42%) 521 (45%) 60 (7%)

Missing enzyme annotations, unique 403 (48%) 408 (54%) 52 (7%)

Genes per annotated reaction 2.2 2.2 1.06

Overlapping pathways with EcoCyc 76 79 164

Reactions in overlapping set 458 (403 unique) 467 (434 unique) 462

Missing enzyme annotations in overlapping set 164 (38%; 151 unique) 204 (43%; 186 unique) 11 (2.3%)

Pathways added manually since initial build 23 – –

Pathways deleted from initial build 22 – –

Mueller et al.

454 Plant Physiol. Vol. 132, 2003

existing pathways (for more details, see “Analysis of

Pathways”). Twenty-two pathways (or 12.7% of the

original 173) were manually deleted from the original

Pathway Tools analysis. Among these were low-

scoring pathways with few enzymes annotated to

them (6 pathways), pathways that were thought not

to occur in plants (12 pathways), and close variants of

other pathways in the database (4 pathways), which

were merged. The complete list of deleted pathways

is available on-line (http://www.Arabidopsis.org/

tools/aracyc/aracyc.deleted.pathways.html). Five

pathways in the database are questionable and are

“on hold,” meaning that they may be deleted in the

future. Deleting them would bring the total path-

ways deleted to 27 or 15% of the initial set. Twenty-

three new pathways comprising 194 (185 unique)

reactions with 212 gene annotations were added, con-

taining 90 (86 unique) missing enzyme annotations

(46%, 46% unique). Some pathways that were re-

trieved from MetaCyc were incomplete. Most nota-

bly, the pathway “isopentenyl diphosphate biosyn-

thesis, mevalonate-independent” consisted of only

two reactions. The pathway has been completed with

four additional reactions (not all of the reactions in

the pathway are currently known).

In total, the AraCyc database presently contains

174 pathways containing 1,096 reactions (833

unique). Of the reactions in the pathways, 469 (43%)

are missing enzyme annotations (403 or 48% unique).

A total of 958 genes were annotated to one or more

reactions. The automatic building process missed

mainly annotations of reactions catalyzed by large

enzyme complexes with complex subunit composi-

tions such as pyruvate dehydrogenase and ketoglu-

tarate dehydrogenase. The reactions were missed not

because Pathologic could not handle them, but rather

because the enzyme names in the input files were not

always accurately specified.

AraCyc contains an average of 2.2 genes per anno-

tated reaction. This may seem to be a low number,

just roughly twice the number of E. coli genes per

reaction. However, there were big differences in the

number of genes annotated per reaction among the

different pathway categories. In Energy Metabolism,

the average number was 3.3 genes/reaction, in Deg-

radation 2.5, in Intermediary Metabolism 2.4, and in

Biosynthesis 2.07. Between these categories, there

were also large differences in the number of reactions

that were lacking annotations, indicating that not all

of the pathway categories are equally well under-

stood: In the Energy Metabolism category, only

16.5% reactions lacked annotations, compared with

29% in Biosynthesis, 41% in Intermediary Metabo-

lism, and 58% in Degradation. At any rate, the gly-

colysis pathway itself has no reactions lacking anno-

tations and has an average of 5.1 genes annotated to

a reaction. Some reactions in that pathway have more

than a dozen annotated genes. In E. coli, glycolysis

has an average of 1.6 reactions and a maximum of

three genes per reaction. This suggests that certain

pathway categories, such as the Energy Metabolism

category, have a higher potential degree of regulation

than the other pathway categories.

Comparison with EcoCyc

To compare these benchmarks with a database that

has been manually curated over a long period of

time, we compared AraCyc with the EcoCyc database

(Karp et al., 2002c). EcoCyc is a database specific for

the metabolism of E. coli and has been manually

curated since the mid-1990s. It contains 164 pathways

(not counting super-pathways) comprising 845 reac-

tions (706 unique). EcoCyc contains only 60 missing

enzyme annotations (54 unique), which means that

more than 93% of all reactions have at least one

enzyme annotation.

Pathways conserved between AraCyc and EcoCyc

have a higher percentage of annotated reactions. We

analyzed the 76 pathways (excluding super-

pathways) that AraCyc shares with EcoCyc, and we

found that they contain a total of 458 reactions (403

unique reactions), 164 of which were missing enzyme

annotations (151 unique) in AraCyc. The percentage

of missing enzyme annotations in these pathways is

therefore only 36% in AraCyc, compared with 43%

when all of the pathways in AraCyc are considered.

In EcoCyc, these pathways contain only 11 missing

annotations (10 unique) or 2%! The pathways that

occur in both AraCyc and EcoCyc are therefore a

subset that is much better described than the other

pathways. A look at the pathways shows that they

are mostly central, conserved metabolism and in-

cludes pathways such as glycolysis and biosynthesis

of amino acids. The pathways that occur in AraCyc

but not in EcoCyc (98 pathways) had 636 reactions

(518 unique) and 315 missing annotations (278

unique), or 49.5% (53% unique) missing enzyme

annotations.

Analysis of Experimental Evidence for Genes in AraCyc

To estimate how many of the annotations that were

used to build the pathways were solely based on

sequence similarity-based predicted information, we

counted how many genes had a gene symbol syn-

onym. A list of gene symbol aliases for each locus

were obtained from The Arabidopsis Information Re-

source (TAIR) FTP site (ftp://ftp.arabidopsis.org/

Genes/). Of the 958 unique genes currently anno-

tated to pathways in AraCyc, only 155 had synonyms

(16%). A large fraction of the gene annotations used

in the Pathologic analysis are likely to be based on

sequence similarity alone; the accuracy of the func-

tional annotation based on sequence is difficult to

estimate and can be verified only with future

experimentation.

AraCyc Database

Plant Physiol. Vol. 132, 2003 455

Analysis of Pathways in AraCyc

The Pathway Tools classification hierarchy defines

four categories of pathways at its top level: Biosyn-

thesis, Intermediary Metabolism, Degradation, and

Energy Metabolism. In AraCyc, these categories con-

tain 73, 27, 50, and 15 pathways, respectively (Fig. 1).

The Biosynthesis class contains the largest number of

pathways, largely due to pathways that we added

manually since the initial build. In comparison, the

largest class in MetaCyc is Degradation. This is prob-

ably due to the many bacterial degradation pathways

that have been characterized. Overall, however, the

distribution of pathways between these classes are

very similar between the curated version of AraCyc

and EcoCyc.

In the Biosynthesis category of AraCyc, all amino

acid biosyntheses have been inferred from Pathologic

except the biosynthesis for Glu. The biosyntheses of

Tyr and Phe that were inferred did not correspond to

the plant versions; in plants, the biosynthesis of these

amino acids is assumed to go through arogenate,

instead of prephenate (Jung et al., 1986). Noticeably

missing were biosynthesis of phospholipids and the

mevalonate pathway. The latter is important as a

precursor for terpenoid biosynthesis and was copied

from MetaCyc to AraCyc manually. MetaCyc also

classifies phytoalexin, flavonoid, and mevalonate

metabolism under the “Fatty Acid and Lipids” class.

These three pathways were also inferred in AraCyc,

but subsequently moved to the newly created “Sec-

ondary Metabolite Biosynthesis” class, which was

added under Biosynthesis. Apart from the secondary

metabolites (flavonoids, phytoalexins) inferred un-

der the Fatty Acid and Lipids class, no plant second-

ary metabolite pathways were in MetaCyc and there-

fore could not be identified by Pathologic. Therefore,

the following pathways have been added manually:

carotenoid biosynthesis, camalexin biosynthesis, and

phenylpropanoid ester biosynthesis.

The Flavonoid biosynthetic pathway had to be

modified extensively; the original pathway contained

errors and was not very comprehensive. The phyto-

alexin pathway is almost an exact copy of that initial

flavonoid pathway and will probably be deleted in

the future. Chlorophyll biosynthesis was newly cre-

ated under “heme biosynthesis.” Conspicuously,

“NAD biosynthesis” was not inferred and added

manually from MetaCyc. Polyisoprenoid metabolism

was moved to the “Terpenoid Biosynthesis” under

the “Secondary Metabolites” class. Both pyrimidine

and purine biosyntheses were inferred correctly. In

addition, under the newly created class “Plant Hor-

mone Biosynthesis,” we added cytokinin, brassinos-

teroid, jasmonic acid, gibberellin, and abscisic acid

biosynthesis pathways. The biosynthesis of ethylene

was inferred correctly by Pathologic and moved to

the Plant Hormones class.

The Energy Metabolism class contained 15 path-

ways, including glycolysis (2 instances, of which one

[glycolysis 2] was deleted as a duplicate variant), the

tricarboxylic acid cycle, and the Calvin cycle. In

plants, glycolysis can use pyrophosphate instead of

ATP for the phosphorylation of Fru. These plant-

specific features will be added to AraCyc manually in

the future. Several fermentation pathways were also

inferred, most of which have been deleted due to

insufficient evidence; the fermentation pathways

usually contained some of the preceding glycolysis

reactions that obviously had good evidence, but the

actual fermentation reactions had no enzyme

matches. The two fermentation pathways—“Glc fer-

mentation” and “anaerobic fermentation”—that were

not deleted from the database had good evidence for

the fermentation-specific reactions. In general, fer-

mentation reactions seem to be less well studied in

Arabidopsis than other metabolic processes; some of

the fermentation pathways we deleted due to insuf-

ficient evidence may have to be restored in the future

as we learn more about them.

Intermediary Metabolism contained 27 pathways,

including carnitine metabolism. Interestingly, al-

though there is a carnitine metabolism pathway in

MetaCyc, there is no carnitine biosynthetic pathway.

Carnitine accumulates in many plants (Panter and

Mudd, 1969), although its presence in Arabidopsis is

uncertain. The 50 pathways in the Degradation class

did not include the amino acid degradation path-

ways for Gln, His, Phe, and Pro. Pathologic found

evidence for several pathways for xenobiotics degra-

dation such as “pentachlorophenol degradation path-

Figure 1. Comparison of pathway distribution between AraCyc, AraCyc initial build, MetaCyc, and EcoCyc. The number of

pathways in the different top-level classifications (Biosynthesis, Energy Metabolism, Intermediary Metabolism, and Degrada-

tion) are shown as pie charts. The major class in AraCyc is the biosynthesis class with 73 or 44.2% of pathways, up from 38.4%

in the initial build. The major class in MetaCyc is the degradation class, probably reflecting the many bacterial degradation

pathways that are known. The distribution within these classes in EcoCyc is, however, very similar to AraCyc.

Mueller et al.

456 Plant Physiol. Vol. 132, 2003

way.” Most of these pathways are known to exist in

certain bacteria but are unlikely in plants. Not all have

been deleted from the database yet, because some

contain a large number of enzyme annotations. These

pathways could potentially be present in Arabidopsis

but remain to be characterized; they could therefore

represent pathways discovered by Pathologic.

Modification of the Controlled Vocabularies of

Pathway Tools for AraCyc

Because the Pathway Tools software has been used

primarily to describe metabolism of prokaryotic or-

ganisms, the descriptions of intracellular structures

in the database were limited and had to be extended

for the use with Arabidopsis. The cellular com-

partment ontology consisted of only five different

keywords: periplasm, membrane, inner-membrane,

outer-membrane, and mitochondria. We extended this

vocabulary to represent eukaryotic structures and

plant organelles; it now comprises 35 terms, including

chloroplast, the inner structures of the chloroplast,

endoplasmatic reticulum, nucleus, etc. The complete

listcanbefoundon-line(http://www.arabidopsis.org/

tools/aracyc/intracellular.html). These modifications

were also adopted by the MetaCyc database. In the

future, it may be desirable to integrate the Gene On-

tology (The Gene Ontology Consortium, 2001; http://

www.geneontology.org) system into the Pathway

Tools.

Another limitation of the Pathway Tools software

is the lack of support for different tissues and devel-

opmental stages, for which TAIR has developed the

necessary ontologies. For example, a pathway may be

active only in a subset of tissues and/or at certain

developmental stages, but this information cannot

yet be captured in the database.

Modification of the Classification Hierarchies in

Pathway Tools

We modified the chemical compound hierarchy to

better accommodate plant metabolism, adding Plant

Hormones and Secondary Metabolites as new classes.

PerlCyc

Pathway Tools is written in Lisp, a powerful lan-

guage that is popular in the artificial intelligence

community. The most popular language in biology is

probably Perl, due to its simplicity and built-in string

handling features such as regular expressions. To

facilitate the access to the internal Pathway Tools

functions, such as automated queries and batch-

loading of data, we implemented a Perl module

called perlcyc.pm. The module is available for down-

load at http://www.arabidopsis.org/tools/aracyc/

perlcyc. PerlCyc allows the user to write small pro-

grams in Perl that formulate more complex queries,

such as: How many reactions have multiple enzyme

annotations that include enzymes located in both the

cytoplasm or in the chloroplast?

DISCUSSION

We have built a database for Arabidopsis metabolic

pathways using the Pathway Tools software. The

automatic build was edited by manual curation and

addition of Arabidopsis-specific pathways. The data-

base contains 174 pathways (excluding super-

pathways) comprising more than 1,000 reactions and

958 different enzyme annotations. The database con-

tains 822 metabolic compounds. Our aim is to repre-

sent the metabolism of Arabidopsis in AraCyc to the

extent that it is known through ongoing manual cura-

tion efforts. AraCyc will allow to pinpoint the gaps in

our understanding of Arabidopsis metabolism, and

to facilitate researchers to fill in the gaps. The data-

base is also a tool for the annotation of Arabidopsis

biochemical enzymes, a resource for researchers who

want to explore Arabidopsis metabolism, and a tool

for teaching plant metabolism.

A noteworthy feature of Pathway Tools is the in-

tegrated expression viewer that allows expression

data from microarray or DNA chip experiments to be

visualized on the metabolic overview diagram (Fig.

2). In this example, we took data from a previously

published microarray experiment (Arabidopsis Func-

tional Genomics Consortium experiment no. 10615;

Ramonell et al., 2002). A number of differentially

expressed enzymes can clearly be distinguished. The

expression viewer is also available through the TAIR

Web site.

A comparison with EcoCyc shows that the AraCyc

database has many more reactions lacking annota-

tions than EcoCyc. EcoCyc has only 7% reactions

lacking annotations as compared with 43% for Ara-

Cyc. This may also reflect the research priority in the

Arabidopsis community to some extent. Other areas

of research, such as development and disease resis-

tance, seem to be studied more extensively in this

organism than metabolism.

For primary plant metabolism, Arabidopsis should

be an excellent model system for other dicots. For

secondary metabolites, Arabidopsis can only be a

model for the 36 secondary metabolites it has been

shown to produce (Chapple et al., 1994). They fall

into four classes: flavonoids, hydroxycinnamic acid

esters, glucosinolates, and indole phyoalexins. Im-

portant classes of plant secondary metabolites such

as alkaloids and terpene secondary metabolites have

not yet been identified in Arabidopsis. Whether these

compounds are not produced by Arabidopsis or have

not yet been detected is an open question. The Ara-

bidopsis genomic sequence reveals sequences homol-

ogous to enzymes that are involved in the biosynthe-

sis of terpenes and alkaloids (Arabidopsis Genome

Initiative, 2000). Recently, two myrcene/(E)-beta oci-

AraCyc Database

Plant Physiol. Vol. 132, 2003 457

mene synthases have been cloned from Arabidopsis

that had enzymatic activity when expressed in E. coli

(Bohlmann et al., 2000). For the biosynthesis of alka-

loids, there are 10 enzymes annotated as berberine

bridge enzyme in Arabidopsis, although it is not

known if they are expressed or can form active en-

zymes. Conversely, not all pathways that are known

to operate in Arabidopsis are well characterized. In

the camalexin biosynthetic pathway, the major indole

phytoalexin in Arabidopsis, only one gene has been

cloned, and even the precursor molecule is uncertain.

Clearly, more research is needed to define the meta-

bolic complement of Arabidopsis.

How complete is AraCyc now and when will it be

finished? One way to estimate the completeness is to

compare the number of estimated metabolic enzymes

to the number of enzymes stored in AraCyc. It has

been estimated that approximately 4,000 enzymes are

involved in metabolism in Arabidopsis (Arabidopsis

Genome Initiative, 2000). However, this number

should be considered as an upper limit, because it is

likely to include kinases, phosphatases, etc., that are

specific for proteins and not for small metabolite

metabolism. In AraCyc, Pathologic identified 1,850

enzymes with a defined biochemical function and a

further 1,650 probable enzymes (again, most of

which were annotated to imprecise functions such as

“kinase,” which may not be specific to small mole-

cule metabolism), for a total of 3,500 enzymes. Pres-

ently, there are 958 different enzymes annotated to

one or more pathways. Hence, AraCyc could pres-

ently be considered one-fourth complete to the extent

of what is known. Considering that roughly one-half

of the reactions do not have annotations, just filling

in the missing reactions should bring completeness to

one-half (assuming that the average number of en-

zymes per reaction is similar for the missing annota-

tions). The rest of the enzymes would probably be in

pathways that are not yet in AraCyc. Again assuming

that these additional pathways have a distribution of

reactions and annotations similar to the present ones,

the complete AraCyc database reflecting the current

knowledge would have an upper limit of just more

than 300 pathways.

In another attempt to estimate completeness, we

compared the compounds in AraCyc with com-

pounds identified in a metabolic profiling experi-

ment. In the experiment, which analyzed the metab-

olites found in leaves, hundreds of compounds were

resolved, of which 94 could be identified (Fiehn et al.,

2000). Of these 94 compounds, 24 were not found in

AraCyc. This seems like a large fraction considering

that the metabolic profiling experiment identified rel-

atively simple low M

compounds and did not detect

the many complex molecules that are present in plant

cells. However, 15 of the 94 compounds were classi-

fied by the authors as “uncommon plant metabolites”

that had never before been seen in plants. These 15

compounds were all part of the 24 missing com-

pounds; the nine remaining compounds, which in-

Figure 2. Overlaying expression data on the overview diagram. The overview diagram gives a bird’s eye view of all of the

pathways in the database. The pathways are shown as glyphs consisting of nodes, which represent the metabolites, and lines,

which represent the reactions. Expression data can be uploaded as a simple tab-delimited file. The lines representing the reactions

are painted in a color relative to the expression level, with a dynamically generated scale depicted on the right side of the screen.

For this example, data from a published data set (Ramonell et al., 2002), downloaded from the ArabidopsisFunctionalGenom-

ics Consortium site, was used.

Mueller et al.

458 Plant Physiol. Vol. 132, 2003

cluded mostly sugars that are likely involved in cell

wall biosynthesis, point us to pathways that we will

have to add to AraCyc in the near future. Additional

profiling experiments will be a great help in verifying

AraCyc completeness in the future, when the tech-

nology will allow more compounds to be identified.

In the coming months, we will add approximately

30 pathways (refer to http://www.Arabidopsis.org/

tools/aracyc), with a focus on carbohydrate and lipid

biosynthesis, bringing the total of pathways to more

than 200. Of course, many pathways are presently in a

canonical form and will have to be extended to reflect

the peculiarities of Arabidopsis metabolism. For ex-

ample, the genes that are known to be involved in the

biosynthesis of anthocyanin pigments, which is rela-

tively well-studied in Arabidopsis, account for the

biosynthesis of cyanidin 3-glucoside. The major antho-

cyanin in Arabidopsis, however, has been shown to be

cyanidin (3-O-[2-O(2-O-(sinapoyl-

␤

-d-xylopyranosyl)-

6-O-(4-O-(

␤

-d -glucopyranosyl)-p-coumaroyl-

␤

-d -

glucopyranoside] 5-O-[6-O-(malonyl)

␤

-d-gluco-

pyranoside]; Bloor and Abrahams, 2002), which is a

long way from cyanidin 3-glucoside.

CONCLUSIONS

AraCyc still has a way to go to be on a par with

databases such as EcoCyc in annotation quality. At

least partially, this may be due to the fact that the

metabolism of Arabidopsis is not as well described as

the metabolism of E. coli. The Pathway Tools system

has, however, permitted us to construct a relatively

high-quality, comprehensive database of Arabidop-

sis metabolism in a short time, forming an excellent

basis for further refinement through manual correc-

tions, curation, and experimentation. Pathways that

are added to AraCyc are also added to the MetaCyc

database, so that these pathways will be available for

future database builds for other plant species. Ara-

Cyc is available on the TAIR Web site (Huala et al.,

2001).

MATERIALS AND METHODS

Pathway Tools Installation

The Pathway Tools software was downloaded from the Web by a link

provided by SRI International (Menlo Park, CA). For information on obtain-

ing Pathway Tools, contact ptools-info@ai.sri.com. The installation was

performed according to instructions provided. The hardware used consisted

of a SunBlade 100 workstation from Sun Microsystems (Palo Alto, CA),

running Solaris 8. Pathway Tools can be run with an Oracle database

backend or using flatfiles for data persistence. In this work, the flat file-

based version was used. The two modes of operation are completely trans-

parent to the user. The flat file version is easier to install and is cheaper

because it does not require the purchase of an Oracle license.

Initial Build of AraCyc

The flowchart outlining the steps in generating AraCyc is described

below and is shown in Figure 3.

Input Files

The Institute for Genomic Research’s Arabidopsis genome annotation

data (http://www.tigr.org) were manually edited to include only enzyme

names. Enzymes labeled as “putative” or “similar to” were also included in

the data set. Any string that might interfere with the enzyme name-

matching algorithm of Pathologic was removed. These strings included

descriptions of subcellular locations or gene names following the enzyme

name. The edited list was then formatted into a Pathologic-specific file

format, which requires one file per chromosome describing their genes and

one file describing the number and nature of the chromosomes (such as

whether the chromosome is circular or linear etc.; P. Karp and S. Paley,

unpublished data). Only nuclear-encoded genes were included in the data set.

Running Pathologic

Pathologic imports the genes and proteins described by the input files

into a new database that is structured using the Pathway Tools schema and

then matches the enzymes listed in the annotated genome against the

enzymes required by every pathway in a reference pathway database Me-

taCyc (http://metacyc.org; Karp et al., 2002b). The program assesses the

pathways using a pathway-scoring algorithm and only those pathways with

significant scores are imported into the new PGDB. The scoring and path-

way import algorithm have been described elsewhere (Paley and Karp,

2002).

Pathologic generates reports that summarize the amount of evidence

supporting each pathway predicted to be present in the new PGDB and that

list the “pathway holes,” i.e. the enzymes missing from each predicted

Figure 3. Building AraCyc. AraCyc was built using a selection of The

Institute for Genomic Research gene models that were annotated as

enzymes or putative enzymes. These annotations were formatted into

a Pathologic-specific format according to the documentation for Path-

way Tools (P. Karp and S. Paley, unpublished data) and then analyzed

with Pathologic, using MetaCyc as a reference database. The resulting

database, AraCyc initial build, was then manually curated, resulting in

AraCyc.

AraCyc Database

Plant Physiol. Vol. 132, 2003 459

pathway. This information can help the curator decide on which of the

pathways imported by Pathologic should be kept in the database.

Modifying the Object Class Structures

The Pathway Tools classification hierarchy for biosynthetic pathways

and chemical compound was modified to accommodate plant pathways

using the built-in editing tool in Pathway Tools, GKB-Editor. In addition,

some attributes, such as the subcellular location attribute of enzymes, were

modified for use with eukaryotic and plant cells, using the GKB-Editor.

Manual Annotation

The manual curation process includes both editing existing pathways

and adding new pathways. Information from the literature is collected and

added to the pathway, reaction, compound, enzyme, and gene frames. For

a pathway, we add a summary of what it does and a short description of its

significance. Regarding reactions, we add, if known, EC number, free en-

ergy, whether the reaction is novel or hypothetical, and whether it is

spontaneous in vitro or in vivo. For compounds, chemical structures are

added if they are not already in the database. For enzymes, subcellular

location, native M

, subunit composition, subunit M

, known cofactors,

activators, and inhibitors are added. The K

, K

, optimum pH, and optimum

temperature of an enzyme are added if known. If an enzyme is a complex

of multiple subunits, comments on the role of each subunit are added. For

enzyme isoforms, we capture the substrate specificity, tissue/cell type, and

developmental stage specificity. Genes are linked to TAIR locus detail pages

by their locus identification. Finally, synonyms of pathways, reactions,

enzymes, genes, and compounds are added, and literature citations are

provided by entering PubMed identification.

ACKNOWLEDGMENTS

We thank Peter Karp, Suzanne Paley, John Pick, Cindy Krieger, and Pepe

Romero from SRI for their help in carrying out this work and Peter Karp,

Leonore Reiser, and Margarita Garcia-Hernandez for critically reading the

manuscript.

This is Carnegie Institution of Washington, DPB, publication no. 1623.

Received November 5, 2002; returned for revision December 11, 2002; ac-

cepted February 7, 2003.

LITERATURE CITED

Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of

the flowering plant Arabidopsis thaliana. Nature 408: 796–815

Bloor SJ, Abrahams S (2002) The structure of the major anthocyanin in

Arabidopsis thaliana. Phytochemistry 59: 343–346

Bohlmann J, Martin D, Oldham NJ, Gershenzon J (2000) Terpenoid sec-

ondary metabolism in Arabidopsis thaliana: cDNA cloning, characteriza-

tion, and functional expression of a myrcene/(E)-beta-ocimene synthase.

Arch Biochem Biophys 375: 261–269

Chapple C, Shirley B, Zook M, Hammerschmidt R, Somerville S (1994)

Secondary metabolism in Arabidopsis. In E Meyerowitz, C Somerville, eds,

Arabidopsis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor,

NY, pp 989–1030

Fiehn O, Kopka J, Trethewey RN, Willmitzer L (2000) Identification of

uncommon plant metabolites based on calculation of elemental compo-

sitions using gas chromatography and quadrupole mass spectrometry.

Anal Chem 72: 3573–3580

Gene Ontology Consortium (2001) Creating the gene ontology resource:

design and implementation. Genome Res 11: 1425–1433

Huala E, Dickerman A, Garcia-Hernandez M, Weems D, Reiser L, LaFond

F, Hanley D, Kiphart D, Zhuang J, Huang W et al. (2001) The Arabi-

dopsis Information Resource (TAIR): a comprehensive database and

Web-based information retrieval, analysis, and visualization system for a

model plant. Nucleic Acids Res 29: 102–105

Jung E, Zamir LO, Jensen RA (1986) Chloroplasts of higher plants synthe-

size l-phenylalanine via l-arogenate. Proc Natl Acad Sci USA 83:

7231–7235

Karp P, Paley S, Romero P (2002a) The Pathway Tools software. Bioinfor-

matics Suppl 1 18: S225–S232

Karp PD, Riley M, Paley SM, Pellegrini-Toole A (2002b) The MetaCyc

Database. Nucleic Acids Res 30: 59–61

Karp PD, Riley M, Saier M, Paulsen IT, Collado-Vides J, Paley SM,

Pellegrini-Toole A, Bonavides C, Gama-Castro S (2002c) The EcoCyc

Database. Nucleic Acids Res 30: 56–58

Paley SM, Karp PD (2002) Evaluation of computational metabolic-pathway

predictions for Helicobacter pylori. Bioinformatics 18: 715–724

Panter RA, Mudd JB (1969) Carnitine levels in some higher plants. FEBS

Lett 5: 169–170

Ramonell K, Zhang B, Ewing R, Chen Y, Xu D, Stacey G, Somerville S

(2002) Microarray analysis of chitin elicitation in Arabidopsis thaliana. Mol

Plant Pathol 3: 301–311

Reiser L, Mueller LA, Rhee SY (2002) Surviving in a sea of data: a survey

of plant genome data resource and issues in building data management

systems. Plant Mol Biol 48: 59–74

Mueller et al.

460 Plant Physiol. Vol. 132, 2003

Stress Knowledge Map: A knowledge graph resource for systems biology analysis of plant stress responses

Preprint

Full-text available

Nov 2023

Stress Knowledge Map (SKM, https://skm.nib.si) is a publicly available resource containing two complementary knowledge graphs describing current knowledge of biochemical, signalling, and regulatory molecular interactions in plants: a highly curated model of plant stress signalling (PSS, 543 reactions) and a large comprehensive knowledge network (CKN, 488,390 interactions). Both were constructed by domain experts through systematic curation of diverse literature and database resources. SKM provides a single entrypoint for plant stress response investigations and the related growth tradeoffs. SKM provides interactive exploration of current knowledge. PSS is also formulated as qualitative and quantitative models for systems biology, and thus represents a starting point of a plant digital twin. Here, we describe the features of SKM and show, through two case studies, how it can be used for complex analyses, including systematic hypothesis generation, design of validation experiments, or to gain new insights into experimental observations in plant biology.

Integrated Proteomics and Metabolomics of Safflower Petal Wilting and Seed Development

Article

Full-text available

Mar 2024

Safflower (Carthamus tinctorius L.) is an ancient oilseed crop of interest due to its diversity of end-use industrial and food products. Proteomic and metabolomic profiling of its organs during seed development, which can provide further insights on seed quality attributes to assist in variety and product development, has not yet been undertaken. In this study, an integrated proteome and metabolic analysis have shown a high complexity of lipophilic proteins and metabolites differentially expressed across organs and tissues during seed development and petal wilting. We demonstrated that these approaches successfully discriminated safflower reproductive organs and developmental stages with the identification of 2179 unique compounds and 3043 peptides matching 724 unique proteins. A comparison between cotyledon and husk tissues revealed the complementarity of using both technologies, with husks mostly featuring metabolites (99%), while cotyledons predominantly yielded peptides (90%). This provided a more complete picture of mechanisms discriminating the seed envelope from what it protected. Furthermore, we showed distinct molecular signatures of petal wilting and colour transition, seed growth, and maturation. We revealed the molecular makeup shift occurring during petal colour transition and wilting, as well as the importance of benzenoids, phenylpropanoids, flavonoids, and pigments. Finally, our study emphasizes that the biochemical mechanisms implicated in the growing and maturing of safflower seeds are complex and far-reaching, as evidenced by AraCyc, PaintOmics, and MetaboAnalyst mapping capabilities. This study provides a new resource for functional knowledge of safflower seed and potentially further enables the precision development of novel products and safflower varieties with biotechnology and molecular farming applications.

Multi-label classification with XGBoost for metabolic pathway prediction

Article

Full-text available

Feb 2024
BMC BIOINFORMATICS

Background Metabolic pathway prediction is one possible approach to address the problem in system biology of reconstructing an organism’s metabolic network from its genome sequence. Recently there have been developments in machine learning-based pathway prediction methods that conclude that machine learning-based approaches are similar in performance to the most used method, PathoLogic which is a rule-based method. One issue is that previous studies evaluated PathoLogic without taxonomic pruning which decreases its performance. Results In this study, we update the evaluation results from previous studies to demonstrate that PathoLogic with taxonomic pruning outperforms previous machine learning-based approaches and that further improvements in performance need to be made for them to be competitive. Furthermore, we introduce mlXGPR, a XGBoost-based metabolic pathway prediction method based on the multi-label classification pathway prediction framework introduced from mlLGPR. We also improve on this multi-label framework by utilizing correlations between labels using classifier chains. We propose a ranking method that determines the order of the chain so that lower performing classifiers are placed later in the chain to utilize the correlations between labels more. We evaluate mlXGPR with and without classifier chains on single-organism and multi-organism benchmarks. Our results indicate that mlXGPR outperform other previous pathway prediction methods including PathoLogic with taxonomic pruning in terms of hamming loss, precision and F1 score on single organism benchmarks. Conclusions The results from our study indicate that the performance of machine learning-based pathway prediction methods can be substantially improved and can even outperform PathoLogic with taxonomic pruning.

Chemical Composition of Commercial Cannabis

Article

Jan 2024
J AGR FOOD CHEM

Lignin Biosynthesis Gene Expression Is Associated with Age-related Resistance of Winter Squash to Phytophthora capsici

Article

Full-text available

Sep 2023

The Oomycete plant pathogen, Phytophthora capsici , causes root, crown, and fruit rot of winter squash ( Cucurbita moschata ) and limits production. Some C. moschata cultivars develop age-related resistance (ARR), whereby fruit develop resistance to P. capsici 14 to 21 days postpollination (DPP) because of thickened exocarp; however, wounding negates ARR. We uncovered the genetic mechanisms of ARR of two C. moschata cultivars, Chieftain and Dickenson Field, that exhibit ARR at 14 and 21 DPP, respectively, using RNA sequencing. The sequencing was conducted using RNA samples from ‘Chieftain’ and ‘Dickenson Field’ fruit at 7, 10, 14, and 21 DPP. A differential expression and subsequent gene set enrichment analysis revealed an overrepresentation of upregulated genes in functional categories relevant to cell wall structure biosynthesis, cell wall modification/organization, transcription regulation, and metabolic processes. A pathway enrichment analysis detected upregulated genes in cutin, suberin monomer, and phenylpropanoid biosynthetic pathways. A further analysis of the expression profile of genes in those pathways revealed upregulation of genes in monolignol biosynthesis and lignin polymerization in the resistant fruit peel. Our findings suggest a shift in gene expression toward the physical strengthening of the cell wall associated with ARR to P. capsici . These findings provide candidate genes for developing Cucurbita cultivars with resistance to P. capsici and improve fruit rot management in Cucurbita species.

Identification of gene function based on models capturing natural variability of Arabidopsis thaliana lipid metabolism

Article

Full-text available

Aug 2023

Lipids play fundamental roles in regulating agronomically important traits. Advances in plant lipid metabolism have until recently largely been based on reductionist approaches, although modulation of its components can have system-wide effects. However, existing models of plant lipid metabolism provide lumped representations, hindering detailed study of component modulation. Here, we present the Plant Lipid Module (PLM) which provides a mechanistic description of lipid metabolism in the Arabidopsis thaliana rosette. We demonstrate that the PLM can be readily integrated in models of A. thaliana Col-0 metabolism, yielding accurate predictions (83%) of single lethal knock-outs and 75% concordance between measured transcript and predicted flux changes under extended darkness. Genome-wide associations with fluxes obtained by integrating the PLM in diel condition- and accession-specific models identify up to 65 candidate genes modulating A. thaliana lipid metabolism. Using mutant lines, we validate up to 40% of the candidates, paving the way for identification of metabolic gene function based on models capturing natural variability in metabolism.

Machine Learning-Assisted Approaches in Modernized Plant Breeding Programs

Article

Full-text available

Mar 2023

In the face of a growing global population, plant breeding is being used as a sustainable tool for increasing food security. A wide range of high-throughput omics technologies have been developed and used in plant breeding to accelerate crop improvement and develop new varieties with higher yield performance and greater resilience to climate changes, pests, and diseases. With the use of these new advanced technologies, large amounts of data have been generated on the genetic architecture of plants, which can be exploited for manipulating the key characteristics of plants that are important for crop improvement. Therefore, plant breeders have relied on high-performance computing, bioinformatics tools, and artificial intelligence (AI), such as machine-learning (ML) methods, to efficiently analyze this vast amount of complex data. The use of bigdata coupled with ML in plant breeding has the potential to revolutionize the field and increase food security. In this review, some of the challenges of this method along with some of the opportunities it can create will be discussed. In particular, we provide information about the basis of bigdata, AI, ML, and their related subgroups. In addition, the bases and functions of some learning algorithms that are commonly used in plant breeding, three common data integration strategies for the better integration of different breeding datasets using appropriate learning algorithms, and future prospects for the application of novel algorithms in plant breeding will be discussed. The use of ML algorithms in plant breeding will equip breeders with efficient and effective tools to accelerate the development of new plant varieties and improve the efficiency of the breeding process, which are important for tackling some of the challenges facing agriculture in the era of climate change.

Meta-analysis of the space flight and microgravity response of the Arabidopsis plant transcriptome

Article

Full-text available

Mar 2023

Spaceflight presents a multifaceted environment for plants, combining the effects on growth of many stressors and factors including altered gravity, the influence of experiment hardware, and increased radiation exposure. To help understand the plant response to this complex suite of factors this study compared transcriptomic analysis of 15 Arabidopsis thaliana spaceflight experiments deposited in the National Aeronautics and Space Administration's GeneLab data repository. These data were reanalyzed for genes showing significant differential expression in spaceflight versus ground controls using a single common computational pipeline for either the microarray or the RNA-seq datasets. Such a standardized approach to analysis should greatly increase the robustness of comparisons made between datasets. This analysis was coupled with extensive cross-referencing to a curated matrix of metadata associated with these experiments. Our study reveals that factors such as analysis type (i.e., microarray versus RNA-seq) or environmental and hardware conditions have important confounding effects on comparisons seeking to define plant reactions to spaceflight. The metadata matrix allows selection of studies with high similarity scores, i.e., that share multiple elements of experimental design, such as plant age or flight hardware. Comparisons between these studies then helps reduce the complexity in drawing conclusions arising from comparisons made between experiments with very different designs.

Genomic Designing for Nutraceuticals in Brassica juncea: Advances and Future Prospects

Chapter

Dec 2023

Genomic Designing for Nutraceuticals in Brassica juncea: Advances and Future Prospects

Chapter

Jul 2023

The Arabidopsis Information Resource (TAIR): A comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant

Article

Full-text available

Feb 2001
NUCLEIC ACIDS RES

Arabidopsis thaliana, a small annual plant belonging to the mustard family, is the subject of study by an estimated 7000 researchers around the world. In addition to the large body of genetic, physiological and biochemical data gathered for this plant, it will be the first higher plant genome to be completely sequenced, with completion expected at the end of the year 2000. The sequencing effort has been coordinated by an international collaboration, the Arabidopsis Genome Initiative (AGI). The rationale for intensive investigation of Arabidopsis is that it is an excellent model for higher plants. In order to maximize use of the knowledge gained about this plant, there is a need for a comprehensive database and information retrieval and analysis system that will provide user-friendly access to Arabidopsis information. This paper describes the initial steps we have taken toward realizing these goals in a project called The Arabidopsis Information Resource (TAIR) (www.arabidopsis.org).

The MetaCyc database

Article

Full-text available

Feb 2002
NUCLEIC ACIDS RES

MetaCyc is a metabolic-pathway database that describes 445 pathways and 1115 enzymes occurring in 158 organisms. MetaCyc is a review-level database in that a given entry in MetaCyc often integrates information from multiple literature sources. The pathways in MetaCyc were determined experimentally, and are labeled with the species in which they are known to occur based on literature references examined to date. MetaCyc contains extensive commentary and literature citations. Applications of MetaCyc include pathway analysis of genomes, metabolic engineering and biochemistry education. MetaCyc is queried using the Pathway Tools graphical user interface, which provides a wide variety of query operations and visualization tools. MetaCyc is available via the World Wide Web at http://ecocyc.org/ecocyc/metacyc.html, and is available for local installation as a binary program for the PC and the Sun workstation, and as a set of flatfiles. Contact metacyc-info{at}ai.sri.com for information on obtaining a local copy of MetaCyc.

The EcoCyc database

Article

Full-text available

Feb 2002
NUCLEIC ACIDS RES

EcoCyc is an organism-specific pathway/genome database that describes the metabolic and signal-transduction pathways of Escherichia coli, its enzymes, its transport proteins and its mechanisms of transcriptional control of gene expression. EcoCyc is queried using the Pathway Tools graphical user interface, which provides a wide variety of query operations and visualization tools. EcoCyc is available at http://ecocyc.org/.

Microarray analysis of chitin elicitation in Arabidopsis thaliana

Article

Sep 2002
MOL PLANT PATHOL

Summary Chitin oligomers, released from fungal cell walls by endochitinase, induce defence and related cellular responses in many plants. However, little is known about chitin responses in the model plant Arabidopsis. We describe here a large-scale characterization of gene expression patterns in Arabidopsis in response to chitin treatment using an Arabidopsis microarray consisting of 2375 EST clones representing putative defence-related and regulatory genes. Transcript levels for 71 ESTs, representing 61 genes, were altered three-fold or more in chitin-treated seedlings relative to control seedlings. A number of transcripts exhibited altered accumulation as early as 10 min after exposure to chitin, representing some of the earliest changes in gene expression observed in chitin-treated plants. Included among the 61 genes were those that have been reported to be elicited by various pathogen-related stimuli in other plants. Additional genes, including genes of unknown function, were also identified, broadening our understanding of chitin-elicited responses. Among transcripts with enhanced accumulation, one cluster was enriched in genes with both the W-box promoter element and a novel regulatory element. In addition, a number of transcripts had decreased abundance, encoding several proteins involved in cell wall strengthening and wall deposition. The chalcone synthase promoter element was identified in the upstream regions of these genes, suggesting that pathogen signals may suppress the expression of some genes. These data indicate that Arabidopsis should be an excellent model to elucidate the mechanisms of chitin elicitation in plant defence.

Creating the Gene Ontology Resource: Design and Implementation

Article

Aug 2001
GENOME RES

The Gene Ontology Consortium

The exponential growth in the volume of accessible biological information has generated a confusion of voices surrounding the annotation of molecular information about genes and their products. The Gene Ontology (GO) project seeks to provide a set of structured vocabularies for specific biological domains that can be used to describe gene products in any organism. This work includes building three extensive ontologies to describe molecular function, biological process, and cellular component, and providing a community database resource that supports the use of these ontologies. The GO Consortium was initiated by scientists associated with three model organism databases: SGD, the Saccharomyces Genome database; FlyBase, the Drosophila genome database; and MGD/GXD, the Mouse Genome Informatics databases. Additional model organism database groups are joining the project. Each of these model organism information systems is annotating genes and gene products using GO vocabulary terms and incorporating these annotations into their respective model organism databases. Each database contributes its annotation files to a shared GO data resource accessible to the public at http://www.geneontology.org/. The GO site can be used by the community both to recover the GO vocabularies and to access the annotated gene product data sets from the model organism databases. The GO Consortium supports the development of the GO database resource and provides tools enabling curators and researchers to query and manipulate the vocabularies. We believe that the shared development of this molecular annotation resource will contribute to the unification of biological information.

Chloroplasts of higher plants synthesize L-phenylalanine via L-arogenate

Article

Nov 1986

The specific enzymological route of L-phenylalanine biosynthesis has not been established in any higher plant system. The possible pathway routes that have been identified in microorganisms utilize either phenylpyruvate or L-arogenate as a unique intermediate. We now report the presence of arogenate dehydratase (which converts L-arogenate to L-phenylalanine) in cultured-cell populations of Nicotiana silvestris. Prephenate dehydratase (which converts prephenate to phenylpyruvate) was not detected. Arogenate dehydratase was also found in washed spinach chloroplasts, and these data add to emerging evidence in support of the existence in the plastidial compartment of a complete assembly of enzymes comprising aromatic amino acid biosynthesis. Arogenate dehydratase from tobacco and spinach were both specific for L-arogenate, inhibited by L-phenylalanine, and activated by L-tyrosine. Apparent Km values for L-arogenate (0.3 X 10(-3) M), pH optima (pH 8.5-9.5), and temperature optima for catalysis (32-34 degrees C) were also similar.

Terpenoid Secondary Metabolism in Arabidopsis thaliana: cDNA Cloning, Characterization, and Functional Expression of a Myrcene/(E)-??-Ocimene Synthase

Article

Apr 2000

The Arabidopsis genome project has recently reported sequences with similarity to members of the terpene synthase (TPS) gene family of higher plants. Surprisingly, several Arabidopsis terpene synthase-like sequences (AtTPS) share the most identity with TPS genes that participate in secondary metabolism in terpenoid-accumulating plant species. Expression of a putative Arabidopsis terpene synthase gene, designated AtTPS03, was demonstrated by amplification of a 392-bp cDNA fragment using primers designed to conserved regions of plant terpene synthases. Using the AtTPS03 fragment as a hybridization probe, a second AtTPS cDNA, designated AtTPS10, was isolated from a jasmonate-induced cDNA library. The partial AtTPS10 cDNA clone contained an open reading frame of 1665 bp encoding a protein of 555 amino acids. Functional expression of AtTPS10 in Escherichia coli yielded an active monoterpene synthase enzyme, which converted geranyl diphosphate (C(10)) into the acyclic monoterpenes beta-myrcene and (E)-beta-ocimene and small amounts of cyclic monoterpenes. Based on sequence relatedness, AtTPS10 was classified as a member of the TPSb subfamily of angiosperm monoterpene synthases. Sequence comparison of AtTPS10 with previously cloned monoterpene synthases suggests independent events of functional specialization of terpene synthases during the evolution of terpenoid secondary metabolism in gymnosperms and angiosperms. Functional characterization of the AtTPS10 gene was prompted by the availability of Arabidopsis genome sequences. Although Arabidoposis has not been reported to form terpenoid secondary metabolites, the unexpected expression of TPS genes belonging to the TPSb subfamily in this species strongly suggests that terpenoid secondary metabolism is active in the model system Arabidopsis.

Identification of Uncommon Plant Metabolites Based on Calculation of Elemental Compositions Using Gas Chromatography and Quadrupole Mass Spectrometry

Article

Sep 2000

Unknown compounds in polar fractions of Arabidopsis thaliana crude leaf extracts were identified on the basis of calculations of elemental compositions obtained from gas chromatography/low-resolution quadrupole mass spectrometric data. Plant metabolites were methoximated and silylated prior to analysis. All known peaks were used as internal references to construct polynomial recalibration curves of from raw mass spectrometric data. Mass accuracies of 0.005 +/- 0.003 amu and isotope ratio errors of 0.5 +/- 0.3% (A + 1/A), respectively, 0.3 +/- 0.2% (A + 2/A), could be achieved. Both masses and isotope ratios were combined when the elemental compositions of unknown peaks were calculated. After calculation, compound identities were elucidated by searching metabolic databases, interpreting spectra, and, finally, by comparison with reference compounds. Sum formulas of more than 70 peaks were determined throughout single GC/MS chromatograms. Exact masses were confirmed by high-resolution mass spectrometric data. More than 15 uncommon plant metabolites were identified, some of which are novel in Arabidopsis, such as tartronate semialdehyde, citramalic acid, allothreonine, or glycolic amide.

The structure of the major anthocyanin in Arabidopsis thaliana

Article

Mar 2002
PHYTOCHEMISTRY

The major anthocyanin in the leaves and stems of Arabidopsis thaliana has been isolated and shown to be cyanidin 3-O-[2-O(2-O-(sinapoyl)-beta-D-xylopyranosyl)-6-O-(4-O-(beta-D-glucopyranosyl)-p-coumaroyl-beta-D-glucopyranoside] 5-O-[6-O-(malonyl) beta-D-glucopyranoside]. This anthocyanin is a glucosylated version of one of the anthocyanins found in the flowers of the closely related Matthiola incana.

Surviving in a sea of data: A survey of plant genome data resources and issues in building data management systems

Article

Feb 2002

Exponential growth of data, largely from whole-genome analyses, has changed the way biologists think about and handle data. Optimal use of these data requires effective methods to analyze and manage these data sets. Computers, software and the World Wide Web are now integral components of biological discovery. Understanding how information is obtained, processed and annotated in public databases allows researchers to effectively organize, analyze and export their own data into these databases. In this review we focus largely on two areas related to management of genomic data. We cite examples of resources available in the public domain and describe some of the software for data management systems currently available for plant research. In addition, we discuss a few concepts of data management from the perspective of an individual or group that wishes to provide data to the public databases, to use the information in the public databases more efficiently, or to develop a database to manage large data sets internally or for public access. These concepts include data descriptions, exchange format, curation, attribution, and database implementation.

Mueller, L.A., Zhang, P. & Rhee, S.Y. AraCyc: A biochemical pathway database for Arabidopsis. Plant Physiol. 132, 453-460

Abstract and Figures

Recommended publications

Control of Specific Gene Expression by Gibberellin and Brassinosteroid

Comparative Transcriptome of Diurnally Oscillating Genes and Hormone-Responsive Genes in Arabidopsis...

Brassinosteroids, gibberellins and light-mediated signalling are the three-way controls of plant spr...

Identification of brassinosteroid-related genes by means of transcript co-response analyses