ArticlePDF Available

Mueller, L.A., Zhang, P. & Rhee, S.Y. AraCyc: A biochemical pathway database for Arabidopsis. Plant Physiol. 132, 453-460

Authors:

Abstract and Figures

AraCyc is a database containing biochemical pathways of Arabidopsis, developed at The Arabidopsis Information Resource (http://www.arabidopsis.org). The aim of AraCyc is to represent Arabidopsis metabolism as completely as possible with a user-friendly Web-based interface. It presently features more than 170 pathways that include information on compounds, intermediates, cofactors, reactions, genes, proteins, and protein subcellular locations. The database uses Pathway Tools software, which allows the users to visualize a bird's eye view of all pathways in the database down to the individual chemical structures of the compounds. The database was built using Pathway Tools' Pathologic module with MetaCyc, a collection of pathways from more than 150 species, as a reference database. This initial build was manually refined and annotated. More than 20 plant-specific pathways, including carotenoid, brassinosteroid, and gibberellin biosyntheses have been added from the literature. A list of more than 40 plant pathways will be added in the coming months. The quality of the initial, automatic build of the database was compared with the manually improved version, and with EcoCyc, an Escherichia coli database using the same software system that has been manually annotated for many years. In addition, a Perl interface, PerlCyc, was developed that allows programmers to access Pathway Tools databases from the popular Perl language. AraCyc is available at the tools section of The Arabidopsis Information Resource Web site (http://www.arabidopsis.org/tools/aracyc).
Content may be subject to copyright.
AraCyc: A Biochemical Pathway
Database for Arabidopsis
1
Lukas A. Mueller*, Peifen Zhang, and Seung Y. Rhee
The Arabidopsis Information Resource, Department of Plant Biology, Carnegie Institution of Washington,
260 Panama Street, Stanford, California 94305
AraCyc is a database containing biochemical pathways of Arabidopsis, developed at The Arabidopsis Information Resource
(http://www.arabidopsis.org). The aim of AraCyc is to represent Arabidopsis metabolism as completely as possible with a
user-friendly Web-based interface. It presently features more than 170 pathways that include information on compounds,
intermediates, cofactors, reactions, genes, proteins, and protein subcellular locations. The database uses Pathway Tools
software, which allows the users to visualize a bird’s eye view of all pathways in the database down to the individual
chemical structures of the compounds. The database was built using Pathway Tools’ Pathologic module with MetaCyc, a
collection of pathways from more than 150 species, as a reference database. This initial build was manually refined and
annotated. More than 20 plant-specific pathways, including carotenoid, brassinosteroid, and gibberellin biosyntheses have
been added from the literature. A list of more than 40 plant pathways will be added in the coming months. The quality of
the initial, automatic build of the database was compared with the manually improved version, and with EcoCyc, an
Escherichia coli database using the same software system that has been manually annotated for many years. In addition, a Perl
interface, PerlCyc, was developed that allows programmers to access Pathway Tools databases from the popular Perl language.
AraCyc is available at the tools section of The Arabidopsis Information Resource Web site (http://www.arabidopsis.org/
tools/aracyc).
The genome of the flowering plant Arabidopsis
was the first plant genome to be fully sequenced
(Arabidopsis Genome Initiative, 2000). Initially, ap-
proximately 26,000 genes were identified in the
genomic sequence, based on different computational
methods, and were assigned to functional categories
(Arabidopsis Genome Initiative, 2000). About 9% of
these genes have been studied experimentally (Ara-
bidopsis Genome Initiative, 2000), and about 32% of
all genes in Arabidopsis could not yet be assigned to
any functional category (Reiser et al., 2002). From the
initial annotation of the genome, it has been esti-
mated that about 4,000 genes may be involved in
metabolism (Arabidopsis Genome Initiative, 2000).
In this work, we used Pathway Tools software
(Karp et al., 2002a) to build a database for Arabidop-
sis metabolism. The software allows automatic gen-
eration of pathway databases using functional as-
signment of genes and also allows manual editing of
pathways through a graphical user interface. Al-
though most of the functional annotations were de-
rived computationally, we hypothesized that there
was enough information to build an initial metabo-
lism database, which could be used to facilitate man-
ual literature curation of genes involved in metabo-
lism.
The Pathway Tools software suite is a comprehen-
sive system to identify, curate, store, and publish
biochemical pathways on the Web in the form of
pathway genome databases (PGDBs; Karp et al.,
2002a). PGDBs contain the entire genomic informa-
tion of an organism, including its metabolic com-
pounds, reactions, biochemical pathways, enzymes,
and enzyme complexes. There are three components
in the Pathway Tools: (a) Pathologic, which allows a
new PGDB to be built from data sets consisting es-
sentially of gene annotations; (b) Pathway/Genome
Editor, which allows pathways to be edited and new
pathways to be added; and (c) Pathway/Genome
Navigator, which allows users to query and browse
the database, both locally and on the Web. The Patho-
logic analysis predicts the pathways of an organism
using a reference PGDB from which pathways are
extracted using a pathway-scoring algorithm (Paley
and Karp, 2002). The reference PGDB used in this
work is MetaCyc (http://metacyc.org; Karp et al.,
2002b), a metabolic-pathway database that describes
449 curated pathways and 1,115 enzymes occurring
in 158 organisms.
The Pathway Tools system has been applied exten-
sively to annotate microbial genomes and has been
optimized to a point where it exceeds expert analyses
in comprehensiveness and matches expert analyses
in accuracy (Paley and Karp, 2002). However, it was
unknown how well it would handle a eukaryotic
genome. The software had been applied previously
1
This work was supported by the National Science Foundation
(grant no. DBI–9978564) and by the National Institutes of Health
(grant no. R01–GM6546601).
* Corresponding author; e-mail mueller@acoma.stanford.edu;
fax 650 –325–6857.
Article, publication date, and citation information can be found
at www.plantphysiol.org/cgi/doi/10.1104/pp.102.017236.
Plant Physiology, June 2003, Vol. 132, pp. 453–460, www.plantphysiol.org © 2003 American Society of Plant Biologists 453
to only one eukaryote, yeast. Because of this limited
exposure to eukaryotic organisms, we expected a
lower accuracy of the initial database build as com-
pared with prokaryotic databases. A eukaryotic ge-
nome not only is more complex, but also has an
enormous difference in scale. A typical bacterium,
such as Escherichia coli, contains 4.6 million bp of
DNA and has on the order of 4,392 genes; the E. coli
pathway/genome database, EcoCyc, lists 164 differ-
ent pathways and 914 enzymes. In comparison, Ara-
bidopsis has a genome of 125 million bases, and
comprises more than 26,000 genes, which corre-
sponds to 20 times the amount of DNA and almost
five times the number of genes in E. coli. The result-
ing pathway/genome database can therefore be ex-
pected to be many-fold more complex than EcoCyc.
In addition, eukaryotes have subcellular compart-
ments, many different cell-types, and elaborated life
cycles with a complex series of developmental stages,
which qualitatively increases the complexity of bio-
chemical processes.
In this paper, we describe how AraCyc was initially
built, we compare the quality of the resulting data-
base to the version of AraCyc that has been improved
through manual verification and annotation, and we
compare the overall quality of AraCyc to EcoCyc. We
also describe what adaptations to the Pathway Tools
software were necessary to better accommodate a
eukaryotic organism.
RESULTS
Pathologic Analysis
The Pathologic module of Pathway Tools was run
using Arabidopsis enzyme annotations that were ob-
tained from the Arabidopsis sequencing project (Ara-
bidopsis Genome Initiative, 2000), which were edited
manually to remove extraneous words and charac-
ters that could interfere with the enzyme name-
matching software. A total of about 6,000 genes were
retained and formatted for input into Pathologic ac-
cording to the Pathway Tools documentation (P.
Karp and S. Paley, unpublished data). Pathologic
recognizes enzyme functions using an enzyme name-
matching program and a database of enzyme names
and synonyms, and extracts corresponding pathways
from the MetaCyc database using a pathway scoring
algorithm (see Materials and Methods; Paley and
Karp, 2002).
Overall Statistics of the Initial Build
Pathologic recognized 1,858 enzymes for which it
knew a defined function (roughly 7% of the total
number of genes in the genome), and another 1,650
gene products (6.3% of the genome) were identified
as putative enzymes (Table I). The putative enzymes
comprised both enzymes annotated with generic
names such as kinase, for which the precise func-
tion was unknown, as well as enzymes that were
specific to plants that were not in MetaCyc, such as
gibberellin oxidase.
In total, AraCyc contained 173 pathways after the
initial build (Table I), containing 767 enzymes and
1,132 reactions (or 750 unique reactions if same reac-
tions in different pathways are counted once). One or
more enzymes were annotated to 611 (342 unique)
reactions, whereas 521 reactions, or 45% (408 unique,
or 54%), lacked enzyme annotations. There were thus
883 enzymes with a defined function that were not
attributed to any pathway; this was the case for many
generic enzymes such as cytochrome P450s, where
the reaction is not specific enough to place it in a
pathway.
Statistics of Manual Editing of AraCyc
AraCyc has been manually edited since the auto-
matic build. Curation includes deleting inappropri-
ate pathways, adding missing pathways, or updating
Table I. Summary data for the AraCyc data sets and comparison with EcoCyc
The number of pathways, reactions, genes, and missing enzyme annotations in pathways are given for AraCyc, the initial build of AraCyc, and
EcoCyc for comparison. AraCyc has approximately 42% of reactions with missing annotations, down from 45% for the initial build. EcoCyc has
only approximately 7% missing annotations. Twenty-two pathways were deleted from the intial build, and 23 new ones were added. More
pathways will be added in the future.
AraCyc AraCyc Initial Build EcoCyc
Pathways (excluding superpathways) 174 173 164
Reactions, total 1,096 1,132 845
Reactions, unique 833 750 706
Unique genes associated with pathways 958 767 695
Missing enzyme annotations, total 469 (42%) 521 (45%) 60 (7%)
Missing enzyme annotations, unique 403 (48%) 408 (54%) 52 (7%)
Genes per annotated reaction 2.2 2.2 1.06
Overlapping pathways with EcoCyc 76 79 164
Reactions in overlapping set 458 (403 unique) 467 (434 unique) 462
Missing enzyme annotations in overlapping set 164 (38%; 151 unique) 204 (43%; 186 unique) 11 (2.3%)
Pathways added manually since initial build 23
Pathways deleted from initial build 22
Mueller et al.
454 Plant Physiol. Vol. 132, 2003
existing pathways (for more details, see Analysis of
Pathways). Twenty-two pathways (or 12.7% of the
original 173) were manually deleted from the original
Pathway Tools analysis. Among these were low-
scoring pathways with few enzymes annotated to
them (6 pathways), pathways that were thought not
to occur in plants (12 pathways), and close variants of
other pathways in the database (4 pathways), which
were merged. The complete list of deleted pathways
is available on-line (http://www.Arabidopsis.org/
tools/aracyc/aracyc.deleted.pathways.html). Five
pathways in the database are questionable and are
on hold, meaning that they may be deleted in the
future. Deleting them would bring the total path-
ways deleted to 27 or 15% of the initial set. Twenty-
three new pathways comprising 194 (185 unique)
reactions with 212 gene annotations were added, con-
taining 90 (86 unique) missing enzyme annotations
(46%, 46% unique). Some pathways that were re-
trieved from MetaCyc were incomplete. Most nota-
bly, the pathway isopentenyl diphosphate biosyn-
thesis, mevalonate-independent consisted of only
two reactions. The pathway has been completed with
four additional reactions (not all of the reactions in
the pathway are currently known).
In total, the AraCyc database presently contains
174 pathways containing 1,096 reactions (833
unique). Of the reactions in the pathways, 469 (43%)
are missing enzyme annotations (403 or 48% unique).
A total of 958 genes were annotated to one or more
reactions. The automatic building process missed
mainly annotations of reactions catalyzed by large
enzyme complexes with complex subunit composi-
tions such as pyruvate dehydrogenase and ketoglu-
tarate dehydrogenase. The reactions were missed not
because Pathologic could not handle them, but rather
because the enzyme names in the input files were not
always accurately specified.
AraCyc contains an average of 2.2 genes per anno-
tated reaction. This may seem to be a low number,
just roughly twice the number of E. coli genes per
reaction. However, there were big differences in the
number of genes annotated per reaction among the
different pathway categories. In Energy Metabolism,
the average number was 3.3 genes/reaction, in Deg-
radation 2.5, in Intermediary Metabolism 2.4, and in
Biosynthesis 2.07. Between these categories, there
were also large differences in the number of reactions
that were lacking annotations, indicating that not all
of the pathway categories are equally well under-
stood: In the Energy Metabolism category, only
16.5% reactions lacked annotations, compared with
29% in Biosynthesis, 41% in Intermediary Metabo-
lism, and 58% in Degradation. At any rate, the gly-
colysis pathway itself has no reactions lacking anno-
tations and has an average of 5.1 genes annotated to
a reaction. Some reactions in that pathway have more
than a dozen annotated genes. In E. coli, glycolysis
has an average of 1.6 reactions and a maximum of
three genes per reaction. This suggests that certain
pathway categories, such as the Energy Metabolism
category, have a higher potential degree of regulation
than the other pathway categories.
Comparison with EcoCyc
To compare these benchmarks with a database that
has been manually curated over a long period of
time, we compared AraCyc with the EcoCyc database
(Karp et al., 2002c). EcoCyc is a database specific for
the metabolism of E. coli and has been manually
curated since the mid-1990s. It contains 164 pathways
(not counting super-pathways) comprising 845 reac-
tions (706 unique). EcoCyc contains only 60 missing
enzyme annotations (54 unique), which means that
more than 93% of all reactions have at least one
enzyme annotation.
Pathways conserved between AraCyc and EcoCyc
have a higher percentage of annotated reactions. We
analyzed the 76 pathways (excluding super-
pathways) that AraCyc shares with EcoCyc, and we
found that they contain a total of 458 reactions (403
unique reactions), 164 of which were missing enzyme
annotations (151 unique) in AraCyc. The percentage
of missing enzyme annotations in these pathways is
therefore only 36% in AraCyc, compared with 43%
when all of the pathways in AraCyc are considered.
In EcoCyc, these pathways contain only 11 missing
annotations (10 unique) or 2%! The pathways that
occur in both AraCyc and EcoCyc are therefore a
subset that is much better described than the other
pathways. A look at the pathways shows that they
are mostly central, conserved metabolism and in-
cludes pathways such as glycolysis and biosynthesis
of amino acids. The pathways that occur in AraCyc
but not in EcoCyc (98 pathways) had 636 reactions
(518 unique) and 315 missing annotations (278
unique), or 49.5% (53% unique) missing enzyme
annotations.
Analysis of Experimental Evidence for Genes in AraCyc
To estimate how many of the annotations that were
used to build the pathways were solely based on
sequence similarity-based predicted information, we
counted how many genes had a gene symbol syn-
onym. A list of gene symbol aliases for each locus
were obtained from The Arabidopsis Information Re-
source (TAIR) FTP site (ftp://ftp.arabidopsis.org/
Genes/). Of the 958 unique genes currently anno-
tated to pathways in AraCyc, only 155 had synonyms
(16%). A large fraction of the gene annotations used
in the Pathologic analysis are likely to be based on
sequence similarity alone; the accuracy of the func-
tional annotation based on sequence is difficult to
estimate and can be verified only with future
experimentation.
AraCyc Database
Plant Physiol. Vol. 132, 2003 455
Analysis of Pathways in AraCyc
The Pathway Tools classification hierarchy defines
four categories of pathways at its top level: Biosyn-
thesis, Intermediary Metabolism, Degradation, and
Energy Metabolism. In AraCyc, these categories con-
tain 73, 27, 50, and 15 pathways, respectively (Fig. 1).
The Biosynthesis class contains the largest number of
pathways, largely due to pathways that we added
manually since the initial build. In comparison, the
largest class in MetaCyc is Degradation. This is prob-
ably due to the many bacterial degradation pathways
that have been characterized. Overall, however, the
distribution of pathways between these classes are
very similar between the curated version of AraCyc
and EcoCyc.
In the Biosynthesis category of AraCyc, all amino
acid biosyntheses have been inferred from Pathologic
except the biosynthesis for Glu. The biosyntheses of
Tyr and Phe that were inferred did not correspond to
the plant versions; in plants, the biosynthesis of these
amino acids is assumed to go through arogenate,
instead of prephenate (Jung et al., 1986). Noticeably
missing were biosynthesis of phospholipids and the
mevalonate pathway. The latter is important as a
precursor for terpenoid biosynthesis and was copied
from MetaCyc to AraCyc manually. MetaCyc also
classifies phytoalexin, flavonoid, and mevalonate
metabolism under the Fatty Acid and Lipids class.
These three pathways were also inferred in AraCyc,
but subsequently moved to the newly created Sec-
ondary Metabolite Biosynthesis class, which was
added under Biosynthesis. Apart from the secondary
metabolites (flavonoids, phytoalexins) inferred un-
der the Fatty Acid and Lipids class, no plant second-
ary metabolite pathways were in MetaCyc and there-
fore could not be identified by Pathologic. Therefore,
the following pathways have been added manually:
carotenoid biosynthesis, camalexin biosynthesis, and
phenylpropanoid ester biosynthesis.
The Flavonoid biosynthetic pathway had to be
modified extensively; the original pathway contained
errors and was not very comprehensive. The phyto-
alexin pathway is almost an exact copy of that initial
flavonoid pathway and will probably be deleted in
the future. Chlorophyll biosynthesis was newly cre-
ated under heme biosynthesis. Conspicuously,
NAD biosynthesis was not inferred and added
manually from MetaCyc. Polyisoprenoid metabolism
was moved to the Terpenoid Biosynthesis under
the Secondary Metabolites class. Both pyrimidine
and purine biosyntheses were inferred correctly. In
addition, under the newly created class Plant Hor-
mone Biosynthesis, we added cytokinin, brassinos-
teroid, jasmonic acid, gibberellin, and abscisic acid
biosynthesis pathways. The biosynthesis of ethylene
was inferred correctly by Pathologic and moved to
the Plant Hormones class.
The Energy Metabolism class contained 15 path-
ways, including glycolysis (2 instances, of which one
[glycolysis 2] was deleted as a duplicate variant), the
tricarboxylic acid cycle, and the Calvin cycle. In
plants, glycolysis can use pyrophosphate instead of
ATP for the phosphorylation of Fru. These plant-
specific features will be added to AraCyc manually in
the future. Several fermentation pathways were also
inferred, most of which have been deleted due to
insufficient evidence; the fermentation pathways
usually contained some of the preceding glycolysis
reactions that obviously had good evidence, but the
actual fermentation reactions had no enzyme
matches. The two fermentation pathways—“Glc fer-
mentationand anaerobic fermentation”—that were
not deleted from the database had good evidence for
the fermentation-specific reactions. In general, fer-
mentation reactions seem to be less well studied in
Arabidopsis than other metabolic processes; some of
the fermentation pathways we deleted due to insuf-
ficient evidence may have to be restored in the future
as we learn more about them.
Intermediary Metabolism contained 27 pathways,
including carnitine metabolism. Interestingly, al-
though there is a carnitine metabolism pathway in
MetaCyc, there is no carnitine biosynthetic pathway.
Carnitine accumulates in many plants (Panter and
Mudd, 1969), although its presence in Arabidopsis is
uncertain. The 50 pathways in the Degradation class
did not include the amino acid degradation path-
ways for Gln, His, Phe, and Pro. Pathologic found
evidence for several pathways for xenobiotics degra-
dation such as pentachlorophenol degradation path-
Figure 1. Comparison of pathway distribution between AraCyc, AraCyc initial build, MetaCyc, and EcoCyc. The number of
pathways in the different top-level classifications (Biosynthesis, Energy Metabolism, Intermediary Metabolism, and Degrada-
tion) are shown as pie charts. The major class in AraCyc is the biosynthesis class with 73 or 44.2% of pathways, up from 38.4%
in the initial build. The major class in MetaCyc is the degradation class, probably reflecting the many bacterial degradation
pathways that are known. The distribution within these classes in EcoCyc is, however, very similar to AraCyc.
Mueller et al.
456 Plant Physiol. Vol. 132, 2003
way. Most of these pathways are known to exist in
certain bacteria but are unlikely in plants. Not all have
been deleted from the database yet, because some
contain a large number of enzyme annotations. These
pathways could potentially be present in Arabidopsis
but remain to be characterized; they could therefore
represent pathways discovered by Pathologic.
Modification of the Controlled Vocabularies of
Pathway Tools for AraCyc
Because the Pathway Tools software has been used
primarily to describe metabolism of prokaryotic or-
ganisms, the descriptions of intracellular structures
in the database were limited and had to be extended
for the use with Arabidopsis. The cellular com-
partment ontology consisted of only five different
keywords: periplasm, membrane, inner-membrane,
outer-membrane, and mitochondria. We extended this
vocabulary to represent eukaryotic structures and
plant organelles; it now comprises 35 terms, including
chloroplast, the inner structures of the chloroplast,
endoplasmatic reticulum, nucleus, etc. The complete
listcanbefoundon-line(http://www.arabidopsis.org/
tools/aracyc/intracellular.html). These modifications
were also adopted by the MetaCyc database. In the
future, it may be desirable to integrate the Gene On-
tology (The Gene Ontology Consortium, 2001; http://
www.geneontology.org) system into the Pathway
Tools.
Another limitation of the Pathway Tools software
is the lack of support for different tissues and devel-
opmental stages, for which TAIR has developed the
necessary ontologies. For example, a pathway may be
active only in a subset of tissues and/or at certain
developmental stages, but this information cannot
yet be captured in the database.
Modification of the Classification Hierarchies in
Pathway Tools
We modified the chemical compound hierarchy to
better accommodate plant metabolism, adding Plant
Hormones and Secondary Metabolites as new classes.
PerlCyc
Pathway Tools is written in Lisp, a powerful lan-
guage that is popular in the artificial intelligence
community. The most popular language in biology is
probably Perl, due to its simplicity and built-in string
handling features such as regular expressions. To
facilitate the access to the internal Pathway Tools
functions, such as automated queries and batch-
loading of data, we implemented a Perl module
called perlcyc.pm. The module is available for down-
load at http://www.arabidopsis.org/tools/aracyc/
perlcyc. PerlCyc allows the user to write small pro-
grams in Perl that formulate more complex queries,
such as: How many reactions have multiple enzyme
annotations that include enzymes located in both the
cytoplasm or in the chloroplast?
DISCUSSION
We have built a database for Arabidopsis metabolic
pathways using the Pathway Tools software. The
automatic build was edited by manual curation and
addition of Arabidopsis-specific pathways. The data-
base contains 174 pathways (excluding super-
pathways) comprising more than 1,000 reactions and
958 different enzyme annotations. The database con-
tains 822 metabolic compounds. Our aim is to repre-
sent the metabolism of Arabidopsis in AraCyc to the
extent that it is known through ongoing manual cura-
tion efforts. AraCyc will allow to pinpoint the gaps in
our understanding of Arabidopsis metabolism, and
to facilitate researchers to fill in the gaps. The data-
base is also a tool for the annotation of Arabidopsis
biochemical enzymes, a resource for researchers who
want to explore Arabidopsis metabolism, and a tool
for teaching plant metabolism.
A noteworthy feature of Pathway Tools is the in-
tegrated expression viewer that allows expression
data from microarray or DNA chip experiments to be
visualized on the metabolic overview diagram (Fig.
2). In this example, we took data from a previously
published microarray experiment (Arabidopsis Func-
tional Genomics Consortium experiment no. 10615;
Ramonell et al., 2002). A number of differentially
expressed enzymes can clearly be distinguished. The
expression viewer is also available through the TAIR
Web site.
A comparison with EcoCyc shows that the AraCyc
database has many more reactions lacking annota-
tions than EcoCyc. EcoCyc has only 7% reactions
lacking annotations as compared with 43% for Ara-
Cyc. This may also reflect the research priority in the
Arabidopsis community to some extent. Other areas
of research, such as development and disease resis-
tance, seem to be studied more extensively in this
organism than metabolism.
For primary plant metabolism, Arabidopsis should
be an excellent model system for other dicots. For
secondary metabolites, Arabidopsis can only be a
model for the 36 secondary metabolites it has been
shown to produce (Chapple et al., 1994). They fall
into four classes: flavonoids, hydroxycinnamic acid
esters, glucosinolates, and indole phyoalexins. Im-
portant classes of plant secondary metabolites such
as alkaloids and terpene secondary metabolites have
not yet been identified in Arabidopsis. Whether these
compounds are not produced by Arabidopsis or have
not yet been detected is an open question. The Ara-
bidopsis genomic sequence reveals sequences homol-
ogous to enzymes that are involved in the biosynthe-
sis of terpenes and alkaloids (Arabidopsis Genome
Initiative, 2000). Recently, two myrcene/(E)-beta oci-
AraCyc Database
Plant Physiol. Vol. 132, 2003 457
mene synthases have been cloned from Arabidopsis
that had enzymatic activity when expressed in E. coli
(Bohlmann et al., 2000). For the biosynthesis of alka-
loids, there are 10 enzymes annotated as berberine
bridge enzyme in Arabidopsis, although it is not
known if they are expressed or can form active en-
zymes. Conversely, not all pathways that are known
to operate in Arabidopsis are well characterized. In
the camalexin biosynthetic pathway, the major indole
phytoalexin in Arabidopsis, only one gene has been
cloned, and even the precursor molecule is uncertain.
Clearly, more research is needed to define the meta-
bolic complement of Arabidopsis.
How complete is AraCyc now and when will it be
finished? One way to estimate the completeness is to
compare the number of estimated metabolic enzymes
to the number of enzymes stored in AraCyc. It has
been estimated that approximately 4,000 enzymes are
involved in metabolism in Arabidopsis (Arabidopsis
Genome Initiative, 2000). However, this number
should be considered as an upper limit, because it is
likely to include kinases, phosphatases, etc., that are
specific for proteins and not for small metabolite
metabolism. In AraCyc, Pathologic identified 1,850
enzymes with a defined biochemical function and a
further 1,650 probable enzymes (again, most of
which were annotated to imprecise functions such as
kinase, which may not be specific to small mole-
cule metabolism), for a total of 3,500 enzymes. Pres-
ently, there are 958 different enzymes annotated to
one or more pathways. Hence, AraCyc could pres-
ently be considered one-fourth complete to the extent
of what is known. Considering that roughly one-half
of the reactions do not have annotations, just filling
in the missing reactions should bring completeness to
one-half (assuming that the average number of en-
zymes per reaction is similar for the missing annota-
tions). The rest of the enzymes would probably be in
pathways that are not yet in AraCyc. Again assuming
that these additional pathways have a distribution of
reactions and annotations similar to the present ones,
the complete AraCyc database reflecting the current
knowledge would have an upper limit of just more
than 300 pathways.
In another attempt to estimate completeness, we
compared the compounds in AraCyc with com-
pounds identified in a metabolic profiling experi-
ment. In the experiment, which analyzed the metab-
olites found in leaves, hundreds of compounds were
resolved, of which 94 could be identified (Fiehn et al.,
2000). Of these 94 compounds, 24 were not found in
AraCyc. This seems like a large fraction considering
that the metabolic profiling experiment identified rel-
atively simple low M
r
compounds and did not detect
the many complex molecules that are present in plant
cells. However, 15 of the 94 compounds were classi-
fied by the authors as uncommon plant metabolites
that had never before been seen in plants. These 15
compounds were all part of the 24 missing com-
pounds; the nine remaining compounds, which in-
Figure 2. Overlaying expression data on the overview diagram. The overview diagram gives a birds eye view of all of the
pathways in the database. The pathways are shown as glyphs consisting of nodes, which represent the metabolites, and lines,
which represent the reactions. Expression data can be uploaded as a simple tab-delimited file. The lines representing the reactions
are painted in a color relative to the expression level, with a dynamically generated scale depicted on the right side of the screen.
For this example, data from a published data set (Ramonell et al., 2002), downloaded from the ArabidopsisFunctionalGenom-
ics Consortium site, was used.
Mueller et al.
458 Plant Physiol. Vol. 132, 2003
cluded mostly sugars that are likely involved in cell
wall biosynthesis, point us to pathways that we will
have to add to AraCyc in the near future. Additional
profiling experiments will be a great help in verifying
AraCyc completeness in the future, when the tech-
nology will allow more compounds to be identified.
In the coming months, we will add approximately
30 pathways (refer to http://www.Arabidopsis.org/
tools/aracyc), with a focus on carbohydrate and lipid
biosynthesis, bringing the total of pathways to more
than 200. Of course, many pathways are presently in a
canonical form and will have to be extended to reflect
the peculiarities of Arabidopsis metabolism. For ex-
ample, the genes that are known to be involved in the
biosynthesis of anthocyanin pigments, which is rela-
tively well-studied in Arabidopsis, account for the
biosynthesis of cyanidin 3-glucoside. The major antho-
cyanin in Arabidopsis, however, has been shown to be
cyanidin (3-O-[2-O(2-O-(sinapoyl-
-d-xylopyranosyl)-
6-O-(4-O-(
-d -glucopyranosyl)-p-coumaroyl-
-d -
glucopyranoside] 5-O-[6-O-(malonyl)
-d-gluco-
pyranoside]; Bloor and Abrahams, 2002), which is a
long way from cyanidin 3-glucoside.
CONCLUSIONS
AraCyc still has a way to go to be on a par with
databases such as EcoCyc in annotation quality. At
least partially, this may be due to the fact that the
metabolism of Arabidopsis is not as well described as
the metabolism of E. coli. The Pathway Tools system
has, however, permitted us to construct a relatively
high-quality, comprehensive database of Arabidop-
sis metabolism in a short time, forming an excellent
basis for further refinement through manual correc-
tions, curation, and experimentation. Pathways that
are added to AraCyc are also added to the MetaCyc
database, so that these pathways will be available for
future database builds for other plant species. Ara-
Cyc is available on the TAIR Web site (Huala et al.,
2001).
MATERIALS AND METHODS
Pathway Tools Installation
The Pathway Tools software was downloaded from the Web by a link
provided by SRI International (Menlo Park, CA). For information on obtain-
ing Pathway Tools, contact ptools-info@ai.sri.com. The installation was
performed according to instructions provided. The hardware used consisted
of a SunBlade 100 workstation from Sun Microsystems (Palo Alto, CA),
running Solaris 8. Pathway Tools can be run with an Oracle database
backend or using flatfiles for data persistence. In this work, the flat file-
based version was used. The two modes of operation are completely trans-
parent to the user. The flat file version is easier to install and is cheaper
because it does not require the purchase of an Oracle license.
Initial Build of AraCyc
The flowchart outlining the steps in generating AraCyc is described
below and is shown in Figure 3.
Input Files
The Institute for Genomic Researchs Arabidopsis genome annotation
data (http://www.tigr.org) were manually edited to include only enzyme
names. Enzymes labeled as putative or similar to were also included in
the data set. Any string that might interfere with the enzyme name-
matching algorithm of Pathologic was removed. These strings included
descriptions of subcellular locations or gene names following the enzyme
name. The edited list was then formatted into a Pathologic-specific file
format, which requires one file per chromosome describing their genes and
one file describing the number and nature of the chromosomes (such as
whether the chromosome is circular or linear etc.; P. Karp and S. Paley,
unpublished data). Only nuclear-encoded genes were included in the data set.
Running Pathologic
Pathologic imports the genes and proteins described by the input files
into a new database that is structured using the Pathway Tools schema and
then matches the enzymes listed in the annotated genome against the
enzymes required by every pathway in a reference pathway database Me-
taCyc (http://metacyc.org; Karp et al., 2002b). The program assesses the
pathways using a pathway-scoring algorithm and only those pathways with
significant scores are imported into the new PGDB. The scoring and path-
way import algorithm have been described elsewhere (Paley and Karp,
2002).
Pathologic generates reports that summarize the amount of evidence
supporting each pathway predicted to be present in the new PGDB and that
list the pathway holes, i.e. the enzymes missing from each predicted
Figure 3. Building AraCyc. AraCyc was built using a selection of The
Institute for Genomic Research gene models that were annotated as
enzymes or putative enzymes. These annotations were formatted into
a Pathologic-specific format according to the documentation for Path-
way Tools (P. Karp and S. Paley, unpublished data) and then analyzed
with Pathologic, using MetaCyc as a reference database. The resulting
database, AraCyc initial build, was then manually curated, resulting in
AraCyc.
AraCyc Database
Plant Physiol. Vol. 132, 2003 459
pathway. This information can help the curator decide on which of the
pathways imported by Pathologic should be kept in the database.
Modifying the Object Class Structures
The Pathway Tools classification hierarchy for biosynthetic pathways
and chemical compound was modified to accommodate plant pathways
using the built-in editing tool in Pathway Tools, GKB-Editor. In addition,
some attributes, such as the subcellular location attribute of enzymes, were
modified for use with eukaryotic and plant cells, using the GKB-Editor.
Manual Annotation
The manual curation process includes both editing existing pathways
and adding new pathways. Information from the literature is collected and
added to the pathway, reaction, compound, enzyme, and gene frames. For
a pathway, we add a summary of what it does and a short description of its
significance. Regarding reactions, we add, if known, EC number, free en-
ergy, whether the reaction is novel or hypothetical, and whether it is
spontaneous in vitro or in vivo. For compounds, chemical structures are
added if they are not already in the database. For enzymes, subcellular
location, native M
r
, subunit composition, subunit M
r
, known cofactors,
activators, and inhibitors are added. The K
m
, K
i
, optimum pH, and optimum
temperature of an enzyme are added if known. If an enzyme is a complex
of multiple subunits, comments on the role of each subunit are added. For
enzyme isoforms, we capture the substrate specificity, tissue/cell type, and
developmental stage specificity. Genes are linked to TAIR locus detail pages
by their locus identification. Finally, synonyms of pathways, reactions,
enzymes, genes, and compounds are added, and literature citations are
provided by entering PubMed identification.
ACKNOWLEDGMENTS
We thank Peter Karp, Suzanne Paley, John Pick, Cindy Krieger, and Pepe
Romero from SRI for their help in carrying out this work and Peter Karp,
Leonore Reiser, and Margarita Garcia-Hernandez for critically reading the
manuscript.
This is Carnegie Institution of Washington, DPB, publication no. 1623.
Received November 5, 2002; returned for revision December 11, 2002; ac-
cepted February 7, 2003.
LITERATURE CITED
Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of
the flowering plant Arabidopsis thaliana. Nature 408: 796815
Bloor SJ, Abrahams S (2002) The structure of the major anthocyanin in
Arabidopsis thaliana. Phytochemistry 59: 343346
Bohlmann J, Martin D, Oldham NJ, Gershenzon J (2000) Terpenoid sec-
ondary metabolism in Arabidopsis thaliana: cDNA cloning, characteriza-
tion, and functional expression of a myrcene/(E)-beta-ocimene synthase.
Arch Biochem Biophys 375: 261269
Chapple C, Shirley B, Zook M, Hammerschmidt R, Somerville S (1994)
Secondary metabolism in Arabidopsis. In E Meyerowitz, C Somerville, eds,
Arabidopsis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor,
NY, pp 9891030
Fiehn O, Kopka J, Trethewey RN, Willmitzer L (2000) Identification of
uncommon plant metabolites based on calculation of elemental compo-
sitions using gas chromatography and quadrupole mass spectrometry.
Anal Chem 72: 35733580
Gene Ontology Consortium (2001) Creating the gene ontology resource:
design and implementation. Genome Res 11: 14251433
Huala E, Dickerman A, Garcia-Hernandez M, Weems D, Reiser L, LaFond
F, Hanley D, Kiphart D, Zhuang J, Huang W et al. (2001) The Arabi-
dopsis Information Resource (TAIR): a comprehensive database and
Web-based information retrieval, analysis, and visualization system for a
model plant. Nucleic Acids Res 29: 102105
Jung E, Zamir LO, Jensen RA (1986) Chloroplasts of higher plants synthe-
size l-phenylalanine via l-arogenate. Proc Natl Acad Sci USA 83:
72317235
Karp P, Paley S, Romero P (2002a) The Pathway Tools software. Bioinfor-
matics Suppl 1 18: S225S232
Karp PD, Riley M, Paley SM, Pellegrini-Toole A (2002b) The MetaCyc
Database. Nucleic Acids Res 30: 5961
Karp PD, Riley M, Saier M, Paulsen IT, Collado-Vides J, Paley SM,
Pellegrini-Toole A, Bonavides C, Gama-Castro S (2002c) The EcoCyc
Database. Nucleic Acids Res 30: 5658
Paley SM, Karp PD (2002) Evaluation of computational metabolic-pathway
predictions for Helicobacter pylori. Bioinformatics 18: 715724
Panter RA, Mudd JB (1969) Carnitine levels in some higher plants. FEBS
Lett 5: 169170
Ramonell K, Zhang B, Ewing R, Chen Y, Xu D, Stacey G, Somerville S
(2002) Microarray analysis of chitin elicitation in Arabidopsis thaliana. Mol
Plant Pathol 3: 301311
Reiser L, Mueller LA, Rhee SY (2002) Surviving in a sea of data: a survey
of plant genome data resource and issues in building data management
systems. Plant Mol Biol 48: 5974
Mueller et al.
460 Plant Physiol. Vol. 132, 2003
... We have assembled an integrated resource of plant signalling, S tress K nowledge M ap (SKM, https://skm.nib.si ), that provides a single, up-to-date entrypoint for plant response investigations. SKM integrates knowledge on plant molecular interactions and stress specific responses from a wide diversity of sources, combining recent discoveries from journal articles with knowledge already existing in resources such as KEGG (Kanehisa et al ., 2016), STRING (Szklarczyk et al. , 2023), MetaCyc (Caspi et al ., 2016), and AraCyc (Mueller et al. , 2003). SKM extends other aggregated resources (listed in Supplementary (Podpečan et al. , 2019), and ConsensusPathDB (Herwig et al. , 2016), in that it allows conversion of biochemical knowledge to diverse perception into a cellular response, resulting in activation of processes which execute protection against stress (Layer 4). ...
... The majority of these interactions were compiled from peer-reviewed manuscripts with targeted experimental methodology, giving them a high degree of confidence. PSS also contains relevant signalling associated pathways from KEGG (Kanehisa et al. , 2016) and AraCyc (Mueller et al ., 2003). . CC-BY 4.0 International license perpetuity. ...
... These relationships are annotated with the subcellular location and the form of the participant when involved in the reaction (e.g. 'cytoplasm' or 'nucleus' and 'gene' or 'protein').Where applicable, nodes are annotated with their provenance (e.g. a DOI) and additional information such as biological pathways, gene identifiers, descriptions and annotations (TAIR(Berardini et al ., 2015), GoMapMan(Ramšak et al ., 2014)), references to external resources (DOI, PubMed, KEGG(Kanehisa et al ., 2016), MetaCyc(Caspi e t al ., 2016), AraCyc(Mueller et al ., 2003), and ChEBI(Hastings et al. , 2016)), and explanatory statements (such as a quote from the article and the experimental techniques used in the original experiments) ...
Preprint
Full-text available
Stress Knowledge Map (SKM, https://skm.nib.si) is a publicly available resource containing two complementary knowledge graphs describing current knowledge of biochemical, signalling, and regulatory molecular interactions in plants: a highly curated model of plant stress signalling (PSS, 543 reactions) and a large comprehensive knowledge network (CKN, 488,390 interactions). Both were constructed by domain experts through systematic curation of diverse literature and database resources. SKM provides a single entrypoint for plant stress response investigations and the related growth tradeoffs. SKM provides interactive exploration of current knowledge. PSS is also formulated as qualitative and quantitative models for systems biology, and thus represents a starting point of a plant digital twin. Here, we describe the features of SKM and show, through two case studies, how it can be used for complex analyses, including systematic hypothesis generation, design of validation experiments, or to gain new insights into experimental observations in plant biology.
... To integrate quantitative results of identified peptides and metabolites, we employed three free online multiomics-compatible resources, namely, MetaboAnalyst, AraCyc, and PaintOmics. Whilst AraCyc [60] had been used to map identified biomarkers from multiomics plant experiments [61,62], to our knowledge, MetaboAnalyst joint-pathway analysis and PaintOmics [63] have never been applied to plant datasets. The joint pathway analysis module of MetaboAnalyst [64] simultaneously analyzed gene products and metabolites (KEGG or HMDB) of interest within the context of metabolic pathways. ...
Article
Full-text available
Safflower (Carthamus tinctorius L.) is an ancient oilseed crop of interest due to its diversity of end-use industrial and food products. Proteomic and metabolomic profiling of its organs during seed development, which can provide further insights on seed quality attributes to assist in variety and product development, has not yet been undertaken. In this study, an integrated proteome and metabolic analysis have shown a high complexity of lipophilic proteins and metabolites differentially expressed across organs and tissues during seed development and petal wilting. We demonstrated that these approaches successfully discriminated safflower reproductive organs and developmental stages with the identification of 2179 unique compounds and 3043 peptides matching 724 unique proteins. A comparison between cotyledon and husk tissues revealed the complementarity of using both technologies, with husks mostly featuring metabolites (99%), while cotyledons predominantly yielded peptides (90%). This provided a more complete picture of mechanisms discriminating the seed envelope from what it protected. Furthermore, we showed distinct molecular signatures of petal wilting and colour transition, seed growth, and maturation. We revealed the molecular makeup shift occurring during petal colour transition and wilting, as well as the importance of benzenoids, phenylpropanoids, flavonoids, and pigments. Finally, our study emphasizes that the biochemical mechanisms implicated in the growing and maturing of safflower seeds are complex and far-reaching, as evidenced by AraCyc, PaintOmics, and MetaboAnalyst mapping capabilities. This study provides a new resource for functional knowledge of safflower seed and potentially further enables the precision development of novel products and safflower varieties with biotechnology and molecular farming applications.
... The single organism golden dataset consists of six Tier 1 PGDBs from BioCyc which are EcoCyc(v21) [28], HumanCyc(v19.5) [4], AraCyc(v18.5) [29], YeastCyc(v19.5), LeishCyc(v19.5) ...
Article
Full-text available
Background Metabolic pathway prediction is one possible approach to address the problem in system biology of reconstructing an organism’s metabolic network from its genome sequence. Recently there have been developments in machine learning-based pathway prediction methods that conclude that machine learning-based approaches are similar in performance to the most used method, PathoLogic which is a rule-based method. One issue is that previous studies evaluated PathoLogic without taxonomic pruning which decreases its performance. Results In this study, we update the evaluation results from previous studies to demonstrate that PathoLogic with taxonomic pruning outperforms previous machine learning-based approaches and that further improvements in performance need to be made for them to be competitive. Furthermore, we introduce mlXGPR, a XGBoost-based metabolic pathway prediction method based on the multi-label classification pathway prediction framework introduced from mlLGPR. We also improve on this multi-label framework by utilizing correlations between labels using classifier chains. We propose a ranking method that determines the order of the chain so that lower performing classifiers are placed later in the chain to utilize the correlations between labels more. We evaluate mlXGPR with and without classifier chains on single-organism and multi-organism benchmarks. Our results indicate that mlXGPR outperform other previous pathway prediction methods including PathoLogic with taxonomic pruning in terms of hamming loss, precision and F1 score on single organism benchmarks. Conclusions The results from our study indicate that the performance of machine learning-based pathway prediction methods can be substantially improved and can even outperform PathoLogic with taxonomic pruning.
... A common set of essential or conserved "primary" plant metabolites was determined by carefully analyzing the published reaction network and metabolome of several diverse, but well-studied plants, including Arabidopsis thaliana, Solanum lycopersicum (tomato), and Oryza sativa (rice). This process involved comparing and consolidating the metabolite/pathway information found in AraCyc, 49 PlantCyc, 50 the Plant Metabolic Network, 50 and various Kyoto Encyclopedia of Genes and Genomes plant pathway collections. 51 This information, along with the recently published cannabis genome sequence, 52 and other publicly available plant metabolite, protein, and pathway data from PathBank 37 and UniProt 53 (which provided additional details on plant lipids) were used to generate a genomescale compilation of highly conserved or "expected" cannabis metabolites. ...
... A. thalianas pathways related to cell wall structure/composition were downloaded from AraCyc version 15.0 (Mueller et al. 2003). The pathways include cuticular wax biosynthesis, cutin biosynthesis, long-chain fatty acid activation, suberin monomers biosynthesis, esterified suberin biosynthesis, cellulose biosynthesis, homogalacturonan biosynthesis, xylogalacturonan biosynthesis, phenylpropanoid biosynthesis, and xylan biosynthesis. ...
Article
Full-text available
The Oomycete plant pathogen, Phytophthora capsici , causes root, crown, and fruit rot of winter squash ( Cucurbita moschata ) and limits production. Some C. moschata cultivars develop age-related resistance (ARR), whereby fruit develop resistance to P. capsici 14 to 21 days postpollination (DPP) because of thickened exocarp; however, wounding negates ARR. We uncovered the genetic mechanisms of ARR of two C. moschata cultivars, Chieftain and Dickenson Field, that exhibit ARR at 14 and 21 DPP, respectively, using RNA sequencing. The sequencing was conducted using RNA samples from ‘Chieftain’ and ‘Dickenson Field’ fruit at 7, 10, 14, and 21 DPP. A differential expression and subsequent gene set enrichment analysis revealed an overrepresentation of upregulated genes in functional categories relevant to cell wall structure biosynthesis, cell wall modification/organization, transcription regulation, and metabolic processes. A pathway enrichment analysis detected upregulated genes in cutin, suberin monomer, and phenylpropanoid biosynthetic pathways. A further analysis of the expression profile of genes in those pathways revealed upregulation of genes in monolignol biosynthesis and lignin polymerization in the resistant fruit peel. Our findings suggest a shift in gene expression toward the physical strengthening of the cell wall associated with ARR to P. capsici . These findings provide candidate genes for developing Cucurbita cultivars with resistance to P. capsici and improve fruit rot management in Cucurbita species.
... This results in a largely lumped representation of lipid metabolic pathways 5 in existing plant models, reflecting the limited degree of annotation of plant genomes. Hence, despite numerous data sources [22][23][24][25] and pathways databases [26][27][28] the use of automated tools to reconstruct lipid metabolic pathways is not warranted even for the model plant Arabidopsis thaliana (Arabidopsis). This in turn limits the applications of constraint-based modeling to gain better understanding of plant lipid metabolism. ...
Article
Full-text available
Lipids play fundamental roles in regulating agronomically important traits. Advances in plant lipid metabolism have until recently largely been based on reductionist approaches, although modulation of its components can have system-wide effects. However, existing models of plant lipid metabolism provide lumped representations, hindering detailed study of component modulation. Here, we present the Plant Lipid Module (PLM) which provides a mechanistic description of lipid metabolism in the Arabidopsis thaliana rosette. We demonstrate that the PLM can be readily integrated in models of A. thaliana Col-0 metabolism, yielding accurate predictions (83%) of single lethal knock-outs and 75% concordance between measured transcript and predicted flux changes under extended darkness. Genome-wide associations with fluxes obtained by integrating the PLM in diel condition- and accession-specific models identify up to 65 candidate genes modulating A. thaliana lipid metabolism. Using mutant lines, we validate up to 40% of the candidates, paving the way for identification of metabolic gene function based on models capturing natural variability in metabolism.
... Moreover, there are more than 100 databases that illustrate and integrate protein-protein interaction (PPI) networks, such as the Predicted Rice Interactome Network (PRIN) and the Protein-Protein Interaction Database for Maize (PPIM) [35,36]. Lastly, metabolic interaction databases such as AraCyc [37], KEGG [38], and PlantCyc [39] provide validated information on biological pathways in plants. As these databases are projected to be improved in the upcoming years, they will play a key role in omics analysis [31]. ...
Article
Full-text available
In the face of a growing global population, plant breeding is being used as a sustainable tool for increasing food security. A wide range of high-throughput omics technologies have been developed and used in plant breeding to accelerate crop improvement and develop new varieties with higher yield performance and greater resilience to climate changes, pests, and diseases. With the use of these new advanced technologies, large amounts of data have been generated on the genetic architecture of plants, which can be exploited for manipulating the key characteristics of plants that are important for crop improvement. Therefore, plant breeders have relied on high-performance computing, bioinformatics tools, and artificial intelligence (AI), such as machine-learning (ML) methods, to efficiently analyze this vast amount of complex data. The use of bigdata coupled with ML in plant breeding has the potential to revolutionize the field and increase food security. In this review, some of the challenges of this method along with some of the opportunities it can create will be discussed. In particular, we provide information about the basis of bigdata, AI, ML, and their related subgroups. In addition, the bases and functions of some learning algorithms that are commonly used in plant breeding, three common data integration strategies for the better integration of different breeding datasets using appropriate learning algorithms, and future prospects for the application of novel algorithms in plant breeding will be discussed. The use of ML algorithms in plant breeding will equip breeders with efficient and effective tools to accelerate the development of new plant varieties and improve the efficiency of the breeding process, which are important for tackling some of the challenges facing agriculture in the era of climate change.
... When they have been employed in the data analyses, online tools such as the TOAST X-Species Transcriptional Explorer (https://gilroy-qlik.botany.wisc.edu/a/sense/app/ab2250b5-ee3a-4da8-b5da-fe87d5f2dbe6/overview), KnetMiner 23 , Metascape 24 , Ensembl GO 53 , the Kyoto Encyclopedia of Gene and Genomes 56 , AraCyc 57 and Reactome 58 are noted in the text and figure legends. Principal Component Analysis (PCA), Multidimensional Scaling analysis (MDS), t-distributed Stochastic Neighbor Embedding (T-SNE), Weighted Gene Correlation Network Analysis (WGCNA) and K-means statistical analyses were performed using the iDEP.94 ...
Article
Full-text available
Spaceflight presents a multifaceted environment for plants, combining the effects on growth of many stressors and factors including altered gravity, the influence of experiment hardware, and increased radiation exposure. To help understand the plant response to this complex suite of factors this study compared transcriptomic analysis of 15 Arabidopsis thaliana spaceflight experiments deposited in the National Aeronautics and Space Administration's GeneLab data repository. These data were reanalyzed for genes showing significant differential expression in spaceflight versus ground controls using a single common computational pipeline for either the microarray or the RNA-seq datasets. Such a standardized approach to analysis should greatly increase the robustness of comparisons made between datasets. This analysis was coupled with extensive cross-referencing to a curated matrix of metadata associated with these experiments. Our study reveals that factors such as analysis type (i.e., microarray versus RNA-seq) or environmental and hardware conditions have important confounding effects on comparisons seeking to define plant reactions to spaceflight. The metadata matrix allows selection of studies with high similarity scores, i.e., that share multiple elements of experimental design, such as plant age or flight hardware. Comparisons between these studies then helps reduce the complexity in drawing conclusions arising from comparisons made between experiments with very different designs.
Article
Full-text available
Arabidopsis thaliana, a small annual plant belonging to the mustard family, is the subject of study by an estimated 7000 researchers around the world. In addition to the large body of genetic, physiological and biochemical data gathered for this plant, it will be the first higher plant genome to be completely sequenced, with completion expected at the end of the year 2000. The sequencing effort has been coordinated by an international collaboration, the Arabidopsis Genome Initiative (AGI). The rationale for intensive investigation of Arabidopsis is that it is an excellent model for higher plants. In order to maximize use of the knowledge gained about this plant, there is a need for a comprehensive database and information retrieval and analysis system that will provide user-friendly access to Arabidopsis information. This paper describes the initial steps we have taken toward realizing these goals in a project called The Arabidopsis Information Resource (TAIR) (www.arabidopsis.org).
Article
Full-text available
MetaCyc is a metabolic-pathway database that describes 445 pathways and 1115 enzymes occurring in 158 organisms. MetaCyc is a review-level database in that a given entry in MetaCyc often integrates information from multiple literature sources. The pathways in MetaCyc were determined experimentally, and are labeled with the species in which they are known to occur based on literature references examined to date. MetaCyc contains extensive commentary and literature citations. Applications of MetaCyc include pathway analysis of genomes, metabolic engineering and biochemistry education. MetaCyc is queried using the Pathway Tools graphical user interface, which provides a wide variety of query operations and visualization tools. MetaCyc is available via the World Wide Web at http://ecocyc.org/ecocyc/metacyc.html, and is available for local installation as a binary program for the PC and the Sun workstation, and as a set of flatfiles. Contact metacyc-info{at}ai.sri.com for information on obtaining a local copy of MetaCyc.
Article
Full-text available
EcoCyc is an organism-specific pathway/genome database that describes the metabolic and signal-transduction pathways of Escherichia coli, its enzymes, its transport proteins and its mechanisms of transcriptional control of gene expression. EcoCyc is queried using the Pathway Tools graphical user interface, which provides a wide variety of query operations and visualization tools. EcoCyc is available at http://ecocyc.org/.
Article
Summary Chitin oligomers, released from fungal cell walls by endochitinase, induce defence and related cellular responses in many plants. However, little is known about chitin responses in the model plant Arabidopsis. We describe here a large-scale characterization of gene expression patterns in Arabidopsis in response to chitin treatment using an Arabidopsis microarray consisting of 2375 EST clones representing putative defence-related and regulatory genes. Transcript levels for 71 ESTs, representing 61 genes, were altered three-fold or more in chitin-treated seedlings relative to control seedlings. A number of transcripts exhibited altered accumulation as early as 10 min after exposure to chitin, representing some of the earliest changes in gene expression observed in chitin-treated plants. Included among the 61 genes were those that have been reported to be elicited by various pathogen-related stimuli in other plants. Additional genes, including genes of unknown function, were also identified, broadening our understanding of chitin-elicited responses. Among transcripts with enhanced accumulation, one cluster was enriched in genes with both the W-box promoter element and a novel regulatory element. In addition, a number of transcripts had decreased abundance, encoding several proteins involved in cell wall strengthening and wall deposition. The chalcone synthase promoter element was identified in the upstream regions of these genes, suggesting that pathogen signals may suppress the expression of some genes. These data indicate that Arabidopsis should be an excellent model to elucidate the mechanisms of chitin elicitation in plant defence.
Article
The exponential growth in the volume of accessible biological information has generated a confusion of voices surrounding the annotation of molecular information about genes and their products. The Gene Ontology (GO) project seeks to provide a set of structured vocabularies for specific biological domains that can be used to describe gene products in any organism. This work includes building three extensive ontologies to describe molecular function, biological process, and cellular component, and providing a community database resource that supports the use of these ontologies. The GO Consortium was initiated by scientists associated with three model organism databases: SGD, the Saccharomyces Genome database; FlyBase, the Drosophila genome database; and MGD/GXD, the Mouse Genome Informatics databases. Additional model organism database groups are joining the project. Each of these model organism information systems is annotating genes and gene products using GO vocabulary terms and incorporating these annotations into their respective model organism databases. Each database contributes its annotation files to a shared GO data resource accessible to the public at http://www.geneontology.org/. The GO site can be used by the community both to recover the GO vocabularies and to access the annotated gene product data sets from the model organism databases. The GO Consortium supports the development of the GO database resource and provides tools enabling curators and researchers to query and manipulate the vocabularies. We believe that the shared development of this molecular annotation resource will contribute to the unification of biological information.
Article
The specific enzymological route of L-phenylalanine biosynthesis has not been established in any higher plant system. The possible pathway routes that have been identified in microorganisms utilize either phenylpyruvate or L-arogenate as a unique intermediate. We now report the presence of arogenate dehydratase (which converts L-arogenate to L-phenylalanine) in cultured-cell populations of Nicotiana silvestris. Prephenate dehydratase (which converts prephenate to phenylpyruvate) was not detected. Arogenate dehydratase was also found in washed spinach chloroplasts, and these data add to emerging evidence in support of the existence in the plastidial compartment of a complete assembly of enzymes comprising aromatic amino acid biosynthesis. Arogenate dehydratase from tobacco and spinach were both specific for L-arogenate, inhibited by L-phenylalanine, and activated by L-tyrosine. Apparent Km values for L-arogenate (0.3 X 10(-3) M), pH optima (pH 8.5-9.5), and temperature optima for catalysis (32-34 degrees C) were also similar.
Article
The Arabidopsis genome project has recently reported sequences with similarity to members of the terpene synthase (TPS) gene family of higher plants. Surprisingly, several Arabidopsis terpene synthase-like sequences (AtTPS) share the most identity with TPS genes that participate in secondary metabolism in terpenoid-accumulating plant species. Expression of a putative Arabidopsis terpene synthase gene, designated AtTPS03, was demonstrated by amplification of a 392-bp cDNA fragment using primers designed to conserved regions of plant terpene synthases. Using the AtTPS03 fragment as a hybridization probe, a second AtTPS cDNA, designated AtTPS10, was isolated from a jasmonate-induced cDNA library. The partial AtTPS10 cDNA clone contained an open reading frame of 1665 bp encoding a protein of 555 amino acids. Functional expression of AtTPS10 in Escherichia coli yielded an active monoterpene synthase enzyme, which converted geranyl diphosphate (C(10)) into the acyclic monoterpenes beta-myrcene and (E)-beta-ocimene and small amounts of cyclic monoterpenes. Based on sequence relatedness, AtTPS10 was classified as a member of the TPSb subfamily of angiosperm monoterpene synthases. Sequence comparison of AtTPS10 with previously cloned monoterpene synthases suggests independent events of functional specialization of terpene synthases during the evolution of terpenoid secondary metabolism in gymnosperms and angiosperms. Functional characterization of the AtTPS10 gene was prompted by the availability of Arabidopsis genome sequences. Although Arabidoposis has not been reported to form terpenoid secondary metabolites, the unexpected expression of TPS genes belonging to the TPSb subfamily in this species strongly suggests that terpenoid secondary metabolism is active in the model system Arabidopsis.
Article
Unknown compounds in polar fractions of Arabidopsis thaliana crude leaf extracts were identified on the basis of calculations of elemental compositions obtained from gas chromatography/low-resolution quadrupole mass spectrometric data. Plant metabolites were methoximated and silylated prior to analysis. All known peaks were used as internal references to construct polynomial recalibration curves of from raw mass spectrometric data. Mass accuracies of 0.005 +/- 0.003 amu and isotope ratio errors of 0.5 +/- 0.3% (A + 1/A), respectively, 0.3 +/- 0.2% (A + 2/A), could be achieved. Both masses and isotope ratios were combined when the elemental compositions of unknown peaks were calculated. After calculation, compound identities were elucidated by searching metabolic databases, interpreting spectra, and, finally, by comparison with reference compounds. Sum formulas of more than 70 peaks were determined throughout single GC/MS chromatograms. Exact masses were confirmed by high-resolution mass spectrometric data. More than 15 uncommon plant metabolites were identified, some of which are novel in Arabidopsis, such as tartronate semialdehyde, citramalic acid, allothreonine, or glycolic amide.
Article
The major anthocyanin in the leaves and stems of Arabidopsis thaliana has been isolated and shown to be cyanidin 3-O-[2-O(2-O-(sinapoyl)-beta-D-xylopyranosyl)-6-O-(4-O-(beta-D-glucopyranosyl)-p-coumaroyl-beta-D-glucopyranoside] 5-O-[6-O-(malonyl) beta-D-glucopyranoside]. This anthocyanin is a glucosylated version of one of the anthocyanins found in the flowers of the closely related Matthiola incana.
Article
Exponential growth of data, largely from whole-genome analyses, has changed the way biologists think about and handle data. Optimal use of these data requires effective methods to analyze and manage these data sets. Computers, software and the World Wide Web are now integral components of biological discovery. Understanding how information is obtained, processed and annotated in public databases allows researchers to effectively organize, analyze and export their own data into these databases. In this review we focus largely on two areas related to management of genomic data. We cite examples of resources available in the public domain and describe some of the software for data management systems currently available for plant research. In addition, we discuss a few concepts of data management from the perspective of an individual or group that wishes to provide data to the public databases, to use the information in the public databases more efficiently, or to develop a database to manage large data sets internally or for public access. These concepts include data descriptions, exchange format, curation, attribution, and database implementation.