ArticlePDF Available

MetDAT: A modular and workflow-based free online pipeline for metabolite data processing, analysis and interpretation

Authors:

Abstract and Figures

Analysis of high throughput metabolomics experiments is a resource-intensive process that includes pre-processing, pre-treatment and post-processing at each level of experimental hierarchy. We developed an interactive user-friendly online software called Metabolite Data Analysis Tool (MetDAT) for mass spectrometry data. It offers a pipeline of tools for file handling, data pre-processing, univariate and multivariate statistical analyses, database searching and pathway mapping. Outputs are produced in the form of text and high-quality images in real-time. MetDAT allows users to combine data management and experiment-centric workflows for optimization of metabolomics methods and metabolite analysis. MetDAT is available free for academic use from http://smbl.nus.edu.sg/METDAT2/. sanjay@nus.edu.sg
Content may be subject to copyright.
[18:24 20/9/2010 Bioinformatics-btq436.tex] Page: 2639 2639–2640
BIOINFORMATICS APPLICATIONS NOTE Vol. 26 no. 20 2010, pages 2639–2640
doi:10.1093/bioinformatics/btq436
Systems biology Advance Access publication August 11, 2010
MetDAT: a modular and workflow-based free online pipeline for
mass spectrometry data processing, analysis and interpretation
Ambarish Biswas1, Kalyan C. Mynampati1, Shivshankar Umashankar1, Sheela Reuben1,
Gauri Parab2, Raghuraj Rao1, Velayutham S. Kannan1and Sanjay Swarup1,2,3,
1Singapore-Delft Water Alliance, National University of Singapore, Singapore 117576, 2Small Molecules Biology
Laboratory, Department of Biological Sciences, National University of Singapore, Singapore 117543 and 3NUS
Environmental Research Institute (NERI), #02-01, T-Lab Building (TL), 5A Engineering Drive 1, Singapore 117411
Associate Editor: Trey Ideker
ABSTRACT
Summary: Analysis of high throughput metabolomics experiments
is a resource-intensive process that includes pre-processing, pre-
treatment and post-processing at each level of experimental
hierarchy. We developed an interactive user-friendly online software
called Metabolite Data Analysis Tool (MetDAT) for mass spectrometry
data. It offers a pipeline of tools for file handling, data pre-processing,
univariate and multivariate statistical analyses, database searching
and pathway mapping. Outputs are produced in the form of text and
high-quality images in real-time. MetDAT allows users to combine
data management and experiment-centric workflows for optimization
of metabolomics methods and metabolite analysis.
Availability: MetDAT is available free for academic use from http://
smbl.nus.edu.sg/METDAT2/.
Contact: sanjay@nus.edu.sg
Received on May 3, 2010; revised on July 20, 2010; accepted on
July 25, 2010
1 INTRODUCTION
Metabolomics experiments conducted using mass spectrometry
produce spectral outputs amounting to gigabytes of data. This is
a result of elaborate experimental set-ups with several replicates,
time-series studies and different types of treatment parameters
for a single sample. Datasets from such experimental set-up can
be very complex. Querying information from several subsets of
complex datasets and deriving meaningful biological interpretation
is very challenging. To overcome this problem in data handling and
analysis, the metabolomics research community has recommended
a minimal set of reporting standards and general guidelines for data
analysis, interpretation and exchange of metabolomics experiments
(Fiehn et al., 2008; Goodacre et al., 2007; Sansone et al., 2007).
Data pre-processing generally involves removal of noise and
reduction of baseline of the raw data. The resulting data are
normalized and scaled for univariate and multivariate statistical
analysis. Visualization of the resultant output helps to understand and
interpret experimental results. A number of standalone data analysis
tools for preprocessing, statistical analysis and pathway mapping
(mzMine, XCMS, MassTrix, Metexplore, Bioconductor) are freely
available on the web (Jourdan et al., 2010; Katajamaa et al., 2007;
Lommen, 2009). Currently, these available data analysis tools do
To whom correspondence should be addressed.
not have the means to include information from experiment such
as hierarchical data structure. As a result, meaningful inferences
based on comparisons at the various levels of data structure are not
straightforward and their associated effects are not easily interpreted.
Most of the available tools require continuous interaction with
users in the programs leading to delayed output. Here, we present
Metabolite Data Analysis Tool (MetDAT) to address gaps in the
currently available software for mass spectrometry data analysis
in a systems-centric approach. MetDAT is a web-based tool with
an open and integrated system that performs data preprocessing,
analysis followed by database searches with filters to refine the
output. MetDAT also provides interactive and customizable modules
and user-driven analysis of data at hierarchical levels.
2 METHODS AND IMPLEMENTATION
2.1 Computational model and user interface
MetDAT is a platform-independent web application, which works well
with most widely used web browsers such as Mozilla Firefox (above 3.2),
Internet Explorer (above 6), Google Chrome and Safari. The web interface
of MetDAT is developed using HTML and CGI-Perl scripts. Computational
functions of MetDAT are written in Perl and R languages, while graphics
are generated using Gnuplot (Williams et al., 1993) and R packages (R
Development Core Team, 2008). The software is hosted on a server running
Red Hat Enterprise Linux Server release 5.4.
MetDAT accepts input data as tab-delimited text files in a single zipped
file which is then converted into a matrix by ‘Prepare Dataset’ module.
The standalone modules run seamlessly on uploaded data and show the
results instantaneously but these results are not saved on MetDAT’s server.
All MetDAT programs can be used without any log-in. However, logged-
in users have the benefit of creating custom workflows or using a default
one. The workflows are designed such that the data can be cleaned up,
analyzed, visualized and interpreted all in one smooth pipeline. Two default
workflows—one for differential metabolite analysis and another for method
optimization are provided in MetDAT. The uploaded datasets, workflows and
results are automatically stored on MetDAT server and can be retrieved at a
later time.
MetDAT has no restrictions on the number of files to be processed.
However, larger datasets would require longer processing time. This software
has a combination of 23 most commonly used programs by metabolomics
researchers for pre-processing, pre-treatment, data analysis, visualization,
metabolite database search and pathway mapping. Other programs like self
organizing map, Random forest and support vector machine are not available
in this version of the software. These programs have been streamlined such
that they can be used readily by beginners while providing ample controls
© The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 2639
at National University of Singapore on October 26, 2010bioinformatics.oxfordjournals.orgDownloaded from
[18:24 20/9/2010 Bioinformatics-btq436.tex] Page: 2640 2639–2640
A.Biswas et al.
Fig. 1. Screen shot from MetDAT v2.0 software. Individual programs are
accessible by left panel. Important functions such as workflow creation,
data management, account details and visualization can be accessed via
dashboard. Graphical outputs from the various programs in MetDAT are
shown at the bottom.
for customization by advanced users. User interface provides a dashboard for
easy access to the programs and results (Fig. 1). A standalone visualization
module that is accessible from the dashboard has two types of interactive
plots—X–Y plot and barchart. These require Adobe Flash plug-in to be
installed in the browser. Algorithms for programs, references, user guide
and examples are included in the manual at the website for easy reference.
2.2 Data processing, analysis and output
The pre-processing and pre-treatment modules in MetDAT include removing
noise, correction of base lines and normalizing or scaling the data to increase
signal-to-noise ratio. Data analysis includes a range of statistical techniques
such as analysis of variance, principal component analysis, partial least
square (PLS), partial least square-discriminant analysis (PLS-DA), linear
discriminant analysis and t-test. The major strength of this software is
the power to analyze data at various hierarchical levels such as samples,
treatments, time points, biological repeats and technical replicates. Datasets
for the different hierarchical levels can be assigned at data preparation.
It provides users the option to perform intensive statistical analysis at any
of these levels. Differentially expressed metabolites are listed based on fold
change and t-test analysis. The listed metabolites can further be searched
against four databases—Aracyc (Mueller et al., 2003), HMDB (Wishart
et al., 2009), Plantcyc and KEGG (Kanehisa and Goto, 2000) along with
multiple filters for better and targeted results. The database search includes
a feature for metabolic pathway identification. Key differential metabolites
are highlighted for users to focus on selected parts of metabolic networks.
3 CONCLUSIONS
Metabolic perturbation studies involve understanding effects of
variation in a statistically robust fashion. This software is created
with the vision of biologists for data analysis and interpretation.
MetDAT platform simplifies and standardizes data analysis for
usage with minimal training. MetDAT provides data processing
and analysis, database search for metabolite identification, pathway
mapping and a rich palette of visualization tools. The combination of
data management and customizable workflow makes data analysis
faster thereby saving researchers time.
ACKNOWLEDGEMENTS
The authors gratefully acknowledge the support and contributions of
the Singapore-Delft Water Alliance (SDWA). The research presented
in this work was carried out as part of the SDWA’s Aquatic Science
Centre at Ulu Pandan programme.
Funding: Singapore-Delft Water Alliance (WBS No: R-264-001-
002-272).
Conflict of Interest: none declared.
REFERENCES
Fiehn,O. et al. (2008) Quality control for plant metabolomics: reporting MSI-compliant
studies. Plant J.,53, 691–704.
Goodacre,R. et al. (2007) Proposed minimum reporting standards for data analysis in
metabolomics. Metabolomics,3, 231–241.
Jourdan,F. et al. (2010) MetExplore: a web server to link metabolomic experiments and
genome-scale metabolic networks. Nucleic Acids Res., 38, W132–W137.
Kanehisa,M and Goto,S. (2000) KEGG: kyoto encyclopedia of genes and genomes.
Nucleic Acids Res., 28, 27–30.
Katajamaa,M. et al. (2007) Data processing for mass spectrometry-based metabolomics.
J. Chromatogr. A,1158, 318–28.
Lommen,A. (2009) MetAlign: interface-driven, versatile metabolomics tool for
hyphenated full-scan mass spectrometry data preprocessing. Anal Chem., 81,
3079–3086.
Mueller,L.A. et al. (2003) AraCyc: a biochemical pathway database for Arabidopsis.
Plant Physiol., 132, 453–460.
R Development Core Team (2008) R: a language and environment for statistical
computing. R Foundaton for Statistical Computing, Vienna, Austria. Available at
http://www.R-project.org.
Sansone,S.A. et al. (2007) The metabolomics standards initiative. Nat. Biotechnol., 25,
846–848.
Williams,T. et al. (1993) ‘gnuplot’. Available at http://www.gnuplot.info
Wishart,D.S. et al. (2009) HMDB: the Human Metabolome Database. Nucleic Acids
Res., 35, D521–D526.
2640
at National University of Singapore on October 26, 2010bioinformatics.oxfordjournals.orgDownloaded from
... It has been the classifier of choice in multitude of applications in diverse fields [12][13][14][15][16][17][18]. Lately, PLS-DA algorithm has become a standard in omics research [19][20][21][22][23][24][25][26][27][28]. Some advantages behind the popularity of PLS-DA as classifier are the ability to cope with collinear and noisy variables, which is often the case in omics datasets [29], as well as possibility of results visualization by means of scores and loading plots [30,31]. ...
Article
Full-text available
Advances in analytical instrumentation have provided the possibility of examining thousands of genes, peptides, or metabolites in parallel. However, the cost and time-consuming data acquisition process causes a generalized lack of samples. From a data analysis perspective, omics data are characterized by high dimensionality and small sample counts. In many scenarios, the analytical aim is to differentiate between two different conditions or classes combining an analytical method plus a tailored qualitative predictive model using available examples collected in a dataset. For this purpose, partial least squares-discriminant analysis (PLS-DA) is frequently employed in omics research. Recently, there has been growing concern about the uncritical use of this method, since it is prone to overfitting and may aggravate problems of false discoveries. In many applications involving a small number of subjects or samples, predictive model performance estimation is only based on cross-validation (CV) results with a strong preference for reporting results using leave one out (LOO). The combination of PLS-DA for high dimensionality data and small sample conditions, together with a weak validation methodology is a recipe for unreliable estimations of model performance. In this work, we present a systematic study about the impact of the dataset size, the dimensionality, and the CV technique used on PLS-DA overoptimism when performance estimation is done in cross-validation. Firstly, by using synthetic data generated from a same probability distribution and with assigned random binary labels, we have obtained a dataset where the true classification rate (CR) is 50%. As expected, our results confirm that internal validation provides overoptimistic estimations of the classification accuracy (i.e., overfitting). We have characterized the CR estimator in terms of bias and variance depending on the internal CV technique used and sample to dimensionality ratio. In small sample conditions, due to the large bias and variance of the estimator, the occurrence of extremely good CRs is common. We have found that overfitting peaks when the sample size in the training subset approaches the feature vector dimensionality minus one. In these conditions, the models are neither under- or overdetermined with a unique solution. This effect is particularly intense for LOO and peaks higher in small sample conditions. Overoptimism is decreased beyond this point where the abundance of noisy produces a regularization effect leading to less complex models. In terms of overfitting, our study ranks CV methods as follows: Bootstrap produces the most accurate estimator of the CR, followed by bootstrapped Latin partitions, random subsampling, K-Fold, and finally, the very popular LOO provides the worst results. Simulation results are further confirmed in real datasets from mass spectrometry and microarrays.
... Several tools were developed to visualize metabolic pathways, some of which can also evaluate statistical significance. These include the software Metscape (plug-in for cytoscape) [66], or webbased tools such as metaP-server [67], MetDAT [68], metabolic pathway analysis (MetPA) [69], MetExplore [70], and metabolite set enrichment analysis (MSEA) [71]. ...
Article
Full-text available
Purpose of Review Metabolomics is a rapidly evolving field that generates large and complex datasets. Bioinformatics becomes critical towards the extraction of meaningful biological information. In this article, we briefly review computational approaches that have been well accepted in the field, and discuss the development of new methods and tools to interpret metabolomics data. Recent Findings Significant progress has been made in computational metabolomics over the past years. This includes methods that are used to preprocess data generated by instruments, to annotate metabolites, to carry out statistical analyses, to identify perturbed metabolic pathways, and to integrate metabolomics with other omics data. Each of these topics is discussed in respective sections of this review. Summary Bioinformatics tools used for metabolomics remain a highly active research area. An ecosystem is emerging with software libraries, standalone tools, and web-based tools and services. While some require bioinformatics training, many of them are user friendly and easily accessible. Much further development is still needed to serve the metabolomics field and its applications.
Article
Large-scale and long-term metabolomic studies have attracted widespread attention in the biomedical studies yet remain challenging despite recent technique progresses. In particular, the ineffective way of experiment integration and limited capacity in metabolite annotation are known issues. Herein, we constructed an online tool MMEASE enabling the integration of multiple analytical experiments with an enhanced metabolite annotation and enrichment analysis (https://idrblab.org/mmease/). MMEASE was unique in capable of (1) integrating multiple analytical blocks; (2) providing enriched annotation for >330 thousands of metabolites; (3) conducting enrichment analysis using various categories/sub-categories. All in all, MMEASE aimed at supplying a comprehensive service for large-scale and long-term metabolomics, which might provide valuable guidance to current biomedical studies. Significance: To facilitate the studies of large-scale and long-term metabolomic analysis, MMEASE was developed to (1) achieve the online integration of multiple datasets from different analytical experiments, (2) provide the most diverse strategies for marker discovery, enabling performance assessment and (3) significantly amplify metabolite annotation and subsequent enrichment analysis. MMEASE aimed at supplying a comprehensive service for long-term and large-scale metabolomics, which might provide valuable guidance to current biomedical studies.
Article
Metabolic phenotyping is an important tool in translational biomedical research. The advanced analytical technologies commonly used for phenotyping, including mass spectrometry (MS) and nuclear magnetic resonance (NMR) spectroscopy, generate complex data requiring tailored statistical analysis methods. Detailed protocols have been published for data acquisition by liquid NMR, solid-state NMR, ultra-performance liquid chromatography (LC-)MS and gas chromatography (GC-)MS on biofluids or tissues and their preprocessing. Here we propose an efficient protocol (guidelines and software) for statistical analysis of metabolic data generated by these methods. Code for all steps is provided, and no prior coding skill is necessary. We offer efficient solutions for the different steps required within the complete phenotyping data analytics workflow: scaling, normalization, outlier detection, multivariate analysis to explore and model study-related effects, selection of candidate biomarkers, validation, multiple testing correction and performance evaluation of statistical models. We also provide a statistical power calculation algorithm and safeguards to ensure robust and meaningful experimental designs that deliver reliable results. We exemplify the protocol with a two-group classification study and data from an epidemiological cohort; however, the protocol can be easily modified to cover a wider range of experimental designs or incorporate different modeling approaches. This protocol describes a minimal set of analyses needed to rigorously investigate typical datasets encountered in metabolic phenotyping. Metabolomics studies using large-scale NMR or mass spectrometry experiments on biofluids or tissues generate complex data. This protocol provides guidelines and software (supplied in Jupyter notebooks) for the statistical analysis of these data.
Article
The advent of omics technologies has greatly improved our understanding of microbial biology, particularly in the last two decades. The field of microbial biofilms is, however, relatively new, consolidated in the 1980s. The morphogenic switching by microbes from planktonic to biofilm phenotype confers numerous survival advantages such as resistance to desiccation, antibiotics, biocides, ultraviolet radiation, and host immune responses, thereby complicating treatment strategies for pathogenic microorganisms. Hence, understanding the mechanisms governing the biofilm phenotype can result in efficient treatment strategies directed specifically against molecular markers mediating this process. The application of omics technologies for studying microbial biofilms is relatively less explored and holds great promise in furthering our understanding of biofilm biology. In this review, we provide an overview of the application of omics tools such as transcriptomics, proteomics, and metabolomics as well as multi-omics approaches for studying microbial biofilms in the current literature. We also highlight how the use of omics tools directed at various stages of the biological information flow, from genes to metabolites, can be integrated via multi-omics platforms to provide a holistic view of biofilm biology. Following this, we propose a future artificial intelligence-based multi-omics platform that can predict the pathways associated with different biofilm phenotypes.
Article
Providing maximum information on the provenance of scientific results in life sciences is getting considerable attention since the widely publicized reproducibility crisis. Improving the reproducibility of data processing and analysis workflows is part of this movement and may help achieve clinical deployment quicker. Scientific workflow managers can be valuable tools towards achieving this goal. Although these platforms are already well established in the field of genomics and other omics fields, in metabolomics scripts and dedicated software packages are still more popular. However, versatile workflows for metabolomics exist in the KNIME and Galaxy platforms. We will here summarize the available options of scientific workflow managers dedicated to metabolomics analysis.
Article
In metabolomics studies, metabolic pathway recognition (MPR) is performed by software tools to screen out the significant pathways disturbed by diseases or reinstated by drugs. To achieve MPR, the significantly changed metabolites determined in different biospecimens (e.g. plasma and urine) are analyzed either independently (metabolites from each biospecimen as a dataset) or integratively (metabolites from all biospecimens as a dataset). However, whether the choice of these two processing approaches affects the results of MPR remains unknown. In this study, this issue was addressed by selecting evaluation of the effects of the herbal medicine Rehmanniae Radix (RR) on anemia and adrenal fatigue by UPLC-QTOF-MS/MS-based metabolomics as an example. The significant pathways disturbed by the modeling of anemia and adrenal fatigue and those reinstated by treatments with raw and processed RR were recognized using MetPA software tool (MetaboAnalyst 3.0), and compared by independent and integrative processing of the significantly changed metabolites determined in plasma and urine. The results showed that the two processing approaches could yield different impact values of pathways and thereby recognize different significant pathways. The differences appear to happen more easily when metabolites from different biospecimens shared the same metabolic pathway. Such pathway could be recognized as a significant pathway by integrative processing but could be excluded by independent processing due to the converged and dispersed importance contributions of the involved metabolites to MPR in the two processing approaches. This issue should concern researchers because MPR is crucial not only to understanding metabolomics data but also to guiding subsequent mechanistic research.
Conference Paper
Full-text available
Computational Core for Plant Metabolomics (CCPM) is a web-based collaborative platform for researchers in the field of metabol-omics to store, analyze and share their data. Metabolomics is a newly emerging field of ‘omics’ research that is concerned with characterizing large numbers of metabolites using chromatography, mass spectrometry and NMR. There is abundant volume and variety in the data, with velocity being unpredictable. An interdisciplinary engagement such as this faces significant non-technical challenges solvable using a balanced approach to software management in a university setting to create an environment promoting collaborative contributions. In this paper we report on our experiences, challenges and methods in delivering a usable solution. CCPM provides a secure data repository with advanced tools for analysis including preprocessing, pretreatment, data filtration, statistical analysis, and pathway analysis functions; and also visualization, integration and sharing of data. As all users are not equally IT-savvy, it is essential that the user interface is robust, friendly and interactive where the user can submit and control various tasks running simultaneously without stopping/interfering with other tasks. In each stage of its pipeline architecture, users are also allowed to upload external data that has been partially processed till the previous stage in other platforms. Use of open source softwares for development makes the maintenance and development of our modules easier than the others which depend on proprietary softwares.
Conference Paper
Full-text available
Amazon sales-rank gives a relative estimate of a product item’s popularity among other items in the same category. An early prediction of the Amazon sales-rank of a product would imply an early guess of its sales-popularity relative to the other products on Amazon, which is one of the largest e-commerce hub across the globe. Traditional methods suggest use of product review related features, e.g., volume of reviews, text content of the reviews etc. for the purpose of prediction. In contrast, we propose in this paper for the first time a network-assisted approach to construct suitable features for prediction. In particular, we build a co-purchase network treating the individual products as nodes, with edges in between if two products are bought with one another. The way a product is positioned in this network (e.g., its centrality, clustering coefficient etc.) turns out to be a strong indicator of its sales-rank. This network-assisted approach has two distinct advantages over the traditional baseline method based on review analysis – (i) it works even if the product has no reviews (relevant especially in the early stages of the product launch) and (ii) it is notably more discriminative in classifying a popular (i.e., low sales-rank) product from an unpopular (i.e., high sales-rank) one. Based on this observation, we build a supervised model to early classify a popular product from an unpopular one. We report our results on two different product categories (CDs and cell phones) and obtain remarkably better classification accuracy compared to the baseline scheme. When the top 100 (700) products based on sales-rank are labelled as popular and the bottom 100 (700) are labelled as unpopular, the classification accuracy of our method is 89.85% (82.1%) for CDs and 84.11% (84.8%) for cell phones compared to 46.37% (68.75%) and 83.17% (71.95%) respectively from the baseline method.
Article
Full-text available
The goal of this group is to define the reporting requirements associated with the statistical analysis (including univariate, multivariate, informatics, machine learning etc.) of metabolite data with respect to other measured/collected experimental data (often called meta-data). These definitions will embrace as many aspects of a complete metabolomics study as possible at this time. In chronological order this will include: Experimental Design, both in terms of sample collection/matching, and data acquisition scheduling of samples through whichever spectroscopic technology used; Deconvolution (if required); Pre-processing, for example, data cleaning, outlier detection, row/column scaling, or other transformations; Definition and parameterization of subsequent visualizations and Statistical/Machine learning Methods applied to the dataset; If required, a clear definition of the Model Validation Scheme used (including how data are split into training/validation/test sets); Formal indication on whether the data analysis has been Independently Tested (either by experimental reproduction, or blind hold out test set). Finally, data interpretation and the visual representations and hypotheses obtained from the data analyses.
Article
Full-text available
High-throughput metabolomic experiments aim at identifying and ultimately quantifying all metabolites present in biological systems. The metabolites are interconnected through metabolic reactions, generally grouped into metabolic pathways. Classical metabolic maps provide a relational context to help interpret metabolomics experiments and a wide range of tools have been developed to help place metabolites within metabolic pathways. However, the representation of metabolites within separate disconnected pathways overlooks most of the connectivity of the metabolome. By definition, reference pathways cannot integrate novel pathways nor show relationships between metabolites that may be linked by common neighbours without being considered as joint members of a classical biochemical pathway. MetExplore is a web server that offers the possibility to link metabolites identified in untargeted metabolomics experiments within the context of genome-scale reconstructed metabolic networks. The analysis pipeline comprises mapping metabolomics data onto the specific metabolic network of an organism, then applying graph-based methods and advanced visualization tools to enhance data analysis. The MetExplore web server is freely accessible at http://metexplore.toulouse.inra.fr.
Article
Full-text available
KEGG (Kyoto Encyclopedia of Genes and Genomes) is a knowledge base for systematic analysis of gene functions, linking genomic information with higher order functional information. The genomic information is stored in the GENES database, which is a collection of gene catalogs for all the completely sequenced genomes and some partial genomes with up-to-date annotation of gene functions. The higher order functional information is stored in the PATHWAY database, which contains graphical representations of cellular processes, such as metabolism, membrane transport, signal transduction and cell cycle. The PATHWAY database is supplemented by a set of ortholog group tables for the information about conserved subpathways (pathway motifs), which are often encoded by positionally coupled genes on the chromosome and which are especially useful in predicting gene functions. A third database in KEGG is LIGAND for the information about chemical compounds, enzyme molecules and enzymatic reactions. KEGG provides Java graphics tools for browsing genome maps, comparing two genome maps and manipulating expression maps, as well as computational tools for sequence comparison, graph comparison and path computation. The KEGG databases are daily updated and made freely available (http://www.genome.ad.jp/kegg/ ).
Article
Full-text available
AraCyc is a database containing biochemical pathways of Arabidopsis, developed at The Arabidopsis Information Resource (http://www.arabidopsis.org). The aim of AraCyc is to represent Arabidopsis metabolism as completely as possible with a user-friendly Web-based interface. It presently features more than 170 pathways that include information on compounds, intermediates, cofactors, reactions, genes, proteins, and protein subcellular locations. The database uses Pathway Tools software, which allows the users to visualize a bird's eye view of all pathways in the database down to the individual chemical structures of the compounds. The database was built using Pathway Tools' Pathologic module with MetaCyc, a collection of pathways from more than 150 species, as a reference database. This initial build was manually refined and annotated. More than 20 plant-specific pathways, including carotenoid, brassinosteroid, and gibberellin biosyntheses have been added from the literature. A list of more than 40 plant pathways will be added in the coming months. The quality of the initial, automatic build of the database was compared with the manually improved version, and with EcoCyc, an Escherichia coli database using the same software system that has been manually annotated for many years. In addition, a Perl interface, PerlCyc, was developed that allows programmers to access Pathway Tools databases from the popular Perl language. AraCyc is available at the tools section of The Arabidopsis Information Resource Web site (http://www.arabidopsis.org/tools/aracyc).
Article
Full-text available
The Human Metabolome Database (HMDB) is currently the most complete and comprehensive curated collection of human metabolite and human metabolism data in the world. It contains records for more than 2180 endogenous metabolites with information gathered from thousands of books, journal articles and electronic databases. In addition to its comprehensive literature-derived data, the HMDB also contains an extensive collection of experimental metabolite concentration data compiled from hundreds of mass spectra (MS) and Nuclear Magnetic resonance (NMR) metabolomic analyses performed on urine, blood and cerebrospinal fluid samples. This is further supplemented with thousands of NMR and MS spectra collected on purified, reference metabolites. Each metabolite entry in the HMDB contains an average of 90 separate data fields including a comprehensive compound description, names and synonyms, structural information, physico-chemical data, reference NMR and MS spectra, biofluid concentrations, disease associations, pathway information, enzyme data, gene sequence data, SNP and mutation data as well as extensive links to images, references and other public databases. Extensive searching, relational querying and data browsing tools are also provided. The HMDB is designed to address the broad needs of biochemists, clinical chemists, physicians, medical geneticists, nutritionists and members of the metabolomics community. The HMDB is available at: www.hmdb.ca.
Article
Hyphenated full-scan MS technology creates large amounts of data. A versatile easy to handle automation tool aiding in the data analysis is very important in handling such a data stream. MetAlign softwareas described in this manuscripthandles a broad range of accurate mass and nominal mass GC/MS and LC/MS data. It is capable of automatic format conversions, accurate mass calculations, baseline corrections, peak-picking, saturation and mass-peak artifact filtering, as well as alignment of up to 1000 data sets. A 100 to 1000-fold data reduction is achieved. MetAlign software output is compatible with most multivariate statistics programs.
Article
Modern analytical technologies afford comprehensive and quantitative investigation of a multitude of different metabolites. Typical metabolomic experiments can therefore produce large amounts of data. Handling such complex datasets is an important step that has big impact on extent and quality at which the metabolite identification and quantification can be made, and thus on the ultimate biological interpretation of results. Increasing interest in metabolomics thus led to resurgence of interest in related data processing. A wide variety of methods and software tools have been developed for metabolomics during recent years, and this trend is likely to continue. In this paper we overview the key steps of metabolomic data processing and focus on reviewing recent literature related to this topic, particularly on methods for handling data from liquid chromatography mass spectrometry (LC-MS) experiments.
Article
The Metabolomics Standards Initiative (MSI) has recently released documents describing minimum parameters for reporting metabolomics experiments, in order to validate metabolomic studies and to facilitate data exchange. The reporting parameters encompassed by MSI include the biological study design, sample preparation, data acquisition, data processing, data analysis and interpretation relative to the biological hypotheses being evaluated. Herein we exemplify how such metadata can be reported by using a small case study - the metabolite profiling by GC-TOF mass spectrometry of Arabidopsis thaliana leaves from a knockout allele of the gene At1g08510 in the Wassilewskija ecotype. Pitfalls in quality control are highlighted that can invalidate results even if MSI reporting standards are fulfilled, including reliable compound identification and integration of unknown metabolites. Standardized data processing methods are proposed for consistent data storage and dissemination via databases.
R: a language and environment for statistical computing. R Foundaton for Statistical Computing
  • R Development
  • Core Team
R Development Core Team (2008) R: a language and environment for statistical computing. R Foundaton for Statistical Computing, Vienna, Austria. Available at http://www.R-project.org.