ArticlePDF Available

MetDAT: A modular and workflow-based free online pipeline for metabolite data processing, analysis and interpretation

October 2010
Bioinformatics 26(20):2639-40

October 2010
26(20):2639-40

DOI:10.1093/bioinformatics/btq436

Source
PubMed

Authors:

Ambarish Biswas

AgResearch

Kalyan Mynampati

National University of Singapore

Shivshankar Umashankar

Singapore Centre on Environmental Life Sciences Engineering

Sheela Reuben

National University of Singapore

Show all 8 authorsHide

Analysis of high throughput metabolomics experiments is a resource-intensive process that includes pre-processing, pre-treatment and post-processing at each level of experimental hierarchy. We developed an interactive user-friendly online software called Metabolite Data Analysis Tool (MetDAT) for mass spectrometry data. It offers a pipeline of tools for file handling, data pre-processing, univariate and multivariate statistical analyses, database searching and pathway mapping. Outputs are produced in the form of text and high-quality images in real-time. MetDAT allows users to combine data management and experiment-centric workflows for optimization of metabolomics methods and metabolite analysis. MetDAT is available free for academic use from http://smbl.nus.edu.sg/METDAT2/. sanjay@nus.edu.sg

Screen shot from MetDAT v2.0 software. Individual programs are accessible by left panel. Important functions such as workflow creation, data management, account details and visualization can be accessed via dashboard. Graphical outputs from the various programs in MetDAT are shown at the bottom.

…

Figures - uploaded by Shivshankar Umashankar

Content may be subject to copyright.

Content uploaded by Shivshankar Umashankar

Content may be subject to copyright.

[18:24 20/9/2010 Bioinformatics-btq436.tex] Page: 2639 2639–2640

BIOINFORMATICS APPLICATIONS NOTE Vol. 26 no. 20 2010, pages 2639–2640

doi:10.1093/bioinformatics/btq436

Systems biology Advance Access publication August 11, 2010

MetDAT: a modular and workﬂow-based free online pipeline for

mass spectrometry data processing, analysis and interpretation

Ambarish Biswas1, Kalyan C. Mynampati1, Shivshankar Umashankar1, Sheela Reuben1,

Gauri Parab2, Raghuraj Rao1, Velayutham S. Kannan1and Sanjay Swarup1,2,3,∗

1Singapore-Delft Water Alliance, National University of Singapore, Singapore 117576, 2Small Molecules Biology

Laboratory, Department of Biological Sciences, National University of Singapore, Singapore 117543 and 3NUS

Environmental Research Institute (NERI), #02-01, T-Lab Building (TL), 5A Engineering Drive 1, Singapore 117411

Associate Editor: Trey Ideker

ABSTRACT

Summary: Analysis of high throughput metabolomics experiments

is a resource-intensive process that includes pre-processing, pre-

treatment and post-processing at each level of experimental

hierarchy. We developed an interactive user-friendly online software

called Metabolite Data Analysis Tool (MetDAT) for mass spectrometry

data. It offers a pipeline of tools for ﬁle handling, data pre-processing,

univariate and multivariate statistical analyses, database searching

and pathway mapping. Outputs are produced in the form of text and

high-quality images in real-time. MetDAT allows users to combine

data management and experiment-centric workﬂows for optimization

of metabolomics methods and metabolite analysis.

Availability: MetDAT is available free for academic use from http://

smbl.nus.edu.sg/METDAT2/.

Contact: sanjay@nus.edu.sg

Received on May 3, 2010; revised on July 20, 2010; accepted on

July 25, 2010

1 INTRODUCTION

Metabolomics experiments conducted using mass spectrometry

produce spectral outputs amounting to gigabytes of data. This is

a result of elaborate experimental set-ups with several replicates,

time-series studies and different types of treatment parameters

for a single sample. Datasets from such experimental set-up can

be very complex. Querying information from several subsets of

complex datasets and deriving meaningful biological interpretation

is very challenging. To overcome this problem in data handling and

analysis, the metabolomics research community has recommended

a minimal set of reporting standards and general guidelines for data

analysis, interpretation and exchange of metabolomics experiments

(Fiehn et al., 2008; Goodacre et al., 2007; Sansone et al., 2007).

Data pre-processing generally involves removal of noise and

reduction of baseline of the raw data. The resulting data are

normalized and scaled for univariate and multivariate statistical

analysis. Visualization of the resultant output helps to understand and

interpret experimental results. A number of standalone data analysis

tools for preprocessing, statistical analysis and pathway mapping

(mzMine, XCMS, MassTrix, Metexplore, Bioconductor) are freely

available on the web (Jourdan et al., 2010; Katajamaa et al., 2007;

Lommen, 2009). Currently, these available data analysis tools do

∗To whom correspondence should be addressed.

not have the means to include information from experiment such

as hierarchical data structure. As a result, meaningful inferences

based on comparisons at the various levels of data structure are not

straightforward and their associated effects are not easily interpreted.

Most of the available tools require continuous interaction with

users in the programs leading to delayed output. Here, we present

Metabolite Data Analysis Tool (MetDAT) to address gaps in the

currently available software for mass spectrometry data analysis

in a systems-centric approach. MetDAT is a web-based tool with

an open and integrated system that performs data preprocessing,

analysis followed by database searches with ﬁlters to reﬁne the

output. MetDAT also provides interactive and customizable modules

and user-driven analysis of data at hierarchical levels.

2 METHODS AND IMPLEMENTATION

2.1 Computational model and user interface

MetDAT is a platform-independent web application, which works well

with most widely used web browsers such as Mozilla Firefox (above 3.2),

Internet Explorer (above 6), Google Chrome and Safari. The web interface

of MetDAT is developed using HTML and CGI-Perl scripts. Computational

functions of MetDAT are written in Perl and R languages, while graphics

are generated using Gnuplot (Williams et al., 1993) and R packages (R

Development Core Team, 2008). The software is hosted on a server running

Red Hat Enterprise Linux Server release 5.4.

MetDAT accepts input data as tab-delimited text ﬁles in a single zipped

ﬁle which is then converted into a matrix by ‘Prepare Dataset’ module.

The standalone modules run seamlessly on uploaded data and show the

results instantaneously but these results are not saved on MetDAT’s server.

All MetDAT programs can be used without any log-in. However, logged-

in users have the beneﬁt of creating custom workﬂows or using a default

one. The workﬂows are designed such that the data can be cleaned up,

analyzed, visualized and interpreted all in one smooth pipeline. Two default

workﬂows—one for differential metabolite analysis and another for method

optimization are provided in MetDAT. The uploaded datasets, workﬂows and

results are automatically stored on MetDAT server and can be retrieved at a

later time.

MetDAT has no restrictions on the number of ﬁles to be processed.

However, larger datasets would require longer processing time. This software

has a combination of 23 most commonly used programs by metabolomics

researchers for pre-processing, pre-treatment, data analysis, visualization,

metabolite database search and pathway mapping. Other programs like self

organizing map, Random forest and support vector machine are not available

in this version of the software. These programs have been streamlined such

that they can be used readily by beginners while providing ample controls

at National University of Singapore on October 26, 2010bioinformatics.oxfordjournals.orgDownloaded from

[18:24 20/9/2010 Bioinformatics-btq436.tex] Page: 2640 2639–2640

A.Biswas et al.

Fig. 1. Screen shot from MetDAT v2.0 software. Individual programs are

accessible by left panel. Important functions such as workﬂow creation,

data management, account details and visualization can be accessed via

dashboard. Graphical outputs from the various programs in MetDAT are

shown at the bottom.

for customization by advanced users. User interface provides a dashboard for

easy access to the programs and results (Fig. 1). A standalone visualization

module that is accessible from the dashboard has two types of interactive

plots—X–Y plot and barchart. These require Adobe Flash plug-in to be

installed in the browser. Algorithms for programs, references, user guide

and examples are included in the manual at the website for easy reference.

2.2 Data processing, analysis and output

The pre-processing and pre-treatment modules in MetDAT include removing

noise, correction of base lines and normalizing or scaling the data to increase

signal-to-noise ratio. Data analysis includes a range of statistical techniques

such as analysis of variance, principal component analysis, partial least

square (PLS), partial least square-discriminant analysis (PLS-DA), linear

discriminant analysis and t-test. The major strength of this software is

the power to analyze data at various hierarchical levels such as samples,

treatments, time points, biological repeats and technical replicates. Datasets

for the different hierarchical levels can be assigned at data preparation.

It provides users the option to perform intensive statistical analysis at any

of these levels. Differentially expressed metabolites are listed based on fold

change and t-test analysis. The listed metabolites can further be searched

against four databases—Aracyc (Mueller et al., 2003), HMDB (Wishart

et al., 2009), Plantcyc and KEGG (Kanehisa and Goto, 2000) along with

multiple ﬁlters for better and targeted results. The database search includes

a feature for metabolic pathway identiﬁcation. Key differential metabolites

are highlighted for users to focus on selected parts of metabolic networks.

3 CONCLUSIONS

Metabolic perturbation studies involve understanding effects of

variation in a statistically robust fashion. This software is created

with the vision of biologists for data analysis and interpretation.

MetDAT platform simpliﬁes and standardizes data analysis for

usage with minimal training. MetDAT provides data processing

and analysis, database search for metabolite identiﬁcation, pathway

mapping and a rich palette of visualization tools. The combination of

data management and customizable workﬂow makes data analysis

faster thereby saving researchers time.

ACKNOWLEDGEMENTS

The authors gratefully acknowledge the support and contributions of

the Singapore-Delft Water Alliance (SDWA). The research presented

in this work was carried out as part of the SDWA’s Aquatic Science

Centre at Ulu Pandan programme.

Funding: Singapore-Delft Water Alliance (WBS No: R-264-001-

002-272).

Conﬂict of Interest: none declared.

REFERENCES

Fiehn,O. et al. (2008) Quality control for plant metabolomics: reporting MSI-compliant

studies. Plant J.,53, 691–704.

Goodacre,R. et al. (2007) Proposed minimum reporting standards for data analysis in

metabolomics. Metabolomics,3, 231–241.

Jourdan,F. et al. (2010) MetExplore: a web server to link metabolomic experiments and

genome-scale metabolic networks. Nucleic Acids Res., 38, W132–W137.

Kanehisa,M and Goto,S. (2000) KEGG: kyoto encyclopedia of genes and genomes.

Nucleic Acids Res., 28, 27–30.

Katajamaa,M. et al. (2007) Data processing for mass spectrometry-based metabolomics.

J. Chromatogr. A,1158, 318–28.

Lommen,A. (2009) MetAlign: interface-driven, versatile metabolomics tool for

hyphenated full-scan mass spectrometry data preprocessing. Anal Chem., 81,

3079–3086.

Mueller,L.A. et al. (2003) AraCyc: a biochemical pathway database for Arabidopsis.

Plant Physiol., 132, 453–460.

R Development Core Team (2008) R: a language and environment for statistical

computing. R Foundaton for Statistical Computing, Vienna, Austria. Available at

http://www.R-project.org.

Sansone,S.A. et al. (2007) The metabolomics standards initiative. Nat. Biotechnol., 25,

846–848.

Williams,T. et al. (1993) ‘gnuplot’. Available at http://www.gnuplot.info

Wishart,D.S. et al. (2009) HMDB: the Human Metabolome Database. Nucleic Acids

Res., 35, D521–D526.

2640

at National University of Singapore on October 26, 2010bioinformatics.oxfordjournals.orgDownloaded from

Overoptimism in cross-validation when using partial least squares-discriminant analysis for omics data: a systematic study

Article

Full-text available

Jun 2018

Advances in analytical instrumentation have provided the possibility of examining thousands of genes, peptides, or metabolites in parallel. However, the cost and time-consuming data acquisition process causes a generalized lack of samples. From a data analysis perspective, omics data are characterized by high dimensionality and small sample counts. In many scenarios, the analytical aim is to differentiate between two different conditions or classes combining an analytical method plus a tailored qualitative predictive model using available examples collected in a dataset. For this purpose, partial least squares-discriminant analysis (PLS-DA) is frequently employed in omics research. Recently, there has been growing concern about the uncritical use of this method, since it is prone to overfitting and may aggravate problems of false discoveries. In many applications involving a small number of subjects or samples, predictive model performance estimation is only based on cross-validation (CV) results with a strong preference for reporting results using leave one out (LOO). The combination of PLS-DA for high dimensionality data and small sample conditions, together with a weak validation methodology is a recipe for unreliable estimations of model performance. In this work, we present a systematic study about the impact of the dataset size, the dimensionality, and the CV technique used on PLS-DA overoptimism when performance estimation is done in cross-validation. Firstly, by using synthetic data generated from a same probability distribution and with assigned random binary labels, we have obtained a dataset where the true classification rate (CR) is 50%. As expected, our results confirm that internal validation provides overoptimistic estimations of the classification accuracy (i.e., overfitting). We have characterized the CR estimator in terms of bias and variance depending on the internal CV technique used and sample to dimensionality ratio. In small sample conditions, due to the large bias and variance of the estimator, the occurrence of extremely good CRs is common. We have found that overfitting peaks when the sample size in the training subset approaches the feature vector dimensionality minus one. In these conditions, the models are neither under- or overdetermined with a unique solution. This effect is particularly intense for LOO and peaks higher in small sample conditions. Overoptimism is decreased beyond this point where the abundance of noisy produces a regularization effect leading to less complex models. In terms of overfitting, our study ranks CV methods as follows: Bootstrap produces the most accurate estimator of the CR, followed by bootstrapped Latin partitions, random subsampling, K-Fold, and finally, the very popular LOO provides the worst results. Simulation results are further confirmed in real datasets from mass spectrometry and microarrays.

Bioinformatics Tools for the Interpretation of Metabolomics Data

Article

Full-text available

Dec 2017

Purpose of Review Metabolomics is a rapidly evolving field that generates large and complex datasets. Bioinformatics becomes critical towards the extraction of meaningful biological information. In this article, we briefly review computational approaches that have been well accepted in the field, and discuss the development of new methods and tools to interpret metabolomics data. Recent Findings Significant progress has been made in computational metabolomics over the past years. This includes methods that are used to preprocess data generated by instruments, to annotate metabolites, to carry out statistical analyses, to identify perturbed metabolic pathways, and to integrate metabolomics with other omics data. Each of these topics is discussed in respective sections of this review. Summary Bioinformatics tools used for metabolomics remain a highly active research area. An ecosystem is emerging with software libraries, standalone tools, and web-based tools and services. While some require bioinformatics training, many of them are user friendly and easily accessible. Much further development is still needed to serve the metabolomics field and its applications.

MultiClassMetabo: A Superior Classification Model Constructed Using Metabolic Markers in Multiclass Metabolomics

Article

Jan 2024
ANAL CHEM

MMEASE: Online meta-analysis of metabolomic data by enhanced metabolite annotation, marker selection and enrichment analysis

Article

Feb 2021

Large-scale and long-term metabolomic studies have attracted widespread attention in the biomedical studies yet remain challenging despite recent technique progresses. In particular, the ineffective way of experiment integration and limited capacity in metabolite annotation are known issues. Herein, we constructed an online tool MMEASE enabling the integration of multiple analytical experiments with an enhanced metabolite annotation and enrichment analysis (https://idrblab.org/mmease/). MMEASE was unique in capable of (1) integrating multiple analytical blocks; (2) providing enriched annotation for >330 thousands of metabolites; (3) conducting enrichment analysis using various categories/sub-categories. All in all, MMEASE aimed at supplying a comprehensive service for large-scale and long-term metabolomics, which might provide valuable guidance to current biomedical studies. Significance: To facilitate the studies of large-scale and long-term metabolomic analysis, MMEASE was developed to (1) achieve the online integration of multiple datasets from different analytical experiments, (2) provide the most diverse strategies for marker discovery, enabling performance assessment and (3) significantly amplify metabolite annotation and subsequent enrichment analysis. MMEASE aimed at supplying a comprehensive service for long-term and large-scale metabolomics, which might provide valuable guidance to current biomedical studies.

Statistical analysis in metabolic phenotyping

Article

Jul 2021

Metabolic phenotyping is an important tool in translational biomedical research. The advanced analytical technologies commonly used for phenotyping, including mass spectrometry (MS) and nuclear magnetic resonance (NMR) spectroscopy, generate complex data requiring tailored statistical analysis methods. Detailed protocols have been published for data acquisition by liquid NMR, solid-state NMR, ultra-performance liquid chromatography (LC-)MS and gas chromatography (GC-)MS on biofluids or tissues and their preprocessing. Here we propose an efficient protocol (guidelines and software) for statistical analysis of metabolic data generated by these methods. Code for all steps is provided, and no prior coding skill is necessary. We offer efficient solutions for the different steps required within the complete phenotyping data analytics workflow: scaling, normalization, outlier detection, multivariate analysis to explore and model study-related effects, selection of candidate biomarkers, validation, multiple testing correction and performance evaluation of statistical models. We also provide a statistical power calculation algorithm and safeguards to ensure robust and meaningful experimental designs that deliver reliable results. We exemplify the protocol with a two-group classification study and data from an epidemiological cohort; however, the protocol can be easily modified to cover a wider range of experimental designs or incorporate different modeling approaches. This protocol describes a minimal set of analyses needed to rigorously investigate typical datasets encountered in metabolic phenotyping. Metabolomics studies using large-scale NMR or mass spectrometry experiments on biofluids or tissues generate complex data. This protocol provides guidelines and software (supplied in Jupyter notebooks) for the statistical analysis of these data.

Multi-omics tools for studying microbial biofilms: current perspectives and future directions

Article

Oct 2020

Christopher Delaney

The advent of omics technologies has greatly improved our understanding of microbial biology, particularly in the last two decades. The field of microbial biofilms is, however, relatively new, consolidated in the 1980s. The morphogenic switching by microbes from planktonic to biofilm phenotype confers numerous survival advantages such as resistance to desiccation, antibiotics, biocides, ultraviolet radiation, and host immune responses, thereby complicating treatment strategies for pathogenic microorganisms. Hence, understanding the mechanisms governing the biofilm phenotype can result in efficient treatment strategies directed specifically against molecular markers mediating this process. The application of omics technologies for studying microbial biofilms is relatively less explored and holds great promise in furthering our understanding of biofilm biology. In this review, we provide an overview of the application of omics tools such as transcriptomics, proteomics, and metabolomics as well as multi-omics approaches for studying microbial biofilms in the current literature. We also highlight how the use of omics tools directed at various stages of the biological information flow, from genes to metabolites, can be integrated via multi-omics platforms to provide a holistic view of biofilm biology. Following this, we propose a future artificial intelligence-based multi-omics platform that can predict the pathways associated with different biofilm phenotypes.

Scientific workflow managers in metabolomics: An overview

Article

Apr 2020

Providing maximum information on the provenance of scientific results in life sciences is getting considerable attention since the widely publicized reproducibility crisis. Improving the reproducibility of data processing and analysis workflows is part of this movement and may help achieve clinical deployment quicker. Scientific workflow managers can be valuable tools towards achieving this goal. Although these platforms are already well established in the field of genomics and other omics fields, in metabolomics scripts and dedicated software packages are still more popular. However, versatile workflows for metabolomics exist in the KNIME and Galaxy platforms. We will here summarize the available options of scientific workflow managers dedicated to metabolomics analysis.

Independent or integrative processing approach of metabolite datasets from different biospecimens potentially affects metabolic pathway recognition in metabolomics

Article

Dec 2018
J CHROMATOGR A

In metabolomics studies, metabolic pathway recognition (MPR) is performed by software tools to screen out the significant pathways disturbed by diseases or reinstated by drugs. To achieve MPR, the significantly changed metabolites determined in different biospecimens (e.g. plasma and urine) are analyzed either independently (metabolites from each biospecimen as a dataset) or integratively (metabolites from all biospecimens as a dataset). However, whether the choice of these two processing approaches affects the results of MPR remains unknown. In this study, this issue was addressed by selecting evaluation of the effects of the herbal medicine Rehmanniae Radix (RR) on anemia and adrenal fatigue by UPLC-QTOF-MS/MS-based metabolomics as an example. The significant pathways disturbed by the modeling of anemia and adrenal fatigue and those reinstated by treatments with raw and processed RR were recognized using MetPA software tool (MetaboAnalyst 3.0), and compared by independent and integrative processing of the significantly changed metabolites determined in plasma and urine. The results showed that the two processing approaches could yield different impact values of pathways and thereby recognize different significant pathways. The differences appear to happen more easily when metabolites from different biospecimens shared the same metabolic pathway. Such pathway could be recognized as a significant pathway by integrative processing but could be excluded by independent processing due to the converged and dispersed importance contributions of the involved metabolites to MPR in the two processing approaches. This issue should concern researchers because MPR is crucial not only to understanding metabolomics data but also to guiding subsequent mechanistic research.

Computational Core for Plant Metabolomics: A Case for Interdisciplinary Research

Conference Paper

Full-text available

Nov 2017

Computational Core for Plant Metabolomics (CCPM) is a web-based collaborative platform for researchers in the field of metabol-omics to store, analyze and share their data. Metabolomics is a newly emerging field of ‘omics’ research that is concerned with characterizing large numbers of metabolites using chromatography, mass spectrometry and NMR. There is abundant volume and variety in the data, with velocity being unpredictable. An interdisciplinary engagement such as this faces significant non-technical challenges solvable using a balanced approach to software management in a university setting to create an environment promoting collaborative contributions. In this paper we report on our experiences, challenges and methods in delivering a usable solution. CCPM provides a secure data repository with advanced tools for analysis including preprocessing, pretreatment, data filtration, statistical analysis, and pathway analysis functions; and also visualization, integration and sharing of data. As all users are not equally IT-savvy, it is essential that the user interface is robust, friendly and interactive where the user can submit and control various tasks running simultaneously without stopping/interfering with other tasks. In each stage of its pipeline architecture, users are also allowed to upload external data that has been partially processed till the previous stage in other platforms. Use of open source softwares for development makes the maintenance and development of our modules easier than the others which depend on proprietary softwares.

Analysis of the Co-purchase Network of Products to Predict Amazon Sales-Rank

Conference Paper

Full-text available

Nov 2017

Amazon sales-rank gives a relative estimate of a product item’s popularity among other items in the same category. An early prediction of the Amazon sales-rank of a product would imply an early guess of its sales-popularity relative to the other products on Amazon, which is one of the largest e-commerce hub across the globe. Traditional methods suggest use of product review related features, e.g., volume of reviews, text content of the reviews etc. for the purpose of prediction. In contrast, we propose in this paper for the first time a network-assisted approach to construct suitable features for prediction. In particular, we build a co-purchase network treating the individual products as nodes, with edges in between if two products are bought with one another. The way a product is positioned in this network (e.g., its centrality, clustering coefficient etc.) turns out to be a strong indicator of its sales-rank. This network-assisted approach has two distinct advantages over the traditional baseline method based on review analysis – (i) it works even if the product has no reviews (relevant especially in the early stages of the product launch) and (ii) it is notably more discriminative in classifying a popular (i.e., low sales-rank) product from an unpopular (i.e., high sales-rank) one. Based on this observation, we build a supervised model to early classify a popular product from an unpopular one. We report our results on two different product categories (CDs and cell phones) and obtain remarkably better classification accuracy compared to the baseline scheme. When the top 100 (700) products based on sales-rank are labelled as popular and the bottom 100 (700) are labelled as unpopular, the classification accuracy of our method is 89.85% (82.1%) for CDs and 84.11% (84.8%) for cell phones compared to 46.37% (68.75%) and 83.17% (71.95%) respectively from the baseline method.

Proposed minimum reporting standards for data analysis in metabolomics

Article

Full-text available

Sep 2007

The goal of this group is to define the reporting requirements associated with the statistical analysis (including univariate, multivariate, informatics, machine learning etc.) of metabolite data with respect to other measured/collected experimental data (often called meta-data). These definitions will embrace as many aspects of a complete metabolomics study as possible at this time. In chronological order this will include: Experimental Design, both in terms of sample collection/matching, and data acquisition scheduling of samples through whichever spectroscopic technology used; Deconvolution (if required); Pre-processing, for example, data cleaning, outlier detection, row/column scaling, or other transformations; Definition and parameterization of subsequent visualizations and Statistical/Machine learning Methods applied to the dataset; If required, a clear definition of the Model Validation Scheme used (including how data are split into training/validation/test sets); Formal indication on whether the data analysis has been Independently Tested (either by experimental reproduction, or blind hold out test set). Finally, data interpretation and the visual representations and hypotheses obtained from the data analyses.

MetExplore: A web server to link metabolomic experiments and genome-scale metabolic networks

Article

Full-text available

May 2010
NUCLEIC ACIDS RES

High-throughput metabolomic experiments aim at identifying and ultimately quantifying all metabolites present in biological systems. The metabolites are interconnected through metabolic reactions, generally grouped into metabolic pathways. Classical metabolic maps provide a relational context to help interpret metabolomics experiments and a wide range of tools have been developed to help place metabolites within metabolic pathways. However, the representation of metabolites within separate disconnected pathways overlooks most of the connectivity of the metabolome. By definition, reference pathways cannot integrate novel pathways nor show relationships between metabolites that may be linked by common neighbours without being considered as joint members of a classical biochemical pathway. MetExplore is a web server that offers the possibility to link metabolites identified in untargeted metabolomics experiments within the context of genome-scale reconstructed metabolic networks. The analysis pipeline comprises mapping metabolomics data onto the specific metabolic network of an organism, then applying graph-based methods and advanced visualization tools to enhance data analysis. The MetExplore web server is freely accessible at http://metexplore.toulouse.inra.fr.

KEGG: kyoto Encyclopedia of Genes and Genomes

Article

Full-text available

Feb 2000

KEGG (Kyoto Encyclopedia of Genes and Genomes) is a knowledge base for systematic analysis of gene functions, linking genomic information with higher order functional information. The genomic information is stored in the GENES database, which is a collection of gene catalogs for all the completely sequenced genomes and some partial genomes with up-to-date annotation of gene functions. The higher order functional information is stored in the PATHWAY database, which contains graphical representations of cellular processes, such as metabolism, membrane transport, signal transduction and cell cycle. The PATHWAY database is supplemented by a set of ortholog group tables for the information about conserved subpathways (pathway motifs), which are often encoded by positionally coupled genes on the chromosome and which are especially useful in predicting gene functions. A third database in KEGG is LIGAND for the information about chemical compounds, enzyme molecules and enzymatic reactions. KEGG provides Java graphics tools for browsing genome maps, comparing two genome maps and manipulating expression maps, as well as computational tools for sequence comparison, graph comparison and path computation. The KEGG databases are daily updated and made freely available (http://www.genome.ad.jp/kegg/ ).

Mueller, L.A., Zhang, P. & Rhee, S.Y. AraCyc: A biochemical pathway database for Arabidopsis. Plant Physiol. 132, 453-460

Article

Full-text available

Jul 2003
PLANT PHYSIOL

AraCyc is a database containing biochemical pathways of Arabidopsis, developed at The Arabidopsis Information Resource (http://www.arabidopsis.org). The aim of AraCyc is to represent Arabidopsis metabolism as completely as possible with a user-friendly Web-based interface. It presently features more than 170 pathways that include information on compounds, intermediates, cofactors, reactions, genes, proteins, and protein subcellular locations. The database uses Pathway Tools software, which allows the users to visualize a bird's eye view of all pathways in the database down to the individual chemical structures of the compounds. The database was built using Pathway Tools' Pathologic module with MetaCyc, a collection of pathways from more than 150 species, as a reference database. This initial build was manually refined and annotated. More than 20 plant-specific pathways, including carotenoid, brassinosteroid, and gibberellin biosyntheses have been added from the literature. A list of more than 40 plant pathways will be added in the coming months. The quality of the initial, automatic build of the database was compared with the manually improved version, and with EcoCyc, an Escherichia coli database using the same software system that has been manually annotated for many years. In addition, a Perl interface, PerlCyc, was developed that allows programmers to access Pathway Tools databases from the popular Perl language. AraCyc is available at the tools section of The Arabidopsis Information Resource Web site (http://www.arabidopsis.org/tools/aracyc).

HMDB: The human metabolome database

Article

Full-text available

Feb 2007
NUCLEIC ACIDS RES

The Human Metabolome Database (HMDB) is currently the most complete and comprehensive curated collection of human metabolite and human metabolism data in the world. It contains records for more than 2180 endogenous metabolites with information gathered from thousands of books, journal articles and electronic databases. In addition to its comprehensive literature-derived data, the HMDB also contains an extensive collection of experimental metabolite concentration data compiled from hundreds of mass spectra (MS) and Nuclear Magnetic resonance (NMR) metabolomic analyses performed on urine, blood and cerebrospinal fluid samples. This is further supplemented with thousands of NMR and MS spectra collected on purified, reference metabolites. Each metabolite entry in the HMDB contains an average of 90 separate data fields including a comprehensive compound description, names and synonyms, structural information, physico-chemical data, reference NMR and MS spectra, biofluid concentrations, disease associations, pathway information, enzyme data, gene sequence data, SNP and mutation data as well as extensive links to images, references and other public databases. Extensive searching, relational querying and data browsing tools are also provided. The HMDB is designed to address the broad needs of biochemists, clinical chemists, physicians, medical geneticists, nutritionists and members of the metabolomics community. The HMDB is available at: www.hmdb.ca.

The Metabolomics Standards Initiative [3]

Article

Full-text available

Sep 2007

MetAlign: Interface-Driven, Versatile Metabolomics Tool for Hyphenated Full-Scan Mass Spectrometry Data Preprocessing

Article

Apr 2009
ANAL CHEM

Arjen Lommen

Hyphenated full-scan MS technology creates large amounts of data. A versatile easy to handle automation tool aiding in the data analysis is very important in handling such a data stream. MetAlign softwareas described in this manuscripthandles a broad range of accurate mass and nominal mass GC/MS and LC/MS data. It is capable of automatic format conversions, accurate mass calculations, baseline corrections, peak-picking, saturation and mass-peak artifact filtering, as well as alignment of up to 1000 data sets. A 100 to 1000-fold data reduction is achieved. MetAlign software output is compatible with most multivariate statistics programs.

Data processing for mass spectrometry-based metabolomics

Article

Aug 2007
J CHROMATOGR A

Modern analytical technologies afford comprehensive and quantitative investigation of a multitude of different metabolites. Typical metabolomic experiments can therefore produce large amounts of data. Handling such complex datasets is an important step that has big impact on extent and quality at which the metabolite identification and quantification can be made, and thus on the ultimate biological interpretation of results. Increasing interest in metabolomics thus led to resurgence of interest in related data processing. A wide variety of methods and software tools have been developed for metabolomics during recent years, and this trend is likely to continue. In this paper we overview the key steps of metabolomic data processing and focus on reviewing recent literature related to this topic, particularly on methods for handling data from liquid chromatography mass spectrometry (LC-MS) experiments.

Quality control for plant metabolomics: Reporting MSI-compliant studies

Article

Mar 2008
PLANT J

The Metabolomics Standards Initiative (MSI) has recently released documents describing minimum parameters for reporting metabolomics experiments, in order to validate metabolomic studies and to facilitate data exchange. The reporting parameters encompassed by MSI include the biological study design, sample preparation, data acquisition, data processing, data analysis and interpretation relative to the biological hypotheses being evaluated. Herein we exemplify how such metadata can be reported by using a small case study - the metabolite profiling by GC-TOF mass spectrometry of Arabidopsis thaliana leaves from a knockout allele of the gene At1g08510 in the Wassilewskija ecotype. Pitfalls in quality control are highlighted that can invalidate results even if MSI reporting standards are fulfilled, including reliable compound identification and integration of unknown metabolites. Standardized data processing methods are proposed for consistent data storage and dissemination via databases.

R: a language and environment for statistical computing. R Foundaton for Statistical Computing

Jan 2008

R Development
Core Team

R Development Core Team (2008) R: a language and environment for statistical computing. R Foundaton for Statistical Computing, Vienna, Austria. Available at http://www.R-project.org.

MetDAT: A modular and workflow-based free online pipeline for metabolite data processing, analysis and interpretation

Abstract and Figures

Recommended publications

Deluge of astronomical data will soon hit South Africa

Automating dChip: Toward reproducible sharing of microarray data analysis

Effective utilization of SCADA for substation protection and control applications

EBM Metadata Based on Dublin Core Better Presenting Validity of Clinical Trials