Fig 1 - uploaded by Juho Rousu
Content may be subject to copyright.
The overview of the two-step metabolite identification framework. An example molecule Tryptophan (mass 204.2 Da) produces a characterizing MS/MS spectra, which is used to predict the original molecule through fingerprints. The predicted fingerprints, along with neutral mass measurement, are used to filter a molecular repository for candidates. 

The overview of the two-step metabolite identification framework. An example molecule Tryptophan (mass 204.2 Da) produces a characterizing MS/MS spectra, which is used to predict the original molecule through fingerprints. The predicted fingerprints, along with neutral mass measurement, are used to filter a molecular repository for candidates. 

Source publication
Article
Full-text available
Metabolite identification from tandem mass spectra is an important problem in metabolomics, underpinning subsequent metabolic modelling and network analysis. Yet, currently this task requires matching the observed spectrum against a database of reference spectra originating from similar equipment and closely matching operating parameters, a conditi...

Contexts in source publication

Context 1
... tandem mass spectrum is generated by selecting an unknown ion band and its mass-to-charge-ratio to undergo fragmentation (see Figure 1). Under mild fragmentation conditions, the peak corresponding to the molecule ion is still visible in the tandem spectrum at the same mass-to-charge. During fragmentation, an ion is often cleaved into two fragments, one of which retains the charge 1 and is visible in the tandem mass spectrum. The complementary fragment is a neutral loss invisible in the spectrum. The first step in compound identification is to constraint the mass and elemental composition of the compound through peak masses. Computational methods utilize the compound's peak and its isotope peak masses to compute set of possible elemental compositions [11]. However, mass measurement accuracy defines the set of compatible compositions through the scope of error in the mass measurements. High or ultra-high resolution analyzers, such as Time-of-Flight and Fourier Transform analyzers, can achieve low mass errors in the range of 1 to 5 ppm ...
Context 2
... consider mass spectra of molecule as a collection χ = {x1, . . . , x k } ∈ X of k 2-dimensional peak tuples xi ∈ R 2 (see Figure 1) 3 . A peak x = (mass, int) T represents the mass- to-charge value and the intensity of the peak measurement. We normalize all intensities to range [0,1]. Often we have a series of spectra of a molecule measured with increasing collision energies as (χ10eV , . . . , χ50eV ...
Context 3
... introduce a novel two-step pattern-recognition approach (see Figure 1) to the metabolite identification problem. Instead of directly learning a mapping between the spectrum and the metabolite, we first predict a set of characterizing fingerprints of the metabolite from its tandem mass spectrum using a kernel- based approach. We learn our fingerprint prediction model from a large set of tandem mass spectra obtained from public mass spectral database MassBank [6]. In the next step, we match the predicted fingerprints against a large molecular database to obtain a list of candidate metabolites. The metabolite identification model generalizes to metabolites not present in reference spectral databases. Due to the machine learning approach, data from any type of mass spectrometer is ...

Citations

... In adjacent chemical disciplines such as metabolomics, these curated molecular datasets play a crucial role for chemical analysis. For example, they assist in compound identification either directly (Kind et al., 2013;Sud et al., 2007;HighChem LLC, Slovakia;Sawada et al., 2012;Wissenbach et al., 2011b, a;Montenegro-Burke et al., 2020;Taguchi and Ishikawa, 2010;Oberacher, 2012;Hummel et al., 2013;Watanabe et al., 2000;McLafferty and Wiley, 2020;Wang et al., 2016;Wishart et al., 2022;Wallace and Moorthy, 2023;Weber et al., 2012;MassBank consortium, 2024;of North America, 2024) or through the development of machine learning-based identifcation tools (Heinonen et al., 2012;Dührkop et al., 2015;Brouard et al., 2016;Nguyen et al., 2018Nguyen et al., , 2019. These datasets also form the foundation of data-driven analysis platforms e.g., (Nothias et al., 2020). ...
... nablaDFT, instead, was curated from the Molecular Sets (MOSES) dataset (Polykovskiy et al., 2020) for the purpose of training models for quantum chemical property prediction (conformational energy and Hamiltonian). On the other hand, the MassBanks provide data pairs of molecular structures and their corresponding mass spectra, and have been used to train and test machine learning models for compound identification based on mass spectra (Heinonen et al., 2012;Dührkop et al., 2015Dührkop et al., , 2019. The MassBanks primarily contain molecules with Table 1. ...
Preprint
Full-text available
    The formation of aerosol particles in the atmosphere impacts air quality and climate change, but many of the organic molecules involved remain unknown. Machine learning could aid in identifying these compounds through accelerated analysis of molecular properties and detection characteristics. However, such progress is hindered by the current lack of curated datasets for atmospheric molecules and their associated properties. To tackle this challenge, we propose a similarity analysis that connects atmospheric compounds to existing large molecular datasets used for machine learning development. We find a small overlap between atmospheric and non-atmospheric molecules using standard molecular representations in machine learning applications. The identified out-of-domain character of atmospheric compounds is related to their distinct functional groups and atomic composition. Our investigation underscores the need for collaborative efforts to gather and share more molecular-level atmospheric chemistry data. The presented similarity based analysis can be used for future dataset curation for machine learning development in the atmospheric sciences.
    ... Spectral library searching compares query spectra against a spectral library based on a similarity measure, while structure database searching compares the query spectra against compounds in a structure database using intermediate representation. Different intermediate representation methods have been proposed for the latter, including transforming MS/MS spectra into molecular fingerprints [6][7][8][9][10], generating in silico MS/ MS spectra from reference compounds [11][12][13][14][15][16][17][18][19], and matching spectra and reference compound embeddings [20]. The database free methods, such as MassGenie [21] and MSNovelist [5], require neither spectral libraries nor compound structure databases for structure prediction. ...
    Article
    Full-text available
    Small molecule identification is a crucial task in analytical chemistry and life sciences. One of the most commonly used technologies to elucidate small molecule structures is mass spectrometry. Spectral library search of product ion spectra (MS/MS) is a popular strategy to identify or find structural analogues. This approach relies on the assumption that spectral similarity and structural similarity are correlated. However, popular spectral similarity measures, usually calculated based on identical fragment matches between the MS/MS spectra, do not always accurately reflect the structural similarity. In this study, we propose TransExION, a Transformer based Explainable similarity metric for IONS. TransExION detects related fragments between MS/MS spectra through their mass difference and uses these to estimate spectral similarity. These related fragments can be nearly identical, but can also share a substructure. TransExION also provides a post-hoc explanation of its estimation, which can be used to support scientists in evaluating the spectral library search results and thus in structure elucidation of unknown molecules. Our model has a Transformer based architecture and it is trained on the data derived from GNPS MS/MS libraries. The experimental results show that it improves existing spectral similarity measures in searching and interpreting structural analogues as well as in molecular networking. Scientific Contribution We propose a transformer-based spectral similarity metrics that improves the comparison of small molecule tandem mass spectra. We provide a post hoc explanation that can serve as a good starting point for unknown spectra annotation based on database spectra.
    ... Traditionally, compounds were identified by searching libraries or databases for matches. With the emergence of digital mass spectral databases more sophisticated approaches were developed, such as in-silico fragmentation, [123,124,125,126,127] fragmentation trees, [128,129,114,130] and machine learning approaches [131,128,132,133,134]. ...
    ... The third category of compound identification algorithms is referred to as machine learning approaches, which are emerging as powerful property and structure inference tools in spectroscopy [137]. Figure 5 illustrates the working principle of most compound identification machine learning algorithms [131,128,132,133,134]. In the first step, a mass spectrum is mapped to a feature space represented by a so-called fingerprint vector. ...
    ... Supervised machine learning algorithms are then trained to assign fingerprints to spectra. Examples include kernel methods, such as support vector machines, [131] vector valued kernel ridge regression, [132,140,141] and multiple kernel learning support vector machines, [114,128,87,133] or a combination of deep learning and multiple kernel learning [134]. In the second step, the fingerprint vector is compared to the molecular fingerprints of compounds in compound databases. ...
    Preprint
    Aerosols found in the atmosphere affect the climate and worsen air quality. To mitigate these adverse impacts, aerosol formation and aerosol chemistry in the atmosphere need to be better mapped out and understood. Currently, mass spectrometry is the single most important analytical technique in atmospheric chemistry and is used to track and identify compounds and processes. Vast amounts of data are collected in each measurement of current time-of-flight and orbitrap mass spectrometers using modern rapid data acquisition practices. However, compound identification remains as a major bottleneck during data analysis due to lacking reference libraries and analysis tools. Data-driven compound identification approaches could alleviate the problem, yet remain rare to non-existent in atmospheric science. In this perspective, we review the current state of data-driven compound identification with mass spectrometry in atmospheric science, and discuss current challenges and possible future steps towards a digital mass spectrometry era in atmospheric science.
    ... Although promising, the approach requires measuring one sample with three sets of chromatographic conditions, as well as utilizing both positive and negative ESI mode. MS 2 spectra carry structurally relevant information about functional groups, 45,46 which further provide information about the compounds' polarity as well as acid−base properties. A machine learning model, MS2Tox, was recently developed to predict toxicity (LC 50 values) of unidentified compounds based on structural fingerprints calculated from the MS 2 spectrum. ...
    Article
    Nontarget analysis by liquid chromatography-high-resolution mass spectrometry (LC-HRMS) is now widely used to detect pollutants in the environment. Shifting away from targeted methods has led to detection of previously unseen chemicals, and assessing the risk posed by these newly detected chemicals is an important challenge. Assessing exposure and toxicity of chemicals detected with nontarget HRMS is highly dependent on the knowledge of the structure of the chemical. However, the majority of features detected in nontarget screening remain unidentified and therefore the risk assessment with conventional tools is hampered. Here, we developed MS2Quant, a machine learning model that enables prediction of concentration from fragmentation (MS2) spectra of detected, but unidentified chemicals. MS2Quant is an xgbTree algorithm-based regression model developed using ionization efficiency data for 1191 unique chemicals that spans 8 orders of magnitude. The ionization efficiency values are predicted from structural fingerprints that can be computed from the SMILES notation of the identified chemicals or from MS2 spectra of unidentified chemicals using SIRIUS+CSI:FingerID software. The root mean square errors of the training and test sets were 0.55 (3.5×) and 0.80 (6.3×) log-units, respectively. In comparison, ionization efficiency prediction approaches that depend on assigning an unequivocal structure typically yield errors from 2× to 6×. The MS2Quant quantification model was validated on a set of 39 environmental pollutants and resulted in a mean prediction error of 7.4×, a geometric mean of 4.5×, and a median of 4.0×. For comparison, a model based on PaDEL descriptors that depends on unequivocal structural assignment was developed using the same dataset. The latter approach yielded a comparable mean prediction error of 9.5×, a geometric mean of 5.6×, and a median of 5.2× on the validation set chemicals when the top structural assignment was used as input. This confirms that MS2Quant enables to extract exposure information for unidentified chemicals which, although detected, have thus far been disregarded due to lack of accurate tools for quantification. The MS2Quant model is available as an R-package in GitHub for improving discovery and monitoring of potentially hazardous environmental pollutants with nontarget screening.
    ... Another application that takes advantage of the intersection of mass spectrometry and ML is in the understanding of metabolite chemistry. 21,22 There are also many papers utilizing ML with mass spectrometry to perform rapid screening methodologies for specific analytes of interest. 23,24 These ML methods have had powerful results and have been revolutionary in our implementation of mass spectral methods. ...
    Article
    Full-text available
    Mass spectrometry is a ubiquitous technique capable of complex chemical analysis. The fragmentation patterns that appear in mass spectrometry are an excellent target for artificial intelligence methods to automate and expedite the analysis of data to identify targets such as functional groups. To develop this approach, we trained models on electron ionization (a reproducible hard fragmentation technique) mass spectra so that not only the final model accuracies but also the reasoning behind model assignments could be evaluated. The convolutional neural network (CNN) models were trained on 2D images of the spectra using transfer learning of Inception V3, and the logistic regression models were trained using array-based data and Scikit Learn implementation in Python. Our training dataset consisted of 21,166 mass spectra from the United States' National Institute of Standards and Technology (NIST) Webbook. The data was used to train models to identify functional groups, both specific (e.g., amines, esters) and generalized classifications (aromatics, oxygen-containing functional groups, and nitrogen-containing functional groups). We found that the highest final accuracies on identifying new data were observed using logistic regression rather than transfer learning on CNN models. It was also determined that the mass range most beneficial for functional group analysis is 0-100 m/z. We also found success in correctly identifying functional groups of example molecules selected from both the NIST database and experimental data. Beyond functional group analysis, we also have developed a methodology to identify impactful fragments for the accurate detection of the models' targets. The results demonstrate a potential pathway for analyzing and screening substantial amounts of mass spectral data.
    ... Existing bioassay systems can inspect the chemical content in the sweat of finger marks; however, this approach requires the samples to be treated with costly reagent kits that destroy metabolites. Active research is exploring a contactless manner with which to acquire the sweat metabolites from fingertips via hyperspectral imaging (HSI) [31,32]. HSI has gained a lot of interest in various fields, including agriculture and medical research [33][34][35]. ...
    Article
    Full-text available
    AI-empowered sweat metabolite analysis is an emerging and open research area with great potential to add a third category to biometrics: chemical. Current biometrics use two types of information to identify humans: physical (e.g., face, eyes) and behavioral (i.e., gait, typing). Sweat offers a promising solution for enriching human identity with more discerning characteristics to overcome the limitations of current technologies (e.g., demographic differential and vulnerability to spoof attacks). The analysis of a biometric trait’s chemical properties holds potential for providing a meticulous perspective on an individual. This not only changes the taxonomy for biometrics, but also lays a foundation for more accurate and secure next-generation biometric systems. This paper discusses existing evidence about the potential held by sweat components in representing the identity of a person. We also highlight emerging methodologies and applications pertaining to sweat analysis and guide the scientific community towards transformative future research directions to design AI-empowered systems of the next generation.
    ... Over the last ten years, several new approaches have been reported for MS prediction of small molecules that rely on established computational methods such as combinatorial optimization (MetFrag [152], MetFusion [153], MAGMa [154], MIDAS [155], and FT-BLAST [156]) and machine learning (ISIS [157], FingerID [158], CFM-ID [159], and CSI:FingerID [146]) techniques. The emergence of new tools for the prediction of spectral data enabled the development of advanced MS-based dereplication methodologies that clearly translated into a significant improvement in the process of drug discovery from natural sources, including marine biosources. ...
    Article
    Full-text available
    Natural Products (NP) are essential for the discovery of novel drugs and products for numerous biotechnological applications. The NP discovery process is expensive and time-consuming, having as major hurdles dereplication (early identification of known compounds) and structure elucidation, particularly the determination of the absolute configuration of metabolites with stereogenic centers. This review comprehensively focuses on recent technological and instrumental advances, highlighting the development of methods that alleviate these obstacles, paving the way for accelerating NP discovery towards biotechnological applications. Herein, we emphasize the most innovative high-throughput tools and methods for advancing bioactivity screening, NP chemical analysis, dereplication, metabolite profiling, metabolomics, genome sequencing and/or genomics approaches, databases, bioinformatics, chemoinformatics, and three-dimensional NP structure elucidation.
    ... Tandem mass spectrometry (MS/MS) is widely used, providing mass fragmentation patterns, which are key to the structural elucidation of small molecules, annotating and identifying known and unknown ions in untargeted metabolomics. MS/MS fragment ions are relative to substructures and can be used in the dereplication of Fidele Tugizimana Fidele.Tugizimana@omnia.co.za natural products (Allen et al., 2015;Dührkop et al., 2021;Heinonen et al., 2012;Yang et al., 2013). However, due to the inherent complexity and structural diversity of the plant metabolome, it remains challenging to annotate most of the chemical signatures detected by untargeted MS analyses. ...
    Article
    Full-text available
    Introduction Molecular networking (MN) has emerged as a key strategy to organize and annotate untargeted tandem mass spectrometry (MS/MS) data generated using either data independent- or dependent acquisition (DIA or DDA). The latter presents a time-efficient approach where full scan (MS¹) and MS² spectra are obtained with shorter cycle times. However, there are limitations related to DDA parameters, some of which are (i) intensity threshold and (ii) collision energy. The former determines ion prioritization for fragmentation, and the latter defines the fragmentation of selected ions. These DDA parameters inevitably determine the coverage and quality of spectral data, which would affect the outputs of MN methods. Objectives This study assessed the extent to which the quality of the tandem spectral data relates to MN topology and subsequent implications in the annotation of metabolites and chemical classification relative to the different DDA parameters employed. Methods Herein, characterising the metabolome of Momordica cardiospermoides plants, we employ classical MN performance indicators to investigate the effects of collision energies and intensity thresholds on the topology of generated MN and propagated annotations. Results We demonstrated that the lowest predefined intensity thresholds and collision energies result in comprehensive molecular networks. Comparatively, higher intensity thresholds and collision energies resulted in fewer MS² spectra acquisition, subsequently fewer nodes, and a limited exploration of the metabolome through MN. Conclusion Contributing to ongoing efforts and conversations on improving DDA strategies, this study proposes a framework in which multiple DDA parameters are utilized to increase the coverage of ions acquired and improve the global coverage of MN, propagated annotations, and the chemical classification performed.
    ... To this end, ML has been employed in recent studies for the prediction of metabolite fingerprints. FingerID, a classical method, has been proposed to predict corresponding fingerprints from a mass spectrometry (MS) set with supervised ML [57]. A support vector machine (SVM) selects fingerprints with integral mass and probability product kernels. ...
    Article
    Full-text available
    Optimizing the metabolic pathways of microbial cell factories is essential for establishing viable biotechnological production processes. However, due to the limited understanding of the complex setup of cellular machinery, building efficient microbial cell factories remains tedious and time-consuming. Machine learning (ML), a powerful tool capable of identifying patterns within large datasets, has been used to analyze biological datasets generated using various high-throughput technologies to build data-driven models for complex bioprocesses. In addition, ML can also be integrated with Design–Build–Test–Learn to accelerate development. This review focuses on recent ML applications in genome-scale metabolic model construction, multistep pathway optimization, rate-limiting enzyme engineering, and gene regulatory element designing. In addition, we have discussed some limitations of these methods as well as potential solutions.
    ... files. The .mgf files were exported to SIRIUS 5 for de novo molecular formula annotation using SIRIUS [36][37][38][39] and ZODIAC [40], structure annotation using CSI:FingerID [33,[41][42][43] and COSMIC [44], and chemical class prediction using CANOPUS [45,46]. ZODIAC is a network-based algorithm that employs Gibbs sampling to re-rank molecular formula annotations by SIRIUS, by considering shared fragment ions and losses between fragmentation trees in complete LC-MS/MS datasets [40]. ...
    Article
    Full-text available
    Morphological characteristics of Piper rubro-venosum hort. ex Rodigas bear a close resemblance to a plant identified as Piper crocatum Ruiz & Pav. in literature. Hence, this study aimed to investigate whether both names describe the same species using data-dependent acquisition (DDA) LC-MS/MS analysis of methanol leaf extracts of P. rubro-venosum in positive and negative electrospray ionization modes. The data were analyzed using two computational mass spectrometry methods: spectral libraries search implemented in Global Natural Product Social (GNPS) molecular networking and molecular structure databases search implemented in SIRIUS. Classical molecular networking implied that the metabolites giving rise to two features with the highest intensities in the positive ionization chromatograms had distinct MS/MS spectra. De novo molecular formula annotations and machine learning predictions in SIRIUS suggested that both features were sodiated precursor ions of neolignans. Based on the accurate mass of the precursor ions, the two features were annotated as crocatin A and B, which are bicyclooctanoid neolignans previously isolated in relatively large amounts from P. crocatum leaves. Based on the findings, it can be concluded that P. crocatum and P. rubro-venosum could be two names in the literature used to refer to one species of Piper.