Workflow of the de novo peptide sequencing.

Workflow of the de novo peptide sequencing.

Source publication
Article
Full-text available
Tandem mass spectrometry (MS/MS)-based de novo peptide sequencing is a powerful method for high-throughput protein analysis. However, the explosively increasing size of MS/MS spectra dataset inevitably and exponentially raises the computational demand of existing de novo peptide sequencing methods, which is an issue urgently to be solved in computa...

Contexts in source publication

Context 1
... novo peptide sequencing aims to deduce an amino acid sequence according to MS/MS spectrum without the use of a protein sequence database. Figure 1 shows the processing flow of MS/MS spectra analysis using de novo sequencing methods, which mainly includes three key parts: 1) Experimental spectra generation: First, the mixed proteins digest into mixed peptides using by enzymes. And then the peptides will be fragmented and ionized (e.g., higher energy collisional dissociation (HCD) [20], collision-induced dissociation (CID) [21]) in liquid chromatography tandem mass spectrometry (LC-MS/MS). ...
Context 2
... employing the dynamic parallel mode, the task distribution between the CPEs should be taken seriously. In our implementation, we adopt the dynamic parallel programming model, which is shown in Figure 10. MPE and CPE in SW26010 serve different functions during the computation. ...
Context 3
... parallelization at this level is implemented through a special accelerate thread library, called aThread. We further employ an optional asynchronous task-loading strategy as shown Figure 11. First we set a list of processes to load only reads. ...
Context 4
... our implementation, the memory access overhead in double buffering mechanism divided into two parts: unsheltered part P, which includes all the cost of transmitting the data in the first round and last round. Another is the overlapping cost part P * (N -1). Eq. (1) shows the speedup of optimization by using the double buffering mechanism: ...
Context 5
... in Exp.3, SWPepNovo spent 385 seconds in total, remarkably lower than PepNovo+ 8967 and PEAKS 4521. As Figure 13 shows, SWPepNovo achieves up 28 times speedup on a SW26010 against the PepNovo+. This validates that the parallel PSMs algorithm get high parallel efficiency and speedup ratio using a single SW26010 many-core processor. ...
Context 6
... order to evaluate the performance of multi-node acceleration, we have implemented the SWPepNovo on a SW26010 cluster. The impact of the number of nodes in SW26010 cluster on the performance of SWPepNovo is illustrated in Figure 14. As shown in Figure 14, it shows the performance of SWPepNovo against the number of nodes in SW26010 cluster, where the X axis represents the number of SW26010 processor in the cluster and the Y axis represents speedup. ...
Context 7
... impact of the number of nodes in SW26010 cluster on the performance of SWPepNovo is illustrated in Figure 14. As shown in Figure 14, it shows the performance of SWPepNovo against the number of nodes in SW26010 cluster, where the X axis represents the number of SW26010 processor in the cluster and the Y axis represents speedup. In the experiment of three nodes, we got 47 times speedup in Dataset.1, 51 times speedup in Dataset.2 and 52 times in Dataset.3. ...
Context 8
... and Dataset.3. Figure 15 shows the execution time of SWPepNovo with the dataset size increasing from 0.51GB (120,212 spectra) up to 11.22GB (2,644,664 spectra). With a 11.22 GB of dataset, SWPepNovo took only 78.5 minutes. ...
Context 9
... a 11.22 GB of dataset, SWPepNovo took only 78.5 minutes. From Figure 15 we also can see that SWPepNovo+ can de novo sequence extremely large spectra datasets with a linear increase in execution time with the dataset size. Meanwhile, the validity is demonstrated by comparing the SWPepNovo results with PepNovo+. ...

Similar publications

Article
Full-text available
Efficient and distributed adaptive mesh construction and editing pose several challenges, including selecting the appropriate distributed data structure, choosing strategies for distributing computational load, and managing inter-processor communication. Distributed Combinatorial Maps permit the representation and editing of distributed 3D meshes....
Article
Full-text available
While computer vision and computing technology advances have facilitated advanced driver assistance applications, systems with multi-task design remain highly demanding to operate at high speed on resource-constrained devices. Our study addresses this challenge by proposing a real-time driver assistance solution specifically developed for a single...
Article
Full-text available
This study takes a network perspective to examine the spatial spillover effects of haze pollution in Cheng-Yu urban agglomeration which is the fourth largest urban agglomeration and a comprehensive demonstration zone of new urbanization in China. Firstly, we use Granger causality test to construct haze pollution spatial association network, and the...
Conference Paper
Full-text available
in this paper a new system is developed for autonomous robots to detect and track multi-objects in uncontrolled environments and in real time for the purpose of decreasing the processing time needed and obtaining better error rates than current systems. To achieve this, a novel multi object tracking algorithm is introduced, implemented and enhanced...
Preprint
Full-text available
This paper wants to focus on providing a characterization of the runtime performances of state-of-the-art implementations of KGE alghoritms, in terms of memory footprint and execution time. Despite the rapidly growing interest in KGE methods, so far little attention has been devoted to their comparison and evaluation; in particular, previous work m...

Citations

... For evaluation, we downloaded all the results [22,17,25,9,31,20,5,23,24,28,6,10] that have been reported till date. This information included, the database size, the number of spectra, serial and parallel times, and the speedups. ...
Article
Mass spectrometry (MS) based omics data analysis require significant time and resources. To date, few parallel algorithms have been proposed for deducing peptides from mass spectrometry-based data. However, these parallel algorithms were designed, and developed when the amount of data that needed to be processed was smaller in scale. In this paper, we prove that the communication bound that is reached by the existing parallel algorithms is Ω(mn+2rqp), where m and n are the dimensions of the theoretical database matrix, q and r are dimensions of spectra, and p is the number of processors. We further prove that communication-optimal strategy with fast-memory M=mn+2qrp can achieve Ω(2mnqp) but is not achieved by any existing parallel proteomics algorithms till date. To validate our claim, we performed a meta-analysis of published parallel algorithms, and their performance results. We show that sub-optimal speedups with increasing number of processors is a direct consequence of not achieving the communication lower-bounds. We further validate our claim by performing experiments which demonstrate the communication bounds that are proved in this paper. Consequently, we assert that next-generation of provable, and demonstrated superior parallel algorithms are urgently needed for MS based large systems-biology studies especially for meta-proteomics, proteogenomic, microbiome, and proteomics for non-model organisms. Our hope is that this paper will excite the parallel computing community to further investigate parallel algorithms for highly influential MS based omics problems.
Article
Full-text available
Exponential advances in computational power have fueled advances in many disciplines, and biology is no exception. High-Performance Computing (HPC) is gaining traction as one of the essential tools in scientific research. Further advances to exascale capabilities will necessitate more energy-efficient hardware. In this article, we present our efforts to improve the efficiency of genome assembly on ARM-based HPC systems. We use vectorization to optimize the popular genome assembly pipeline of minimap2, miniasm, and Racon. We compare different implementations using the Scalable Vector Extension (SVE) instruction set architecture and evaluate their performance in different aspects. Additionally, we compare the performance of autovectorization to hand-tuned code with intrinsics. Lastly, we present the design of a CPU dispatcher included in the Racon consensus module that enables the automatic selection of the fastest instruction set supported by the utilized CPU. Our findings provide a promising direction for further optimization of genome assembly on ARM-based HPC systems.
Article
Full-text available
Peptides are unique class of biomolecules for pharmaceutics and industries, given their structural features that can be applied to many approaches. Although their advantages are known, they suffer from some limitations that need to be overcome. Some disadvantages are peptidic conformations’ flexibility and susceptibility to proteolytic degradation. Research has been in a constant endeavor to provide solutions. The discovery of cyclic peptides in plants opened the door for new insights into peptide‐based applications. These peptides display high stability to physical and chemical conditions. They possess a wide range of biological activities. In addition, cyclic peptides shows enhanced activities compared to linear peptides. Thus, the idea of non‐cyclic peptide cyclization can be of great use in eliminating issues and improving peptide capabilities. Inspired by the naturally found cyclic peptides, many approaches for synthetic cyclization have been proposed. The current review provides an overall discussion of the available methods for cyclization, applications, and characterization techniques. The present review offers a unique source for colleagues newly exposed to the subject and on the verge of entering the field of cyclic peptides by providing an initiating step covering the essential points needed to be considered around peptide cyclization.
Chapter
Mass spectrometry (MS)-based omics data analysis requires substantial time and resources which has necessitated the need for high-performance computing (HPC) methods. Few parallel algorithms have been proposed, designed, and developed when the amount of data that needed to be processed was smaller in scale, i.e. only a few PTM were of interest and would satisfy when only a shorter theoretical database was needed for computations.
Article
Full-text available
Proteomics, the large-scale study of all proteins of an organism or system, is a powerful tool for studying biological systems. It can provide a holistic view of the physiological and biochemical states of given samples through identification and quantification of large numbers of peptides and proteins. In forensic science, proteomics can be used as a confirmatory and orthogonal technique for well-built genomic analyses. Proteomics is highly valuable in cases where nucleic acids are absent or degraded, such as hair and bone samples. It can be used to identify body fluids, ethnic group, gender, individual, and estimate post-mortem interval using bone, muscle, and decomposition fluid samples. Compared to genomic analysis, proteomics can provide a better global picture of a sample. It has been used in forensic science for a wide range of sample types and applications. In this review, we briefly introduce proteomic methods, including sample preparation techniques, data acquisition using liquid chromatography-tandem mass spectrometry, and data analysis using database search, spectral library search, and de novo sequencing. We also summarize recent applications in the past decade of proteomics in forensic science with a special focus on human samples, including hair, bone, body fluids, fingernail, muscle, brain, and fingermark, and address the challenges, considerations, and future developments of forensic proteomics.
Article
Full-text available
Biomanufacturing is a key component of biotechnology that uses biological systems to produce bioproducts of commercial relevance, which are of great interest to the energy, material, pharmaceutical, food, and agriculture industries. Biotechnology-based approaches, such as synthetic biology and metabolic engineering are heavily reliant on “omics” driven systems biology to characterize and understand metabolic networks. Knowledge gained from systems biology experiments aid the development of synthetic biology tools and the advancement of metabolic engineering studies toward establishing robust industrial biomanufacturing platforms. In this review, we discuss recent advances in “omics” technologies, compare the pros and cons of the different “omics” technologies, and discuss the necessary requirements for carrying out multi-omics experiments. We highlight the influence of “omics” technologies on the production of biofuels and bioproducts by metabolic engineering. Finally, we discuss the application of “omics” technologies to agricultural and food biotechnology, and review the impact of “omics” on current COVID-19 research.
Article
Introduction: Proteins are crucial for every cellular activity and unraveling their sequence and structure is a crucial step to fully understand their biology. Early methods of protein sequencing were mainly based on the use of enzymatic or chemical degradation of peptide chains. With the completion of the human genome project and with the expansion of the information available for each protein, various databases containing this sequence information were formed. Areas covered: De novo protein sequencing, shotgun proteomics and other mass-spectrometric techniques, along with the various software are currently available for proteogenomic analysis. Emphasis is placed on the methods for de novo sequencing, together with potential and shortcomings using databases for interpretation of protein sequence data. Expert opinion: As mass-spectrometry sequencing performance is improving with better software and hardware optimizations, combined with user-friendly interfaces, de-novo protein sequencing becomes imperative in shotgun proteomic studies. Issues regarding unknown or mutated peptide sequences, as well as, unexpected post-translational modifications (PTMs) and their identification through false discovery rate searches using the target/decoy strategy need to be addressed. Ideally, it should become integrated in standard proteomic workflows as an add-on to conventional database search engines, which then would be able to provide improved identification.
Article
Full-text available
Recent advances in mass spectrometry (MS)-based proteomics have enabled tremendous progress in the understanding of cellular mechanisms, disease progression, and the relationship between genotype and phenotype. Though many popular bioinformatics methods in proteomics are derived from other omics studies, novel analysis strategies are required to deal with the unique characteristics of proteomics data. In this review, we discuss the current developments in the bioinformatics methods used in proteomics and how they facilitate the mechanistic understanding of biological processes. We first introduce bioinformatics software and tools designed for mass spectrometry-based protein identification and quantification, and then we review the different statistical and machine learning methods that have been developed to perform comprehensive analysis in proteomics studies. We conclude with a discussion of how quantitative protein data can be used to reconstruct protein interactions and signaling networks.