Comparison methodology flowchart.

Source publication

PANNOTATOR: an automated tool for annotation of pan-genomes

Article

Full-text available

Aug 2013

Due to next-generation sequence technologies, sequencing of bacterial genomes is no longer one of the main bottlenecks in bacterial research and the number of new genomes deposited in public databases continues to increase at an accelerating rate. Among these new genomes, several belong to the same species and were generated for pan-genomic studies...

Context 1

... the last few years, sequencing technologies known as next-generation sequencing had a major impact on the availability of genomes in public data bases (Metzker, 2010). As whole genome sequencing became faster and inexpensive, new comparative analyses were possible, such as pan-genomic studies (Tettelin et al., 2008). The study of a single genome is not enough to determine the pool of genes present in bacterial species or to explain the vari- ability that determines, for instance, the pathogenicity of these bacterial species. Therefore, pan-genomic studies aim to characterize the complete genetic repertory of species through analysis of multiple strain genomes (Medini et al., 2005). The pan-genomic approach presents challenges associated with the management of the assembly and the automatic and manual annotation process of the many genome strains related to a project. To solve this issue we developed the PANNOTATOR workbench. This tool is com- posed of a relational database, interactive tools, several SQL reports, and a web-based in- terface. The workbench was initially developed as an in-house solution to manage the Co- rynebacterium pseudotuberculosis pan-genome project (Santos et al., 2012). Therefore, the relational schema was denominated the C. pseudotuberculosis Database (CpDB). Although it was initially conceived for C. pseudotuberculosis , it was used for other bacterial species as well (Carneiro et al., 2012). A parser to format entries to the PANNOTATOR workbench was also developed; it is capable of successfully interpreting genome annotations in EMBL and GenBank formats and converting these to our database format. Given a stored genome, the CpDB reports are capable of exporting files in EMBL format, an extension accepted by the Artemis program (Rutherford et al., 2000). PANNOTATOR’s main feature is its ability to produce an automatic annotation based on a manual curated genome. This workbench was conceived to reduce the workload required to generate reports and corrections of the various annotations during a pan-genome project. The idea was to transfer the annotation of gene names and functional products of a curated ge- nome, which is obtained using the alignments of protein sequence results as a linkage criteri- on. The cut-off parameters depend on how similar the protein products of the curated genome (source) are to those of the new genome (destiny), and the cut-off parameters of the quality of alignments allows the control of how much of the curated annotation will be incorporated into the new genome’s annotation. These parameters include the percentage of protein identity and the total extension of the alignment between amino acids. For instance, during the C. pseudo- tuberculosis pan-genome start (Ruiz et al., 2011), a threshold of 95% amino acid identity and sequence alignment was used, which is sufficient to correctly link most of the CDS among different strains. However, for the first automatic annotation of a C. pseudotuberculosis ge- nome, it was necessary to use the genome of C. diphtheriae (Cerdeño-Tárraga et al., 2003), the phylogenetically closest organism available, and a threshold of 65% protein identity and sequence alignment. When using a 95% threshold level, only the annotation of 4 ribosomal units was incorporated into the first C. pseudotuberculosis genome. Another useful report created by PANNOTATOR is a putative list of frame-shifts based on possible gene fragments from the destiny genome compared to source genome. The genomes compared were obtained from the NCBI website according to the follow- ing accession numbers: CP002251 ( C. pseudotuberculosis str. I19); NC_016932 ( C. pseudotu- berculosis str. 316); NC_002935 ( C. diphtheriae str. NCTC13129); NC_000913 ( Escherichia coli str. K-12 substr. MG1655); NC_010473 ( E. coli str. K-12 substr. DH10B); NC_012759 ( E. coli str. BW2952); and NC_011353 ( E. coli O157:H7 substr. EC4115). The PANNOTATOR pipeline (Figure 1) was implemented in the Ubuntu 12.04 oper- ating system. The Apache server was used to process web requests, and the web interface was developed using PHP (Hypertext Preprocessor). A number of inbuilt tools/components of the Linux operating system, such as “ sed ”, were used together with the software tools/components Bioperl for sequence file format conversions and feature extraction: BLAST and the Database Management System Postgres. PANNOTATOR mainly automates the process of annotation. The tool performs all required file conversions and modifications required by different software components. The entire process starts with 3 inputs by the user: a DNA strand, its gene prediction (destiny), and the curated genome (source). All predicted genes are compared to the ones of the genome curated. The gene name and product are assigned to a new genome based on BLAST similarity with the genes in the genome curated (Figure 1). Source and destiny genomes are evaluated using our in-house tool called parseEMBLtoCpDB. This parser comprises our annotation workbench (sourceforge.net/projects/cpdb) and is responsible for formatting data to feed the PANNOTATOR relational database schema. The destiny gene prediction is kept in a table denominated ‘gene’, which considers the locus tag and organism fields as discriminants. On the other hand, the source genome is kept in the ‘curated’ table, which is similar to the ‘gene’ table. After insertion of all annotation versions from each genome, several comparisons are performed between the source and the destiny genomes, with cut-off being those parameters selected by the user. The features considered in the analyses were named genes and products. The PANNOTATOR final result, an automated annotation of the destiny genome mediated by the source genome, only exists in the computer memory during an outer join SQL command, output written to an EMBL file. The outer join is essential in this situation because even without the existence of an acceptable similarity level between a gene from the destiny and all other genes from the source, it is still necessary to represent all the predicted genes from the destiny genome. PANNOTATOR uses the following color code for gene annotation: green for genes with a strong match (100%) to the source genome, yellow for matches between 100%, and the user specified cut-off value, or red otherwise. Genes colored red have no gene name or function linked. Furthermore, two kinds of RNA predictions (tRNA and rRNA) are automatically incorporated into the output file. Overlapping genes with RNA predictions are removed from the destiny genome, anticipating further GenBank demands in case a genome deposit process takes place. The genomes’ DNA strands of C. pseudotuberculosis str. I19 and E. coli str. K-12 substr. MG1655 were used to evaluate the tool and were submitted to the functional automated annotation tools BASys (Van Domselaar et al., 2005), RAST (Aziz et al., 2008), and PANNOTATOR (Figure 2). PANNOTATOR annotation transfer was performed using different strains and spe- cies, under different cut-off thresholds. For C. pseudotuberculosis comparison, strain 316 (Ra- mos et al., 2012), and the closely related species C. diphtheriae strain NCTC13129 (Cerdeño- Tárraga et al., 2003) were used as curated genomes; for E. coli comparison, strains BW2952, K-12 (substr. DH10B), and O157:H7 (substr. EC4115) were used as curated genomes. The main challenge to perform such comparisons resides in the fact that there is no common locus tag between a new gene prediction and the previous one present in the depos- ited version of a genome, known as the correct one or gold standard. In such situation, it is not possible to compare new features predicted (gene name and functional product) just using the locus tag. To work around such technical issue, we decided to take advantage of a relative conservation of the stop codon predictions as unique gene identifier between the genomes tested even when different gene predictors were used. Transferring annotation from close and more distant related curated genomes was used in order to evaluate the performance of the tool. The best results were achieved for C. pseudotuberculosis (Figures 3 and 4). Transferring the annotation from a different strain (316) resulted in 98% of gene names and 76% of products correctly assigned. When using the cut-off parameter of 60% similarity, considerable incorrect information was introduced in the new genome; since this value is less restrictive, it is more permissive to error introduction. Therefore, we advise the users to be careful while using a cut-off parameter below 70%. The worst results compared to RAST and BASys were obtained when using C. diph- theriae to transfer the annotation. Therefore, it is not just a matter of choosing a closely related organism for transfer, but it is also important for the annotation to be reliable (Richardson and Watson, 2013). It was suggested that the C. diphtheriae annotation was outdated, since it was deposited in 2003 (D’Afonseca et al., 2012). The only challenge in which PANNOTATOR was surpassed by another tool was regarding gene name assignment in E. coli (Figure 5). BASys had 92% of gene names correctly assigned while PANNOTATOR had, at best, 85%. Comparing the correct product assignment, PANNOTATOR outdid the other tools as observed in Figure 6. When transferring the annotation from a different species or distantly related strains, PANNOTATOR showed fairly comparable results compared to the other ...

View in full-text

Improving continuous integration with similarity-based test case selection

Conference Paper

Full-text available

May 2018

Automated testing is an essential component of Continuous Integration (CI) and Delivery (CD), such as scheduling automated test sessions on overnight builds. That allows stakeholders to execute entire test suites and achieve exhaustive test coverage, since running all tests is often infeasible during work hours, i.e., in parallel to development act...

Halide: Decoupling algorithms from schedules for high-performance image processing

Article

Full-text available

Dec 2017

Writing high-performance code on modern machines requires not just locally optimizing inner loops, but globally reorganizing computations to exploit parallelism and locality---doing things such as tiling and blocking whole pipelines to fit in cache. This is especially true for image processing pipelines, where individual stages do much too little w...

Pannotator integrated with Medpipe provides immunological and subcellular location features using a microservice

Preprint

Full-text available

May 2024

Genome sequencing and assembly are trivial tasks nowadays. After assembling contigs and scaffolds from a genome, the subsequent step is annotation. An annotation evidencing the expected features, like rRNA, tRNA, and CDS, is a signal of the quality of our sequencing and assembly. Different techniques to obtain and reproduce DNA samples, as well as sequencing and assembly of genomes, can impact the quality of a genome's expected features. The Pannotator tool was conceived as an aid annotation tool focusing on the differences between assembling and its reference genome. Some of the key features for bacterial genome annotations are the subcellular location and immunological potential of a CDS. Instead of reimplementing the prediction of these features in Pannotator, we leveraged the capabilities of our microservice to provide them. In the end, Medpipe software was not modified, and Pannotator underwent minor changes to incorporate the subcellular location and immunological potential of all exported proteins annotated by the tool. Moreover, our Medpipe microservice can also be incorporated into other software. The Medpipe microservice is open to anyone, not only to our Pannotator tool. The successful integration of Medpipe to Pannotator, powered by the Medpipe microservice, offers a powerful approach to advanced genomic analysis. The Medpipe microservice, built on Kotlin with the Spring Boot framework, is instrumental in the automation of Medpipe processing. It achieves this using REST endpoints, such as the execution of Medpipe in an asynchronous manner, status retrieval, and prediction generation, which enhance the modularity and scalability of the microservice. The availability of endpoint documentation, detailed request examples, and logs make our microservice user-friendly. The results of this integration demonstrate the value of the information provided by Medpipe, enriching genomic annotation with additional details, such as the density of mature epitopes (MED) and protein subcellular location classification. The Pannotator has evolved beyond basic function annotation and now provides data on immunological potential, structure, and subcellular location after being integrated with our microservice. The Medpipe microservice is available at https://github.com/santosardr/medpipe-ms.git.

A comprehensive evaluation of the potential of three next-generation short-read-based plant pan-genome construction strategies for the identification of novel non-reference sequence

Article

Full-text available

Mar 2024

Pan-genome studies are important for understanding plant evolution and guiding the breeding of crops by containing all genomic diversity of a certain species. Three short-read-based strategies for plant pan-genome construction include iterative individual, iteration pooling, and map-to-pan. Their performance is very different under various conditions, while comprehensive evaluations have yet to be conducted nowadays. Here, we evaluate the performance of these three pan-genome construction strategies for plants under different sequencing depths and sample sizes. Also, we indicate the influence of length and repeat content percentage of novel sequences on three pan-genome construction strategies. Besides, we compare the computational resource consumption among the three strategies. Our findings indicate that map-to-pan has the greatest recall but the lowest precision. In contrast, both two iterative strategies have superior precision but lower recall. Factors of sample numbers, novel sequence length, and the percentage of novel sequences’ repeat content adversely affect the performance of all three strategies. Increased sequencing depth improves map-to-pan’s performance, while not affecting the other two iterative strategies. For computational resource consumption, map-to-pan demands considerably more than the other two iterative strategies. Overall, the iterative strategy, especially the iterative pooling strategy, is optimal when the sequencing depth is less than 20X. Map-to-pan is preferable when the sequencing depth exceeds 20X despite its higher computational resource consumption.

Unveiling the Brazilian Kefir Microbiome: Discovery of a novel Lactobacillus kefiranofaciens (LkefirU) genome and in silico prospection of bioactive peptides with potential anti-Alzheimer properties

Preprint

Full-text available

Mar 2024

Background- Kefir is a complex microbial community that plays a critical role in the fermentation and production of bioactive peptides, and has health-improving properties. In this study, we employed shotgun metagenomics and peptidomics approaches to further characterize Brazilian kefir. Results- We successfully assembled the novel genome of Lactobacillus kefiranofaciens (LkefirU) and conducted a comprehensive pangenome analysis to compare it with other strains. Furthermore, we performed a peptidome analysis, revealing the presence of bioactive peptides encrypted by L. kefiranofaciens in the Brazilian kefir sample, and utilized in silicoprospecting and molecular docking techniques to identify potential anti-Alzheimer peptides, targeting β-amyloid (fibril and plaque), BACE, and acetylcholinesterase. Through this analysis, we identified two peptides that show promise as compounds with anti-Alzheimer properties. Conclusions -These findings not only provide insights into the genome of L. kefiranofaciens but also serve as a promising prototype for the development of novel anti-Alzheimer compounds derived from Brazilian kefir.

A pangenome analysis of ESKAPE bacteriophages: the underrepresentation may impact machine learning models

Preprint

Full-text available

Feb 2024

Bacteriophages are the most prevalent biological entities in the biosphere. However, limitations in both medical relevance and sequencing technologies have led to a systematic underestimation of the genetic diversity within phages. This underrepresentation not only creates a significant gap in our understanding of phage roles across diverse biosystems but also introduces biases in computational models reliant on these data for training and testing. In this study, we focused on publicly available genomes of bacteriophages infecting high-priority ESKAPE pathogens to show the extent and impact of this underrepresentation. First, we demonstrate a stark underrepresentation of ESKAPE phage genomes within the public genome and protein databases. Next, a pangenome analysis of these ESKAPE phages reveals extensive sharing of core genes among phages infecting the same host. Furthermore, genome analyses and clustering highlight close nucleotide-level relationships among the ESKAPE phages, raising concerns about the limited diversity within current public databases. Lastly, we uncover a scarcity of unique lytic phages and phage proteins with antimicrobial activities against ESKAPE pathogens. This comprehensive analysis of the ESKAPE phages underscores the severity of underrepresentation and its potential implications. This lack of diversity in phage genomes may restrict the resurgence of phage therapy and cause biased outcomes in data-driven computational models due to incomplete and unbalanced biological datasets.

IPGA: A handy integrated prokaryotes genome and pan‐genome analysis web service

Article

Full-text available

Sep 2022

Pan-genomics is one of the most powerful means to study genomic variation and obtain a sketch of genes within a defined clade of species. Though there are a lot of computational tools to achieve this, an integrated framework to evaluate their performance and offer the best choice to users has never been achieved. To ease the process of large-scale prokaryotic genome analysis, we introduce Integrated Prokaryotes Genome and pan-genome Analysis (IPGA), a one-stop web service to analyze, compare, and visualize pan-genome as well as individual genomes, that rids users of installing any specific tools. IPGA features a scoring system that helps users to evaluate the reliability of pan-genome profiles generated by different packages. Thus, IPGA can help users ascertain the profiling method that is most suitable for their data set for the following analysis. In addition, IPGA integrates several downstream comparative analysis and genome analysis modules to make users achieve diverse targets. Graphical Abstract Integrated Prokaryotes Genome and pan-genome Analysis (IPGA) serves as a free and easy-to-use web-based system that could provide up-to-date pan-genome analysis service for non-bioinformaticians. IPGA offers users the most reliable pan-genome profile which enables users to perform additional comparative genomic analysis. IPGA provides a series of downstream analysis modules such as phylogenetic inference, synteny inference, and target genome annotation. Highlights • IPGA serves as a free and easy-to-use web-based system that could provide up-to-date pan-genome analysis service for non-bioinformaticians. • IPGA offers users the most reliable pan-genome profile which enables users to perform additional comparative genomic analysis. • IPGA provides a series of downstream analysis modules such as phylogenetic inference, synteny inference, and target genome annotation.

Étude du pangénome d’une population bactérienne structurée : vers une nouvelle compréhension de l’origine des variations intra-génomiques

Thesis

Dec 2021

Hélène Gardon

Les modèles d’évolution associés aux concepts de l’espèce mettent en avant des processus de balayage sélectif à l’échelle des gènes ou à l’échelle des génomes. Au cours de cette thèse, une reconsidération des processus à l’origine de la différenciation de populations bactériennes libres environnementales a été réalisée en prenant comme modèle des sous-populations cooccurrentes de l’écotype HLII de Prochlorococcus. L’objectif était d’appréhender les forces évolutives à l’origine de la formation et du maintien du pangénome pour des populations bactériennes libres de l’environnement.Le pangénome de Prochlorococcus apparaît ouvert à l’échelle populationnelle. Les gènes core et flexibles qui le composent dessinent un paysage génomique caractérisé par des régions conservées et variables. Cette organisation génomique s’accompagne d’une répartition non aléatoire des fonctions portées par les gènes flexibles et d’une dynamique évolutive différentielle, illustrée par une variation des contraintes sélectives et l’identification de points chaud de recombinaison, le long du génome.Les résultats obtenus au cours de ces travaux mettent en évidence une distinction des trajectoires évolutives d’ensembles de gènes, spécifiques de compartiments génomiques particuliers, dans une population structurée. Ceci est conforme à une évolution de type barrière à la dérive. En outre, la structuration de l’information génétique le long du génome pourrait dépendre de la dynamique des flux de gènes entre les sous-populations, en particulier pour les gènes flexibles. Plutôt qu’une acquisition non aléatoire des gènes en fonction de leur localisation génomique, une probabilité différentielle de rétention des gènes transférés comme conséquence de la fluctuation de la taille efficace de la population le long du génome peut être envisagée.

Bacterial Pan-Genomics

Chapter

Nov 2019

Due to their tendency to have a high recombination rate, bacterial genomes are highly diverse across different strains. This diversity may even be in the form of the presence or absence of entire genes; therefore, each strain might have its own combination of genes. The pan-genome represents the complete gene pool of a species. It is made up of the core genome (genes shared by all strains) and the accessory genome (genes shared by some strains and not all). The pan-genome can be considered to be a comprehensive reference genome for computational biology, and several tools have been developed for pan-genomics applications. The tools enable scientists to explore bacterial genomes with more flexibility considering all types of genetic variations. Pan-genomics has many applications in medicine such as the development of vaccines and drugs against pathogenic bacteria. In this chapter, we discuss the fundamental principles and algorithms for pan-genome analysis and introduce and compare the most recent computational tools.

Bioinformatics as a Tool for the Structural and Evolutionary Analysis of Proteins

Chapter

Full-text available

Oct 2019

This chapter deals with the topic of bioinformatics, computational, mathematics, and statistics tools applied to biology, essential for the analysis and characterization of biological molecules, in particular proteins, which play an important role in all cellular and evolutionary processes of the organisms. In recent decades, with the next generation sequencing technologies and bioinformatics, it has facilitated the collection and analysis of a large amount of genomic, transcriptomic, proteomic, and metabolomic data from different organisms that have allowed predictions on the regulation of expression, transcription, translation, structure, and mechanisms of action of proteins as well as homology, mutations, and evolutionary processes that generate structural and functional changes over time. Although the information in the databases is greater every day, all bioinformatics tools continue to be constantly modified to improve performance that leads to more accurate predictions regarding protein functionality, which is why bioinformatics research remains a great challenge.

PGAweb: A Web Server for Bacterial Pan-Genome Analysis

Article

Full-text available

Aug 2018

An astronomical increase in microbial genome data in recent years has led to strong demand for bioinformatic tools for pan-genome analysis within and across species. Here, we present PGAweb, a user-friendly, web-based tool for bacterial pan-genome analysis, which is composed of two main pan-genome analysis modules, PGAP and PGAP-X. PGAweb provides key interactive and customizable functions that include orthologous clustering, pan-genome profiling, sequence variation and evolution analysis, and functional classification. PGAweb presents features of genomic structural dynamics and sequence diversity with different visualization methods that are helpful for intuitively understanding the dynamics and evolution of bacterial genomes. PGAweb has an intuitive interface with one-click setting of parameters and is freely available at http://PGAweb.vlcc.cn/.

PanGeneHome : A Web Interface to Analyze Microbial Pangenomes

Preprint

Full-text available

May 2018

PanGeneHome is a web server dedicated to the analysis of available microbial pangenomes. For any prokaryotic taxon with at least three sequenced genomes, PanGeneHome provides (i) conservation level of genes, (ii) pangenome and core-genome curves, estimated pangenome size and other metrics, (iii) dendrograms based on gene content and average amino acid identity (AAI) for these genomes, and (iv) functional categories and metabolic pathways represented in the core, accessory and unique gene pools of the selected taxon. In addition, the results for these different analyses can be compared for any set of taxa. With the availability of 615 taxa, covering 182 species and 49 orders, PanGeneHome provides an easy way to get a glimpse on the pangenome of a microbial group of interest. The server and its documentation are available at http://pangenehome.lmge.uca.fr.

Comparison methodology flowchart.

Context in source publication

Similar publications

Citations