Figure 3 - uploaded by Lars Snipen
Content may be subject to copyright.
Gene family distribution. The distributions of how many domain sequence families are found in 1, 2,...,347 genomes. There are 909 ORFan families (leftmost bar), 479 core families (rightmost bar) and in total there are 5724 unique domain sequence families (sum of all bars). 

Gene family distribution. The distributions of how many domain sequence families are found in 1, 2,...,347 genomes. There are 909 ORFan families (leftmost bar), 479 core families (rightmost bar) and in total there are 5724 unique domain sequence families (sum of all bars). 

Source publication
Article
Full-text available
The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering of coding sequences into groups of essentially similar genes. There is no standard approach to obtain such gene families. Ideally, the gene family computations should be robust against errors in the annotation of genes in various genomes. In an attempt...

Contexts in source publication

Context 1
... y g is also a random variable. The probability density of this variable, e.g. the one depicted in Figure 3, can be described by a K component bino- mial mixture model ...
Context 2
... lower triangle (below the diagonal) shows the mean Jaccard distance over the 54 genomes in this case. In Figure 3 we show the distribution (presence/absence) of domain sequences over the E. coli population. The leftmost bar is the number of ORFans, domain sequence families found in one single genome only. ...
Context 3
... allowing the mixture model to have many more components, we found 12 was op- timal here, we can cope with the fact that genomes are not uniformly distributed within the population. The distribution in Figure 3 is af- fected by this, e.g. since we have 31 genomes of serotype O157:H7 in the sample we expect there will be a small 'bump' in the dis- tribution at 31 genomes, reflecting the domain sequence families common to these closely related genomes. ...

Citations

... Here we performed an original pan-genomic analysis of 53 C. pseudotuberculosis strains focusing on their functional domains. Domain comparisons play an important role in comparative genomics but have been poorly explored in pan-genomics studies (Snipen & Ussery, 2012). We analysed protein/domain properties of strains in both biovars (equi and ovis) by searching for some divergence and biovar specific-domains. ...
Article
Full-text available
Corynebacterium pseudotuberculosis is a pathogenic bacterium with great veterinary and economic importance. It is classified into two biovars: ovis, nitrate-negative, that causes lymphadenitis in small ruminants and equi, nitrate-positive, causing ulcerative lymphangitis in equines. With the explosive growth of available genomes of several strains, pan-genome analysis has opened new opportunities for understanding the dynamics and evolution of C. pseudotuberculosis. However, few pan-genomic studies have compared biovars equi and ovis. Such studies have considered a reduced number of strains and compared entire genomes. Here we conducted an original pan-genome analysis based on protein sequences and their functional domains. We considered 53 C. pseudotuberculosis strains from both biovars isolated from different hosts and countries. We have analysed conserved domains, common domains more frequently found in each biovar and biovar-specific (unique) domains. Our results demonstrated that biovar equi is more variable; there is a significant difference in the number of proteins per strains, probably indicating the occurrence of more gene loss/gain events. Moreover, strains of biovar equi presented a higher number of biovar-specific domains, 77 against only eight in biovar ovis, most of them are associated with virulence mechanisms. With this domain analysis, we have identified functional differences among strains of biovars ovis and equi that could be related to niche-adaptation and probably help to better understanding mechanisms of virulence and pathogenesis. The distribution patterns of functional domains identified in this work might have impacts on bacterial physiology and lifestyle, encouraging the development of new diagnoses, vaccines, and treatments for C. pseudotuberculosis diseases. Communicated by Ramaswamy H. Sarma
... An alternative approach, already employed in a comparative genomics study of Escherichia coli [467], consists of grouping of proteins on the base of domain architectures with a fixed N-C terminal order []. Clustering based on domain order is highly scalable and moreover, most protein domains represent structural folds that can be directly linked to function. ...
... The first published pangenome covered eight strains of Streptococcus agalactiae (Tettelin et al., 2005), reflecting the number of available genome sequences for that species at the time. The number of genomes included in pangenome analyses has since increased along with the increased availability of sequenced bacterial genomes and now contains 100s or 1000s of genomes (Jun et al., 2014;Kaas et al., 2011;Land et al., 2015;Leekitcharoenphon et al., 2016;Méric et al., 2013;Snipen and Ussery, 2012) resulting in >10 000 gene groups. A main concern when evaluating the result of pangenome analyses is how the pangenome and core size change as genomes are added to the pangenome. ...
Article
Full-text available
Motivation: The increase in available microbial genome sequences has resulted in an increase in the size of the pangenomes being analyzed. Current pangenome visualizations are not intended for the pangenome sizes possible today and new approaches are necessary in order to convert the increase in available information to increase in knowledge. As the pangenome data structure is essentially a collection of sets we explore the potential for scalable set visualization as a tool for pangenome analysis. Results: We present a new hierarchical clustering algorithm based on set arithmetics that optimizes the intersection sizes along the branches. The intersection and union sizes along the hierarchy are visualized using a composite dendrogram and icicle plot, which, in pangenome context, shows the evolution of pangenome and core size along the evolutionary hierarchy. Outlying elements, i.e. elements whose presence pattern do not correspond with the hierarchy, can be visualized using hierarchical edge bundles. When applied to pangenome data this plot shows putative horizontal gene transfers between the genomes and can highlight relationships between genomes that is not represented by the hierarchy.We illustrate the utility of hierarchical sets by applying it to a pangenome based on 113 Escherichia and Shigella genomes and find it provides a powerful addition to pangenome analysis. Availability: The described clustering algorithm and visualizations are implemented in the hierarchicalSets R package available from CRAN (https://cran.r-project.org/web/packages/hierarchicalSets) CONTACT: Thomas Lin Pedersen (thomasp85@gmail.com)Supplementary information Supplementary data are available at Bioinformatics online.
... An alternative approach, already employed in a comparative genomics study of Escherichia coli 13 , consists of grouping of proteins on the base of domain architectures with a fixed N-C terminal order 14 . Clustering based on domain order is highly scalable and moreover, most protein domains represent structural folds that can be directly linked to function. ...
... The majority of the strains cluster in accord with their taxonomic classification. Many of the unclassified strains could be classified either in P. aeruginosa (4) or P. putida (13). Exploring the pan-and core-genome of Pseudomonas at protein domain level. ...
Article
Full-text available
Pseudomonas is a highly versatile genus containing species that can be harmful to humans and plants while others are widely used for bioengineering and bioremediation. We analysed 432 sequenced Pseudomonas strains by integrating results from a large scale functional comparison using protein domains with data from six metabolic models, nearly a thousand transcriptome measurements and four large scale transposon mutagenesis experiments. Through heterogeneous data integration we linked gene essentiality, persistence and expression variability. The pan-genome of Pseudomonas is closed indicating a limited role of horizontal gene transfer in the evolutionary history of this genus. A large fraction of essential genes are highly persistent, still non essential genes represent a considerable fraction of the core-genome. Our results emphasize the power of integrating large scale comparative functional genomics with heterogeneous data for exploring bacterial diversity and versatility. Pseudomonas is a highly versatile genus containing species that can be harmful to humans and plants while others are widely used for bioengineering and bioremediation. We analysed 432 sequenced Pseudomonas strains by integrating results from a large scale functional comparison using protein domains with data from six metabolic models, nearly a thousand transcriptome measurements and four large scale transposon mutagenesis experiments. Through heterogeneous data integration we linked gene essentiality, persistence and expression variability. The pan-genome of Pseudomonas is closed indicating a limited role of horizontal gene transfer in the evolutionary history of this genus. A large fraction of essential genes are highly persistent, still non essential genes represent a considerable fraction of the core-genome. Our results emphasize the power of integrating large scale comparative functional genomics with heterogeneous data for exploring bacterial diversity and versatility.</p
... To overcome these bottlenecks, protein domains have been suggested as an alternative for defining groups of functionally equivalent proteins [8][9][10] and have been used to perform comparative analyses of Escherichia coli 9 , Pseudomonas 10 , Streptococcus 11 and for protein functional annotation 12,13 . A protein domain architecture describes the arrangement of domains contained in a protein and is exemplified in Figure 1. ...
... To overcome these bottlenecks, protein domains have been suggested as an alternative for defining groups of functionally equivalent proteins [8][9][10] and have been used to perform comparative analyses of Escherichia coli 9 , Pseudomonas 10 , Streptococcus 11 and for protein functional annotation 12,13 . A protein domain architecture describes the arrangement of domains contained in a protein and is exemplified in Figure 1. ...
Article
Full-text available
A functional comparative genome analysis is essential to understand the mechanisms underlying bacterial evolution and adaptation. Detection of functional orthologs using standard global sequence similarity methods faces several problems; the need for defining arbitrary acceptance thresholds for similarity and alignment length, lateral gene acquisition and the high computational cost for finding bi-directional best matches at a large scale. We investigated the use of protein domain architectures for large scale functional comparative analysis as an alternative method. The performance of both approaches was assessed through functional comparison of 446 bacterial genomes sampled at different taxonomic levels. We show that protein domain architectures provide a fast and efficient alternative to methods based on sequence similarity to identify groups of functionally equivalent proteins within and across taxonomic bounderies. As the computational cost scales linearly, and not quadratically with the number of genomes, it is suitable for large scale comparative analysis. Running both methods in parallel pinpoints potential functional adaptations that may add to bacterial fitness.
... Producing the more than 5 million (2267 2 ) BLAST result files would take weeks given ordinary computing resources, and in the meantime the number of E. coli strains has increased further! An alternative to clustering based on orthologs is to scan all genes for protein domains using HMMER3, as suggested by [24], and then cluster them by their ordered sequence of non-overlapping domains. The whole idea of this indirect comparison is to first compare all proteins to some reference set of sequences, e.g. a database of protein domains, and then cluster them based on how they look in this sequence subspace. ...
Article
Full-text available
A pan-genome is defined as the set of all unique gene families found in one or more strains of a prokaryotic species. Due to the extensive within-species diversity in the microbial world, the pan-genome is often many times larger than a single genome. Studies of pan-genomes have become popular due to the easy access to whole-genome sequence data for prokaryotes. A pan-genome study reveals species diversity and gene families that may be of special interest, e.g because of their role in bacterial survival or their ability to discriminate strains. We present an R package for the study of prokaryotic pan-genomes. The R computing environment harbors endless possibilities with respect to statistical analyses and graphics. External free software is used for the heavy computations involved, and the R package provides functions for building a computational pipeline. We demonstrate parts of the package on a data set for the gram positive bacterium Enterococcus faecalis. The package is free to download and install from The Comprehensive R Archive Network.
... Here, we explore a bit the stunning diversity of E. coli by using a functional domain approach (Snipen and Ussery, 2013) to identify sigma factors in the 983 Escherichia and Shigella genomes with good-enough quality scores for analysis. Further, this method is used to predict novel sigma factors. ...
... The genomes with extremely low and high numbers of proteins have low genome quality scores. We have found previously that PfamA run on a set of 347 E. coli genomes will retrieve 95% of the domains found in the E. coli pangenome (Snipen and Ussery, 2013), so we expect that we should have good coverage of this set of genomes. ...
Article
Everyone working with bacterial genomics is familiar with the phrase ‘too much data’. In this Genome Update, we discuss two methods for helping to deal with this explosion of genomic information. First, we introduce the concept of calculating a quality score for each sequenced genome, and second, we describe a method to quickly sort through genomes for a particular set of protein families. We apply these two methods to all of the current Escherichia coli genomes available in the The National Center for Biotechnology Information database. Out of the 2074 E. coli/Shigella genomes listed (June, 2013), only less than half (983) are of sufficient quality to use in comparative genomic work. Unfortunately, even some of the ‘complete’ E. coli genomes are in pieces, and a few ‘draft’ genomes are good quality. Six of the seven known sigma factors in E. coli strain K‐12 are extremely well conserved; the iron‐regulating sigma factor FecI (σ19) is missing in most genomes. Surprisingly, the E. coli strain CFT073 genome does not encode a functional RpoD (σ70), which is obviously essential, and this is likely due to poor genome assembly/annotation. We find a possible novel sigma factor present in more than a hundred E. coli genomes.
Thesis
Full-text available
Introduites en microbiologie en 2005, les approches pangénomiques visent à compiler l'ensemble de la diversité génomique d'une espèce. Dans ces études, on distingue généralement à l'intérieur du pangénome, le génome coeur, c'est-à-dire l'ensemble des familles de gènes où les représentants géniques sont présents dans tous les organismes; et d'autre part, le génome accessoire qui correspond aux gènes spécifiques à certains organismes seulement. Cependant, on constate que le concept de génome coeur est limitant avec un nombre important d'organismes car des gènes bien que fonctionnellement indispensables peuvent être absents de certains génomes. Pour limiter ce phénomène la quasi-totalité des études utilisent un seuil arbitraire de présence (généralement 95%) pour définir un génome coeur assoupli. De plus, cette dichotomie entre le génome coeur et accessoire ne rend pas compte des nombreuses gammes de fréquence d'apparition des gènes dans un pangénome. Ce travail de thèse a pour objectif de proposer une approche statistique basé sur un modèle mixé multivarié de Bernoulli couplé à un champ de Markov caché pour partitionner le pangénome afin d'être résilient aux absences de gènes et de mieux distinguer les différents schémas de présence/absence des gènes. En parallèle, plusieurs structures de données basées sur des graphes de pangénomes ont été développées ces dernières années. En effet, exploiter la totalité des informations disponibles dans un génome et non plus seulement la présence de gènes isolés est désormais crucial pour correctement rendre compte de l'organisation des génomes et notamment des régions de plasticité génomique dans les espèces. Cette approche se veut le chaînon manquant entre ces nouvelles approches graphiques à l'échelle de la séquence et les approches originelles en familles de gènes isolés. Pour y parvenir, ce travail de thèse s'intéresse donc à la définition, au partitionnement statistique et à l'exploitation d'un graphe d'un pangénome comme représentation compacte de la diversité du répertoire génomique des espèces procaryotes. Enfin, ce graphe est ensuite employé pour analyser la diversité pangénomique de 439 espèces procaryotes.
Article
Full-text available
Photosynthetic microbes are considered promising biofactories for transforming inorganic carbon from the atmosphere into a renewable source of chemicals and precursors of industrial interest; however, there continues to be a need for strains that demonstrate high productivity, environmental robustness, and the potential to be genetically manipulated. Genome sequencing and biochemical characterization of promising culture collection microalgae strains, as well as the isolation of previously unidentified strains from the environment or mixed cultures, bring us closer to the goal of decreasing the cost-per-gallon of algal biofuels by identifying new and promising potential production strains. The halotolerant alga Picochlorum soloecismus was isolated from the culture collection strain, Nannochloropsis salina CCMP 1776. Here, we show that P. soloecismus accumulates moderate levels of fatty acids and high levels of total carbohydrates and that it can effectively grow in a range of salinities. In addition, we make use of its sequenced genome to compare it to other biofuel production platforms and to validate the capacity for engineering this strain's genome. Our work shows that Picochlorum soloecismus is a candidate production strain for the generation of renewable bioproducts.
Conference Paper
Full-text available
Uropathogenic Escherichia coli (UPEC) is the most important bacterial agent causing urinary tract infections (UTIs) in patients around the world. The UTIs rank second among different types of infectious diseases. So, there is an urgent need to have a rapid and accurate diagnostic method for detecting UTIs. DNA microarray is an advanced pan-genomic technique which can be used as a rapid and accurate diagnostic method with high specificity and sensitivity. Designing a collection of DNA microarray probes enables us to have a sharp methodology for detection and identification of UPEC pathotypes. For this reason, the authors of the present review literature have tried to represent a vast range of availabilities regarding UPEC recognition by DNA microarray probe designing.