ArticlePDF Available

A complete domain-to-species taxonomy for Bacteria and Archaea

Authors:

Abstract and Figures

The Genome Taxonomy Database is a phylogenetically consistent, genome-based taxonomy that provides rank-normalized classifications for ~150,000 bacterial and archaeal genomes from domain to genus. However, almost 40% of the genomes in the Genome Taxonomy Database lack a species name. We address this limitation by using commonly accepted average nucleotide identity criteria to set bounds on species and propose species clusters that encompass all publicly available bacterial and archaeal genomes. Unlike previous average nucleotide identity studies, we chose a single representative genome to serve as the effective nomenclatural ‘type’ defining each species. Of the 24,706 proposed species clusters, 8,792 are based on published names. We assigned placeholder names to the remaining 15,914 species clusters to provide names to the growing number of genomes from uncultivated species. This resource provides a complete domain-to-species taxonomic framework for bacterial and archaeal genomes, which will facilitate research on uncultivated species and improve communication of scientific results. A full species classification is built for all publicly available bacterial and archaeal genomes.
Content may be subject to copyright.
ResouRce
https://doi.org/10.1038/s41587-020-0501-8
Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, Queensland, Australia.
e-mail: donovan.parks@gmail.com
Sequencing and computational advances have enabled the
expedient recovery of genomes from both cultivated and
uncultivated microorganisms, and spurred initiatives such
as the Genomic Encyclopedia of Bacteria and Archaea, which
produced thousands of isolate genomes assembled from type
strains1,2. Multiple culture-independent studies report thousands
of metagenome-assembled genomes (MAGs) recovered from a
diverse range of environments35. Genome sequences inform taxo-
nomic classifications69 and have been used to establish the Genome
Taxonomy Database (GTDB), a comprehensive genome-based
taxonomy with bacterial and archaeal taxa circumscribed on the
basis of monophyly and relative evolutionary divergence10. Here we
expand on this earlier work by organizing all genomes encompassed
by the GTDB into quantitatively defined species clusters based on
the average nucleotide identity (ANI) to selected species representa-
tive genomes.
Many species definitions have been proposed for Bacteria and
Archaea that consider different biological, ecological and genomic
aspects of these organisms1114. Here we are interested in an opera-
tional species definition that facilitates the automated assignment of
genomes to species and that scales to large datasets to allow all avail-
able and forthcoming genomes to be organized into species clusters.
This can be achieved using whole-genome ANI, which has emerged
as a robust and widely accepted method for circumscribing spe-
cies1517, with 95% ANI found to recapitulate the majority of existing
species1820. ANI is determined from the similarity of orthologous
regions shared between two genomes6 and a number of methods
have been proposed for calculating this statistic18,19,2123. Here we
make use of two recent advances in calculating ANI that allow tens
of thousands of genomes to be organized into species clusters: a fast
heuristic for approximating ANI24 and a computationally efficient
approach highly correlated with the results of traditional meth-
ods19. We also use the alignment fraction (AF; that is, percentage of
orthologous regions shared between two genomes) as an additional
threshold for circumscribing species18,22,25 to ensure that ANI values
are not based on a small set of conserved genes.
Species clusters can be formed in a number of ways based on
the ANI between genomes. A common approach is to represent
genomes as nodes in a graph with edges between genome pairs hav-
ing an ANI 95%. A graph-based clustering method can then be
applied to divide the graph into putative species clusters without
apriori consideration of existing nomenclature or taxonomy22,24,26.
By contrast, we have taken an approach that explicitly accounts for
validly or effectively published species names, directly ties species
clusters to nomenclatural type material where possible and results
in a single representative for each species cluster. Specifically, we
identified genomes assembled from the type strain of the species
(subsequently referred to as type strain genomes) and used these
as the representatives of species clusters circumscribed using ANI.
We believe the use of type strain genomes is a pragmatic choice for
circumscribing species given systematic efforts to sequence such
strains1,2 and the nomenclatural and taxonomic importance of type
material17,27. The National Center for Biotechnology Information
(NCBI) currently uses the ANI to type strain genomes to identify
misclassified genome assemblies15,28, and recently a combined ANI/
AF metric has been proposed for delineating genera around type
strain genomes29. We organize genomes that were not assigned to a
named species cluster into denovo species clusters with representa-
tive genomes selected based on genome quality and acting as effec-
tive nomenclatural type material. This follows the recent, currently
unratified, proposal that gene sequences are suitable type mate-
rial for Bacteria and Archaea to the extent that they allow for the
unambiguous circumscription of taxa17,3032. The proposed species
clusters encompass all public genomes within the NCBI Assembly
database33 and have been incorporated into the GTDB to provide a
taxonomic framework where genomes have assignments at all ranks
from domain to species10.
Results
Identification of genomes assembled from type material. The
proposed species clusters were determined from a dat aset compris-
ing 153,849 genomes obtained from the NCBI Assembly database
A complete domain-to-species taxonomy for
Bacteria and Archaea
Donovan H. Parks  ✉ , Maria Chuvochina, Pierre-Alain Chaumeil, Christian Rinke ,
Aaron J. Mussig  and Philip Hugenholtz
The Genome Taxonomy Database is a phylogenetically consistent, genome-based taxonomy that provides rank-normalized
classifications for ~150,000 bacterial and archaeal genomes from domain to genus. However, almost 40% of the genomes in
the Genome Taxonomy Database lack a species name. We address this limitation by using commonly accepted average nucleo-
tide identity criteria to set bounds on species and propose species clusters that encompass all publicly available bacterial and
archaeal genomes. Unlike previous average nucleotide identity studies, we chose a single representative genome to serve as the
effective nomenclatural ‘type’ defining each species. Of the 24,706 proposed species clusters, 8,792 are based on published
names. We assigned placeholder names to the remaining 15,914 species clusters to provide names to the growing number of
genomes from uncultivated species. This resource provides a complete domain-to-species taxonomic framework for bacterial
and archaeal genomes, which will facilitate research on uncultivated species and improve communication of scientific results.
NATURE BIOTECHNOLOGY | www.nature.com/naturebiotechnology
ResouRce NaTUre BiOTecHNOlOGy
and 153 previously recovered archaeal MAGs3 (Fig. 1a). Genomes
assembled from type material were identified by cross-referencing
strain identifiers at NCBI with the co-identical strain identifiers
at the List of Prokaryotic Names with Standing in Nomenclature
(LPSN)34, BacDive35 and StrainInfo36. Unfortunately, associating
genome assemblies with nomenclatural types remains a chal-
lenge with currently available resources37. Type material is refer-
enced by the unique strain identifiers at each culture collection
resulting in a list of co-identical strain identifiers (for example,
Escherichia coli ATCC 11775 = CCUG 24 = = NCTC 9001).
Nomenclatural resources such as LPSN and BacDive have no easy
mechanism for maintaining a complete list of these co-identical
identifiers and the strain identifiers associated with genomes
at NCBI are largely the responsibility of individual submitters.
Consequently, genomes may be identified as being assembled
from type material at only a subset of nomenclatural resources
and do not always agree with the nomenclatural status of genomes
at NCBI (Fig. 1b). This latter situation is conflated by genomes
at NCBI being annotated as assembled from type material if a
genome has been ‘effectively published’ (for example, Clostridium
autoethanogenum DSM 10061), but the species name has not
been validated27. We considered a genome to be assembled from
type material if any of its strain identifiers at NCBI could be
matched with a co-identical strain identifier at LPSN, BacDive or
StrainInfo. This results in 8,665 genomes spanning 7,104 species
being identified as assemblies from the type strain of a species
(Supplementary Table 1).
Representative genomes for named species. Species clusters
were formed by selecting a single representative genome for each
of the 9,162 validly or effectively published species names associ-
ated with one or more of the 145,904 quality-controlled genomes
(Supplementary Fig. 1 and Fig. 1a). Of these species, 5,942 (64.9%)
consisted of a single genome that was selected as the species repre-
sentative. The remaining 3,220 species comprised multiple genomes
and the representative was selected by giving preference to (1) type
strain genomes (2,632 species), (2) genomes annotated as being
assembled from type material at NCBI37 (123 species), (3) genomes
designated as a reference or representative genome at NCBI38 (220
species) or (4) genomes assembled from the type strain of a sub-
species (8 species). In 1,506 cases, multiple potential representa-
tive genomes were still available within a genome category (that is,
multiple type strain genomes) and the representative was selected
by considering NCBI metadata and the ANI between genomes,
and in a small number of cases by manual investigation (219 spe-
cies; see Methods). Overall, 7,104 of the 9,162 (77.5%) species were
represented by a type strain genome (Fig. 2a and Supplementary
Table 2), demonstrating the success of initiatives such as the Genomic
Encyclopedia of Bacteria and Archaea that aim to sequence all avail-
able type strains1.
Formation of named species clusters. Species clusters are cir-
cumscribed based on the ANI and AF between genomes. The ANI
circumscription radius for each named species representative was
set to 95% ANI except if two representatives had an ANI > 95%
a
NCBI assembly
database
Filter poor-quality
genome assemblies
Identify genomes
assembled from type
material
Select representative
genome for named
species
Form species clusters
for named species
Form de novo species
clusters from remaining
genomes
Genome
Taxonomy
Database
154,002
genomes
145,904
quality-controlled
genomes
9,162
named species
8,792
named
species clusters
24,706
species
clusters
15,914
de novo
species clusters
7,104
type strain of species
StrainInfo
BacDive
LPSN
153 MAGs
370
heterotypic synonyms
BacDive StrainInfo
38
202 114
535
140 48 16
587 35
10 5,477 45
71,853
93
b
De novo species representativeNamed species representativeNonrepresentative genome
dc
NCBI
LPSN
Fig. 1 | Overview of workflow for organizing genome assemblies into species clusters. a, A dataset of 151,188 bacterial and 2,661 archaeal genomes
was obtained from the NCBI Assembly database and supplemented with 153 archaeal MAGs. These genomes were filtered to remove 8,098 low-quality
genomes. LPSN, BacDive and StrainInfo were cross-referenced with the species and strain information at NCBI to identify type strain genomes. A single
representative genome was selected for each of the 9,162 validly or effectively published species names associated with one or more genomes in the
dataset, with preference given to type strain genomes. Clusters were formed for named species based on the ANIs between species representatives and all
other genomes. This resulted in 8,792 species clusters due to the formation of 370 synonyms between closely related species representatives. Genomes
not assigned to a named species cluster were formed into 15,914 denovo clusters using a greedy clustering algorithm that prioritized high-quality genome
assemblies. The resulting 24,706 species clusters encompass all 145,904 quality-controlled genomes and have been incorporated in the taxonomy of
GTDB release R04-RS89. b, Overlap between genomes determined to be type strain genomes at LPSN, BacDive, StrainInfo and NCBI, highlighting the
incomplete nature of co-identical strain identifiers between these different nomenclatural resources. c, Conceptual illustration of genomes with the ANIs
between genomes depicted by their Euclidean distance. Three selected representative genomes for validly or effectively published species names are
shown as circles with their circumscription radii depicted by larger circles. Genomes assigned to each named species representative are shown by squares
of the same color. Genomes not assigned to a representative are shown in gray. d, As per c, with the additional selection of three denovo representative
genomes that result in all genomes being assigned to a species cluster.
NATURE BIOTECHNOLOGY | www.nature.com/naturebiotechnology
ResouRce
NaTUre BiOTecHNOlOGy
(Fig. 3a,b). In such cases, the ANI radius of a representative was set
to the value of the closest representative up to a maximum of 97%,
with species representatives having an ANI > 97% considered syn-
onyms (Fig. 3c and see “Synonyms in the GTDB”).
The resulting ANI circumscription radii were then used to
form species clusters for the 8,792 nonsynonymous representative
genomes (Fig. 1c). For each of the 137,112 remaining genomes
in the dataset, the closest representative with an AF > 65%
(refs. 18,22,25) was determined and the genome assigned to this rep-
resentative if it was within its ANI circumscription radius. This
resulted in 104,763 (77.8%) genomes being assigned to named
species clusters, with the most abundant species reflecting highly
sequenced human-associated microorganisms (Supplementary
Table 3). The majority (87.3%) of assigned genomes only satis-
fied the ANI circumscription radius and 65% AF criterion for a
single species representative. Genomes meeting the assignment
criteria of multiple representatives were primarily classified as
Escherichia flexneri (68.1%), Escherichia dysenteriae (9.0%) or
Neisseria meningitidis_B (8.0%) (Supplementary Table 4). In a small
number of instances (461 genomes; 0.44%), a transitive situation
arose whereby a genome was not within the ANI radius of the clos-
est representative while being in the ANI radius of one or more
other species representatives (Fig. 3d). These genomes were left
unassigned to reflect the reduced ANI radius of species in their local
phylogenetic neighborhoods, which were almost exclusively within
the genera Escherichia (95.2%) and Serratia (4.3%).
Synonyms in the GTDB. The 370 species reclassified as synonyms
because they have an ANI > 97% to another species with naming pri-
ority represent a practical compromise between having a quantitative
species definition and retaining the majority (8,792 of 9,162; 96%) of
species with validly or effectively published names (Supplementary
Tables 5 and 6). The necessity for ANI-defined synonyms was great-
est in taxa of medical importance such as Brucella39, which comprises
nine species as defined under the NCBI taxonomy (Brucella melitensis,
Brucella vulpis, Brucella ovis , Brucella canis, Brucella neotomae, Brucella
suis, Brucella ceti, Brucella abortus and Brucella microti). These are
reclassified as a single species, B. melitensis, in GTDB as the ANI is
>99.5% and the AF is >93% for the representatives of these synony-
mous species, with the exception of the B. vulpis genome at 97.5% ANI
and 90% AF (Supplementary Table 6). The high genomic similarity of
these genomes suggests that they should be classified as monospecific
subspecies or biovars as previously proposed40. Similarly, several other
instances of proposed synonyms are supported by our ANI-defined
approach and have been incorporated into the GTDB taxonomy
(Supplementary Table 6). These include Mycobacterium africanum,
Mycobacterium bovis, Mycobacterium caprae, Mycobacterium canettii,
Mycobacterium microti, Mycobacterium mungi, Mycobacterium orygis
Bacteria Archaea
Type strain of species
(7,104)
NCBI type material
(544)
NCBI representative
(553)
Other
(961)
77.5%
5.9%
6.0%
10.5%
1 nomenclature source (4.3%)
>1 nomenclature sources (73.3%)
Single genome (64.9%)
Multiple genomes (35.1%)
ArchaeaBacteria
a
b
Named clusters
(8,792) 35.6%
De novo clusters
(15,914)
64.4%
Single genome (65.3%)
Multiple genomes (34.7%)
Isolate (58.3%)
HQ-MAG (15.2%)
MQ-MAG (24.7%)
SAG (1.8%)
Fig. 2 | Properties of genomes selected as species representatives. a, Representative genomes selected for the 9,162 validly or effectively published
species names. Outer ring indicates the proportion of genomes selected from different metadata categories. Middle ring indicates the proportion of
species within each metadata category that consist of either a single genome or multiple genomes. Inner ring indicates the proportion of genomes
designated as being type strain genomes based on co-identical strain information from either single or multiple nomenclature resources (that is, LPSN,
BacDive, StrainInfo). b, Representative genomes for the 24,706 species clusters circumscribing the 145,904 quality-controlled genomes. Outer ring
indicates the proportion of named and denovo species clusters. Middle ring indicates the proportion of species with an isolate, high-quality MAG
(HQ-MAG; completeness > 90%, contamination < 5%), medium-quality MAG (MQ-MAG; completeness  50%; contamination < 10%) or single
amplified genome (SAG) as the representative genome. Inner ring indicates the proportion of species clusters consisting of either a single genome or
multiple genomes. Inset charts show results for bacterial and archaeal species using the same color scheme and layout.
NATURE BIOTECHNOLOGY | www.nature.com/naturebiotechnology
ResouRce NaTUre BiOTecHNOlOGy
and Mycobacterium pinnipedii as synonyms of Mycobacterium
tuberculosis41; Bacillus plakortidis and Bacillus lehensis as synonyms
of Bacillus oshimensis42; Burkholderia pseudomallei as a synonym
of Burkholderia mallei22; and Halomonas sinaiensi as a synonym of
Halomonas caseinilytica43. Note that 192 (52%) of the 370 ANI-defined
synonyms are not based on type strains because genome assemblies
are not available (Supplementary Table 6). The status of these synony-
mous species will need to be reassessed once type strain sequences
become available.
Establishing denovo species clusters. The 32,349 genomes not
assigned to named species were organized into de novo clus-
ters using a greedy clustering approach favoring the selection of
high-quality genomes to represent each cluster (Fig. 1d). Genome
quality is established using estimates of completeness and con-
tamination, assembly quality (for example N50, number of con-
tigs) and preference for isolate genomes over MAGs or single
amplified genomes (SAGs) (see Methods). Selection of representa-
tive genomes consists of four steps: (1) sorting genomes without
a species assignment by estimated genome quality, (2) selecting
the highest-quality genome as a representative of a new species
cluster, (3) determining the species-specific ANI circumscription
radius for the new species cluster and (4) temporarily assigning
genomes to the new cluster using the same ANI and AF criteria
used for named species clusters. These steps were repeated until
all genomes were assigned to a species cluster. Finally, nonrepre-
sentative genomes were re-clustered to ensure that they had been
assigned to the closest denovo species representative. This resulted
in 15,914 denovo species clusters with the majority being repre-
sented by a MAG (61.6%) and comprising a single genome (68.8%;
Fig. 2b and Supplementary Table 7).
Suitability of GTDB representatives as type material. The 8,792
GTDB representatives for named species are generally of high qual-
ity with 96.4% meeting the high-quality completeness (90%) and
contamination (5%) MIMAG criteria44 and 85.1% containing a
near-complete 16S ribosomal RNA sequence. The majority (58.4%)
of denovo GTDB representatives are also estimated to be 90%
complete with 5% contamination which has been suggested as
suitable for defining sequence-based type material32. The number of
representatives suitable for use as type material increases to 73.5%
under the more lenient proposal of 80% completeness31. However, if
the presence of a near-complete 16S rRNA gene (1,200 base pairs)
is required as generally recommended31,32,44, the number of potential
type material representatives decreases substantially (37.0% at 90%
completeness; 39.8% at 80% completeness) owing to this gene often
being absent in MAGs3. A list of the 90,149 genomes in the GTDB
satisfying the high-quality MIMAG criteria, which includes 12,322
GTDB species representatives, is available on the GTDB website.
Landscape of proposed species clusters. The majority of the 24,706
species clusters are composed of a single genome (65.3%) and only
919 clusters (3.7%) consist of 10 genomes (Fig. 4a). A substan-
tial number of clusters are composed exclusively of MAGs (39.8%;
9,839 species) with the majority of these being singletons (67.5%;
6,646 species). While ANI circumscription radii between 95% and
97% were used to retain published species names, 8,407 (95.6%) of
the 8,792 named species clusters have a radius of 95% (Fig. 3a and
Supplementary Table 8). Notably, 97.3% of the 4,660 species clus-
ters with 3 genomes form a clique at 95% ANI (that is, all pair-
wise combinations of genomes have an ANI 95%) with all species
forming a clique at an ANI of 93.5% (Fig. 4b and Supplementary
Table 8). Interspecies ANI values between representatives indicate
95%
c d
ba
370 species
8,406 species
386 species
96%
<95%
95%95%
461 genomes
96%
96%
96%
98%
96%
96% 95%
~
95%
96.5%
Fig. 3 | Illustrative examples of circumscribing species for varying ANI values between species representatives. The ANI between genomes is depicted by
their Euclidean distance. Representative genomes are shown as circles with their circumscription radii depicted by larger circles. Genomes assigned to the
same species as a representative are shown by squares of the same color. a, Representative genomes with <95% ANI have a circumscription radius of 95%.
b, Representative genomes with an ANI between 95% and 97% have a circumscription radius equal to the ANI between the representatives, that is, 96% in
this example. c, Representative genomes with an ANI > 97% are considered synonyms and only the representative of the species with priority retained. In
this example, the ANI between the representative genomes is 98% and the species shown in orange has priority. d, Transitive situation illustrating a genome
(shown in red) that is not within the ANI circumscription radius of the closest representative (shown in orange; ANI radius of 96%) and is therefore not
assigned to any representative even if it is within the circumscription radius of another representative (shown in green; ANI radius of 95%).
NATURE BIOTECHNOLOGY | www.nature.com/naturebiotechnology
ResouRce
NaTUre BiOTecHNOlOGy
that the majority of species within a genus are highly divergent
from each other with >96.2% of the 927,064 comparisons having
an ANI < 90% (Fig. 4c). In contrast, the ANI to the closest intrage-
nus species for the 19,898 species representatives within genera with
multiple species shows a fairly even distribution between 78% and
95% ANI (Fig. 4d).
The tight intraspecies ANI clustering combined with relatively
small interspecies ANI values suggest that the genome used to
represent a species cluster is not critical. This can be illustrated by
considering the consequences of using the medoid genome within
each of the 4,660 species clusters with 3 genomes as the represen-
tative, analogously to the approach taken by the Microbial Genomes
Atlas26. For 1,574 clusters, the medoid genome is the same as the
representative genome proposed in this study. For the remaining
3,086 species clusters, we calculated the mean ANI from all genomes
in the cluster to the medoid and proposed representative genomes.
The difference between these mean ANI values was 0.35 ± 0.51%
on average with the 90th percentile being 0.96% and the maxi-
mum difference being 3.76% (Fig. 4e and Supplementary Table 9).
Reassigning genomes within species clusters with 3 genomes, with
the medoid genomes acting as the species representative, resulted
in 121,445 (99.6%) of 121,939 genomes being assigned to the same
species, with 362 (0.3%) assigned to a different species and 132
(0.1%) failing the criteria for species assignment.
We examined the uniqueness of species assignments by con-
sidering the number of species containing genomes within the
ANI circumscription radius of other species. We found that 456
(5.3%) of the 8,579 species containing 2 genomes (that is, at least
one genome other than the representative genome) had genomes
within the ANI radius of 2 species (0.90% 3 species; 0.23% 4
species). The 456 species are from both named (202) and denovo
(254) species clusters, with over half (278) of these species having
an ANI circumscription radius of 95%. While this indicates there
can be ambiguity in species assignments, the average difference in
ANI between the closest and second-closest representative genome
was relatively high at 2.0 ± 1.6 (Fig. 4f), demonstrating that assign-
ing genomes to the closest representative is robust. Notably, 26,129
(21.5%) of the 121,198 nonrepresentative genomes are within the
ANI circumscription radii of multiple species, with 21,915 (83.9%)
being from species in just four intensively classified clinically
important genera: Escherichia with 13,131 genomes, Salmonella
with 5,808 genomes, Listeria with 1,775 genomes and Neisseria with
1,201 genomes (Supplementary Table 10).
Evaluating robustness of proposed species clusters. We further
assessed the robustness of the proposed species clusters by forming
new clusters that did not take into account type material or genome
quality. The genome dataset was first simplified by randomly sub-
sampling GTDB species clusters comprising >10 genomes to 10
genomes to reduce computational requirements and more evenly
weight individual species. The resulting 49,902 genomes were orga-
nized into species clusters by randomly selecting genomes to act as
species representatives until all genomes were assigned to a cluster
as determined using the same clustering criteria as described for
denovo species clusters (see Methods). Five independent trials were
conducted to explore the impact of using randomly selected rep-
resentatives to form species clusters (Supplementary Table 11). In
all trials, randomly seeded clusters were highly congruent with the
proposed GTDB clusters, with 98.3 ± 0.05% of the 24,706 clusters
being identical and 99.4 ± 0.06% of the 49,902 genomes retaining
the same species assignment. There were 129 proposed species clus-
ters that were incongruent across all five random trials, with the
majority of these (85 species; 65.9%) having an ANI circumscription
radius >95% as a result of being from highly sampled genera such as
Neisseria (4 species), Paenibacillus (4 species), Streptococcus (4 spe-
cies) and Streptomyces (16 species; Supplementary Table 12).
Monophyly of ANI-based species clusters. From a selection of
7,293 species (see Methods), 6,854 (94.0%) were recovered as mono-
phyletic in maximum-likelihood trees and 29,115 of 29,564 (98.5%)
genomes had a species assignment congruent with the topology of
the inferred trees (Supplementary Table 13). Of the 2,592 genomes
belonging to the 439 polyphyletic species, 2,143 (81.8%) had species
assignments congruent with the topology of the tree. This indicates
1 2 3 4 5 6 7 8 9 10
Cluster size
0
20
40
60
Species (%)
a
94 95 96 97 98 99 100
Clique ANI threshold
0
50
100
Species (%)
b
77 79 81 83 85 87 89 91 93 95 97
Pairwise interspecies ANI
0
5
10
15
20
Comparisons (%)
c
77 79 81 83 85 87 89 91 93 95 97
Closest interspecies ANI
0
2
5
8
Species (%)
d
0 1 2 3
Medoid – representative ANI
0
20
40
60
Species (%)
e
0 1 2 3 4 5
Change in ANI
0
10
20
Genomes (%)
f
Fig. 4 | Key properties of GTDB species clusters circumscribed by ANI to a representative genome. a, Number of genomes within each of the 24,706
species clusters. b, Percentage of species forming a clique at varying ANI thresholds for the 4,660 species clusters with 3 genomes. c, ANI values
for 881,840 pairs of species representatives within the same genus. ANI values could not be calculated for 45,224 genome pairs due to insufficient
homologous regions between the genomes. d, ANI values between 19,466 species representatives and the closest representative within the same genus.
For 432 species the closest representative had insufficient sequence similarity for a reliable ANI value to be calculated. e, Difference between the mean
ANI to the medoid and the mean ANI to the selected representative for 4,660 species with 3 genomes. f, Difference in ANI between the closest and
second-closest representative genomes for the 26,129 genomes within the ANI circumscription radius of multiple species.
NATURE BIOTECHNOLOGY | www.nature.com/naturebiotechnology
ResouRce NaTUre BiOTecHNOlOGy
that polyphyly is the result of a small number of incongruent
genomes as evidenced by 263 (59.9%) species having only one incor-
rectly placed genome in the genus tree (Supplementary Table 13).
For comparison, across the same set of trees there were 2,894 spe-
cies defined by the NCBI taxonomy consisting of >2 genomes with
2,027 (70.0%) forming monophyletic groups. Of the 6,493 genomes
assigned to one of the 867 polyphyletic NCBI species, 5,594 (86.2%)
had species assignments congruent with the topology of the tree.
Comparison with NCBI species assignments. Of the 143,566
genomes with an NCBI taxonomic assignment, 79.6% had a species
assignment as a consequence of deeply sequenced species having
recognized species names within the NCBI taxonomy (for example
Staphylococcus aureus with 9,444 genome assemblies). Of these,
30.8% have a different species assignment under the proposed
GTDB species clusters (Fig. 5a). This is largely the result of reas-
signment of E. coli genomes (26.7% of changes) to E. flexneri and
E. dysenteriae along with changes to the generic name of a few deeply
sampled species such as Bacillus (7.5%), Campylobacter (6.4%) and
Shigella (5.1%; Supplementary Tables 14 and 15). Arguably, the
more relevant metric is only comparing species representatives to
remove the distorting effect of highly sampled species. Using this
criterion, less than half (10,323; 43%) of the 24,080 proposed species
representatives with an NCBI taxonomic assignment were classified
to the species level (Fig. 5b). Of the NCBI-classified representatives,
over a third (3,620; 35.1%) have a different species assignment in
the GTDB (Supplementary Table 16). The majority of these changes
(2,473; 68.3%) were due to modifying the generic name of the spe-
cies as a result of resolving polyphyletic genera or normalizing gen-
era by relative evolutionary divergence10. The ten most commonly
reassigned genus names account for >30% of the total differences
between the two taxonomies (Supplementary Table 17) and include
commonly recognized polyphyletic groups such as Pseudomonas45
(8.5%), Bacillus46 (5.5%) and Clostridum47 (3.3%).
Discussion
Type material is the cornerstone of modern bacterial and archaeal
nomenclature and forms the basis for taxonomic opinion17,27. Ideally,
genome-based taxonomy should use type material to provide refer-
ence points for both naming and classification. We combined this
concept with ANI to produce a fully articulated species classifica-
tion for publicly available bacterial and archaeal genomes. In so
doing, we redefine a bacterial species cluster to be every genome
within a fixed ANI distance (typically, 95%) to a designated type
strain, and build a taxonomy from that starting point.
Circumscribing species using ANI to representative genomes
provides a quantitative and convenient operational species defini-
tion. However, this definition is not always congruent with species
whose names are validly or effectively published, as evident from
370 species becoming synonyms under the proposed species clus-
ters (Supplementary Table 6). This is exemplified by the nine spe-
cies in Brucella being reclassified into a single species, B. melitensis.
While the genomic evidence supporting this reassignment has been
recognized for over three decades40, concerns have been raised in
regard to the challenges this change would cause clinicians and reg-
ulatory agencies39,48,49. We appreciate these concerns but have opted
for the proposed quantitative species definition as we believe it is of
the greatest use to the majority of the scientific community and best
reflects current opinions regarding the circumscription of species.
The term ‘variant’ or ‘subspecies’ will be incorporated into future
GTDB releases to accommodate preserving historical nomencla-
ture associations through reclassification of specific names as infra-
subspecific epithets41, for example, B. vulpis will be classified as
B. melitensis subsp. vulpis.
Arguably the most contentious reclassifications resulting from
the proposed species clusters are within the Escherichia/Shigella
genus. GTDB10 classifies Shigella species as belonging to the genus
Escherichia (Supplementary Table 15), and Escherichia sonnei and
Escherichia boydii are later heterotypic synonyms of E. flexneri
under the proposed species clusters (Supplementary Table 6).
Furthermore, circumscription of species based on ANI to the type
strains of E. coli, E. flexneri and E. dysenteriae resulted in 7,212
and 1,183 genomes classified as E. coli within the NCBI taxonomy
being reassigned to E. flexneri and E. dysenteriae, respectively
(Supplementary Table 14). This represents a reassignment of nearly
80% of the genomes classified as E. coli at NCBI. The scale of these
reassignments results in traditional properties of these species no
longer holding true, for example, E. dysenteriae and E. flexneri being
composed of human pathogenic strains50. Consequently, it may be
prudent to reassign E. flexneri and E. dysenteriae as synonyms of
E. coli (96.4% and 96.2% ANI, respectively) to avoid confusion and
to better reflect the high genomic similarity and evolutionary rela-
tionships of these species50,51.
Nearly all proposed species clusters (98.2% of species) have an
ANI circumscription radius of 95%. Consequently, we might expect
intraspecies ANI values to approach 90%, reflecting the ‘10% diam-
eter’ around representative genomes. In practice, all species clusters
form a clique at 93.5% ANI and the vast majority (97.6%) form a
clique at 95% ANI. This tight intraspecies clustering of genomes
may reflect evolutionary forces shaping speciation since it is trivial
to bioinformatically produce species clusters with a 10% diam-
eter. Tight intraspecies clustering has the practical benefit that the
selection of species representatives is not critical as demonstrated
by medoid and random representative testing (Supplementary
Table 11). Therefore, selection of type strains to represent species
clusters is the most pragmatic choice since they are directly tied to
nomenclature. The interspecies ANIs between closest representa-
tives within a genus were nearly evenly distributed between 78%
and 95% ANI (Fig. 4d). This result is in contrast to reports of a
genetic discontinuum between 83% and 95% ANI19. These appar-
ently contradictory results may reflect differences in how species
BothGeneric name Specific name
UnchangedPassive change
Active
change
- Streptococcus mitis (1.0%)
- Streptococcus oralis (0.9%)
- Pseudomonas fluorescens (0.9%)
- Pseudomonas (8.5%)
- Bacillus (5.5%)
- Lactobacillus (4.4%)
b
Unchanged
Passive
change
Active change
BothGeneric name Specific name
- Bacillus (7.5%)
- Campylobacter (6.4%)
- Shigella (5.1%)
- Escherichia coli (26.7%)
- Neisseria meningitidis (3.9%)
- Shigella sonnei (3.5%)
a
Fig. 5 | Comparison of proposed species assignments with the NCBI
taxonomy. a,b, Results are shown for the 143,566 genomes (a) and 24,080
species representatives (b) with an NCBI taxonomic assignment. A genome
was categorized as unchanged if its binomial species name was identical in
the NCBI taxonomy, passively changed if the NCBI taxonomy did not have
a species assignment or actively changed if the proposed species name
differed from the NCBI classification. In the lower bars, active changes were
divided into those with a change in only the generic name of the species, in
only the specific name of the species or due to changes in both the generic
and specific names. The top three most commonly changed NCBI genera
and species are listed.
NATURE BIOTECHNOLOGY | www.nature.com/naturebiotechnology
ResouRce
NaTUre BiOTecHNOlOGy
were defined, but is primarily the result of changing the perspective
from all pairwise comparisons in a large genomic dataset19 to the
intergenus ANI values between closest species representatives. This
latter perspective suggests a fairly smooth continuum of interspecies
generic diversity within genera which, despite the results presented
here, may ultimately challenge efforts to unambiguously define spe-
cies boundaries using ANI as additional species are discovered52. In
this event, other means for defining species may ultimately be more
appropriate, such as recombination boundaries14.
Incorporation of the proposed species clusters into the GTDB
requires that they be updated biannually with each GTDB release. An
emphasis will be placed on maintaining the selected representative
genome for each species cluster between GTDB releases so they can
serve as effective nomenclatural type material3032. However, this must
be balanced with the desire to use high-quality type strain genomes as
representatives and the incorporation of changing taxonomic opin-
ion. It is also likely that the genome assemblies of some selected rep-
resentatives will ultimately be found to be in error, either in terms of
the assembly itself or the associated metadata (that is, incorrect spe-
cies or strain assignment). We anticipate that the selection of a single
representative genome acting as the nomenclatural type for each spe-
cies will place these genomes under heavier community scrutiny that
will help uncover such issues. Ultimately, these issues mean that some
changes to species clusters will occur with each update. Nonetheless,
organization of genome assemblies into species clusters defined by a
single representative provides a community resource that addresses
the computationally demanding requirement of many applications to
dereplicate the large number of available genome assemblies53. It also
provides a common set of species representatives on which to com-
pare alternative approaches to open problems such as the inference
of large-scale phylogenies54,55. This is particularly relevant to efforts
such as the GTDB that rely on large-scale reference trees to establish
the monophyly and stability of taxa10.
Our proposal for a quantitative species definition allows for the
scalable and automated assignment of genomes to species clus-
ters. We incorporated these species clusters into the GTDB and
GTDB Toolkit56, an open source tool for the taxonomic classifica-
tion of genome assemblies. These clusters encompass the nearly
150,000 public genomes within the NCBI Assembly database and
will be updated with each GTDB release. This provides a complete
genome-based taxonomy in which all genomes have an assignment
from domain to species and establishes representative genomes for
circumscribing species. We anticipate that the availability of quan-
titatively defined and regularly updated species clusters will greatly
improve the taxonomic resolution of microbial studies and conse-
quently the communication of scientific results.
Online content
Any methods, additional references, Nature Research reporting
summaries, source data, extended data, supplementary informa-
tion, acknowledgements, peer review information; details of author
contributions and competing interests; and statements of data and
code availability are available at https://doi.org/10.1038/s41587-
020-0501-8.
Received: 16 September 2019; Accepted: 26 March 2020;
Published: xx xx xxxx
References
1. Kyrpides, N. C. etal. Genomic Encyclopedia of Bacteria and Archaea:
sequencing a myriad of type strains. PLoS Biol. 12, e1001920 (2014).
2. Mukherjee, S. etal. 1,003 reference genomes of bacterial and archaeal
isolates expand coverage of the tree of life. Nat. Biotechnol. 35,
676–683 (2017).
3. Parks, D. H. etal. Recovery of nearly 8,000 metagenome-assembled genomes
substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542 (2017).
4. Chen, I. A. etal. IMG/M v.5.0: an integrated data management and
comparative analysis system for microbial genomes and microbiomes.
Nucleic Acids Res. 47, D666–D677 (2019).
5. Pasolli, E. etal. Extensive unexplored human microbiome diversity revealed
by over 150,000 genomes from metagenomes spanning age, geography, and
lifestyle. Cell 176, 649–662 (2019).
6. Konstantinidis, K. T. & Tiedje, J. M. Towards a genome-based taxonomy for
prokaryotes. J. Bacteriol. 187, 6258–6264 (2005).
7. ompson, C. C. etal. Microbial taxonomy in the post-genomic era:
rebuilding from scratch? Arch. Microbiol. 197, 359–370 (2015).
8. Garrity, G. M. A new genomics-driven taxonomy of Bacteria and Archaea:
are we there yet? J. Clin. Microbiol. 54, 1956–1963 (2016).
9. Hugenholtz, P., Sharshewski, A. & Parks, D. H. in Microbial Evolution
(ed. Ochman, H.) 55–65 (Cold Spring Harbor Laboratory Press, 2016).
10. Parks, D. H. etal. A standardized bacterial taxonomy based on genome
phylogeny substantially revises the tree of life. Nat. Biotechnol. 36,
996–1004 (2018).
11. Cohan, F. M. What are bacterial species? Annu. Rev. Microbiol. 56,
457–487 (2002).
12. Konstantinidis, K. T., Ramette, A. & Tiedje, J. M. e bacterial species
denition in the genomic era. Phil. Trans. R. Soc. Lond. B Biol. Sci. 361,
1929–1940 (2006).
13. Fraser, C., Alm, E. J., Polz, M. F., Spratt, B. G. & Hanage, W. P. e bacterial
species challenge: making sense of genetic and ecological diversity. Science
323, 741–746 (2009).
14. Bobay, L. M. & Ochman, H. Biological species are universal across life’s
domains. Genome Biol. Evol. 9, 491–501 (2017).
15. Ciufo, S. etal. Using average nucleotide identity to improve taxonomic
assignments in prokaryotic genomes at the NCBI. Int. J. Syst. Evol. Microbiol.
68, 2386–2392 (2018).
16. Chun, J. etal. Proposed minimal standards for the use of genome data for the
taxonomy of prokaryotes. Int. J. Syst. Evol. Microbiol. 68, 461–466 (2018).
17. Whitman, W. B. Genome sequences as the type material for taxonomic
descriptions of prokaryotes. Syst. Appl. Microbiol. 38, 217–222 (2015).
18. Konstantinidis, K. T. & Tiedje, J. M. Genomic insights that advance the
species denition for prokaryotes. Proc. Natl Acad. Sci. USA 102,
2567–2572 (2005).
19. Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru,
S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear
species boundaries. Nat. Commun. 9, 5114 (2018).
20. Olm, M. R. etal. Consistent metagenome-derived metrics verify and dene
bacterial species boundaries. mSystems 5, e00731-19 (2020).
21. Richter, M. & Rosselló-Móra, R. Shiing the genomic gold standard
for the prokaryotic species denition. Proc. Natl Acad. Sci. USA 45,
19126–19131 (2009).
22. Varghese, N. J. etal. Microbial species delineation using whole genome
sequences. Nucleic Acids Res. 43, 6761–6771 (2015).
23. Yoon, S. H., Ha, S. M., Lim, J., Kwon, S. & Chun, J. A large-scale evaluation
of algorithms to calculate average nucleotide identity. Antonie Van
Leeuwenhoek 110, 1281–1286 (2017).
24. Ondov, B. D. etal. Mash: fast genome and metagenome distance estimation
using MinHash. Genome Biol. 17, 132 (2016).
25. Goris, J. etal. DNA–DNA hybridization values and their relationship to
whole-genome sequence similarities. Int. J. Syst. Evol. Microbiol. 57,
81–91 (2007).
26. Rodriguez-R, L. M. etal. e Microbial Genomes Atlas (MiGA) webserver:
taxonomic and gene diversity analysis of Archaea and Bacteria at the whole
genome level. Nucleic Acids Res. 46, W282–W288 (2018).
27. Parker, C. T., Tindall, B. J. & Garrity, G. M. International Code of
Nomenclature of Prokaryotes. Int. J. Syst. Evol. Microbiol. 69, S1–S111 (2019).
28. Federhen, S. etal. Meeting report: GenBank microbial genomic taxonomy
workshop (12–13 May, 2015). Stand. Genomic Sci. 11, 15 (2016).
29. Barco, R. A. etal. A genus denition for Bacteria and Archaea based on a
standard genome relatedness index. mBio 11, e02475-19 (2020).
30. Whitman, W. B. Modest proposals to expand the type material for naming of
prokaryotes. Int. J. Syst. Evol. Microbiol. 66, 2108–2112 (2016).
31. Konstantinidis, K. T., Rosselló-Móra, R. & Amann, R. Uncultivated microbes
in need of their own taxonomy. ISME J. 11, 2399–2406 (2017).
32. Chuvochina, M. etal. e importance of designating type material for
uncultured taxa. Syst. Appl. Microbiol. 42, 15–21 (2019).
33. Kitts, P. A. etal. Assembly: a resource for assembled genomes at NCBI.
Nucleic Acids Res. 44, D73–D80 (2016).
34. Parte, A. C. LPSN—List of Prokaryotic names with Standing in Nomenclature
(bacterio.net), 20 years on. Int. J. Syst. Evol. Microbiol. 68, 1825–1829 (2018).
35. Reimer, L. C. BacDive in 2019: bacterial phenotypic data for high-throughput
biodiversity analysis. Nucleic Acids Res. 47, D631–D636 (2019).
36. Verslyppe, B., De Smet, W., De Baets, B., De Vos, P. & Dawyndt, P. StrainInfo
introduces electronic passports for microorganisms. Syst. Appl. Microbiol. 37,
42–50 (2014).
NATURE BIOTECHNOLOGY | www.nature.com/naturebiotechnology
ResouRce NaTUre BiOTecHNOlOGy
37. Federhen, S. Type material in the NCBI Taxonomy Database. Nucleic Acids
Res. 43, D1086–D1098 (2015).
38. O’Leary, N. A. etal. Reference sequence (RefSeq) database at NCBI: current
status, taxonomic expansion, and functional annotation. Nucleic Acids Res.
44, D733–D745 (2016).
39. Ficht, T. Brucella taxonomy and evolution. Future Microbiol. 5, 859–866 (2010).
40. Verger, J. M., Grimont, F., Grimont, P. A. D. & Grayon, M. Brucella, a
monospecic genus as shown by deoxyribonucleic acid hybridization.
Int. J. Syst. Evol. Microbiol. 35, 292–295 (1985).
41. Riojas, M. A., McGough, K. J., Rider-Riojas, C. J., Rastogi, N. & Hazbón, M. H.
Phylogenomic analysis of the species of the Mycobacterium tuberculosis
complex demonstrates that Mycobacterium africanum, Mycobacterium bovis,
Mycobacterium caprae, Mycobacterium microti and Mycobacterium pinnipedii
are later heterotypic synonyms of Mycobacterium tuberculosis. Int. J. Syst.
Evol. Microbiol. 68, 324–332 (2018).
42. Liu, G. H.et al. Genome-based reclassication of Bacillus plakortidis Borchert
etal. 2007 and Bacillus lehensis Ghosh etal. 2007 as a later heterotypic
synonym of Bacillus oshimensis Yumoto etal. 2005; Bacillus rhizosphaerae
Madhaiyan etal. 2011 as a later heterotypic synonym of Bacillus clausii
Nielsen etal. 1995. Antonie Van Leeuwenhoek, doi:112, 1725–1730 (2019).
43. Oren, A. Reclassication of Halomonas caseinilytica Wu etal. 2008 as a later
synonym of Halomonas sinaiensis—comments on the proposal by Hwang
etal., Antonie Van Leeuwenhoek 109:1345–1352, 2016. Antonie Van
Leeuwenhoek 110, 171 (2017).
44. Bowers, R. M. etal. Minimum information about a single amplied genome
(MISAG) and a metagenome-assembled genome (MIMAG) of Bacteria and
Archaea. Nat. Biotechnol. 35, 725–731 (2017).
45. Peix, A., Ramírez-Bahena, M. H. & Velázquez, E. Historical evolution and
current status of the taxonomy of genus Pseudomonas. Infect. Genet. Evol. 9,
1132–1147 (2009).
46. Bhandari, V., Ahmod, N. Z., Shah, H. N. & Gupta, R. S. Molecular signatures
for Bacillus species: demarcation of the Bacillus subtilis and Bacillus cereus
clades in molecular terms and proposal to limit the placement of new
species into the genus Bacillus. Int. J. Syst. Evol. Microbiol. 63,
2712–2726 (2013).
47. Beiko, R. G. Microbial malaise: how can we classify the microbiome?
Trends Microbiol. 23, 671–679 (2015).
48. Osterman, B. & Moriyon, I. International committee on systematics of
prokaryotes; subcommittee on the taxonomy of Brucella: minutes of the
meeting, 17 September 2003, Pamplona, Spain. Int. J. Syst. Evol. Microbiol. 56,
1173–1175 (2006).
49. Fenwick, A. J. & Carroll, K. C. Practical problems when incorporating rapidly
changing microbial taxonomy into clinical practice. Clin. Chem. Lab. Med. 57,
e238–e240 (2019).
50. Lan, R. & Reeves, P. R. Escherichia coli in disguise: molecular origins of
Shigella. Microbes Infect. 4, 1125–1132 (2002).
51. Pettengill, E. A., Pettengill, J. B. & Binet, R. Phylogenetic analyses
of Shigella and enteroinvasive Escherichia coli for the identication of
molecular epidemiological markers: whole-genome comparative
analysis does not support distinct genera designation. Front. Microbiol. 6,
1573 (2016).
52. Hanage, W. P. Fuzzy species revisited. BMC Biol. 11, 41 (2013).
53. Evans, J. T. & Denef, V. J. To dereplicate or not to dereplicate? Preprint at
bioRxiv https://doi.org/10.1101/848176 (2019).
54. Zhu, Q. etal. Phylogenomics of 10,575 genomes reveals evolutionary
proximity between domains Bacteria and Archaea. Nat. Commun. 10,
5477 (2019).
55. Hug, L. A. etal. A new view of the tree of life. Nat. Microbiol. 1,
16048 (2016).
56. Chaumeil, P. A., Mussig, A., Hugenholtz, P. & Parks, D. H. GTDB-Tk: a
toolkit to classify genomes with the Genome Taxonomy Database.
Bioinformatics 36, 1925–1927 (2020).
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
© The Author(s), under exclusive licence to Springer Nature America, Inc. 2020
NATURE BIOTECHNOLOGY | www.nature.com/naturebiotechnology
ResouRce
NaTUre BiOTecHNOlOGy
Methods
Genome dataset. A dataset of 151,188 bacterial and 2,661 archaeal genomes was
obtained from the NCBI Assembly database33 on 16 July 2018, and supplemented
with 153 archaeal MAGs3 recovered from Sequence Read Archive57 metagenomes.
ese genomes comprise the basis of GTDB R04-RS89. Genomes were agged for
exclusion if they failed any of the following quality control criteria: (1) estimated
completeness < 50%, (2) estimated contamination > 10%, (3) completeness –
5 × contamination < 50, (4) composed of >1,000 contigs, (5) N50 (dened as
the minimum contig length needed to cover 50% of the genome) <5 kbp, (6)
contained >100,000 ambiguous bases or (7) contained <40% of the 120 bacterial
or 122 archaeal proteins used for phylogenetic inference. CheckM58 v.1.0.13 was
used to estimate genome quality and determine assembly statistics. is ltering
resulted in 8,185 genomes being agged for removal, with 87 being retained
aer manual inspection as they represent genomes of high nomenclatural or
taxonomic importance (Supplementary Table 18); for example, the isolate genome
Ktedonobacter racemifer representing the class Ktedonobacteria was retained
despite having a contamination estimate of 11%.
Genome metadata and standing in nomenclature. The NCBI taxonomy37
associated with each genome assembly was obtained from the NCBI FTP site on
16 July 2018. NCBI Assembly summary files for Bacteria and Archaea (that is,
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/archaea/assembly_summary.txt)
were also downloaded on 16 July 2018 and used to establish whether a genome
was annotated as being assembled from type material at NCBI (relation_to_type_
material field), whether it was considered a representative or reference genome
at NCBI (refseq_category field), and whether the assembly is a MAG or SAG
(indicated by ‘derived from metagenome/environmental source’ or ‘derived from
single cell’ designations in the excluded_from_refseq field). NCBI strain identifiers
were obtained from the following fields: the infraspecific name and isolate fields
in the assembly report, the strain and isolate fields in the genomic GenBank flat
file, and the strain and isolate fields in the WGS master GenBank flat file. LPSN34
(consulted 16 April 2019), BacDive35 (consulted 16 April 2019) and StrainInfo36
(consulted 19 November 2018) were cross-referenced with the strain identifiers
at NCBI to establish which genomes were assembled from the type strain of a
species or type strain of a subspecies. A genome was considered assembled from
type material if any of its co-identical strain identifiers could be matched with a
co-identical strain identifier at LPSN, BacDive or StrainInfo. The NCBI metadata
and nomenclatural status of each genome are provided in the GTDB R04-RS89
metadata file available from the GTDB website (https://gtdb.ecogenomic.org).
Selecting representative genome for species clusters. A single representative
genome was selected for each of the 9,162 validly or effectively published species
names in the quality-controlled genome dataset (Supplementary Fig. 1). The majority
of species (5,942 species; 64.9%) had a single genome in the dataset and this genome
was selected as the representative of the species. Selection of representative genomes
for the remaining 3,220 species was performed by prioritizing genomes in the
following order: (1) assembled from the type strain of the species (2,632 species),
(2) annotated as being assembled from type material at NCBI (123 species), (3)
designated as a reference or representative genome at NCBI (220 species) or (4)
assembled from the type strain of a subspecies (8 species). Filtering by these metadata
categories resulted in a further 1,714 species being reduced to a single genome that
was selected as the representative of the species. Further selection criteria were used
to establish a representative genome for the remaining 1,506 species that contained
multiple genomes within a specific metadata category (for example, multiple genomes
from the type strain of the species). Of those, 196 species were resolved by selecting
the single genome annotated as being assembled from type material at NCBI. A
further 1,073 species were resolved by determining that all genomes formed a clique
at 99% ANI and selecting the highest-quality genome. For this purpose, genome
quality was defined with regard to multiple genome assembly statistics and metadata
fields indicative of assembly quality (Supplementary Table 19). From the remaining
237 species that had at least one genome pair with an ANI < 99%, 18 were resolved
by selecting the single genome annotated as a representative or reference genome
at NCBI. The final 219 species were resolved manually through investigation of the
literature, inspection of 16S rRNA sequence identity results and consideration of all
pairwise ANI results within a species.
Calculating ANI, AF and cliques. The ANI and AF between genomes were
calculated using FastANI19 v.1.1 with default parameters. FastANI requires at least
150 kb of homologous genome sequence between two genomes to make a reliable ANI
estimate. ANI and AF values are not symmetric and were defined as the maximum
of the two possible values. Using the maximum AF accommodates MAGs, SAGs and
lower-quality isolate genomes that may be incomplete or contaminated. Use of the
maximum ANI value is for convenience and is not a critical decision as ANI values
are nearly symmetrical in practice19. Cliques are a complete graph where each vertex
represents a genome and every vertex is linked to every other vertex. In this study
links are formed between genomes at a specific ANI threshold such as 95%.
Circumscribing named species clusters. Clusters for validly or effectively
named species were established by (1) determining the species-specific ANI
circumscription radius for each representative genome, and (2) assigning
nonrepresentative genomes to species clusters based on these ANI thresholds.
The ANI circumscription radius for a representative genome was determined
by calculating the ANI to all other representatives and setting the ANI
circumscription radius as follows: (1) 95% if the closest representative had an
ANI < 95% or (2) to the ANI value of the closest representative if this was between
95% and 97%. Any representative with an ANI > 97% was considered a synonym
and disregarded for the purposes of establishing the ANI circumscription radius.
To reduce computational requirements, Mash24 v.2.1.1 was used to produce a
subset of representative genome pairs with a Mash distance of 0.1 for processing
by FastANI. The ANIs and AFs between nonrepresentative and representative
genomes were calculated using FastANI, again using Mash with a distance
threshold of 0.1 as a prefilter. A nonrepresentative genome was assigned to the
closest representative genome where the AF was >65% only if it was within the
ANI circumscription radius of the representative.
Establishing denovo species clusters. Genomes not assigned to a named species
cluster were formed into denovo species clusters using a greedy clustering
approach that favored the selection of high-quality representative genomes.
Genome quality was defined in the same manner used to resolve the selection of
representative genomes for validly or effectively published species with multiple
genome assemblies (Supplementary Table 19). Greedy clustering consists of four
steps: (1) sort genomes without a species assignment by genome quality, (2)
select the highest-quality genome to form a new species cluster, (3) determine
the species-specific ANI circumscription radius for the new species cluster and
(4) assign genomes without a species assignment to this species cluster using
the same criteria as for named species clusters. These steps are repeated until all
genomes have been assigned to a species. Determining the species-specific ANI
circumscription radius (step 3) requires calculating the ANIs to all existing species
representatives and setting the ANI radius to 95% if the closest representative had
an ANI < 95% or to the ANI value of the closest representative. The ANI to the
closest representative will always be 97% by definition and thus synonyms need
not be considered during the formation of denovo species clusters. This procedure
ensures that all genomes can be assigned to at least one species representative,
but does not guarantee that all genomes are assigned to the closest representative.
Consequently, all nonrepresentative genomes were reassigned to the closest
representative where the AF is >65% and where the genome is within the ANI
circumscription radius of the representative.
Naming denovo species clusters. Species clusters consisting of genomes without
a validly or effectively published name were assigned a placeholder name. The
generic name follows the genus name within the GTDB which is established as
described previously10. The specific name was derived from the NCBI accession
of the representative genome of the species cluster. Specifically, the numerical
portion of the accession is prefixed with ‘sp’. For example, GCF_000192635.1 is the
representative genome of a species cluster within the genus Agrobacterium so was
assigned the species name Agrobacterium sp000192635.
Genus and species names with an alphabetic suffix indicate genera and
species that are polyphyletic or needed to be subdivided based on taxonomic
rank normalization according to the current GTDB reference tree. The lineage
containing the type strain retains the unsuffixed name and all other lineages are
given alphabetic suffixes, indicating that they are placeholder names that need to
be replaced in due course. For example, the proposed species clusters define both
Lactobacillus gasseri and Lactobacillus gasseri_A, the latter being a denovo GTDB
species cluster that does not contain the L. gasseri type strain and circumscribes
one or more genomes classified as L. gasseri in the NCBI taxonomy.
Properties of species clusters. Species clusters containing >100 genomes were
randomly sampled to 100 genomes. Interspecies properties were calculated
by limiting the calculation to species within the same genus and discarding
comparisons with an AF < 25%. The medoid genome of a species cluster is defined
as the genome minimizing the mean ANI to all genomes in the cluster and was
determined using a brute-force implementation of this definition. For species
clusters with two genomes, the medoid was taken as the selected representative
genome as this genome is of higher quality.
Assessing robustness of species clusters using randomly selected representatives.
Species clusters consisting of >10 genomes were randomly sampled to 9 genomes
along with the proposed representative genome. Random species representatives
were selected in a manner similar to that described for the denovo species
clusters: (1) randomly select a genome without a species assignment, (2) set the
species-specific ANI circumscription radius to the radius of its proposed species
cluster and (3) assign genomes without a species assignment to this species cluster
using the same criteria as for named species clusters. These steps are repeated until
all genomes have been assigned to a species. All nonrepresentative genomes were
then reassigned to the closest representative where the AF is >65% and where
the genome is within the ANI circumscription radius of the representative. This
procedure was independently repeated five times. The proposed and random
species clusters were compared to determine (1) the number of clusters in perfect
NATURE BIOTECHNOLOGY | www.nature.com/naturebiotechnology
ResouRce NaTUre BiOTecHNOlOGy
congruence (that is, there exists a proposed and random species cluster composed
of the exact same set of genomes) and (2) the number of genomes that would
be assigned the same species name (that is, the number of genomes in common
between each proposed species cluster and the random species cluster containing
the representative genome of the proposed species cluster).
Assessing monophyly of species clusters. The monophyly of the proposed species
clusters was assessed by inferring trees for each genus where polyphyly can arise
(that is, genera composed of 2 species with at least one species containing 2
genomes). Species clusters consisting of >10 genomes were randomly sampled
to 10 genomes. Trees were rooted by randomly selecting a genome from the
sister lineage to the genus as determined from the topology of the bacterial and
archaeal GTDB R04-RS89 reference trees. The GTDB-Tk56 v.0.3.2 ‘denovo
workflow with default parameters was used to infer the trees. Briefly, genes were
called using Prodigal59 v.2.6.3 and 120 bacterial or 122 archaeal marker genes
identified and aligned using HMMER60 v.3.1b2. The resulting multiple sequence
alignment was trimmed to approximately 5,000 columns using the bacterial or
archaeal GTDB R04-RS89 mask. Trees were inferred with FastTree61 v.2.1.10 with
the WAG + GAMMA models and rooted on the selected outgroup. PhyloRank
v.0.0.37 (https://github.com/dparks1134/phyloRank) was used to decorate the tree
with the proposed species assignments and determine the F-measure, defined as
the harmonic mean of the precision and recall, for each species62. A genome was
considered to be congruent with the topology of the tree if it was contained in the
lineage with the highest F-measure for its corresponding species assignment.
Reporting Summary. Further information on research design is available in the
Nature Research Reporting Summary linked to this article.
Data availability
Genome metadata used to establish the proposed species clusters are available
on the GTDB website in the files ar122_metadata.tsv and bac120_metadata.tsv.
Metadata for the 24,706 GTDB species representatives are in the file sp_clusters.tsv.
Genomes in the GTDB satisfying the high-quality MIMAG criteria47 are indicated
in the file hq_mimag_genomes.tsv. Genome sequences are available from the NCBI
Assembly database, including the 153 archaeal MAGs in BioProject PRJNA593905.
Code availability
The methodology used to establish the GTDB species clusters is implemented
in version GTDB-R89 of the GTDB Species Cluster Toolkit (https://github.com/
Ecogenomics/gtdb-species-clusters), a Python program available under the GNU
General Public License v.3.0.
References
57. Leinonen, R., Sugawara, H. & Shumway, M. e Sequence Read Archive.
Nucleic Acids Res. 39, D19–D21 (2011).
58. Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W.
CheckM: assessing the quality of microbial genomes recovered from isolates,
single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
59. Hyatt, D. etal. Prodigal: prokaryotic gene recognition and translation
initiation site identication. BMC Bioinformatics 11, 119 (2010).
60. Eddy, S. R. Accelerated prole HMM searches. PLoS Comp. Biol. 7,
e1002195 (2011).
61. Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree: computing large minimum
evolution trees with proles instead of a distance matrix. Mol. Biol. Evol. 26,
1641–1650 (2009).
62. McDonald, D. etal. An improved Greengenes taxonomy with explicit ranks
for ecological and evolutionary analyses of Bacteria and Archaea. ISME J. 6,
610–618 (2012).
Acknowledgements
We thank the members of an NSF-sponsored Microbial Taxonomy Workshop
(NSF no. 1841658) for helpful discussions relating to establishing species clusters. This
project was supported by an Australian Research Council Laureate Fellowship (grant no.
FL150100038) awarded to P.H. and an Australian Research Council Future Fellowship
(grant no. FT170100213) awarded to C.R.
Author contributions
D.H.P. and P.H. wrote the paper with constructive suggestions from all other authors.
D.H.P. designed the initial study. M.C., C.R. and P.H. provided nomenclatural advice and
manual curation of species representatives where necessary. D.H.P., P.-A.C. and A.J.M.
performed the bioinformatic analyses.
Competing interests
The authors declare no competing interests.
Additional information
Supplementary information is available for this paper at https://doi.org/10.1038/
s41587-020-0501-8.
Correspondence and requests for materials should be addressed to D.H.P.
Reprints and permissions information is available at www.nature.com/reprints.
NATURE BIOTECHNOLOGY | www.nature.com/naturebiotechnology
1
nature research | reporting summary October 2018
Corresponding author(s): Donovan Parks
Last updated by author(s): Sep 24, 2019
Reporting Summary
Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.
Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.
n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested
A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons
A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient)
AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals)
For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Give P values as exact values whenever suitable.
For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated
Our web collection on statistics for biologists contains articles on many of the points above.
Software and code
Policy information about availability of computer code
Data collection Data was obtained from NCBI, LPSN, StrainInfo, BacDive, and GTDB. Data from NCBI was downloaded from their Assembly and Taxonomy
FTP sites on 2018 July 16. Data from BacDive was obtained using their PNU web service on 16 April 2019. Data from LPSN and StrainInfo
was obtained by consulting these web resources on 16 April 2019 and 19 November 2018, respectively. Data used from GTDB R04-RS89
is available from the download section of the GTDB website.
Data analysis CheckM v1.0.13, FastANI v1.1, Mash v2.1.1, FastTree v2.1.10, GTDB-Tk v0.3.2, Prodigal v2.6.3, HMMER v3.1b, PhyloRank v0.0.37, GTDB
Species Cluster Toolkit version GTDB-R89
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers.
We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.
Data
Policy information about availability of data
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable:
- Accession codes, unique identifiers, or web links for publicly available datasets
- A list of figures that have associated raw data
- A description of any restrictions on data availability
The metadata associated with each genome and used to establish the GTDB species clusters is available on the GTDB website (http://gtdb.ecogenomic.org) in the
files ar122_metadata_r89.tsv and bac120_metadata_r89.tsv. The genomic files for genomes are available from the NCBI Assembly database, including the 153
archaeal MAGs which are under BioProject PRJNA593905.
2
nature research | reporting summary October 2018
Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.
Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf
Ecological, evolutionary & environmental sciences study design
All studies must disclose on these points even when the disclosure is negative.
Study description Public genomes assemblies were organized into species clusters based on the average nucleotide identity (ANI) to selected
representative genomes.
Research sample Study considers 152,288 bacterial and 2,661 archaeal genomes obtained from the NCBI Assembly database along with an additional
153 archaeal MAGs recovered from Sequence Read Archive metagenomes which have been submitted to NCBI. This represents all
bacterial and archaeal genomes in the International Nucleotide Sequence Database Collaboration, the open access repository used to
archieve genomic information in our field.
Sampling strategy No sampling was performed. All public bacterial and archaeal genomes in the NCBI Assembly database on 2018 July 16 were
considered.
Data collection Data was obtained from the NCBI Assembly FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/).
Timing and spatial scale Genomes were downloaded on 2018 July 16.
Data exclusions Genomes were quality filtered based on estimates of their completeness and contamination, and assembly statistics including
number of contigs and N50. These criteria were established before conducting the study.
Reproducibility The code used to generate the species clusters is deterministic. The impact of the specific choices used to generate the species
clusters were investigated using random sampling. All replicates support the assertion that the proposed species clusters are robust.
Randomization Ranomization in the context of groups for statistics tests is not relevant to this study as all genomes in the NCBI Assembly database
on 2018 July 16 were considered and no statistical test are performed. The focus of this paper is a detailed explanation of organizing
genomes into clusters using a deterministic procedure.
Blinding Blinding is not relevant to this study as all genomes in the NCBI Assembly database on 2018 July 16 were considered.
Did the study involve field work? Yes No
Reporting for specific materials, systems and methods
We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material,
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response.
Materials & experimental systems
n/a Involved in the study
Antibodies
Eukaryotic cell lines
Palaeontology
Animals and other organisms
Human research participants
Clinical data
Methods
n/a Involved in the study
ChIP-seq
Flow cytometry
MRI-based neuroimaging
... In the essence of the biological species concept, reproductive isolation or, more generally speaking, independent evolution is considered to be virtually synonymous with the process of speciation 1 . For prokaryotes, the recombination rate declines with increased sequence divergence, and the number of documented species has been significantly enlarged by computing average nucleotide identity (ANI) in the scenario of alpha taxonomy in the past decades [2][3][4][5] . However, a ubiquitous biological species concept for prokaryotes has been questioned 4,5 , largely due to the fact that genetic differences between populations or species could be eroded by promiscuous lateral gene transfer events 4,6 . ...
... For prokaryotes, the recombination rate declines with increased sequence divergence, and the number of documented species has been significantly enlarged by computing average nucleotide identity (ANI) in the scenario of alpha taxonomy in the past decades [2][3][4][5] . However, a ubiquitous biological species concept for prokaryotes has been questioned 4,5 , largely due to the fact that genetic differences between populations or species could be eroded by promiscuous lateral gene transfer events 4,6 . This evolutionary dilemma regarding prokaryote species is tentatively solved by the split of pangenome into core genes shared by all relevant strains and accessory genes present in a subset of strains, which manage essential and nonessential cellular processes, respectively [7][8][9] . ...
Article
Full-text available
Most in silico evolutionary studies commonly assumed that core genes are essential for cellular function, while accessory genes are dispensable, particularly in nutrient‐rich environments. However, this assumption is seldom tested genetically within the pangenome context. In this study, we conducted a robust pangenomic Tn‐seq analysis of fitness genes in a nutrient‐rich medium for Sinorhizobium strains with a canonical open pangenome. To evaluate the robustness of fitness category assignment, Tn‐seq data for three independent mutant libraries per strain were analyzed by three methods, which indicates that the Hidden Markov Model (HMM)‐based method is most robust to variations between mutant libraries and not sensitive to data size, outperforming the Bayesian and Monte Carlo simulation‐based methods. Consequently, the HMM method was used to classify the fitness category. Fitness genes, categorized as essential (ES), advantage (GA), and disadvantage (GD) genes for growth, are enriched in core genes, while nonessential genes (NE) are over‐represented in accessory genes. Accessory ES/GA genes showed a lower fitness effect than core ES/GA genes. Connectivity degrees in the cofitness network decrease in the order of ES, GD, and GA/NE. In addition to accessory genes, 1599 out of 3284 core genes display differential essentiality across test strains. Within the pangenome core, both shared quasi‐essential (ES and GA) and strain‐dependent fitness genes are enriched in similar functional categories. Our analysis demonstrates a considerable fuzzy essential zone determined by cofitness connectivity degrees in Sinorhizobium pangenome and highlights the power of the cofitness network in understanding the genetic basis of ever‐increasing prokaryotic pangenome data.
... Specifically, ninety percent of the species identified by MetaKSSD (44 out of 49 species) have a counterpart identified by MetaPhlAn4 for the same phenotype that is from the same class, while vice versa, 96% (27 out of 28 species) also hold true (Fig. 3d), indicating that the results of the two methods support each other. This observation is aligned with the fact that the GTDB taxonomy [27][28][29][30] , used by MetaKSSD, is much more consistent at the class rank than at the species rank, compared to the NCBI taxonomy used by MetaPhlAn4. ...
Preprint
Metagenomic taxonomic profiling, which involves mapping microbiome sample reads or k-mers to a pre-built reference taxonomic marker database (MarkerDB) to estimate taxon composition and abundances, is a fundamental step for microbiome-related studies. The explosive growth of reference genomes and metagenomic data presents significant scalability and efficiency challenges for existing metagenomic profilers. Additionally, many microbiome researchers prefer online analysis platforms like Galaxy and One Codex due to unfamiliarity with command line tools, but uploading large-scale data can be frustrating, especially with low bandwidth. To meet these challenges, we introduce MetaKSSD, a highly scalable method for MarkerDB construction and metagenomic taxonomic profiling. MetaKSSD can profile approximately 10 GB of metagenome data within seconds while utilizing only 0.5 GB of memory and its MarkerDB encompasses 85,202 species using only 0.17 GB of storage. Extensive benchmarking demonstrated that MetaKSSD's accuracy was comparable to the state-of-the-art metagenomic profilers. In microbiome-phenotype association study, MetaKSSD identified more phenotype-associated bacterial species than the predominant profiler MetaPhlAn4. Utilizing MetaKSSD, we profiled 382,016 metagenomic runs from NCBI SRA and developed efficient methods for searching similar profiles among them. Notably, MetaKSSD's client-server architecture performs local sketching of metagenomic data to a compact sketch, allowing swift transmission to remote server for real-time online metagenomic analyses and profile searching, thus facilitating use by non-expert users.
... [26]. The latest GTDB release r202 16 S rRNA database (trimmed to only retain the V3 hypervariable region), which consists of 254,090 bacterial and 4316 archeal genomes organised into 45,555 bacterial and 2339 archaeal species clusters [27], was utilised to train the q2feature-classifier [28] used for taxonomic assignment of the ASV. Both the ASV table and the taxonomic classification table were exported into the tab-separated values (tsv) format using QIIME2 tools, and then manually prepared to produce input that is compatible with Micro-biomeAnalyst [29]. ...
Article
Full-text available
A variety of bacterial communities can be found in heavy metal leachate-contaminated soil. These bacteria might have molecular defenses that allow them to endure a high heavy metal concentration. However, there aren't enough datasets on bacterial populations to be exploited for bioremediation, especially the biotreatment of leachate contaminated with heavy metals. This research examined the diversity of the bacterial population in heavy metal leachate-contaminated soil from the Jalan Lipis Sanitary Landfill in Pahang, Malaysia. The soil samples were taken from three (3) different locations. pH analysis showed that the heavy metal leachate-contaminated soil possesses alkaline pH. ICPMS analysis showed high concentration of nickel (112.96 mg/kg) followed by manganase (89.83 mg/kg), arsenic (43.84 mg/kg), and lead (3.62 mg/kg). The three most prevalent bacteria in the heavy metal leachate-contaminated soil from the site were Pseudomonas C (Proteobacteria), Flavobacterium (Bacteroidota), and Proteiniclasticum (Firmicutes), according to metagenomic sequencing of the 16 s rRNA gene. The alpha and beta bacterial diversity analysis indicate that each location with different heavy metal concentration differs significantly on its bacterial diversity providing a valuable information to be applied in heavy metal bioremediation exclusively from landfill leachate.
... (129). Previously published data were used for this work (130). All other data are included in the manuscript and/or SI Appendix. ...
Article
Life harnessing light energy transformed the relationship between biology and Earth—bringing a massive flux of organic carbon and oxidants to Earth’s surface that gave way to today’s organotrophy- and respiration-dominated biosphere. However, our understanding of how life drove this transition has largely relied on the geological record; much remains unresolved due to the complexity and paucity of the genetic record tied to photosynthesis. Here, through holistic phylogenetic comparison of the bacterial domain and all photosynthetic machinery (totally spanning >10,000 genomes), we identify evolutionary congruence between three independent biological systems—bacteria, (bacterio)chlorophyll-mediated light metabolism (chlorophototrophy), and carbon fixation—and uncover their intertwined history. Our analyses uniformly mapped progenitors of extant light-metabolizing machinery (reaction centers, [bacterio]chlorophyll synthases, and magnesium-chelatases) and enzymes facilitating the Calvin–Benson–Bassham cycle (form I RuBisCO and phosphoribulokinase) to the same ancient Terrabacteria organism near the base of the bacterial domain. These phylogenies consistently showed that extant phototrophs ultimately derived light metabolism from this bacterium, the last phototroph common ancestor (LPCA). LPCA was a non-oxygen-generating (anoxygenic) phototroph that already possessed carbon fixation and two reaction centers, a type I analogous to extant forms and a primitive type II. Analyses also indicate chlorophototrophy originated before LPCA. We further reconstructed evolution of chlorophototrophs/chlorophototrophy post-LPCA, including vertical inheritance in Terrabacteria, the rise of oxygen-generating chlorophototrophy in one descendant branch near the Great Oxidation Event, and subsequent emergence of Cyanobacteria. These collectively unveil a detailed view of the coevolution of light metabolism and Bacteria having clear congruence with the geological record.
... Only high-quality MAGs (completeness-5*contamination ≥ 50%) were retained for downstream analysis. The recovered MAGs were dereplicated using dRep (v3.4.0) [47], and the MAGs with average nucleotide identity (ANI) ≥ 95% were considered to be the same species [48,49]. The Quant_bin module (default parameters) from MetaWRAP was used to calculate the relative abundance of non-redundant MAGs, and genome copies per million reads (GPMR) was used as the abundance unit. ...
Article
Full-text available
Background Aquaculture is an important food source worldwide. The extensive use of antibiotics in intensive large-scale farms has resulted in resistance development. Non-intensive aquaculture is another aquatic feeding model that is conducive to ecological protection and closely related to the natural environment. However, the transmission of resistomes in non-intensive aquaculture has not been well characterized. Moreover, the influence of aquaculture resistomes on human health needs to be further understood. Here, metagenomic approach was employed to identify the mobility of aquaculture resistomes and estimate the potential risks to human health. Results The results demonstrated that antibiotic resistance genes (ARGs) were widely present in non-intensive aquaculture systems and the multidrug type was most abundant accounting for 34%. ARGs of non-intensive aquaculture environments were mainly shaped by microbial communities accounting for 51%. Seventy-seven genera and 36 mobile genetic elements (MGEs) were significantly associated with 23 ARG types (p < 0.05) according to network analysis. Six ARGs were defined as core ARGs (top 3% most abundant with occurrence frequency > 80%) which occupied 40% of ARG abundance in fish gut samples. Seventy-one ARG-carrying contigs were identified and 75% of them carried MGEs simultaneously. The qacEdelta1 and sul1 formed a stable combination and were detected simultaneously in aquaculture environments and humans. Additionally, 475 high-quality metagenomic-assembled genomes (MAGs) were recovered and 81 MAGs carried ARGs. The multidrug and bacitracin resistance genes were the most abundant ARG types carried by MAGs. Strikingly, Fusobacterium_A (opportunistic human pathogen) carrying ARGs and MGEs were identified in both the aquaculture system and human guts, which indicated the potential risks of ARG transfer. Conclusions The mobility and pathogenicity of aquaculture resistomes were explored by a metagenomic approach. Given the observed co-occurrence of resistomes between the aquaculture environment and human, more stringent regulation of resistomes in non-intensive aquaculture systems may be required. 4-S8N7kaEyTj6WBvQ6hX9pVideo Abstract
... In the DADA2 pipeline, forward and reverse sequences were cut to 159 and 191 bp, respectively, with a quality threshold of truncQ = 20, maxEE = (1,2), with cut lengths informed by calculations using the open-source software Figaro (Weinstein et al., 2019). For taxonomic assignment, a DADA2 formatted version of Genome taxonomy database (GTDB) release 207 was used as a reference rRNA database (Parks et al., 2018(Parks et al., , 2020Alishum, 2021). The R package phyloseq (v 1.40) (McMurdie and Holmes, 2013) was used for visualization of microbial community structure. ...
Article
Full-text available
Microbial inhibition by high ammonia concentrations is a recurring problem that significantly restricts methane formation from intermediate acids, i.e., propionate and acetate, during anaerobic digestion of protein-rich waste material. Studying the syntrophic communities that perform acid conversion is challenging, due to their relatively low abundance within the microbial communities typically found in biogas processes and disruption of their cooperative behavior in pure cultures. To overcome these limitations, this study examined growth parameters and microbial community dynamics of highly enriched mesophilic and ammonia-tolerant syntrophic propionate and acetate-oxidizing communities and analyzed their metabolic activity and cooperative behavior using metagenomic and metatranscriptomic approaches. Cultivation in batch set-up demonstrated biphasic utilization of propionate, wherein acetate accumulated and underwent oxidation before complete degradation of propionate. Three key species for syntrophic acid degradation were inferred from genomic sequence information and gene expression: a syntrophic propionate-oxidizing bacterium (SPOB) “Candidatus Syntrophopropionicum ammoniitolerans”, a syntrophic acetate-oxidizing bacterium (SAOB) Syntrophaceticus schinkii and a novel hydrogenotrophic methanogen, for which we propose the provisional name “Candidatus Methanoculleus ammoniitolerans”. The results revealed consistent transcriptional profiles of the SAOB and the methanogen both during propionate and acetate oxidation, regardless of the presence of an active propionate oxidizer. Gene expression indicated versatile capabilities of the two syntrophic bacteria, utilizing both molecular hydrogen and formate as an outlet for reducing equivalents formed during acid oxidation, while conserving energy through build-up of sodium/proton motive force. The methanogen used hydrogen and formate as electron sources. Furthermore, results of the present study provided a framework for future research into ammonia tolerance, mobility, aggregate formation and interspecies cooperation.
Article
The microbiome is a complex community of microorganisms, encompassing prokaryotic (bacterial and archaeal), eukaryotic, and viral entities. This microbial ensemble plays a pivotal role in influencing the health and productivity of diverse ecosystems while shaping the web of life. However, many software suites developed to study microbiomes analyze only the prokaryotic community and provide limited to no support for viruses and microeukaryotes. Previously, we introduced the Viral Eukaryotic Bacterial Archaeal (VEBA) open-source software suite to address this critical gap in microbiome research by extending genome-resolved analysis beyond prokaryotes to encompass the understudied realms of eukaryotes and viruses. Here we present VEBA 2.0 with key updates including a comprehensive clustered microeukaryotic protein database, rapid genome/protein-level clustering, bioprospecting, non-coding/organelle gene modeling, genome-resolved taxonomic/pathway profiling, long-read support, and containerization. We demonstrate VEBA’s versatile application through the analysis of diverse case studies including marine water, Siberian permafrost, and white-tailed deer lung tissues with the latter showcasing how to identify integrated viruses. VEBA represents a crucial advancement in microbiome research, offering a powerful and accessible software suite that bridges the gap between genomics and biotechnological solutions.
Preprint
Full-text available
The rhizosphere microbiome contributes to crop health in the face of disease pressures. Increased diversity and production of antimicrobial metabolites are characteristics of the microbiome that underpin microbial-mediated pathogen resistance. A goal of sustainable agriculture is to unravel the mechanisms by which crops assemble beneficial microbiomes, but precise understanding of the ability of the plant to manipulate intragenus microdiversity is unclear. Through an integrative approach combining culture-dependent methods and long-read amplicon sequencing, we demonstrate cultivar-dependent taxonomic and functional microdiversity of the rhizocompetent and bioactive Pseudomonas genus associated with Fusarium -resistant versus susceptible winter wheat cultivars. The resistant cultivar demonstrated increased Pseudomonas taxonomic but not biosynthetic diversity when compared to the susceptible cultivar, correlating with a thinner root diameter of the resistant cultivar. We found enrichment of antifungal Pseudomonas isolates, genes (chitinase), and biosynthetic gene clusters (pyoverdine) in the resistant cultivar. Overall, we highlight cultivar-dependent microdiversity of Pseudomonas taxonomy and functional potential in the rhizosphere, which may link to root morphology and play a role in crop susceptibility to disease.
Article
Full-text available
There is controversy about whether bacterial diversity is clustered into distinct species groups or exists as a continuum. To address this issue, we analyzed bacterial genome databases and reports from several previous large-scale environment studies and identified clear discrete groups of species-level bacterial diversity in all cases. Genetic analysis further revealed that quasi-sexual reproduction via horizontal gene transfer is likely a key evolutionary force that maintains bacterial species integrity. We next benchmarked over 100 metrics to distinguish these bacterial species from each other and identified several genes encoding ribosomal proteins with high species discrimination power. Overall, the results from this study provide best practices for bacterial species delineation based on genome content and insight into the nature of bacterial species population genetics.
Article
Full-text available
In recent decades, the taxonomy of Bacteria and Archaea , and therefore genus designation, has been largely based on the use of a single ribosomal gene, the 16S rRNA gene, as a taxonomic marker. We propose an approach to delineate genera that excludes the direct use of the 16S rRNA gene and focuses on a standard genome relatedness index, the average nucleotide identity. Our findings are of importance to the microbiology community because the emergent properties of Bacteria and Archaea that are identified in this study will help assign genera with higher taxonomic resolution.
Article
Full-text available
Rapid growth of genome data provides opportunities for updating microbial evolutionary relationships, but this is challenged by the discordant evolution of individual genes. Here we build a reference phylogeny of 10,575 evenly-sampled bacterial and archaeal genomes, based on a comprehensive set of 381 markers, using multiple strategies. Our trees indicate remarkably closer evolutionary proximity between Archaea and Bacteria than previous estimates that were limited to fewer “core” genes, such as the ribosomal proteins. The robustness of the results was tested with respect to several variables, including taxon and site sampling, amino acid substitution heterogeneity and saturation, non-vertical evolution, and the impact of exclusion of candidate phyla radiation (CPR) taxa. Our results provide an updated view of domain-level relationships.
Preprint
Full-text available
Our ability to reconstruct genomes from metagenomic datasets has rapidly evolved over the past decade, leading to publications presenting 1,000s, and even more than 100,000 metagenome-assembled genomes (MAGs) from 1,000s of samples. While this wealth of genomic data is critical to expand our understanding of microbial diversity, evolution, and ecology, various issues have been observed in some of these datasets that risk obfuscating scientific inquiry. In this perspective we focus on the issue of identical or highly similar genomes assembled from independent datasets. While obtaining multiple genomic representatives for a species is highly valuable, multiple copies of the same or highly similar genomes complicates downstream analysis. We analyzed data from recent studies to show the levels of redundancy within these datasets, the highly variable performance of commonly used dereplication tools, and to point to existing approaches to account and leverage repeated sampling of the same/similar populations.
Article
Full-text available
The GTDB Toolkit (GTDB-Tk) provides objective taxonomic assignments for bacterial and archaeal genomes based on the Genome Taxonomy Database (GTDB). GTDB-Tk is computationally efficient and able to classify thousands of draft genomes in parallel. Here we demonstrate the accuracy of the GTDB-Tk taxonomic assignments by evaluating its performance on a phylogenetically diverse set of 10,156 bacterial and archaeal metagenome-assembled genomes. Availability: GTDB-Tk is implemented in Python and licensed under the GNU General Public License v3.0. Source code and documentation are available at: https://github.com/ecogenomics/gtdbtk. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
In the present study, phylogenetic and genome-based comparison was carried out to clarify the taxonomic positions of alkaliphilic Bacillus species, Bacillus plakortidis, Bacillus lehensis, Bacillus oshimensis, Bacillus rhizosphaerae and Bacillus clausii. Phylogenetic trees based on 16S rRNA gene sequences and concatenated protein marker genes were constructed. Average nucleotide identity (ANI) values were calculated to compare genetic relatedness. In phylogenetic trees, B. plakortidis DSM 19153T, B. lehensis DSM 19099T, and B. oshimensis DSM 18940T; B. rhizosphaerae DSM 21911T and B. clausii DSM 8716T clade together. The average nucleotide identity (ANI) values between B. oshimensis DSM 18940T, B. plakortidis DSM 19153T and B. lehensis DSM 19099T ranged from 98.7–98.8%, while the ANI values between B. rhizosphaerae DSM 21911T and B. clausii DSM 8716T were 95.2–95.5%. The ANI values were higher than the recognized threshold value for bacterial species delineation. Based on phylogenetic and genome comparison we propose reclassification of B. plakortidis and B. lehensis as a later heterotypic synonym of B. oshimensis; B. rhizosphaerae as a later heterotypic synonym of B. clausii.
Article
Full-text available
The body-wide human microbiome plays a role in health, but its full diversity remains uncharacterized, particularly outside of the gut and in international populations. We leveraged 9,428 metagenomes to reconstruct 154,723 microbial genomes (45% of high quality) spanning body sites, ages, countries, and lifestyles. We recapitulated 4,930 species-level genome bins (SGBs), 77% without genomes in public repositories (unknown SGBs [uSGBs]). uSGBs are prevalent (in 93% of well-assembled samples), expand underrepresented phyla, and are enriched in non-Westernized populations (40% of the total SGBs). We annotated 2.85 M genes in SGBs, many associated with conditions including infant development (94,000) or Westernization (106,000). SGBs and uSGBs permit deeper microbiome analyses and increase the average mappability of metagenomic reads from 67.76% to 87.51% in the gut (median 94.26%) and 65.14% to 82.34% in the mouth. We thus identify thousands of microbial genomes from yet-to-be-named species, expand the pangenomes of human-associated microbes, and allow better exploitation of metagenomic technologies.
Article
Full-text available
A fundamental question in microbiology is whether there is continuum of genetic diversity among genomes, or clear species boundaries prevail instead. Whole-genome similarity metrics such as Average Nucleotide Identity (ANI) help address this question by facilitating high resolution taxonomic analysis of thousands of genomes from diverse phylogenetic lineages. To scale to available genomes and beyond, we present FastANI, a new method to estimate ANI using alignment-free approximate sequence mapping. FastANI is accurate for both finished and draft genomes, and is up to three orders of magnitude faster compared to alignment-based approaches. We leverage FastANI to compute pairwise ANI values among all prokaryotic genomes available in the NCBI database. Our results reveal clear genetic discontinuity, with 99.8% of the total 8 billion genome pairs analyzed conforming to >95% intra-species and <83% inter-species ANI values. This discontinuity is manifested with or without the most frequently sequenced species, and is robust to historic additions in the genome databases.
Article
The Integrated Microbial Genomes & Microbiomes system v.5.0 (IMG/M: https://img.jgi.doe.gov/m/) contains annotated datasets categorized into: archaea, bacteria, eukarya, plasmids, viruses, genome fragments, metagenomes, cell enrichments, single particle sorts, and metatranscriptomes. Source datasets include those generated by the DOE's Joint Genome Institute (JGI), submitted by external scientists, or collected from public sequence data archives such as NCBI. All submissions are typically processed through the IMG annotation pipeline and then loaded into the IMG data warehouse. IMG's web user interface provides a variety of analytical and visualization tools for comparative analysis of isolate genomes and metagenomes in IMG. IMG/M allows open access to all public genomes in the IMG data warehouse, while its expert review (ER) system (IMG/MER: https://img.jgi.doe.gov/mer/) allows registered users to access their private genomes and to store their private datasets in workspace for sharing and for further analysis. IMG/M data content has grown by 60% since the last report published in the 2017 NAR Database Issue. IMG/M v.5.0 has a new and more powerful genome search feature, new statistical tools, and supports metagenome binning.