Classification and genetic features of CRISPR-targeted TR sequences. (A) Length distribution of TR sequences. We used the HK97 capsid and portal proteins as tailed-phage signature genes. The dotted line at 20 kb represents an arbitrary cut-off between small and large sequences. Sequences longer than 100 kb are shown in the inset. (B) Results of the classification of TR sequences. Sequences encoding a detectable capsid gene were classified to a viral taxon according to capsid type, as follows. Caudovirales: HK97 fold capsid; Inoviridae: Inoviridae MCP; and Microviridae: Microviridae MCP. The capsid-less TR sequences with ParA, ParB, ParM, and/ or MoBM were classified as Plasmid-like. The remaining sequences were labeled as "Unclassified." (C) Distribution of singleton coverage and coding ratio. Selected kvalues were higher in large TR sequences, to avoid doubletons by chance. https://doi.org/10.1371/journal.pcbi.1009428.g001

Classification and genetic features of CRISPR-targeted TR sequences. (A) Length distribution of TR sequences. We used the HK97 capsid and portal proteins as tailed-phage signature genes. The dotted line at 20 kb represents an arbitrary cut-off between small and large sequences. Sequences longer than 100 kb are shown in the inset. (B) Results of the classification of TR sequences. Sequences encoding a detectable capsid gene were classified to a viral taxon according to capsid type, as follows. Caudovirales: HK97 fold capsid; Inoviridae: Inoviridae MCP; and Microviridae: Microviridae MCP. The capsid-less TR sequences with ParA, ParB, ParM, and/ or MoBM were classified as Plasmid-like. The remaining sequences were labeled as "Unclassified." (C) Distribution of singleton coverage and coding ratio. Selected kvalues were higher in large TR sequences, to avoid doubletons by chance. https://doi.org/10.1371/journal.pcbi.1009428.g001

Source publication
Article
Full-text available
Viruses are the most numerous biological entity, existing in all environments and infecting all cellular organisms. Compared with cellular life, the evolution and origin of viruses are poorly understood; viruses are enormously diverse, and most lack sequence similarity to cellular genes. To uncover viral sequences without relying on either referenc...

Contexts in source publication

Context 1
... evaluation of TR sequence length revealed a multimodal distribution with a distinct trough at 20 kb ( Fig 1A). For reference, we termed the 8837 TR sequences that were shorter than 20 kb as "small" and the 2554 TR sequences that were longer than 20 kb as "large." ...
Context 2
... the large TR sequences, 2047 (80.1%) encoded HK97 fold capsid proteins, a definitive gene of Duplodnaviria [29]. Phage portal proteins were encoded by 2163 large TR sequences (84.7%), indicating that most large TR sequences are from Caudovirales, also known as tailed phages (Fig 1B). Among the small TR sequences, 766 (8.7%) encoded Microviridae major capsid proteins (MCPs) [30], and 56 (0.6%) encoded Inoviridae major coat proteins. ...
Context 3
... the small TR sequences, 766 (8.7%) encoded Microviridae major capsid proteins (MCPs) [30], and 56 (0.6%) encoded Inoviridae major coat proteins. We propose that this portion of small TR sequences are likely viruses with a non-tailed morphology (Fig 1B) [31,32]. Finally, 107 (1.2%) small TR sequences encoded HK97 fold capsid proteins. ...
Context 4
... scrutinize other genomic features, such as repeats and noncoding regions, the k-mer singleton coverage and coding ratio for each classified and unclassified TR sequence were investigated ( Fig 1C). Singleton coverage is the number of k-mer singletons from a given contig divided by its length; the value approaches 1 if the sequence does not contain repeats. ...
Context 5
... analysis intended to discover a viral genome that could not be discovered using the conventional homology-based method. Although most of the discovered genomes with detectable capsid genes were previously recognized viral lineages, substantial portions of particularly small TR sequences remained unclassified (Fig 1B). The coding ratio of these unclassified sequences exhibited a broad distribution, and some were exceptionally low; thus, we speculated that these sequences might have unknown genetic features that differ from the conventional protein-coding genes. ...

Similar publications

Article
Full-text available
Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 ne...

Citations

... We collected viral sequences, including bacteriophages, from the NCBI viral RefSeq (released September 2020) [17,18] and the Millard labs PHAge REference Database (INPHARED; July 2019) [19]. We also utilized gut viral sequences from the human Gut Virome Database (GVD) [20] and an in-house data set of CRISPR targeted viral sequences constructed from human gut metagenomic data [21]. All pre-processed reads were aligned to these viral reference genomes using the BWA. ...
... We utilized the genomes of 19 host bacterial species of the viruses detected in the previous study [21]. Genomic assembly data of Acanthamoeba castellanii (GenBank assembly accession was GCA_000313135.1) was downloaded from the NCBI database. ...
... We attempted to detect both gut phages and their bacterial host genomes, in order to examine their cohabitation. Based on our previous works, we detected 11,391 terminally redundant sequences targeted by host clustered regularly interspaced short palindromic repeats (CRISPR) immunological memory [21]. As prokaryotic cells can memorize previously infected phage sequences in their CRISPR system, we were able to determine the candidate hosts, based on the connection between the CRISPR-targeted sequences, or protospacers, and the associated CRISPR direct repeats. ...
Article
Full-text available
Coprolites contain various kinds of ancient DNAs derived from gut micro-organisms, viruses, and foods, which can help to determine the gut environment of ancient peoples. Their genomic information should be helpful in elucidating the interaction between hosts and microbes for thousands of years, as well as characterizing the dietary behaviors of ancient people. We performed shotgun metagenomic sequencing on four coprolites excavated from the Torihama shell-mound site in the Japanese archipelago. The coprolites were found in the layers of the Early Jomon period, corresponding stratigraphically to 7000 to 5500 years ago. After shotgun sequencing, we found that a significant number of reads showed homology with known gut microbe, viruses, and food genomes typically found in the feces of modern humans. We detected reads derived from several types of phages and their host bacteria simultaneously, suggesting the coexistence of viruses and their hosts. The food genomes provide biological evidence for the dietary behavior of the Jomon people, consistent with previous archaeological findings. These results indicate that ancient genomic analysis of coprolites is useful for understanding the gut environment and lifestyle of ancient peoples.
... Using this approach, Bacteroides vulgatus from human gut samples was found to be significantly associated with the p-crAssphage and was therefore predicted to be its host. This result is consistent with other studies that propose the host of the crAssphage to be from the Bacteroides genus [63,64]. ...
... CRISPR spacers analysis resolved the crAss-like phage hosts at the order level (Bacteroidales), while more precise host identification by this method is problematic due to horizontal gene transfer leading to distant bacterial species sharing the same CRISPR direct repeats, which hinders taxonomic assignment of CRISPR regions themselves [64]. The correlation between bacterial and viral relative abundance pointed towards Bacteroides vulgatus and Ruminococcus spp. as possible hosts for crAss-like phages [65]. ...
... The coevolution of phages and their hosts leaves specific signatures in the genomes of both. The signals for host-phage identification could be abundance profiles, codon usage similarity, genetic similarity, and CRISPR spacer matches [36,64]. Examples of the tools are HoPhage, HostFinder, PHIAF, and VirHostMatcher. ...
Article
Full-text available
The order Crassvirales comprises dsDNA bacteriophages infecting bacteria in the phylum Bacteroidetes that are found in a variety of environments but are especially prevalent in the mammalian gut. This review summarises available information on the genomics, diversity, taxonomy, and ecology of this largely uncultured viral taxon. With experimental data available from a handful of cultured representatives, the review highlights key properties of virion morphology, infection, gene expression and replication processes, and phage-host dynamics.
... In some cases, these characteristics may also suggest the likely host range [82][83][84]. Additional indications of the potential host range for these viruses can also be derived from their co-occurrence with specific groups of potential host organisms, as well as matches to CRISPR spacers in the case of viruses infecting bacteria and archaea [85][86][87][88]. As a follow-up to metagenomics, the properties of individual proteins or whole viruses can be experimentally determined through reverse genetics and characterization in vitro and in vivo, when possible (e.g., [89]). ...
Article
Full-text available
A universal taxonomy of viruses is essential for a comprehensive view of the virus world and for communicating the complicated evolutionary relationships among viruses. However, there are major differences in the conceptualisation and approaches to virus classification and nomenclature among virologists, clinicians, agronomists, and other interested parties. Here, we provide recommendations to guide the construction of a coherent and comprehensive virus taxonomy, based on expert scientific consensus. Firstly, assignments of viruses should be congruent with the best attainable reconstruction of their evolutionary histories, i.e., taxa should be monophyletic. This fundamental principle for classification of viruses is currently included in the International Committee on Taxonomy of Viruses (ICTV) code only for the rank of species. Secondly, phenotypic and ecological properties of viruses may inform, but not override, evolutionary relatedness in the placement of ranks. Thirdly, alternative classifications that consider phenotypic attributes, such as being vector-borne (e.g., "arboviruses"), infecting a certain type of host (e.g., "mycoviruses," "bacteriophages") or displaying specific pathogenicity (e.g., "human immunodeficiency viruses"), may serve important clinical and regulatory purposes but often create polyphyletic categories that do not reflect evolutionary relationships. Nevertheless, such classifications ought to be maintained if they serve the needs of specific communities or play a practical clinical or regulatory role. However, they should not be considered or called taxonomies. Finally, while an evolution-based framework enables viruses discovered by metagenomics to be incorporated into the ICTV taxonomy, there are essential requirements for quality control of the sequence data used for these assignments. Combined, these four principles will enable future development and expansion of virus taxonomy as the true evolutionary diversity of viruses becomes apparent.
... This protocol could extract numerous characterized/uncharacterized MGEs. For complete details on the use and execution of this protocol, please refer to Sugimoto et al. (2021). ...
... Dataset from the previous study (Sugimoto et al., 2021) ...
... Conversely, ssDNA viruses, plasmid-like elements, and many unclassified sequences were common among the small (<20 kb) sequences. This result was obtained from our previous study (Sugimoto et al., 2021). ...
Article
Full-text available
Homology-based search is commonly used to uncover mobile genetic elements (MGEs) from metagenomes, but it heavily relies on reference genomes in the database. Here we introduce a protocol to extract CRISPR-targeted sequences from the assembled human gut metagenomic sequences without using a reference database. We describe the assembling of metagenome contigs, the extraction of CRISPR direct repeats and spacers, the discovery of protospacers, and the extraction of protospacer-enriched regions using the graph-based approach. This protocol could extract numerous characterized/uncharacterized MGEs. For complete details on the use and execution of this protocol, please refer to Sugimoto et al. (2021).
... Another method is using frequencies of nucleic acids or kmer-based machine-learning methods with known viral sequences, such as VirFinder [201]. The clustered regularly interspaced short palindromic repeats (CRISPR) system and prokaryotic adaptive immunological memory are also employed as nonreference-based approaches [202,203]. Bacteria can memorize the partial genomes of previously infected phages, and there are almost identical sequences between bacterial CRISPR spacers and phage protospacers [204,205]. Therefore, we can identify viral sequences utilizing bacterial CRISPR spacer sequences. ...
Article
Full-text available
The COVID-19 outbreak has reminded us of the importance of viral evolutionary studies as regards comprehending complex viral evolution and preventing future pandemics. A unique approach to understanding viral evolution is the use of ancient viral genomes. Ancient viruses are detectable in various archaeological remains, including ancient people’s skeletons and mummified tissues. Those specimens have preserved ancient viral DNA and RNA, which have been vigorously analyzed in the last few decades thanks to the development of sequencing technologies. Reconstructed ancient pathogenic viral genomes have been utilized to estimate the past pandemics of pathogenic viruses within the ancient human population and long-term evolutionary events. Recent studies revealed the existence of non-pathogenic viral genomes in ancient people’s bodies. These ancient non-pathogenic viruses might be informative for inferring their relationships with ancient people’s diets and lifestyles. Here, we reviewed the past and ongoing studies on ancient pathogenic and non-pathogenic viruses and the usage of ancient viral genomes to understand their long-term viral evolution.
... Next, we tried to infer whether these putative phages were predicted to be active or not by comparing them with databases of CRISPR spacers derived from human gut metagenomes. To this end, we used a publicly available collection of spacers extracted from 11 817 human gut metagenome datasets (64) and an in-house spacer database that was built by running CRISPRCasFinder (65) on the Integrative Human Microbiome Project -Inflammatory Bowel Disease metagenomic dataset (66) (Methods). By doing this, we obtained that 1346 out of 1447 dereplicated viral genomes harboring UG27 (93,02%) were targeted by at least 1 spacer, with 1197 (82%) being targeted by five or more spacers, suggesting a recent active role of these viral genomes in their natural environment. ...
Article
Full-text available
Reverse transcriptases (RTs) are enzymes capable of synthesizing DNA using RNA as a template. Within the last few years, a burst of research has led to the discovery of novel prokaryotic RTs with diverse antiviral properties, such as DRTs (Defense-associated RTs), which belong to the so-called group of unknown RTs (UG) and are closely related to the Abortive Infection system (Abi) RTs. In this work, we performed a systematic analysis of UG and Abi RTs, increasing the number of UG/Abi members up to 42 highly diverse groups, most of which are predicted to be functionally associated with other gene(s) or domain(s). Based on this information, we classified these systems into three major classes. In addition, we reveal that most of these groups are associated with defense functions and/or mobile genetic elements, and demonstrate the antiphage role of four novel groups. Besides, we highlight the presence of one of these systems in novel families of human gut viruses infecting members of the Bacteroidetes and Firmicutes phyla. This work lays the foundation for a comprehensive and unified understanding of these highly diverse RTs with enormous biotechnological potential.
Article
Two decades of metagenomic analyses have revealed that in many environments, small (∼5 kb), single-stranded DNA phages of the family Microviridae dominate the virome. Although the emblematic microvirus phiX174 is ubiquitous in the laboratory, most other microviruses, particularly those of the gokushovirus and amoyvirus lineages, have proven to be much more elusive. This puzzling lack of representative isolates has hindered insights into microviral biology. Furthermore, the idiosyncratic size and nature of their genomes have resulted in considerable misjudgments of their actual abundance in nature. Fortunately, recent successes in microvirus isolation and improved metagenomic methodologies can now provide us with more accurate appraisals of their abundance, their hosts, and their interactions. The emerging picture is that phiX174 and its relatives are rather rare and atypical microviruses, and that a tremendous diversity of other microviruses is ready for exploration.