Figure 2 - available via license: Creative Commons Attribution 2.0 Generic
Content may be subject to copyright.
Dispersity score. Illustration of dispersity measurement. Read pairs linking contigs c1 and c2 of lengths n and m respectively are transformed to data tested with the KS-test. (a) Observations from contig c1 are translated and reflected on the x-axis while observations from contig c2 are translated. The two sample KS statistic will indicate high similarity in read distribution. (b) Strange placement of linked reads occur. Several explanations are possible. One possible explanation is that contig c2 is misassembled (chimeric) and another explanation is that c2 is a correctly assembled contig with small repeated regions solved on assembly level. The repeat might not be present in other contigs from the assembly and therefore, the alignments to these regions are reported as unique. Contig c2 is however not close to the to contig c1 on the genome and linked reads fail to place at the non-repeated regions on c2. The KS test will indicate low similarity

Dispersity score. Illustration of dispersity measurement. Read pairs linking contigs c1 and c2 of lengths n and m respectively are transformed to data tested with the KS-test. (a) Observations from contig c1 are translated and reflected on the x-axis while observations from contig c2 are translated. The two sample KS statistic will indicate high similarity in read distribution. (b) Strange placement of linked reads occur. Several explanations are possible. One possible explanation is that contig c2 is misassembled (chimeric) and another explanation is that c2 is a correctly assembled contig with small repeated regions solved on assembly level. The repeat might not be present in other contigs from the assembly and therefore, the alignments to these regions are reported as unique. Contig c2 is however not close to the to contig c1 on the genome and linked reads fail to place at the non-repeated regions on c2. The KS test will indicate low similarity

Source publication
Article
Full-text available
The use of short reads from High Throughput Sequencing (HTS) techniques is now commonplace in de novo assembly. Yet, obtaining contiguous assemblies from short reads is challenging, thus making scaffolding an important step in the assembly pipeline. Different algorithms have been proposed but many of them use the number of read pairs supporting a l...

Similar publications

Article
Full-text available
Background: The long-range sequencing information captured by linked reads, such as those available from 10× Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodology that uses linked reads to organize genome as...

Citations

... Pilon software (Walker et al. 2014) was used to correct assembled contigs, mapping the paired-end reads, thus, generating a corrected version for contigs greater than 10,000 bp. To assemble the scaffolds, we used the trimmed mate-pair readings added to the corrected contigs with the BESST program (Sahlin et al. 2014). ...
Article
Full-text available
The genetic basis underlying adaptive physiological mechanisms has been extensively explored in mammals after colonizing the seas. However, independent lineages of aquatic mammals exhibit complex patterns of secondary colonization in freshwater environments. This change in habitat represents new osmotic challenges, and additional changes in key systems, such as the osmoregulatory system, are expected. Here, we studied the selective regime on coding and regulatory regions of 20 genes related to the osmoregulation system in strict aquatic mammals from independent evolutionary lineages, cetaceans, and sirenians, with representatives in marine and freshwater aquatic environments. We identified positive selection signals in genes encoding the protein vasopressin (AVP) in mammalian lineages with secondary colonization in the fluvial environment and in aquaporins for lineages inhabiting the marine and fluvial environments. A greater number of sites with positive selection signals were found for the dolphin species compared to the Amazonian manatee. Only the AQP5 and AVP genes showed selection signals in more than one independent lineage of these mammals. Furthermore, the vasopressin gene tree indicates greater similarity in river dolphin sequences despite the independence of their lineages based on the species tree. Patterns of distribution and enrichment of Transcription Factors in the promoter regions of target genes were analyzed and appear to be phylogenetically conserved among sister species. We found accelerated evolution signs in genes ACE, AQP1, AQP5, AQP7, AVP, NPP4, and NPR1 for the fluvial mammals. Together, these results allow a greater understanding of the molecular bases of the evolution of genes responsible for osmotic control in aquatic mammals.
... The contigs representing high heterozygosity were identified and deleted by Redundans v0.13c (Pryszcz and Gabald on, 2016). Contig scaffolding and gap filling were performed with BESST v2.2.8 (Sahlin et al., 2014) and GapCloser v1.12 in the SOAPdenovo2 suite (Luo et al., 2012) respectively. Sequences shorter than 500 bp were deleted by reformat.sh ...
Article
Full-text available
Spiders are important models for evolutionary studies of web building, sexual selection and adaptive radiation. The recent development of probes for UCE (ultra-conserved element)-based phylogenomic studies has shed light on the phylogeny and evolution of spiders. However, the two available UCE probe sets for spider phylogenomics (Spider and Arachnida probe sets) have relatively low capture efficiency within spiders, and are not optimized for the retrolateral tibial apophysis (RTA) clade, a hyperdiverse lineage that is key to understanding the evolution and diversification of spiders. In this study, we sequenced 15 genomes of species in the RTA clade, and using eight reference genomes, we developed a new UCE probe set (41 845 probes targeting 3802 loci, labelled as the RTA probe set). The performance of the RTA probes in resolving the phylogeny of the RTA clade was compared with the Spider and Arachnida probes through an in-silico test on 19 genomes. We also tested the new probe set empirically on 28 spider species of major spider lineages. The results showed that the RTA probes recovered twice and four times as many loci as the other two probe sets, and the phylogeny from the RTA UCEs provided higher support for certain relationships. This newly developed UCE probe set shows higher capture efficiency empirically and is particularly advantageous for phylogenomic and evolutionary studies of RTA clade and jumping spiders.
... Clean sequenced reads were assembled according to the following steps: quality control (qtrim = rl; trimq = 15; minlen = 15) and normalization with BBTools v38.49 (Bushnell, 2014); genome contigs assemblies with Minia v3.2.4 (Chikhi and Rizk, 2013) with multiple k-mer (21,41,61,81,101,121); reduction of heterozygous contigs with Redundans v0.14a (Pryszcz and Gabald on, 2016); scaffolding with BESST v2.0 (Sahlin et al., 2014); and gap filling with Gapcloser v1.0.1 in the SOAPdenovo2 suite (Luo et al., 2012). For the final assemblies, only scaffolds larger than 1000 bp were retained for further analyses. ...
Article
Entomobryoidea has been the focus of phylogenetic studies in recent years owing to a divergence between morphological and genetic data. Recent phylogenies have converged on the sister relationship of Orchesellidae with the remaining Entomobryoidea, and on the non‐monophyly of the traditional Paronellidae and Entomobryidae, but still lack resolution. Known molecular phylogenies of the superfamily differ greatly between mitogenomic and multilocus markers. For this reason, we designed universal single‐copy orthologue (USCO) and ultraconserved element (UCE) marker sets specific for Entomobryoidea, based on 11 genome assemblies. Upon the newly designed 3406 USCOs and 4030 UCEs, we analysed 34 species covering all Entomobryoidea families and major subfamilies. New data for 26 species were mined from whole‐genome sequencing. Phylogenetic inference confirmed the Orchesellidae as an independent family and the Entomobryinae remained the most puzzling taxon gathering scaled and unscaled lineages of both traditional Entomobryidae and Paronellidae. To accommodate Paronellides, Zhuqinia and related genera, Paronellidinae subfam. nov. is proposed within Entomobryidae. The sampled representatives of Paronellinae were recovered as the sister group of (Seirinae+Lepidocyrtinae), suggesting that reduction on the dorsal macrochaetotaxy and trunk sensillar pattern may have occurred independently within the Lepidocyrtinae and Paronellinae or represent their symplesiomorphy posteriorly modified in the Seirinae. The current systematics of the superfamily are revised here, with Entomobryidae now comprising six subfamilies, including all taxa with smooth dens. Our data also point out that all the main events of cladogenesis of the families and subfamilies of Entomobryoidea occurred during the Jurassic. Our genome‐scale phylogenomics provides a complete, reliable example for systematics of Entomobryoidea, as well as other invertebrates in the big data era.
... To further see influence of misassembly correction on isolate genomes, we scaffolded original and corrected contigs separately with popular scaffolders including BESST [33] and ScaffMatch [34], and then used QUAST to evaluate the scaffolding results. As seen in Table 2 and Table S4, the number of misassemblies in the scaffolding results based on metaMIC's corrected contigs was much lower than that based on the original uncorrected contigs, and metaMIC significantly outperforms MEC in terms of misassembled contig length. ...
... ru/ en/ conte nt/ spades-30-gage-b-data-sets. These contigs are further scaffolded with BESST [33] and ScaffMatch [34]. The accuracy of both contigs and scaffolds are evaluated with QUAST [9] as gold standard. ...
Article
Full-text available
Evaluating the quality of metagenomic assemblies is important for constructing reliable metagenome-assembled genomes and downstream analyses. Here, we present metaMIC (https://github.com/ZhaoXM-Lab/metaMIC), a machine learning-based tool for identifying and correcting misassemblies in metagenomic assemblies. Benchmarking results on both simulated and real datasets demonstrate that metaMIC outperforms existing tools when identifying misassembled contigs. Furthermore, metaMIC is able to localize the misassembly breakpoints, and the correction of misassemblies by splitting at misassembly breakpoints can improve downstream scaffolding and binning results.
... Assembly optimization using ABySS was then performed by modifying the k-mer size (k) and k-mer minimum coverage multiplicity cutoff (kc) to obtain the largest N50 and BUSCO (Benchmarking Universal Single-Copy Orthologs) completeness metrics, evaluated against the Eudicotyledoneae-lineage database (eudicots_odb10), downloaded from BUSCO v.3.0.2 [34]. We then proceeded to close the gaps between the contigs for each of the selected assemblies (Fasta format) using the ABySS-sealer module and BESST [35]. Genome features, such as genome size and single-copy regions, were estimated using the k-mer method [36,37], which relies on counting the total nonerroneous k-mers in the filtered raw reads. ...
Article
Full-text available
Bursera comprises ~100 tropical shrub and tree species, with the center of the species diversification in Mexico. The genomic resources developed for the genus are scarce, and this has limited the study of the gene flow, local adaptation, and hybridization dynamics. In this study, based on ~155 million Illumina paired-end reads per species, we performed a de novo genome assembly and annotation of three Bursera species of the Bullockia section: Bursera bipinnata, Bursera cuneata, and Bursera palmeri. The total lengths of the genome assemblies were 253, 237, and 229 Mb for B. cuneata, B. palmeri, and B. bipinnata, respectively. The assembly of B. palmeri retrieved the most complete and single-copy BUSCOs (87.3%) relative to B. cuneata (86.5%) and B. bipinnata (76.6%). The ab initio gene prediction recognized between 21,000 and 32,000 protein-coding genes. Other genomic features, such as simple sequence repeats (SSRs), were also detected. Using the de novo genome assemblies as a reference, we identified single-nucleotide polymorphisms (SNPs) for a set of 43 Bursera individuals. Moreover, we mapped the filtered reads of each Bursera species against the chloroplast genomes of five Burseraceae species, obtaining consensus sequences ranging from 156 to 160 kb in length. Our work contributes to the generation of genomic resources for an important but understudied genus of tropical-dry-forest species.
... Megaira' significantly underestimates their diversity. We 34 also evaluate the metabolic potential and diversity of 'Ca. Megaira' from this new genomic 35 data and find no clear evidence of nutritional symbiosis. ...
Preprint
Full-text available
Symbiotic microbes from the genus ‘ Candidatus Megaira’ (Rickettsiales) are known to be common associates of algae and ciliates. However genomic resources for these bacteria are scarce, limiting our understanding of their diversity and biology. We therefore utilized SRA and metagenomic assemblies to explore the diversity of this genus. We successfully extracted four draft ‘ Ca . Megaira’ genomes including one complete scaffold for a ‘ Ca . Megaira’ and identified an additional 14 draft genomes from uncategorised environmental Metagenome-Assembled Genomes. We use this information to resolve the phylogeny for the hyper-diverse ‘ Ca . Megaira’, with hosts broadly spanning ciliates, micro- and macro-algae, and find that the current single genus designation ‘ Ca . Megaira’ significantly underestimates their diversity. We also evaluate the metabolic potential and diversity of ‘ Ca . Megaira’ from this new genomic data and find no clear evidence of nutritional symbiosis. In contrast, we hypothesize a potential for defensive symbiosis in ‘ Ca . Megaira’. Intriguingly, one symbiont genome revealed a proliferation of ORFs with ankyrin, tetratricopeptide and Leucine rich repeats like those observed in the genus Wolbachia where they are considered important for host-symbiont protein-protein interactions. Onward research should investigate the phenotypic interactions between ‘ Ca . Megaira’ and their various potential hosts, including the economically important Nemacystus decipiens , and target acquisition of genomic information to reflect the diversity of this massively variable group. Data Summary Genomes assembled in this project have been deposited in bioproject PRJNA867165 Impact statement Bacteria that live inside larger organisms commonly form symbiotic relationships that impact the host’s biology in fundamental ways, such as improving defences against natural enemies or altering host reproduction. Certain groups like ciliates and algae are known to host symbiotic bacteria commonly, but our knowledge of their symbiont’s evolution and function is limited. One such bacteria is ‘ Candidatus Megaira’, a Rickettsiales that was first identified in ciliates, then later in algae. To improve the available data for this common but understudied group, we searched the genomes of potential hosts on online databases for Rickettsiales and assembled their genomes. We found 4 ‘ Ca . Megaira’ this way and then used these to find a further 14 genomes in environmental metagenomic data. Overall, we increased the number of known ‘ Ca . Megaira’ draft genomes from 2 to 20. These new genomes show us that ‘ Ca . Megaira’ is far more diverse than previously thought and that it is potentially involved in defensive symbioses. In addition, one genome shows striking resemblance to well characterized symbiont, Wolbachia , in encoding many proteins predicted to interact directly with host proteins. The genomes we have identified and examined here provide baseline resources for future work investigating the real-world interactions between the hyper diverse ‘ Ca . Megaira’ and its various potential hosts, like the economically important Nemacystus decipiens .
... Raw sequences were demultiplexed, cleaned, and normalized to 20X coverage using BBTOOLS (Bushnell 2014) and then error corrected using Lighter (Song et al. 2014) and assembled using the multi-kmer assembly program Minia3 (Chikhi et al. 2016). Contig scaffolding and gap filling were then performed with BESST v2.2.8 (Sahlin et al., 2014) and GapCloser v.1.12 in the SOAPdenovo2 suite (Luo et al. 2012). We then extracted UCE loci from the assembled genomes using Phyluce and a published bioinformatics pipeline (Faircloth 2017). ...
Article
Despite recent advances in phylogenomics, the early evolution of the largest bee family, Apidae, remains uncertain, hindering efforts to understand the history of Apidae and establish a robust comparative framework. Confirming the position of Anthophorinae—a diverse, globally distributed lineage of apid bees—has been particularly problematic, with the subfamily recovered in various conflicting positions, including as sister to all other Apidae or to the cleptoparasitic Nomadinae. We aimed to resolve relationships in Apidae and Anthophorinae by combining dense taxon sampling, with rigorous phylogenomic analysis of a dataset consisting of ultraconserved elements (UCEs) acquired from multiple sources, including low-coverage genomes. Across a diverse set of analyses, including both concatenation and species tree approaches, and numerous permutations designed to account for systematic biases, Anthophorinae was consistently recovered as the sister group to all remaining Apidae, with Nomadinae sister to (Apinae, [Xylocopinae, Eucerinae]). However, several alternative support metrics (concordance factors, quartet sampling, and gene genealogy interrogation) indicate that this result should be treated with caution. Within Anthophorinae, all genera were recovered as monophyletic, following synonymization of Varthemapistra with Habrophorula. Our results demonstrate the value of dense taxon sampling in bee phylogenomics research and how implementing diverse analytical strategies is important for fully evaluating results at difficult nodes.
... Raw sequences were demultiplexed, cleaned, and normalized to 20X coverage using BBTOOLS (Bushnell 2014) and then error corrected using Lighter (Song et al. 2014) and assembled using the multi-kmer assembly program Minia3 (Chikhi et al. 2016). Contig scaffolding and gap filling were then performed with BESST v2.2.8 (Sahlin et al., 2014) and GapCloser v.1.12 in the SOAPdenovo2 suite (Luo et al. 2012). We then extracted UCE loci from the assembled genomes using Phyluce and a published bioinformatics pipeline (Faircloth 2017). ...
Article
Despite recent advances in phylogenomics, the early evolution of the largest bee family, Apidae, remains uncertain, hindering efforts to understand the history of Apidae and establish a robust comparative framework. Confirming the position of Anthophorinae—a diverse, globally distributed lineage of apid bees—has been particularly problematic, with the subfamily recovered in various conflicting positions, including as sister to all other Apidae or to the cleptoparasitic Nomadinae. We aimed to resolve relationships in Apidae and Anthophorinae by combining dense taxon sampling, with rigorous phylogenomic analysis of a dataset consisting of ultraconserved elements (UCEs) acquired from multiple sources, including low-coverage genomes. Across a diverse set of analyses, including both concatenation and species tree approaches, and numerous permutations designed to account for systematic biases, Anthophorinae was consistently recovered as the sister group to all remaining Apidae, with Nomadinae sister to (Apinae, [Xylocopinae, Eucerinae]). However, several alternative support metrics (concordance factors, quartet sampling, and gene genealogy interrogation) indicate that this result should be treated with caution. Within Anthophorinae, all genera were recovered as monophyletic, following synonymization of Varthemapistra with Habrophorula. Our results demonstrate the value of dense taxon sampling in bee phylogenomics research and how implementing diverse analytical strategies is important for fully evaluating results at difficult nodes.
... identity and coverage against mitochondrial or chloroplast RefSeq sequences of Ulva species by BLASTn v2.12.0 were removed as organellar contigs. BESST v2.2.8 (13,14) was used for scaffolding of the contigs using the Illumina reads mapped with BWA-MEM v0.7.17, and the final assembly was obtained (Table 1). Only two contigs were scaffolded with a 1-bp gap by BESST. ...
Article
Full-text available
We report the genome sequence of Ulva prolifera , which originated from the Yoshinogawa River in Japan, using Oxford Nanopore Technologies MinION and Illumina sequencing reads. The genome assembly size is 103.8 Mbp, consisting of 142 scaffolds with an N 50 value of 4.11 Mbp.
... The assembler uses the overlap graph approach to assemble raw reads 55 . This primary assembly was scaffolded with the 250 bp insert paired-end reads using the BESST (Bias Estimating Stepwise Scaffolding Tool) package 56 and SSPACE (SSAKE-based Scaffolding of Pre-Assembled Contigs after Extension) 57 . Reference guided scaffolding was performed with the Dignea simplex genome using the RaGOO tool 58 . ...
Preprint
Full-text available
Crustose coralline algae are some of the most ecologically significant species of red algae occurring in the tropical reef ecosystems. They contribute efficiently to the primary productivity of the reef and helps in carbon sequestration. They act as one of the most suitable substrates for coral larval settlement and metamorphosis. Because of their encrusting nature, they also cement the reefs together, preventing their subsequent degradation. Though numerous studies on taxonomy and barcodes are available for this red algal group, genomic studies are mostly restricted to a few transcriptomes. The group also lacks a reference genome. Porolithon onkodes (Heydrisch) Foslie 1909, is one of the most abundant and ecologically important species found on the tropical reefs. The authors herewith announce the first draft genome of crustose coralline algae. A draft genome of 215 Mb size of P. onkodes was assembled from 22 Gb of paired-end raw data. BUSCO’s (Benchmarking Universal Single-Copy Orthologs) genome completeness analysis indicated 71% complete BUSCOs against the Eukaryota_odb10 dataset. The assembled genome projected 8150 protein-coding genes. Further comparative and functional genomic studies of the coralline algae group will be possible due to the P. onkodes draft genome.