Figure - available from: Scientific Reports
This content is subject to copyright. Terms and conditions apply.
Origin of phasing effects. Depiction of the sequencing-by-synthesis approach. The black dots represent the sequencing primers. The terminator (black star) on the deoxynucleoside triphosphates (dNTPs) prevents the addition of the subsequent nucleotide to the growing DNA strand. The left strand depicts a post-phased sequence, the right strand a pre-phased one. The middle strand represents the state without phasing effects of any kind. If non-incorporated nucleotides remain after incorporation of the next nucleotide (upper right) and washes (middle left), removal of the terminator allows their addition to the growing strand (middle right, right strand). The resulting strand will subsequently be pre-phased. If the removal of the terminator is not complete (middle right, left strand), no nucleotide can be incorporated during the next sequencing cycle (lower left, left strand). The resulting strand will subsequently be post-phased.

Origin of phasing effects. Depiction of the sequencing-by-synthesis approach. The black dots represent the sequencing primers. The terminator (black star) on the deoxynucleoside triphosphates (dNTPs) prevents the addition of the subsequent nucleotide to the growing DNA strand. The left strand depicts a post-phased sequence, the right strand a pre-phased one. The middle strand represents the state without phasing effects of any kind. If non-incorporated nucleotides remain after incorporation of the next nucleotide (upper right) and washes (middle left), removal of the terminator allows their addition to the growing strand (middle right, right strand). The resulting strand will subsequently be pre-phased. If the removal of the terminator is not complete (middle right, left strand), no nucleotide can be incorporated during the next sequencing cycle (lower left, left strand). The resulting strand will subsequently be post-phased.

Source publication
Article
Full-text available
Next-generation sequencing (NGS) is the method of choice when large numbers of sequences have to be obtained. While the technique is widely applied, varying error rates have been observed. We analysed millions of reads obtained after sequencing of one single sequence on an Illumina sequencer. According to our analysis, the index-PCR for sample prep...

Citations

... We conducted a simulation study to estimate the effect of marker number and read depth on genome-wide error rate (experiment-wide error rate). Two simulations were carried out, one obtained using calculated error rates based on probabilities in Sedcole (1977) and assuming 0% sequencing error, and a second one in which a sequencing error rate of 1% was simulated based on the highest sequencing error rate reported for different Illumina platforms (Glenn, 2011;Pfeiffer et al., 2018;Stoler & Nekrutenko, 2021). A Perl script (Supplementary file S4) was written to simulate genotyping for data sets with 1, 3, 5, 10, 100, 1000, 2000, 3000, 5000, and 10,000 markers, and 5, 7, 10, 14, 17, 20, 30, and 40× average depth (reads/marker); permutation tests were performed to obtain estimates of genome-wide error rates with different depth filters. ...
Article
Full-text available
Genotyping‐by‐sequencing (GBS) is a widely used strategy for obtaining large numbers of genetic markers in model and non‐model organisms. In crop plants, GBS‐derived marker datasets are frequently used to perform quantitative trait locus (QTL) mapping. In some plant species, however, high heterozygosity and complex genome structure mean that researchers must use care in handling GBS data to conduct QTL mapping most effectively. Such outbred crops include most of the perennial grass and tree species used for bioenergy. To identify strategies for increasing accuracy and precision of QTL mapping using GBS data in outbred crops, we conducted an empirical study of SNP‐calling and genetic map‐building pipeline parameters in a Miscanthus sinensis population, and a complementary simulation study to estimate the relationship between genome‐wide error rate, read depth, and marker number. The bioenergy grass Miscanthus is an obligate outcrossing species with a recent (diploidized) whole‐genome duplication. For the study of empirical M. sinensis data, we compared two SNP‐calling methods (one non‐reference‐based and one reference‐based), a series of depth filters (12×, 20×, 30×, and 40×) and two map‐construction methods (i.e., marker ordering: linkage‐only and order‐corrected based on a reference genome). We found that correcting the order of markers on a linkage map by using a high‐quality reference genome improved QTL precision (shorter confidence intervals). For typical GBS datasets of between 1000 and 5000 markers to build a genetic map for biparental populations, a depth filter set at 30× to 40× applied to outbred populations provided a genome‐wide genotype‐calling error rate of less than 1%, improved accuracy of QTL point estimates and minimized type I errors for identifying QTL. Based on these results, we recommend using a reference genome to correct the marker order of genetic maps and a robust genotype depth filter to improve QTL mapping for outbred crops.
... Moreover, so-called allelic dropout (ADO) events remain even after the amplification. In addition, the subsequent sequencing of the amplified materials introduces sequencing errors [22][23][24][25]. ...
... Read sequencing is also erroneous and depends on sequencing technology [22][23][24][25]. We use the Phred quality scores (ρ) to compute the base-calling error probabilities, Q, [27,28]; Q = 10 −0.1×ρ . ...
Article
Full-text available
Cell lineage tree reconstruction methods are developed for various tasks, such as investigating the development, differentiation, and cancer progression. Single-cell sequencing technologies enable more thorough analysis with higher resolution. We present Scuphr, a distance-based cell lineage tree reconstruction method using bulk and single-cell DNA sequencing data from healthy tissues. Common challenges of single-cell DNA sequencing, such as allelic dropouts and amplification errors, are included in Scuphr. Scuphr computes the distance between cell pairs and reconstructs the lineage tree using the neighbor-joining algorithm. With its embarrassingly parallel design, Scuphr can do faster analysis than the state-of-the-art methods while obtaining better accuracy. The method’s robustness is investigated using various synthetic datasets and a biological dataset of 18 cells.
... Note that the latter makes Sanger sequencing a preferred option over next-generation sequencing (i.e., Illumina), for which random copying errors would have been indiscernible from erroneous base calls. 33 From the obtained sequencing chromatograms, we then calculated the Q-values 34 of each NNK stretch, which is a quantitative analysis of library degeneracy, which revealed a high degree of genetic diversity for both libraries (Figures 3B,C and Table S2). Specifically, Lib_2N displays Q-values of 0.85 and 0.82, indicating a nearly equal distribution of the desired nucleotides at both NNK stretches. ...
Article
Full-text available
Growth-based selections evaluate the fitness of individual organisms at a population level. In enzyme engineering, such growth selections allow for the rapid and straightforward identification of highly efficient biocatalysts from extensive libraries. However, selection-based improvement of (synthetically useful) biocatalysts is challenging, as they require highly dependable strategies that artificially link their activities to host survival. Here, we showcase a robust and scalable growth-based selection platform centered around the complementation of noncanonical amino acid-dependent bacteria. Specifically, we demonstrate how serial passaging of populations featuring millions of carbamoylase variants autonomously selects biocatalysts with up to 90,000-fold higher initial rates. Notably, selection of replicate populations enriched diverse biocatalysts, which feature distinct amino acid motifs that drastically boost carbamoylase activity. As beneficial substitutions also originated from unintended copying errors during library preparation or cell division, we anticipate that our growth-based selection platform will be applicable to the continuous, autonomous evolution of diverse biocatalysts in the future.
... One pragmatic solution, therefore, is for the user to pre-specify the sequencing error based on previous experience with HTS data. Previous HTS studies have found that the overall sequencing error rate to be between 0.1% and 0.3% (Bilton et al. 2018b;Clark et al. 2019;Pfeiffer et al. 2018). ...
Article
Full-text available
Key message An improved estimator of genomic relatedness using low-depth high-throughput sequencing data for autopolyploids is developed. Its outputs strongly correlate with SNP array-based estimates and are available in the package GUSrelate. Abstract High-throughput sequencing (HTS) methods have reduced sequencing costs and resources compared to array-based tools, facilitating the investigation of many non-model polyploid species. One important quantity that can be computed from HTS data is the genetic relatedness between all individuals in a population. However, HTS data are often messy, with multiple sources of errors (i.e. sequencing errors or missing parental alleles) which, if not accounted for, can lead to bias in genomic relatedness estimates. We derive a new estimator for constructing a genomic relationship matrix (GRM) from HTS data for autopolyploid species that accounts for errors associated with low sequencing depths, implemented in the R package GUSrelate. Simulations revealed that GUSrelate performed similarly to existing GRM methods at high depth but reduced bias in self-relatedness estimates when the sequencing depth was low. Using a panel consisting of 351 tetraploid potato genotypes, we found that GUSrelate produced GRMs from genotyping-by-sequencing (GBS) data that were highly correlated with a GRM computed from SNP array data, and less biased than existing methods when benchmarking against the array-based GRM estimates. GUSrelate provides researchers with a tool to reliably construct GRMs from low-depth HTS data.
... It cannot be discarded that a small percentage of IPMAVs in the current dataset might have arisen from sample contamination, amplification (commercial high fidelity polymerases range from 2.4×10 −6 to 3×10 −5 [34]) or sequencing errors (the error rate of SBS sequencing has been reported to be 0.24 ± 0.06 % [35]). Regarding contamination, no access was granted to most sequencing schemes (several laboratories were involved). ...
Article
Full-text available
Coinfections occurring within a single sample can be detected by exploring all mutations in a genomic set as these have full profiles commonly observed in lower prevalence.
... An a priori advantage of optical mapping compared to 16S rRNA sequencing and shotgun sequencing is the fact that no polymerase chain reaction (PCR) amplification is used, avoiding bias toward certain genomic regions introduced by both local polymerase enzyme affinity and primer design. 38 The evaluation of the coverage of the cultured ATCC strains corroborated this expectation. The global coverage achieved by Fluorocode was satisfactory, covering the full genome several times. ...
Article
Full-text available
Methicillin-resistant Staphylococcus aureus (MRSA) is a multidrug-resistant bacterium with a global presence in healthcare facilities as well as community settings. The resistance of MRSA to beta-lactam antibiotics can be attributed to a mobile genetic element called the staphylococcal cassette chromosome mec (SCCmec), ranging from 23 to 68 kilobase pairs in length. The mec gene complex contained in SCCmec allows MRSA to survive in the presence of penicillin and other beta-lactam antibiotics. We demonstrate that optical mapping (OM) is able to identify the bacterium as S. aureus, followed by an investigation of the presence of kilobase pair range SCCmec elements by examining the associated OM-generated barcode patterns. By employing OM as an alternative to traditional DNA sequencing, we showcase its potential for the detection of complex genetic elements such as SCCmec in MRSA. This approach holds promise for enhancing our understanding of antibiotic resistance mechanisms and facilitating the development of targeted interventions against MRSA infections.
... The new Nanopore R10.4 flow cells are more accurate in calling homopolymers in the 4-9 bp range than the R9.4.1 flow cells [69], which we have used in this current study. The majority of the indels reported in this study are associated with homopolymers, and it is difficult to determine their veracity by other means, as most sequencing technologies have difficulties with homopolymers [70][71][72][73]. ...
Article
Full-text available
African swine fever virus (ASFV) is the causative agent of African swine fever, an economically important disease of pigs, often with a high case fatality rate. ASFV has demonstrated low genetic diversity among isolates collected within Eurasia. To explore the influence of viral variants on clinical outcomes and infection dynamics in pigs experimentally infected with ASFV, we have designed a deep sequencing strategy. The variant analysis revealed unique SNPs at <10% frequency in several infected pigs as well as some SNPs that were found in more than one pig. In addition, a deletion of 10,487 bp (resulting in the complete loss of 21 genes) was present at a nearly 100% frequency in the ASFV DNA from one pig at position 6362-16849. This deletion was also found to be present at low levels in the virus inoculum and in two other infected pigs. The current methodology can be used for the currently circulating Eurasian ASFVs and also adapted to other ASFV strains and genotypes. Comprehensive deep sequencing is critical for following ASFV molecular evolution, especially for the identification of modifications that affect virus virulence.
... The whole genome of B. breve BS2-PB3 had a total length of 2,268,931 bp with a G-C content of 58.9%. The Pomoxis tool calculated an error rate of 0.309%, ensuring the quality of its assembly [24,25]. The complete genome sequence has been submitted to GenBank (Accession No. CP138211). ...
Article
Full-text available
Our group had isolated Bifidobacterium breve strain BS2-PB3 from human breast milk. In this study, we sequenced the whole genome of B. breve BS2-PB3, and with a focus on its safety profile, various probiotic characteristics (presence of antibiotic resistance genes, virulence factors, and mobile elements) were then determined through bioinformatic analyses. The antibiotic resistance profile of B. breve BS2-PB3 was also evaluated. The whole genome of B. breve BS2-PB3 consisted of 2,268,931 base pairs with a G-C content of 58.89% and 2,108 coding regions. The average nucleotide identity and whole-genome phylogenetic analyses supported the classification of B. breve BS2-PB3. According to our in silico assessment, B. breve BS2-PB3 possesses antioxidant and immunomodulation properties in addition to various genes related to the probiotic properties of heat, cold, and acid stress, bile tolerance, and adhesion. Antibiotic susceptibility was evaluated using the Kirby-Bauer disk-diffusion test, in which the minimum inhibitory concentrations for selected antibiotics were subsequently tested using the Epsilometer test. B. breve BS2-PB3 only exhibited selected resistance phenotypes, i.e., to mupirocin (minimum inhibitory concentration/MIC >1,024 μg/ml), sulfamethoxazole (MIC >1,024 μg/ml), and oxacillin (MIC >3 μg/ml). The resistance genes against those antibiotics, i.e., ileS, mupB, sul4, mecC and ramA, were detected within its genome as well. While no virulence factor was detected, four insertion sequences were identified within the genome but were located away from the identified antibiotic resistance genes. In conclusion, B. breve BS2-PB3 demonstrated a sufficient safety profile, making it a promising candidate for further development as a potential functional food.
... To build the random-mer sequence at the 5' and/or 3' ends, a number ranging from 1-5 was randomly chosen to determine its length (n), and then [A, T, C, G] was randomly sampled n times to generate the random-mer sequence. For scenarios 1-4, the 3' adapter sequence was randomly chosen from an adapter pool compiled from real datasets with 0.25% sequencing error [28]. For scenarios 5-8, 12-nt poly(A) was used as the 3' adapter. ...
Article
Full-text available
Adapter trimming is an essential step for analyzing small RNA sequencing data, where reads are generally longer than target RNAs ranging from 18 to 30 bp. Most adapter trimming tools require adapter information as input. However, adapter information is hard to access, specified incorrectly, or not provided with publicly available datasets, hampering their reproducibility and reusability. Manual identification of adapter patterns from raw reads is labor-intensive and error-prone. Moreover, the use of randomized adapters to reduce ligation biases during library preparation makes adapter detection even more challenging. Here, we present FindAdapt, a Python package for fast and accurate detection of adapter patterns without relying on prior information. We demonstrated that FindAdapt was far superior to existing approaches. It identified adapters successfully in 180 simulation datasets with diverse read structures and 3,184 real datasets covering a variety of commercial and customized small RNA library preparation kits. FindAdapt is stand-alone software that can be easily integrated into small RNA sequencing analysis pipelines.
... In addition, error profiles in both technologies are different. Errors in short-reads are mostly at the level of wrong nucleotide substitutions, while errors in long-reads mostly involved incorrect insertions and deletions 7,8 . This difference makes long read errors more complex to resolve, requiring an error correction step prior to genome assembly. ...
Preprint
Full-text available
The red-legged partridge, Alectoris rufa (n=38 chromosomes) plays a crucial role in the ecosystem of southwestern Europe, and understanding its genetics is vital for conservation and management. Here we sequence, assemble, and annotate a highly contiguous and nearly complete version of it genome (115 scaffolds, L90=23). This assembly contains 96.9% (8078 out of 8332) orthologous genes from the BUSCO aves_odb10 dataset of single copy orthologous genes. We identify RNA and protein genes, 95% of which with functional annotation. This near-chromosome level assembly revealed significant chromosome rearrangements compared to quail ( Coturnix japonica ) and chicken ( Gallus gallus ), suggesting that A. rufa and C. japonica diverged 21 M-years ago and that their common ancestor diverged from G. gallus 37 M-years ago. The reported assembly is a significant step towards a complete reference genome for A. rufa , contributing to facilitate comparative avian genomics, and providing a valuable resource for future research and conservation efforts for the red-legged partridge.