Performance statistics for nanopore native RNA and cDNA sequencing. (a) Alignment identity vs. read length for native RNA reads. (b) Substitution matrix for native RNA reads. (c) Observed v. expected read length for native RNA reads. (d) Alignment identity vs. read length for cDNA reads. (e) Substitution matrix for cDNA reads. (f) Observed vs. expected read length for cDNA reads. (g) Observed vs expected kmers (k = 5) for native RNA reads. and (h) Observed vs. expected kmers (k = 5) for cDNA reads.

Performance statistics for nanopore native RNA and cDNA sequencing. (a) Alignment identity vs. read length for native RNA reads. (b) Substitution matrix for native RNA reads. (c) Observed v. expected read length for native RNA reads. (d) Alignment identity vs. read length for cDNA reads. (e) Substitution matrix for cDNA reads. (f) Observed vs. expected read length for cDNA reads. (g) Observed vs expected kmers (k = 5) for native RNA reads. and (h) Observed vs. expected kmers (k = 5) for cDNA reads.

Source publication
Preprint
Full-text available
High throughput cDNA sequencing technologies have dramatically advanced our understanding of transcriptome complexity and regulation. However, these methods lose information contained in biological RNA because the copied reads are often short and because modifications are not carried forward in cDNA. We address these limitations using a native poly...

Contexts in source publication

Context 1
... 26 was employed to calculate the number of matches, mismatches, and indels per aligned read in this population. We found a median identity of 86% (Figure 2a), with mismatch, insertion, and deletion errors of 2.4%, 4.3%, and 4.4% respectively. Percent identity was consistent across institutions and among flow cells (median average identity of pass reads equal to 85.5%, with a standard deviation of 0.65).The basecaller seldom confused G-for-C or C-for-G (0.38% and 0.47% errors respectively); by comparison, C-to-T and T-C errors were substantially higher (3.62% and 2.23% respectively) ( Figure 2b). ...
Context 2
... found a median identity of 86% (Figure 2a), with mismatch, insertion, and deletion errors of 2.4%, 4.3%, and 4.4% respectively. Percent identity was consistent across institutions and among flow cells (median average identity of pass reads equal to 85.5%, with a standard deviation of 0.65).The basecaller seldom confused G-for-C or C-for-G (0.38% and 0.47% errors respectively); by comparison, C-to-T and T-C errors were substantially higher (3.62% and 2.23% respectively) ( Figure 2b). ...
Context 3
... reads were also aligned to a set of high-confidence isoform sequences that were curated using a pipeline termed 'FLAIR' (see below). Using these alignments, we compared observed vs expected length and found general agreement ( Figure 2c). ...
Context 4
... nanopore cDNA data, we observed a median identity of 85% ( Figure 2d) which is comparable to other recent nanopore DNA studies​ 27​ . The substitution error patterns for cDNA data were similar to those for native RNA data (Figure 2e). ...
Context 5
... substitution error patterns for cDNA data were similar to those for native RNA data (Figure 2e). However, while there was agreement in observed vs expected read lengths for cDNA and RNA, there were substantially fewer cDNA reads above 4 kb in length compared to the RNA reads (Figure 2c,f). ...
Context 6
... reads included all possible kmers which were represented in sufficient numbers to permit a statistically valid analysis. Figure 2g and 2h show normalized counts for all 1024 kmers determined from native RNA and cDNA data respectively. Kmer frequencies exhibit an approximate one-to-one trend between observed and expected frequencies for both native RNA and cDNA. ...

Similar publications

Article
Full-text available
High-throughput complementary DNA sequencing technologies have advanced our understanding of transcriptome complexity and regulation. However, these methods lose information contained in biological RNA because the copied reads are often short and modifications are not retained. We address these limitations using a native poly(A) RNA sequencing stra...

Citations

... Unlike short-read and long-read cDNA sequencing, longread RNA sequencing, also known as dRNA-seq (DRS), does not require cDNA generation and therefore can eliminate the errors that occur during cDNA amplification and avoid RNA-RNA chimaeras produced by cDNA [58]. Although the limitation of reading length is not the challenge with the technique, the fragmentation of the input read is still challenging [59,60]. ...
Article
Full-text available
With the invention of RNA sequencing over a decade ago, diagnosis and identification of the gene-related diseases entered a new phase that enabled more accurate analysis of the diseases that are difficult to approach and analyze. RNA sequencing has availed in-depth study of transcriptomes in different species and provided better understanding of rare diseases and taxonomical classifications of various eukaryotic organisms. Development of single-cell, short-read, long-read and direct RNA sequencing using both blood and biopsy specimens of the organism together with recent advancement in computational analysis programs has made the medical professional’s ability in identifying the origin and cause of genetic disorders indispensable. Altogether, such advantages have evolved the treatment design since RNA sequencing can detect the resistant genes against the existing therapies and help medical professions to take a further step in improving methods of treatments towards higher effectiveness and less side effects. Therefore, it is of essence to all researchers and scientists to have deeper insight in all available methods of RNA sequencing while taking a step-in therapy design.
... The recent advances in Nanopore direct RNA sequencing (DRS) have allowed, for the first time, direct sequencing of fulllength native RNA molecules without the need for RT or amplification. Importantly, a number of studies have shown that DRS data intrinsically contain information about RNA modifications [10][11][12] . In Nanopore DRS, a single RNA molecule is ratcheted by a molecular motor through a protein pore embedded in a synthetic membrane. ...
Article
Full-text available
RNA molecules undergo a vast array of chemical post-transcriptional modifications (PTMs) that can affect their structure and interaction properties. In recent years, a growing number of PTMs have been successfully mapped to the transcriptome using experimental approaches relying on high-throughput sequencing. Oxford Nanopore direct-RNA sequencing has been shown to be sensitive to RNA modifications. We developed and validated Nanocompore, a robust analytical framework that identifies modifications from these data. Our strategy compares an RNA sample of interest against a non-modified control sample, not requiring a training set and allowing the use of replicates. We show that Nanocompore can detect different RNA modifications with position accuracy in vitro, and we apply it to profile m6A in vivo in yeast and human RNAs, as well as in targeted non-coding RNAs. We confirm our results with orthogonal methods and provide novel insights on the co-occurrence of multiple modified residues on individual RNA molecules. Nanopore direct RNA Sequencing data contain information about the presence of RNA modifications, but their detection poses substantial challenges. Here the authors introduce Nanocompore, a new methodology for modification detection from Nanopore data.
... These annotations were produced using RNA-seq evidence from a greater diversity of tissue types, which likely explains this discrepancy. The Lake Trout annotation, as well as annotations for other salmonids, could also be further improved by directly sequencing full length transcripts using long-read sequencing technologies (Workman et al. 2018). We predict that the completeness of the Lake Trout genome annotation will be improved as more gene expression data from a greater diversity of tissue types becomes available for the species (Salzberg 2019). ...
Preprint
Here we present an annotated, chromosome-anchored, genome assembly for Lake Trout (Salvelinus namaycush) – a highly diverse salmonid species of notable conservation concern and an excellent model for research on adaptation and speciation. We leveraged Pacific Biosciences long-read sequencing, paired-end Illumina sequencing, proximity ligation (Hi-C), and a previously published linkage map to produce a highly contiguous assembly composed of 7,378 contigs (contig N50 = 1.8 mb) assigned to 4,120 scaffolds (scaffold N50 = 44.975 mb). 84.7% of the genome was assigned to 42 chromosome-sized scaffolds and 93.2% of Benchmarking Universal Single Copy Orthologs were recovered, putting this assembly on par with the best currently available salmonid genomes. Estimates of genome size based on k-mer frequency analysis were highly similar to the total size of the finished genome, suggesting that the entirety of the genome was recovered. A mitome assembly was also produced. Self-vs-self synteny analysis allowed us to identify homeologs resulting from the Salmonid specific autotetraploid event (Ss4R) and alignment with three other salmonid species allowed us to identify homologous chromosomes in other species. We also generated multiple resources useful for future genomic research on Lake Trout including a repeat library and a sex averaged recombination map. A novel RNA sequencing dataset was also used to produce a publicly available set of gene annotations using the National Center for Biotechnology Information Eukaryotic Genome Annotation Pipeline. Potential applications of these resources to population genetics and the conservation of native populations are discussed.
... Similar to cDNA sequencing methodologies on the ONT platforms, the direct RNA sequencing methodology can identify and quantify splice variants (Workman et al., 2018). The ability to directly sequence RNA skips three main problems: First, there is no necessity to reverse transcribe the RNA into cDNA, a process that can introduce errors or biases in the resulting sequencing data (Lahens et al., 2014). ...
... Second, it permits the identification of RNA modifications as well as the poly-A tail length (currently for poly-A tails > 10bp). Third, and most important, for every different isoform detected, its fully processed (no introns present) or unprocessed (some introns present) status can be recorded along with its modifications and poly-A tail length (Workman et al., 2018). ...
... It has already been shown that the ONT direct RNA sequencing approach can sequence some long and very long RNA molecules that are not efficiently synthesized into cDNAs (Workman et al., 2018). The identification of RNA modifications has biological implications and is based on the ability of the platform to sense the RNA modifications directly [see review from Novoa et al. (2017)]. ...
Article
Full-text available
RNA sequencing using next-generation sequencing technologies (NGS) is currently the standard approach for gene expression profiling, particularly for large-scale high-throughput studies. NGS technologies comprise high throughput, cost efficient short-read RNA-Seq, while emerging single molecule, long-read RNA-Seq technologies have enabled new approaches to study the transcriptome and its function. The emerging single molecule, long-read technologies are currently commercially available by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), while new methodologies based on short-read sequencing approaches are also being developed in order to provide long range single molecule level information—for example, the ones represented by the 10x Genomics linked read methodology. The shift toward long-read sequencing technologies for transcriptome characterization is based on current increases in throughput and decreases in cost, making these attractive for de novo transcriptome assembly, isoform expression quantification, and in-depth RNA species analysis. These types of analyses were challenging with standard short sequencing approaches, due to the complex nature of the transcriptome, which consists of variable lengths of transcripts and multiple alternatively spliced isoforms for most genes, as well as the high sequence similarity of highly abundant species of RNA, such as rRNAs. Here we aim to focus on single molecule level sequencing technologies and single-cell technologies that, combined with perturbation tools, allow the analysis of complete RNA species, whether short or long, at high resolution. In parallel, these tools have opened new ways in understanding gene functions at the tissue, network, and pathway levels, as well as their detailed functional characterization. Analysis of the epi-transcriptome, including RNA methylation and modification and the effects of such modifications on biological systems is now enabled through direct RNA sequencing instead of classical indirect approaches. However, many difficulties and challenges remain, such as methodologies to generate full-length RNA or cDNA libraries from all different species of RNAs, not only poly-A containing transcripts, and the identification of allele-specific transcripts due to current error rates of single molecule technologies, while the bioinformatics analysis on long-read data for accurate identification of 5′ and 3′ UTRs is still in development.
... Pore-based sequencers measure changes in an ionic current as nucleic acids pass through a nanopore: information about changes in current and dwell time in the pore is used to identify the nucleotide in question. Several publications demonstrated that RNA modifications produce specific current and dwell time signals, suggesting nanopore-based methods could identify modified nucleotides in a high throughput manner ( Figure 1D; Garalde et al., 2018;Workman et al., 2018;Smith et al., 2019). The potential benefits of this approach for mapping RNA modifications are huge, as stoichiometric and positional information of multiple modifications could be interpreted simultaneously. ...
Article
Full-text available
A flurry of methods has been developed in recent years to identify N6-methyladenosine (m⁶A) sites across transcriptomes at high resolution. This raises the need to understand both the common features and those that are unique to each method. Here, we complement the analyses presented in the original papers by reviewing their various technical aspects and comparing the overlap between m⁶A-methylated messenger RNAs (mRNAs) identified by each. Specifically, we examine eight different methods that identify m⁶A sites in human cells with high resolution: two antibody-based crosslinking and immunoprecipitation (CLIP) approaches, two using endoribonuclease MazF, one based on deamination, two using Nanopore direct RNA sequencing, and finally, one based on computational predictions. We contrast the respective datasets and discuss the challenges in interpreting the overlap between them, including a prominent expression bias in detected genes. This overview will help guide researchers in making informed choices about using the available data and assist with the design of future experiments to expand our understanding of m⁶A and its regulation.
... lrRNAseq approaches were successfully used to study transcriptional and post-transcriptional regulation in various physiological and disease conditions (De Roeck et al., 2017;Aneichyk et al., 2018;Anvar et al., 2018;Nattestad et al., 2018), including single-cells (Byrne et al., 2017). Focusing on RNAs, these techniques can produce single reads of up to 10 4 bases, with an average length of almost 1 Kb for ONT (Workman et al., 2018). Hence, in a number of cases, this allows the profiling of full-length RNA molecules, and the fine characterization of their alternative isoforms. ...
... Recent and growing literature is available about the footprints left by RNA modifications on dRNA-seq data, and how to exploit them to detect RNA marks (Xu and Seki, 2019). Differences in current levels between native bases and their modified counterparts were reported for m 6 A, m 5 C, m 7 G, and pseudouridine (Garalde et al., 2018;Workman et al., 2018;Smith et al., 2019). Moreover, the increase of base miscalls frequency in concomitance to modified sites were observed next to "Ato-I, " 7-methylguanosine and pseudouridine sites (Workman et al., 2018;Smith et al., 2019). ...
... Differences in current levels between native bases and their modified counterparts were reported for m 6 A, m 5 C, m 7 G, and pseudouridine (Garalde et al., 2018;Workman et al., 2018;Smith et al., 2019). Moreover, the increase of base miscalls frequency in concomitance to modified sites were observed next to "Ato-I, " 7-methylguanosine and pseudouridine sites (Workman et al., 2018;Smith et al., 2019). These observations led to the development of specific computational tools for the detection of RNA modifications. ...
Article
Full-text available
It has been known for a few decades that transcripts can be marked by dozens of different modifications. Yet, we are just at the beginning of charting these marks and understanding their functional impact. High-quality methods were developed for the profiling of some of these marks, and approaches to finely study their impact on specific phases of the RNA life-cycle are available, including RNA metabolic labeling. Thanks to these improvements, the most abundant marks, including N⁶-methyladenosine, are emerging as important determinants of the fate of marked RNAs. However, we still lack approaches to directly study how the set of marks for a given RNA molecule shape its fate. In this perspective, we first review current leading approaches in the field. Then, we propose an experimental and computational setup, based on direct RNA sequencing and mathematical modeling, to decipher the functional consequences of RNA modifications on the fate of individual RNA molecules and isoforms.
... Nanopore transcriptome analysis also allows estimation of the poly(A) tail length ( Workman et al., 2018), that is important for RNA stability and translation, and is difficult to study with standard short-read sequencing methods ( Seki et al., 2019). Overall, NS currently offers three different approaches for RNA-seq: cDNA-PCR sequencing, direct cDNA sequencing and direct RNA sequencing; the latter is the first direct RNA sequencing method ( Garalde et al., 2018). ...
... Moreover, direct RNA-seq allows epitranscriptome analysis, through the detection of transcriptional modifications inferred from the signal as the RNA molecule passes through the nanopore ( Schwartz and Motorin, 2017;Workman et al., 2018;Wongsurawat et al., 2018;Liu et al., 2019), as subsequently described. However, the main drawback of nanopore transcriptome analysis is a lower throughput compared to short-read sequencing; the throughput further decreases when using the direct RNA approach, that is the method with the highest input requirement, due to a slower transition through the pore compared to the DNA strand. ...
... In some papers, nanopore cDNA and direct RNA sequencing approaches have been compared, illustrating the strengths and limitations of both ( Workman et al., 2018;Seki et al., 2019;Soneson et al., 2019). Although direct sequencing of full-length cDNA/RNA molecules is promising, it has been reported that a certain percentage of the raw nanopore reads are unlikely to be full-length reference transcripts, both with direct cDNA and RNA sequencing, thus interfering with the true identification/ quantification of transcripts; this aspect needs to be improved ( Soneson et al., 2019). ...
Article
Full-text available
The molecular pathogenesis of hematological diseases is often driven by genetic and epigenetic alterations. Next-generation sequencing has considerably increased our genomic knowledge of these disorders becoming ever more widespread in clinical practice. In 2012 Oxford Nanopore Technologies (ONT) released the MinION, the first long-read nanopore-based sequencer, overcoming the main limits of short-reads sequences generation. In the last years, several nanopore sequencing approaches have been performed in various “-omic” sciences; this review focuses on the challenge to introduce ONT devices in the hematological field, showing advantages, disadvantages and future perspectives of this technology in the precision medicine era.
... To date, the Iso-Seq pipeline has been used to build catalogues of transcripts in a range of species [128,169,170]. Nanopore reads-based transcriptomes are more recent [10, [171][172][173], and work is still needed to understand the characteristics of these data (e.g. coverage bias, sequence biases, reproducibility). ...
Article
Full-text available
Long-read technologies are overcoming early limitations in accuracy and throughput, broadening their application domains in genomics. Dedicated analysis tools that take into account the characteristics of long-read data are thus required, but the fast pace of development of such tools can be overwhelming. To assist in the design and analysis of long-read sequencing projects, we review the current landscape of available tools and present an online interactive database, long-read-tools.org, to facilitate their browsing. We further focus on the principles of error correction, base modification detection, and long-read transcriptomics analysis and highlight the challenges that remain.
... Instead of sequencing transcript fragments, long-read sequencing methods in the form of Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are now capable of sequencing comprehensive full-length transcriptomes [16][17][18][19] . These methods have now been used to analyze single cell cDNA pools generated by different methods, both well- 20,21 and droplet-based 22,23 , enriching the information we can extract from single cells experiments. ...
Preprint
Full-text available
Single cell transcriptome analysis elucidates facets of cell biology that have been previously out of reach. However, the high-throughput analysis of thousands of single cell transcriptomes has been limited by sample preparation and sequencing technology. High-throughput single cell analysis today is facilitated by protocols like the 10X Genomics platform or Drop-Seq which generate cDNA pools in which the origin of a transcript is encoded at its 5′ or 3′ end. These cDNA pools are currently analyzed by short read Illumina sequencing which can identify the cellular origin of a transcript and what gene it was transcribed from. However, these methods fail to retrieve isoform information. In principle, cDNA pools prepared using these approaches can be analyzed with Pacific Biosciences and Oxford Nanopore long-read sequencers to retrieve isoform information but all current implementations rely heavily on Illumina short-reads for the analysis in addition to long reads. Here, we used R2C2 to sequence and demultiplex 9 million full-length cDNA molecules generated by the 10X Chromium platform from ~3000 peripheral blood mononuclear cells (PBMCs). We used these reads to - independent from Illumina data - cluster cells into B cells, T cells, and Monocytes and generate isoform-level transcriptomes for these cell-types. We also generated isoform-level transcriptomes for all single cells and used this information to identify a wide range of isoform diversity between genes. Finally, we also designed a computational workflow to extract paired adaptive immune receptor - T cell receptor and B cell receptor (TCR and BCR) - sequences unique to each T and B cell. This work represents a new, simple, and powerful approach that - using a single sequencing method - can extract an unprecedented amount of information from thousands of single cells.
... Coming soon: direct RNA sequencing Finally, a newly emerging technology, direct sequencing of RNA [10], offers the possibility of dramatically improving gene annotation in the future. Although still in early development, nanopore sequencing technology can been used to sequence RNA without first converting it to DNA, unlike RNA-seq and other methods. ...
Article
Full-text available
While the genome sequencing revolution has led to the sequencing and assembly of many thousands of new genomes, genome annotation still uses very nearly the same technology that we have used for the past two decades. The sheer number of genomes necessitates the use of fully automated procedures for annotation, but errors in annotation are just as prevalent as they were in the past, if not more so. How are we to solve this growing problem?