Figure 1 - available from: Nature Communications
This content is subject to copyright. Terms and conditions apply.
A single gene may be transcribed into several distinct mRNA variants called isoforms through alternative splicing mechanisms. This figure shows six common types of splicing events (top to bottom): simple transcript; alternative transcription start site; alternative 5' splice site; alternative 3' splice site; skipped exon; and alternative polyadenylation. 

A single gene may be transcribed into several distinct mRNA variants called isoforms through alternative splicing mechanisms. This figure shows six common types of splicing events (top to bottom): simple transcript; alternative transcription start site; alternative 5' splice site; alternative 3' splice site; skipped exon; and alternative polyadenylation. 

Source publication
Article
Full-text available
Most human protein-coding genes can be transcribed into multiple possible distinct mRNA isoforms. These alternative splicing patterns encourage molecular diversity and dysregulation of isoform expression plays an important role in disease etiology. However, isoforms are difficult to characterize from short-read RNA-seq data because they share ident...

Contexts in source publication

Context 1
... splicing is the process by which a single protein-coding gene produces distinct mRNA transcripts, which vary in usage of component exons 1 . Isoforms can differ by alternative transcrip- tion initiation sites, alternative usage of splice sites (either 5 donor or 3 acceptor sites), alternative polyadenylation sites, or variable inclusion of entire exons or introns (Figure 1). Altogether, alternative splicing enables the large diversity of mRNA expression levels or proteome composition observed in eukaryotic cells, which is particularly important for regulating the context-specific needs of the cell 2 . ...
Context 2
... precision and recall were calculated by defining imperfect matchings between each esti- mated transcript and the true transcripts (see Methods and Supplementary Fig. 1). We controlled for issues regarding exon identification by counting an exon as successfully inferred if any subse- quence of the inferred isoform overlapped an exon in the gene annotation. ...
Context 3
... and Cufflinks also showed the highest agreement between any pair of distinct methods (r = 0.567, Supplementary Fig. 8). We also found that BIISQ inferred more transcripts across all exon compositions (Supplementary Fig. 9) and all spans ( Supplementary Fig. 10) in the Iso-Seq data. ...
Context 4
... junctions. However, isoform transcripts often share many splice junctions, making them difficult to deconvolve in short-read paired end data, whereas dissimilar transcripts are easier to differentiate using unique splice junctions. The Iso-Seq simulated data included transcripts with a lower average normalized distance between isoforms Figs. ...
Context 5
... SLIDE had the longest run times. However, isoform reconstruction can be parallelized at the level of reference transcript, so difficulties associated with running multiple iterations of BIISQ on large data may be reduced by having many compute nodes to process distinct genes in parallel. Gene and read lengths had marginal effects on run time ( Supplementary Figs. 11 and 12). However, Supplementary Fig. 15) 12 ...
Context 6
... and 12). However, Supplementary Fig. 15) 12 . ...
Context 7
... investigate population and sex specific splicing patterns, we first analyzed transcript ratio quantification patterns across all genes in the GEUVADIS data. We considered global signatures of differential transcript ratio usage and did not find a significant difference in the average isoform transcript counts across sex (χ 2 test, p ≤ 0.99) or population (χ 2 test, p ≤ 1) when counts were aggregated across protein coding genes ( Supplementary Fig. 15). We then computed population and sex specific transcript ratio distributions for each protein coding gene independently using like-lihood ratio (LR) tests (see Online Methods). ...
Context 8
... found significant correlation among BIISQ cis-eQTLs and the GEUVADIS study cis- eQTLs (EUR: Supplementary Fig. 17; ρ = 0.76 and r s = 0.63, t-test, p ≤ 2.2 × 10 −16 and YRI: ...
Context 9
... Fig. 18; ρ = 0.56 and r s = 0.52, t-test, p ≤ 2.2 × 10 −16 ). There were not enough overlapping transcripts to compare BIISQ cis-trQTLs and the GEUVADIS study cis-trQTLs, but, comparing the most significant cis-trQTLs for each gene between the BIISQ and GEUVADIS study (EUR population) results did not show significant correlation (Supplementary ...
Context 10
... Fig. 18; ρ = 0.56 and r s = 0.52, t-test, p ≤ 2.2 × 10 −16 ). There were not enough overlapping transcripts to compare BIISQ cis-trQTLs and the GEUVADIS study cis-trQTLs, but, comparing the most significant cis-trQTLs for each gene between the BIISQ and GEUVADIS study (EUR population) results did not show significant correlation (Supplementary Fig. 19; t-test, p ρ = 0.49 and p rs = 0.12). Interestingly, BIISQ cis-eQTLs were more highly correlated with BIISQ cis-trQTLs (Supplementary Fig. 20; ρ = 0.74 and r s = 0.51) than GEUVADIS study cis-trQTLs and cis-eQTLs (EUR: Supplementary Fig. 21; ρ = 0.3 and r s = 0.23 and YRI: Supplementary Fig. 22; ρ = 0.18 and r s = 0.03). ...
Context 11
... were not enough overlapping transcripts to compare BIISQ cis-trQTLs and the GEUVADIS study cis-trQTLs, but, comparing the most significant cis-trQTLs for each gene between the BIISQ and GEUVADIS study (EUR population) results did not show significant correlation (Supplementary Fig. 19; t-test, p ρ = 0.49 and p rs = 0.12). Interestingly, BIISQ cis-eQTLs were more highly correlated with BIISQ cis-trQTLs (Supplementary Fig. 20; ρ = 0.74 and r s = 0.51) than GEUVADIS study cis-trQTLs and cis-eQTLs (EUR: Supplementary Fig. 21; ρ = 0.3 and r s = 0.23 and YRI: Supplementary Fig. 22; ρ = 0.18 and r s = 0.03). ...
Context 12
... define the cis region of a gene as the genetic variants falling within 100 Kb of a gene's transcription start or end site. Sex, population, the first three genotype principal components, and 15 PEER factors estimated from the isoform ratio matrix were included as covariates using a standard processing pipeline for RNA-seq data to control for population structure and latent confounders 61 ( Supplementary Fig. 16 and Supplementary Table 5). The expression of cis-eQTLs were also quantile normalized and we removed genes with a single transcript or fewer than three exons in the computation of cis-trQTLs. ...
Context 13
... evaluation criterion that requires the true and inferred exon sets to be identical is often conservative due to variable read coverage of exons. Therefore, isoform reconstruction was evaluated by considering both perfect and imperfect matchings to determine precision and recall ( Supplementary Fig. 1). For exact matches, precision and recall were calcu- lated based on exact full length isoform matches between true (simulated) and estimated isoforms: let true positives, false positives, and false negatives be denoted TP, FP, and FN respectively. ...

Similar publications

Article
Full-text available
Background Sugarcane is an important global food crop and energy resource. To facilitate the sugarcane improvement program, genome and gene information are important for studying traits at the molecular level. Most currently available transcriptome data for sugarcane were generated using second-generation sequencing platforms, which provide short r...

Citations

... It has the ability to detect isoforms and novel transcripts. It also has a bigger dynamic range [5,6]. The DEG analysis is essential in cancer research to assess the biological variation in genes and identify gene Available online at http://www.jabonline.in ...
Article
Full-text available
Gene expression analysis of transcriptomic data enables us to identify changes in gene expression under some biological conditions. Ribonucleic acid (RNA) sequencing (RNA-seq) data can show genetic mutations and intricate biological process connections, which are useful in the diagnosis and treatment of cancer. The existing classical differential gene expression analysis techniques are prone to false negatives and false positives with smaller datasets. With the improvements in the field of machine learning (ML), we want to build an ensemble learning model for the classification of differentially expressed genes (DEGs) from RNA-seq data for pancreatic cancer. The gene expression data was obtained from the Cancer Genome Atlas-Pancreatic Adenocarcinoma Project database. In this paper, we are proposing a stacking classifier with cross-validation called the stacking CV classifier, which is an ensemble of K-nearest neighbor, random forest, gradient boosting, and logistic regression classifiers for effective classification of DEGs. We also made a comparative analysis between the results of our ensemble model and existing models in the literature. The results of our model were competitive (accuracy 96% and area under the curve 0.99) against the stand-alone and existing gene classification models. Our ML-based model is a promising tool for classifying DEGs based on gene expression patterns.
... The experimental data labels are noisy and prior work suggests that less than 10% of non-B DNA motif windows for Z-DNA, G4, and H-DNA form non-B DNA structures at any given time (Kouzine et al. 2017). Therefore, we consider an approach commonly employed in eQTL studies (Aguiar et al. 2018); after model training, we compute an empirical distribution of scores on X B te , which forms an empirical null distribution. We generate scores for each x 2 X N te and compute a P-value analogously to Equation (5). ...
Article
Full-text available
Motivation: Non-canonical (or non-B) DNA are genomic regions whose three-dimensional conformation deviates from the canonical double helix. Non-B DNA play an important role in basic cellular processes and are associated with genomic instability, gene regulation, and oncogenesis. Experimental methods are low-throughput and can detect only a limited set of non-B DNA structures, while computational methods rely on non-B DNA base motifs, which are necessary but not sufficient indicators of non-B structures. Oxford Nanopore sequencing is an efficient and low-cost platform, but it is currently unknown whether nanopore reads can be used for identifying non-B structures. Results: We build the first computational pipeline to predict non-B DNA structures from nanopore sequencing. We formalize non-B detection as a novelty detection problem and develop the GoFAE-DND, an autoencoder that uses goodness-of-fit (GoF) tests as a regularizer. A discriminative loss encourages non-B DNA to be poorly reconstructed and optimizing Gaussian GoF tests allows for the computation of P-values that indicate non-B structures. Based on whole genome nanopore sequencing of NA12878, we show that there exist significant differences between the timing of DNA translocation for non-B DNA bases compared with B-DNA. We demonstrate the efficacy of our approach through comparisons with novelty detection methods using experimental data and data synthesized from a new translocation time simulator. Experimental validations suggest that reliable detection of non-B DNA from nanopore sequencing is achievable. Availability and implementation: Source code is available at https://github.com/bayesomicslab/ONT-nonb-GoFAE-DND.
... Despite these challenges, a significant number of isoform reconstruction and quantification methods have been developed with different modelling assumptions and reconstruction goals (Aguiar et al. 2018;Trapnell et al. 2013;Vaquero-Garcia et al. 2016). ...
... Among these annotation-based methods, the majority reconstruct full-length transcripts defined by their composite exons. The Bayesian isoform discovery and individual specific quantification (BIISQ) method models transcript reconstruction with a nonparametric Bayesian hierarchical model, where samples are mixtures of transcripts sampled from a population transcript distribution (Aguiar et al. 2018). While BIISQ was shown to have high accuracy on low abundance isoforms, it requires both the genes and the composite exon coordinates, and is unable to construct isoform transcripts that deviate from this reference annotation. ...
... We model the k th SME, β k , as a degenerate Dirichlet distribution whose dimension is controlled by the beta-Bernoulli prior (Aguiar et al. 2018;Wang et al. 2009). Intuitively, we discourage excisions to occupy the same SME based on the structure of the excision interval graph G. ...
Preprint
Characterizing the differential excision of mRNA is critical for understanding the functional complexity of a cell or tissue, from normal developmental processes to disease pathogenesis. Most transcript reconstruction methods infer full-length transcripts from high-throughput sequencing data. However, this is a challenging task due to incomplete annotations and the differential expression of transcripts across cell-types, tissues, and experimental conditions. Several recent methods circumvent these difficulties by considering local splicing events, but these methods lose transcript-level splicing information and may conflate transcripts. We develop the first probabilistic model that reconciles the transcript and local splicing perspectives. First, we formalize the sequence of mRNA excisions (SME) reconstruction problem, which aims to assemble variable-length sequences of mRNA excisions from RNA-sequencing data. We then present a novel hierarchical Bayesian admixture model for the Reconstruction of Excised mRNA (BREM). BREM interpolates between local splicing events and full-length transcripts and thus focuses only on SMEs that have high posterior probability. We develop posterior inference algorithms based on Gibbs sampling and local search of independent sets and characterize differential SME usage using generalized linear models based on converged BREM model parameters. We show that BREM achieves higher F1 score for reconstruction tasks and improved accuracy and sensitivity in differential splicing when compared with four state-of-the-art transcript and local splicing methods on simulated data. Lastly, we evaluate BREM on both bulk and scRNA sequencing data based on transcript reconstruction, novelty of transcripts produced, model sensitivity to hyperparameters, and a functional analysis of differentially expressed SMEs, demonstrating that BREM captures relevant biological signal.
... The observed data likelihood remains a concave function under this adjustment (see next section), provided we precompute the extent of migration errors. We can, thus, extend the EM algorithm implemented in kallisto to find the values of α that maximize likelihood (3). The EM algorithm alternates between fractionally assigning fragments to transcripts in different bands based on current parameter estimates and recalculating parameters from these fragment assignments. ...
Article
Full-text available
The accuracy of methods for assembling transcripts from short-read RNA sequencing data is limited by the lack of long-range information. Here we introduce Ladder-seq, an approach that separates transcripts according to their lengths before sequencing and uses the additional information to improve the quantification and assembly of transcripts. Using simulated data, we show that a kallisto algorithm extended to process Ladder-seq data quantifies transcripts of complex genes with substantially higher accuracy than conventional kallisto. For reference-based assembly, a tailored scheme based on the StringTie2 algorithm reconstructs a single transcript with 30.8% higher precision than its conventional counterpart and is more than 30% more sensitive for complex genes. For de novo assembly, a similar scheme based on the Trinity algorithm correctly assembles 78% more transcripts than conventional Trinity while improving precision by 78%. In experimental data, Ladder-seq reveals 40% more genes harboring isoform switches compared to conventional RNA sequencing and unveils widespread changes in isoform usage upon m⁶A depletion by Mettl14 knockout.
... Based on sampling frequencies, IntAPT reports, for each isoform, a confidence measure of its presence and abundance. Another advantage of Gibbs sampling is improved identification of low abundance isoforms, which is also address by another Bayesian assembler (Aguiar et al., 2018). IntAPT demonstrated improved sensitivity for lowly expressed isoforms in both our simulation and real data studies. ...
Article
Motivation High-throughput RNA sequencing has revolutionized the scope and depth of transcriptome analysis. Accurate reconstruction of a phenotype-specific transcriptome is challenging due to the noise and variability of RNA-seq data. This requires computational identification of transcripts from multiple samples of the same phenotype, given the underlying consensus transcript structure. Results We present a Bayesian method, Integrated Assembly of Phenotype-specific Transcripts (IntAPT), that identifies phenotype-specific isoforms from multiple RNA-seq profiles. IntAPT features a novel two-layer Bayesian model to capture the presence of isoforms at the group layer and to quantify the abundance of isoforms at the sample layer. A spike-and-slab prior is used to model the isoform expression and to enforce the sparsity of expressed isoforms. Dependencies between the existence of isoforms and their expression are modeled explicitly to facilitate parameter estimation. Model parameters are estimated iteratively using Gibbs sampling to infer the joint posterior distribution, from which the presence and abundance of isoforms can reliably be determined. Studies using both simulations and real data sets show that IntAPT consistently outperforms existing methods for the integrated assembly of phenotype-specific transcripts. Experimental results demonstrate that, despite sequencing errors, IntAPT exhibits a robust performance among multiple samples, resulting in notably improved identification of expressed isoforms of low abundance. Availability The IntAPT package is available at http://github.com/henryxushi/IntAPT. Supplementary information Supplementary data are available at Bioinformatics online.
... Transcript assembly methods such as StringTie [37], CIDANE [7], and CLASS [45] aim to identify the set of expressed full-length transcripts which in principle provides a complete picture of all splicing variations, see e.g. transcript t 1 -t 5 in Fig. 1. Since the number of possible transcripts that can be stitched together from short reads is larger than the number of truly expressed transcripts and each read carries little information about its originating transcript, the transcript assembly problem is ill-posed [26] and error-prone especially for complex genes expressing multiple transcript isoforms [18,1]. ...
Preprint
Full-text available
Alternative splicing removes intronic sequences from transcripts in alternative ways to produce different forms (isoforms) of mature mRNA. The composition of expressed transcripts and their alternative forms give specific functionalities to cells in a particular condition or developmental stage. In addition, a large fraction of human disease mutations affect splicing and lead to aberrant mRNA and protein products. Current methods that interrogate the transcriptome based on RNA-seq either suffer from short read length when trying to infer full-length transcripts, or are restricted to predefined units of alternative splicing that they quantify from local read evidence. Instead of attempting to quantify individual outcomes of the splicing process such as local splicing events or full-length transcripts, we propose to quantify alternative splicing using a simplified probabilistic model of the underlying splicing process. Our model is based on the usage of individual splice sites and can generate arbitrarily complex types of splicing patterns. In our method, McSplicer, we estimate the parameters of our model using all read data at once and we demonstrate in our experiments that this yields more accurate estimates compared to competing methods. Our model is able to describe multiple effects of splicing mutations using few, easy to interpret parameters, as we illustrate in an experiment on RNA-seq data from autism spectrum disorder patients. McSplicer is implemented in Python and available as open-source at https://github.com/canzarlab/McSplicer.
... Graphical models are also quite relevant in Data-Bio-sciences. Their ability to discover patterns even when the training size is small makes them particularly suited to approach many biological questions (Aguiar et al., 2018). Many other related matters can or could be tackled by advanced computational learning methods (Beam et al., 2018;De Cao and Kipf, 2018). ...
Preprint
Full-text available
The codon usage bias is the DNA sequence pattern problem in protein coding genes, where sets of synonymous codons, blocks of 3 nucleotides which encode for the same amino acid in the transcription process, have a non random distribution. As codons are very low level features in the genome, many factors might explain their particular distributions. Amongst the stated ones, the GC content of the genome, the tRNA pool, the replication speed of the cell, as well as the size, the evolutionary distance and pressure over the gene might all have a degree of influence over it. How much does each factor help to explain the bias registered in the codon usage of genes? How much of the codon usage bias are they able to predict? We set out to study such questions using advanced statistical tools, over a set of related species.
Article
Machine learning (ML) models are used in the interdisciplinary field of bio-ML to solve biological challenges. The diagnosis and treatment of cancer can benefit from the display of genetic mutations and complex biological process relationships in Ribonucleic acid sequencing (RNA-seq) data. In this paper, we are proposing a bio-ML approach to find gene biomarkers in pancreatic cancer (PC). The pancreatic adenocarcinoma (PAAD) gene expression data was obtained from The Cancer Genome Atlas (TCGA) project database. In our work, we used two methods: one is an ensemble stacking classifier with cross-validation (SCV), which is an ensemble of K-nearest neighbour (KNN), random forest (RF), gradient boosting (GB), and logistic regression (LR) classifiers for effective classification of differentially expressed genes (DEGs); and the second is weighted gene co-expression network analysis (WGCNA) to find the hub gene module. The genes reported from the first and second methods were intersected to find common DEGs. These DEGs were analysed using the PPI network, gene ontology, and pathways to identify the eight hub genes. These hub genes were further evaluated using Gene expression profiling interactive analysis version 2 (GEPIA2), resulting in four novel biomarkers (BUB1, BUB1B, KIF11, and TTK). We believe the integration of the ML approach in biological research is producing encouraging results and aiding in the resolution of challenging issues.