Overview of the regional splicing constraint model. A The per-site splicing substitution rate by reference allele and Sum SpliceAI score bin across autosomal protein-coding genes. The rate of no substitutions across all SpliceAI score bins for each reference allele is A > A = 0.9003, C > C = 0.8565, G > G = 0.8433, and T > T = 0.9347 B Calculating an Observed over Expected (O/E) ratio for a genomic region by counting the number of variants in that region from gnomAD and the number of expected variants with a given SpliceAI score. C The O/E score distribution. Smaller O/E scores indicate higher constraint against splicing, while larger O/E scores indicate lower constraints against splicing. (O/E plot truncated at -2000 to 2000 for visibility) D Representation of regional splicing constraint O/E scores across a hypothetical gene. The presence of gnomAD variants, in gray and the SpliceAI prediction for each position in the gene, in shades of red influences the splicing-specific observed and expected counts in a region. gnomAD variants with higher SpliceAI scores show evidence for more tolerance against splicing variation. In contrast, sites with a higher SpliceAI score and no gnomAD variant show evidence for less tolerance against splicing. Pathogenic splicing variants, in black, are commonly absent from gnomAD and have predictions of alternative splicing from SpliceAI. In this example, the regional constraint model identifies constraint signals at regions that harbor pathogenic splicing variants, such as at canonical splice regions (^) and cryptic splice regions (^^). All genomic positions in C without a SpliceAI score should be recognized as sites with a SpliceAI score < 0.1

Source publication

Patterns of genic splicing constraint. A The proportion of OR genes,...

Splicing constraint around exon features. A The pattern of splicing...

Splicing constraint for pathogenic alternative splicing variants. A...

Splicing prediction and interpretation performance. A–D...

Combining genetic constraint with predictions of alternative splicing to prioritize deleterious splicing in rare disease studies

Article

Full-text available

Nov 2022

Background Despite numerous molecular and computational advances, roughly half of patients with a rare disease remain undiagnosed after exome or genome sequencing. A particularly challenging barrier to diagnosis is identifying variants that cause deleterious alternative splicing at intronic or exonic loci outside of canonical donor or acceptor spli...

All exons are not created equal-exon vulnerability determines the effect of exonic mutations on splicing

Article

Full-text available

Feb 2024
NUCLEIC ACIDS RES

It is now widely accepted that aberrant splicing of constitutive exons is often caused by mutations affecting cis-acting splicing regulatory elements (SREs), but there is a misconception that all exons have an equal dependency on SREs and thus a similar vulnerability to aberrant splicing. We demonstrate that some exons are more likely to be affected by exonic splicing mutations (ESMs) due to an inherent vulnerability, which is context dependent and influenced by the strength of exon definition. We have developed VulExMap, a tool which is based on empirical data that can designate whether a constitutive exon is vulnerable. Using VulExMap, we find that only 25% of all exons can be categorized as vulnerable, whereas two-thirds of 359 previously reported ESMs in 75 disease genes are located in vulnerable exons. Because VulExMap analysis is based on empirical data on splicing of exons in their endogenous context, it includes all features important in determining the vulnerability. We believe that VulExMap will be an important tool when assessing the effect of exonic mutations by pinpointing whether they are located in exons vulnerable to ESMs.

Benchmarking splice variant prediction algorithms using massively parallel splicing assays

Article

Full-text available

Dec 2023
GENOME BIOL

Background Variants that disrupt mRNA splicing account for a sizable fraction of the pathogenic burden in many genetic disorders, but identifying splice-disruptive variants (SDVs) beyond the essential splice site dinucleotides remains difficult. Computational predictors are often discordant, compounding the challenge of variant interpretation. Because they are primarily validated using clinical variant sets heavily biased to known canonical splice site mutations, it remains unclear how well their performance generalizes. Results We benchmark eight widely used splicing effect prediction algorithms, leveraging massively parallel splicing assays (MPSAs) as a source of experimentally determined ground-truth. MPSAs simultaneously assay many variants to nominate candidate SDVs. We compare experimentally measured splicing outcomes with bioinformatic predictions for 3,616 variants in five genes. Algorithms’ concordance with MPSA measurements, and with each other, is lower for exonic than intronic variants, underscoring the difficulty of identifying missense or synonymous SDVs. Deep learning-based predictors trained on gene model annotations achieve the best overall performance at distinguishing disruptive and neutral variants, and controlling for overall call rate genome-wide, SpliceAI and Pangolin have superior sensitivity. Finally, our results highlight two practical considerations when scoring variants genome-wide: finding an optimal score cutoff, and the substantial variability introduced by differences in gene model annotation, and we suggest strategies for optimal splice effect prediction in the face of these issues. Conclusion SpliceAI and Pangolin show the best overall performance among predictors tested, however, improvements in splice effect prediction are still needed especially within exons.

Computational prediction of human deep intronic variation

Article

Full-text available

Oct 2023

Background The adoption of whole-genome sequencing in genetic screens has facilitated the detection of genetic variation in the intronic regions of genes, far from annotated splice sites. However, selecting an appropriate computational tool to discriminate functionally relevant genetic variants from those with no effect is challenging, particularly for deep intronic regions where independent benchmarks are scarce. Results In this study, we have provided an overview of the computational methods available and the extent to which they can be used to analyze deep intronic variation. We leveraged diverse datasets to extensively evaluate tool performance across different intronic regions, distinguishing between variants that are expected to disrupt splicing through different molecular mechanisms. Notably, we compared the performance of SpliceAI, a widely used sequence-based deep learning model, with that of more recent methods that extend its original implementation. We observed considerable differences in tool performance depending on the region considered, with variants generating cryptic splice sites being better predicted than those that potentially affect splicing regulatory elements. Finally, we devised a novel quantitative assessment of tool interpretability and found that tools providing mechanistic explanations of their predictions are often correct with respect to the ground - information, but the use of these tools results in decreased predictive power when compared to black box methods. Conclusions Our findings translate into practical recommendations for tool usage and provide a reference framework for applying prediction tools in deep intronic regions, enabling more informed decision-making by practitioners.

PDIVAS: Pathogenicity predictor for Deep-Intronic Variants causing Aberrant Splicing

Article

Full-text available

Oct 2023
BMC GENOMICS

Background Deep-intronic variants that alter RNA splicing were ineffectively evaluated in the search for the cause of genetic diseases. Determination of such pathogenic variants from a vast number of deep-intronic variants (approximately 1,500,000 variants per individual) represents a technical challenge to researchers. Thus, we developed a Pathogenicity predictor for Deep-Intronic Variants causing Aberrant Splicing (PDIVAS) to easily detect pathogenic deep-intronic variants. Results PDIVAS was trained on an ensemble machine-learning algorithm to classify pathogenic and benign variants in a curated dataset. The dataset consists of manually curated pathogenic splice-altering variants (SAVs) and commonly observed benign variants within deep introns. Splicing features and a splicing constraint metric were used to maximize the predictive sensitivity and specificity, respectively. PDIVAS showed an average precision of 0.92 and a maximum MCC of 0.88 in classifying these variants, which were the best of the previous predictors. When PDIVAS was applied to genome sequencing analysis on a threshold with 95% sensitivity for reported pathogenic SAVs, an average of 27 pathogenic candidates were extracted per individual. Furthermore, the causative variants in simulated patient genomes were more efficiently prioritized than the previous predictors. Conclusion Incorporating PDIVAS into variant interpretation pipelines will enable efficient detection of disease-causing deep-intronic SAVs and contribute to improving the diagnostic yield. PDIVAS is publicly available at https://github.com/shiro-kur/PDIVAS. Graphical abstract

Benchmarking splice variant prediction algorithms using massively parallel splicing assays

Preprint

Full-text available

May 2023

Background Variants that disrupt mRNA splicing account for a sizable fraction of the pathogenic burden in many genetic disorders, but identifying splice-disruptive variants (SDVs) beyond the essential splice site dinucleotides remains difficult. Computational predictors are often discordant, compounding the challenge of variant interpretation. Because they are primarily validated using clinical variant sets heavily biased to known canonical splice site mutations, it remains unclear how well their performance generalizes. Results We benchmarked eight widely used splicing effect prediction algorithms, leveraging massively parallel splicing assays (MPSAs) as a source of experimentally determined ground-truth. MPSAs simultaneously assay many variants to nominate candidate SDVs. We compared experimentally measured splicing outcomes with bioinformatic predictions for 3,616 variants in five genes. Algorithms' concordance with MPSA measurements, and with each other, was lower for exonic than intronic variants, underscoring the difficulty of identifying missense or synonymous SDVs. Deep learning-based predictors trained on gene model annotations achieved the best overall performance at distinguishing disruptive and neutral variants. Controlling for overall call rate genome-wide, SpliceAI and Pangolin also showed superior overall sensitivity for identifying SDVs. Finally, our results highlight two practical considerations when scoring variants genome-wide: finding an optimal score cutoff, and the substantial variability introduced by differences in gene model annotation, and we suggest strategies for optimal splice effect prediction in the face of these issues. Conclusion SpliceAI and Pangolin showed the best overall performance among predictors tested, however, improvements in splice effect prediction are still needed especially within exons.

PDIVAS: Pathogenicity predictor for Deep-Intronic Variants causing Aberrant Splicing

Preprint

Full-text available

Mar 2023

Deep-intronic variants often cause genetic diseases by altering RNA splicing. However, these pathogenic variants are overlooked in whole-genome sequencing analyses, because they are quite difficult to segregate from a vast number of benign variants (approximately 1,500,000 deep-intronic variants per individual). Therefore, we developed the Pathogenicity predictor for Deep-intronic Variants causing Aberrant Splicing (PDIVAS), an ensemble machine-learning model combining multiple splicing features and regional splicing constraint metrics. Using PDIVAS, around 27 pathogenic candidates were identified per individual with 95% sensitivity, and causative variants were more efficiently prioritized than previous predictors in simulated patient genome sequences. PDIVAS is available at https://github.com/shiro-kur/PDIVAS.

Computational prediction of human deep intronic variation

Preprint

Full-text available

Feb 2023

The adoption of whole genome sequencing in genetic screens has facilitated the detection of genetic variation in the intronic regions of genes, far from annotated splice sites. However, selecting an appropriate computational tool to differentiate functionally relevant genetic variants from those with no effect is challenging, particularly for deep intronic regions where independent benchmarks are scarce. In this study, we have provided an overview of the computational methods available and the extent to which they can be used to analyze deep intronic variation. We leveraged diverse datasets to extensively evaluate tool performance across different intronic regions, distinguishing between variants that are expected to disrupt splicing through different molecular mechanisms. Notably, we compared the performance of SpliceAI, a widely used sequence-based deep learning model, with that of more recent methods that extend its original implementation. We observed considerable differences in tool performance depending on the region considered, with variants generating cryptic splice sites being better predicted than those that affect splicing regulatory elements or the branchpoint region. Finally, we devised a novel quantitative assessment of tool interpretability and found that tools providing mechanistic explanations of their predictions are often correct with respect to the ground truth information, but the use of these tools results in decreased predictive power when compared to black box methods. Our findings translate into practical recommendations for tool usage and provide a reference framework for applying prediction tools in deep intronic regions, enabling more informed decision-making by practitioners.

Citations