ArticlePDF Available

Alu sequences in the coding regions of mRNA: A source of protein variability

Authors:

Abstract and Figures

Dispersion of repetitive sequence elements is a source of genetic variability that contributes to genome evolution. Alu elements, the most common dispersed repeats in the human genome, can cause genetic diseases by several mechanisms, including de novo Alu insertions and splicing of intragenic Alu elements into mRNA. Such mutations might contribute positively to protein evolution if they are advantageous or neutral. To test this hypothesis, we searched the literature and sequence databases for examples of protein-coding regions that contain Alu sequences: 17 Alu 'cassettes' inserted within 15 different coding sequences were found. In three instances, these events caused genetic diseases; the possible functional significance of the other Alu-containing mRNAs is discussed. Our analysis suggests that splice-mediated insertion of intronic elements is the major mechanism by which Alu segments are introduced into mRNAs.
Content may be subject to copyright.
A preview of the PDF is not available
... In a process called exonization, TEs can integrate into genomic regions and offer recognition by the splicing machinery as a newly recruited exon [100]. Approximately 4% of human genes contain TE motifs in their coding regions, indicating that exons may have been derived from the exonization of TEs [101][102][103][104][105][106]. Some studies have identified that exonized LINEs in the human genome provide an additional domain and produce abnormal transcripts through diverse alternative splicing mechanisms in cancers. ...
... TEs change the expression of cancer-related gene variants through the suggestion of alternative 5 or 3 SSs. In particular, inserted Alu retroelements, which contain multiple sites with sequences similar to those of SSs, could be considered a real exon by offering pseudo-SSs [101,102,121,122]. One study identified that PDZK1, which plays a crucial role in ion-channel organization, upregulates gene expression by providing alternative 5 sites via the inserted Alu [123,124]. ...
Article
Full-text available
Alternative splicing of messenger RNA (mRNA) precursors contributes to genetic diversity by generating structurally and functionally distinct transcripts. In a disease state, alternative splicing promotes incidence and development of several cancer types through regulation of cancer-related biological processes. Transposable elements (TEs), having the genetic ability to jump to other regions of the genome, can bring about alternative splicing events in cancer. TEs can integrate into the genome, mostly in the intronic regions, and induce cancer-specific alternative splicing by adjusting various mechanisms, such as exonization, providing splicing donor/acceptor sites, alternative regulatory sequences or stop codons, and driving exon disruption or epigenetic regulation. Moreover, TEs can produce microRNAs (miRNAs) that control the proportion of transcripts by repressing translation or stimulating the degradation of transcripts at the post-transcriptional level. Notably, TE insertion creates a cancer-friendly environment by controlling the overall process of gene expression before and after transcription in cancer cells. This review emphasizes the correlative interaction between alternative splicing by TE integration and cancer-associated biological processes, suggesting a macroscopic mechanism controlling alternative splicing by TE insertion in cancer.
... A significant fraction of them (72%, 44 out of 61) have not been annotated before, which is possibly owing to their lower expression levels (Supplemental Fig. S11). Compared with conserved exons, these species-specific exons are more frequently located in repetitive regions (79% of them overlap repeat elements, 48 out of 61), particularly SINEs (mainly Alu elements) and LINEs (Supplemental Fig. S12), transposable elements that have been previously associated with the birth of new exons (Makałowski et al. 1994;Sorek et al. 2002;Cordaux and Batzer 2009;Schwartz et al. 2009;Avgan et al. 2019). We then wondered if these exonizations might impact the protein function by inspecting the combinations of protein domains encoded in the transcript (Supplemental Methods). ...
... Particularly, SE events are enriched in speciesspecific transcript gains, posing this type of AS as a prevalent mechanism in the recent evolution of primate transcriptomes. Taking advantage of the most recent NHP genome assemblies, we have also characterized the emergence of new exons, many of them absent in reference annotations, and confirmed the association between the exonization process and the insertion of transposable elements such as SINEs and LINEs (Makałowski et al. 1994;Sorek et al. 2002;Cordaux and Batzer 2009;Schwartz et al. 2009;Avgan et al. 2019). ...
Article
Full-text available
Transcriptomic diversity greatly contributes to the fundamentals of disease, lineage-specific biology, and environmental adaptation. However, much of the actual isoform repertoire contributing to shaping primate evolution remains unknown. Here, we combined deep long- and short-read sequencing complemented with mass spectrometry proteomics in a panel of lymphoblastoid cell lines (LCLs) from human, three other great apes, and rhesus macaque, producing the largest full-length isoform catalog in primates to date. Around half of the captured isoforms are not annotated in their reference genomes, significantly expanding the gene models in primates. Furthermore, our comparative analyses unveil hundreds of transcriptomic innovations and isoform usage changes related to immune function and immunological disorders. The confluence of these evolutionary innovations with signals of positive selection and their limited impact in the proteome points to changes in alternative splicing in genes involved in immune response as an important target of recent regulatory divergence in primates.
... Alu retrotransposons are very common in primate genomes, being found in more than 1,000,000 copies, covering %13% of the genome size and present in almost every protein-coding gene intron (International Human Genome Sequencing Consortium 2001). In dozens of reported cases, an Alu sequence was found to be spliced with an upstream exon, resulting in a chimeric peptide (Makałowski et al. 1994). These hybrid proteins are a source of genetic novelty, although their total number in the human genome has not been precisely determined yet. ...
Chapter
Full-text available
Homology has been a contentious topic for discussion for almost 200 years and the debate is ongoing. In its simplest definition, homology means “descended from a common ancestor.” Because of genetic recombination, or the replacement of one kind of character or trait with a different kind that can fulfil the same role, identifying homologs and indeed defining homology in detail is fraught with difficulty. In this chapter, I detail some of the history of the concept and link the concept to its uses in phylogeny reconstruction, developmental biology and networks of gene sharing.
... De novo gene birth is a process that consists of at least two steps (Van Oss & Carvunis, 2019). First, a non-coding sequence obtains an open reading frame (ORF) by overprinting (Grass e, 1977) or exonization (Grass e, 1977;Makalowski & Mitchell, 1994). Second, ORFs are transcribed or translated into mRNAs or proteins (Gubala et al., 2017;Zhang et al., 2019). ...
Article
De novo genes are derived from non‐coding sequences, and they can play essential roles in organisms. Cultivated peanut (Arachis hypogaea) is a major oil and protein crop derived from a cross between A. duranensis and A. ipaensis. However, few de novo genes have been documented in Arachis. Here, we identified 381 de novo genes in A. hypogaea cv. Tifrunner based on comparison with five closely related Arachis species. There are distinct differences in gene expression patterns and gene structures between conserved and de novo genes. The identified de novo genes originated from ancestral sequence regions associated with metabolic and biosynthetic processes, and they were subsequently integrated into existing regulatory networks. De novo paralogs and homoeologs were identified in A. hypogaea cv. Tifrunner. De novo paralogs and homoeologs with conserved expression have mismatching cis‐acting elements under normal growth conditions. De novo genes potentially have pluripotent functions in responses to biotic stresses as well as in growth and development based on quantitative trait loci data. This work provides a foundation for future research examining gene birth process and gene function in Arachis and related taxa.
... In some cases, pseudoexons arise from intronic transposable elements; long interspersed nuclear elements (LINE) and short interspersed nuclear elements (SINE), and from antisense Alu elements in particular (Corvelo & Eyras, 2008;Vořechovský, 2010). The consensus sequence of antisense Alu carries putative pseudo splice sites and poly(U)-tracts deriving from a sense poly(A)-tail and an A-rich linker (Makałowski et al., 1994;Sorek, 2007), which can function as polypyrimidine tracts at 3′ss. ...
Article
It is now widely accepted that aberrant splicing of constitutive exons is often caused by mutations affecting cis-acting splicing regulatory elements, but there is a misconception that all exons have an equal dependency on SREs and thus a similar susceptibility to aberrant splicing. We investigated exonic mutations in ACADM exon 5 to experimentally examine their effect on splicing and found that seven out of eleven tested mutations affected exon inclusion, demonstrating that this constitutive exon is particularly vulnerable to exonic splicing mutations. Employing ACADM exon 5 and 6 as models, we demonstrate that the balance between splicing enhancers and silencers, flanking intron length, and flanking splice site strength are important factors that determine exon definition and splicing efficiency of the exon in question. Our study shows that two constitutive exons in ACADM have different inherent vulnerabilities to exonic splicing mutations. This suggests that in silico prediction of potential pathogenic effects on splicing from exonic mutations may be improved by also considering the inherent vulnerability of the exon. Moreover, we show that SNPs that affect either of two different exonic splicing silencers, located far apart in exon 5, all protect against both immediately flanking and more distant exonic splicing mutations. This article is protected by copyright. All rights reserved.
... In some cases, pseudoexons arise from intronic transposable elements; long interspersed nuclear elements (LINE) and short interspersed nuclear elements (SINE), and from antisense Alu elements in particular (Corvelo & Eyras, 2008;Vořechovský, 2010). The consensus sequence of antisense Alu carries putative pseudo splice sites and poly(U)-tracts deriving from a sense poly(A)-tail and an A-rich linker (Makałowski et al., 1994;Sorek, 2007), which can function as polypyrimidine tracts at 3′ss. ...
Article
Full-text available
Accuracy of pre-mRNA splicing is crucial for normal gene expression. Complex regulation supports the spliceosomal distinction between authentic exons and the many seemingly functional splice sites delimiting pseudoexons. Pseudoexons are nonfunctional intronic sequences that can be activated for aberrant inclusion in mRNA, which may cause disease. Pseudoexon activation is very challenging to predict, in particular when activation occurs by sequence variants that alters the splicing regulatory environment without directly affecting splice sites. Because pseudoexon inclusion often evade detection due to activation of nonsense-mediated mRNA decay, and because conventional diagnostic procedures miss deep intronic sequence variation, pseudoexon activation is a heavily underreported disease mechanism. Pseudoexon characteristics have mainly been studied based on in silico predicted sequences. Moreover, because recognition of sequence variants that create or strengthen splice sites is possible by comparison with well-established consensus sequences, this type of pseudoexon activation is by far the most frequently reported. Here we review all known human disease-associated pseudoexons that carry functional splice sites and are activated by deep intronic sequence variants located outside splice site sequences. We delineate common characteristics that make this type of wild type pseudoexons distinct high-risk sites in the human genome. This article is protected by copyright. All rights reserved.
Article
Full-text available
Growing evidence indicates that transposable elements (TEs) play important roles in evolution by providing genomes with coding and non-coding sequences. Identification of TE-derived functional elements, however, has relied on TE annotations in individual species, which limits its scope to relatively intact TE sequences. Here, we report a novel approach to uncover previously unannotated degenerate TEs (degTEs) by probing multiple ancestral genomes reconstructed from hundreds of species. We applied this method to the human genome and achieved a 10.8% increase in coverage over the most recent annotation. Further, we discovered that degTEs contribute to various cis-regulatory elements and transcription factor binding sites, including those of a known TE-controlling family, the KRAB zinc-finger proteins. We also report unannotated chimeric transcripts between degTEs and human genes expressed in embryos. This study provides a novel methodology and a freely available resource that will facilitate the investigation of TE co-option events on a full scale.
Article
Eukaryotic mRNAs and lncRNA exons are often small compared to introns. The exon definition model predicts that exons splice autonomously, dependent on proximal exon sequence features, explaining their delineation within large introns. This model has not been examined on a genome-wide scale, however, leaving open the question of how often mRNA and lncRNA exons are autonomous. It is also unknown how frequently such exons can arise by chance. Here, we directly assayed large fragments (500-1000 bp) of the human genome by exon trapping, which detects exons spliced into a heterologous transgene, here designed with a large intron context. We define these exons as "autonomous". We obtained ~1.25 million exons, including most known mRNA and well-annotated lncRNA internal exons, demonstrating that human exons are predominantly autonomous. mRNA exons are trapped with highest efficiency. Nearly a million of the trapped exons are unannotated, most located in intergenic regions and antisense to mRNA, with depletion from the forward strand of introns. These exons are not conserved, indicating they are nonfunctional and likely arose from random mutations. They are nonetheless highly enriched with known splicing promoting sequence features delineating known exons. Novel autonomous exons are more abundant than annotated lncRNA exons, and computational models also indicate they will occur with similar frequency in any randomly generated sequence. These results show that most human coding exons splice autonomously, and provide an explanation for the existence of many unconserved lncRNAs, as well as a new annotation and inclusion levels of spliceable loci in the human genome.
Article
Full-text available
Since the discovery of the first transposon by Dr. Barbara McClintock, the prevalence and diversity of transposable elements (TEs) have been gradually recognized. As fundamental genetic components, TEs drive organismal evolution not only by contributing functional sequences (e.g., regulatory elements or "controllers" as phrased by Dr. McClintock) but also by shuffling genomic sequences. In the latter respect, TE-mediated gene duplications have contributed to the origination of new genes and attracted extensive interest. In response to the development of this field, we herein attempt to provide an overview of TE-mediated duplication by focusing on common rules emerging across duplications generated by different TE types. Specifically, despite the huge divergence of transposition machinery across TEs, we identify three common features of various TE-mediated duplication mechanisms, including end bypass, template switching, and recurrent transposition. These three features lead to one common functional outcome, namely, TE-mediated duplicates tend to be subjected to exon shuffling and neofunctionalization. Therefore, the intrinsic properties of the mutational mechanism constrain the evolutionary trajectories of these duplicates. We also discuss the future of this field including an in-depth characterization of both the duplication mechanisms and functions of TE-mediated duplicates.
Article
Full-text available
Since formation of the first proto-eukaryotes, gene repertoire and genome complexity have significantly increased. Among genetic elements responsible for this increase are tandem repeats. Here we describe a genome-wide analysis of large tandem repeats, called megasatellites, in 58 vertebrate genomes. Two bursts occurred, one after the radiation between Agnatha and Gnathostomata fishes and the second one in therian mammals. Megasatellites are enriched in subtelomeric regions and frequently encoded in genes involved in transcription regulation, intracellular trafficking, and cell membrane metabolism, reminiscent of what is observed in fungus genomes. The presence of many introns within young megasatellites suggests that an exon-intron DNA segment is first duplicated and amplified before accumulation of mutations in intronic parts partially erases the megasatellite in such a way that it becomes detectable only in exons. Our results suggest that megasatellite formation and evolution is a dynamic and still ongoing process in vertebrate genomes.
Article
Full-text available
We have identified an alternatively spliced form of the human beta 1 integrin subunit, beta 1S. The beta 1S mRNA is expressed in human platelets, HEL and K562 erythroleukemia cell lines, and THP1 monocytic and HL60 promyelocytic cell lines. It is undetectable in peripheral blood lymphocytes and cultured umbilical vein endothelial cells at early passages. The beta 1S cDNA encodes a new cytoplasmic domain distinct from the previously reported alternative cytoplasmic domain of the beta 1 subunit. The sequence reveals the presence of an insert of 116 nucleotides which produces a frame shift in the previously reported 3' end of the beta 1 integrin subunit and codes for a unique 48-amino acid COOH-terminal sequence. An antiserum prepared against a synthetic peptide generated from the deduced sequence of the beta 1S cytoplasmic domain immunoprecipitated an HEL cell surface molecule that comigrated with the usual beta 1 subunit in sodium dodecyl sulfate electrophoresis. The immunoprecipitation indicated that beta 1S constitutes a minor portion of total beta 1 subunit in these cells. This variant beta 1 cytoplasmic tail may modulate integrin affinity and may provide additional modes for the transduction of extracellular signals and modulation of cytoskeletal organization by beta 1 integrins.
Article
The Θ1 globin gene is an a globin-like gene, and started to diverge from the other members of the a globin family 260 million years ago. DNA sequencing and transcriptional analysis indicated that it is functional in erythroid cells of the higher primates, but not in prosimians and rabbit The 91 promoter region of higher primates including man consists of GC-rich sequences characteristic of housekeeping gene promoters, and CCAAT and TATA boxes located further upstream. It is shown here that the housekeeping gene promoter-like region of human 81 contains two tandemly arranged, GC-rich motifs (GC-I and GC-ip. Of these, GC-II interacts with nuclear factor(s) present in the globin-expressing, erythroleukemia cell line K562, before and after hemin induction. GC-I, however, interacts with nuclear factors) only present in hemin-induced K562 cells. These factors are different from previously reported erythroid cell-specific factors, and are not detectable in non-erythroid Hela cells. Furthermore, the sequence of the motif GC-I and its location relative to ATG codon have been conserved among all known mammalian 01 globin genes. Finally, and most interestingly, the CCAAT box of 81 is contained within a 38 bp internal segment of Alu repeat sequence. Immediately upstream from this CCAAT box-containing Alu repeat segment is a 241 bp Alu repeat pointing in the opposite direction. The conservation of this novel arrangement among the higher primates suggests that an inserted Alu family repeat and its flanking genomic sequence have co-evolved, for at least 30 million years, to provide the canonical CCAAT and TATA promoter elements of the θ1 globin genes in higher primates.
Article
The family of protein kinases includes many oncogenes and growth-factor receptors, as well as genes that are involved in cell-cycle regulation. We have identified protein kinases expressed in a human breast-cancer cell line, 600PEI, and a primary human breast carcinoma, using PCR cloning techniques based on consensus sequences in the kinase domain. Twenty-five different protein kinases were isolated, including 3 novel putative tyrosine kinases (designated TKI, TK2, and TK5), and 2 novel putative cell-cycle-associated serine/threonine kinases (designated STKI and STK2). TKI is a new member of the src family of kinases that is expressed predominantly in epithelial cells. TK2 is homologous to the receptor kinase, HEK, and TK5 appears to be another member of the J AK family of kinases. The novel serine/threonine kinases, designated STKI and STK2, were homologous to the human cdc2 and the Aspergillus nimA genes. We subsequently analyzed the levels of expression of all of these protein kinases in a panel of human breast carcinomas, using PCR-based methods. This analysis revealed different expression profiles in different primary breast carcinomas and, therefore, may determine new molecular sub-sets of human.
Article
Alu repeats are short interspersed elements whose transposition has lead to genetic variability and heritable disorders in humans. A select subset of the nearly one million Alu sequences in human DNA actually produce new transpositions. The evolution of newly inserted Alu repeats is currently a key subject for study. Mechanisms of RNA polymerase III activity and the sequence environment into which an Alu inserts might select for transcriptional and posttranscriptional determinants of Alu transposition.
Article
The analysis of species-specific subfamilies of both the LINE and SINE mammalian repetitive DNA families suggests that such subfamilies have arisen by amplification of an extremely small group of 'master' genes. In contrast to the master genes, the vast majority of both SINEs and LINEs appear to behave like psudogenes in their inability to undergo extensive amplification.
Article
The search for significant local similarities with known protein sequences is a powerful method for interpreting anonymous cDNA sequences or locating coding exons within genomic DNA sequences at a stage where the average contig size is still very small. The BLASTx program, implemented on the National Center for Biotechnology Information server, allows a sensitive search of all putative translations of a nucleotide query sequence against all known proteins in a matter of seconds. From an analysis of the current databases, I report a set of protein sequences exhibiting high local similarity to Alu repeat or vector sequences. These entries can lead to misleading interpretations of similarity searches. During the course of this study, the protease of a human spumaretrovirus was found to have integrated the 3' end half of the U2 snRNA.
Article
The human cholinesterase (ChE) gene from a patient with acholinesterasemia was cloned and analyzed. By using ChE cDNA as a probe, four independent clones were isolated from a genomic library constructed from the patient's DNA. Sequencing analysis of all of the four clones revealed that exon 2 of the ChE gene was disrupted by a 342-base-pair (bp) insertion of Alu element, including a poly(A) tract of 38 bp, which showed 93% sequence homology with a current type of human Alu consensus sequence. Southern blot analysis showed that the Alu insertion occurred in both alleles of the patient and was inherited in the patient's family. This Alu insertion was flanked by 15-bp of target site duplication in exon 2 corresponding to positions 1062-1076 of ChE cDNA, indicating that an Alu element could have been integrated by retrotransposition. Thus, this case provides an important clue to the mechanism of inactivation of a gene by integration of a retrotransposon.
Article
The existing classification of human Alu sequences is revised and expanded using a novel methodology and a larger set of sequence data. Our study confirms that there are two major Alu subfamilies, Alu-J and Alu-S. The Alu-S subfamily consists of at least five distinct subfamilies referred to as Alu-Sx, Alu-Sq, Alu-Sp, Alu-Sc, and Alu-Sb. The Alu-Sp and Alu-Sq subfamilies have been revealed by this study. Alu subfamilies differ from one another in a number of positions called diagnostic. In this paper the diagnostic positions are defined in quantitative terms and are used to evaluate statistical significance of the observed subfamilies. Each Alu subfamily most likely represents pseudogenes retroposed from evolving functional source Alu genes. Evidence presented in this paper indicates that Alu-Sp and Alu-Sc pseudogenes were retroposed from different source genes, during overlapping periods of time, and at different rates. Our analysis also indicates that the previously identified Alu-type transcript BC200 comes from an active Alu gene that might have existed even before the origin of dimeric Alu sequences. The source genes for Alu pseudogene families are reconstructed. It is assumed that diagnostic differences between reconstructed source genes reflect mutations that have occurred in true source Alu genes under natural selection. Some of these mutations are compensatory and are used to reconstruct a common secondary structure of Alu RNAs transcribed from the source genes. The biological function of Alu RNA is discussed in the context of its homology to the elongation-arresting domain of 7SL RNA.(ABSTRACT TRUNCATED AT 250 WORDS)
Article
Two truncated human C5 clones, pHC5A and pHC5B, were isolated from an adult human liver cDNA library, and contained inserts of 2930 and 2181 bp, respectively. Both clones were polyadenylated and encoded the 5'-end of the C5 pro-molecule, thereby completing the human pro-C5 cDNA sequence. However, near the 3'-ends, at exon/intron boundaries, the nucleotide sequences of pHC5A and pHC5B diverged from each other and from the full-length 6.0-kb C5 cDNA sequence. Clone pHC5A, which overlapped the first human C5 clone described (J-16), encoded most of the C5 signal peptide, the complete beta-chain, the linker peptide, 177 amino acids of the alpha-chain, and contained 144 bp of Alu family consensus sequence encoding 48 amino acids of divergent protein sequence in an open reading frame. Clone pHC5B encoded the entire C5 signal peptide, the beta-chain, the linker peptide, nine amino acids of the alpha-chain, and six amino acids of divergent protein sequence in an open reading frame. Northern blot experiments demonstrated the presence of a 3.0-kb truncated C5 mRNA in adult human liver and a 4.8-kb truncated C5 mRNA in HepG2 cells in addition to the 6.0-kb full-length transcript. Truncated C5 mRNA were not detected in Raji, MOLT-4, human fibroblast or U937 cells, although the full-length 6.0-kb transcript was seen in MOLT-4 cells. Southern blot analyses indicated that the human C5 structural gene is large, complex, and is present in the human genome in a single copy, thereby demonstrating that the truncated C5 clones and mRNA are derived from a single C5 gene by alternative processing events.