Evidence relevant to the annotation of different types of genes

Evidence relevant to the annotation of different types of genes

Source publication
Article
Full-text available
Our understanding of the human genome has continuously expanded since its draft publication in 2001. Over the years, novel assays have allowed us to progressively overlay layers of knowledge above the raw sequence of A's, T's, G's, and C's. The reference human genome sequence is now a complex knowledge base maintained under the shared stewardship o...

Context in source publication

Context 1
... in transcriptome sequencing. The emergence of new transcript sequencing technologies has supported new approaches for detecting genes and transcripts along the human genome (see Table 1). The first of these next-generation sequencing technologies, RNA-seq (163), was based on Solexa (12) (now Illumina) sequencing and provided significantly higher depth (i.e., more sequenced molecules) than Sanger cDNA reads but with much shorter reads. ...

Citations

... These data suggest that most human genes should produce multiple protein isoforms, roughly three or four proteins per gene. However, some RNA pundits consider that a significant number of AS products may be noise or be precarious [9,[15][16][17][18] and that alternative initiation of transcription is largely nonadaptive [19]. Results of some proteomics studies suggest that each gene has a principal protein form [7,9,15], which in our opinion should be the wild type (Wt). ...
... Why there are so many AS products is still a question debated in molecular biology, and some RNA pundits consider most AS products noisy or precarious [9,[15][16][17][18]. Indeed, in our long-term studies of AS using traditional approaches, we have experienced that some RT-PCR products shown as weak bands in agarose gels have been confirmed with sequencing as unreported RNA variants, but they were irreproducible in iterated experiments. ...
Article
Full-text available
We determined RNA spectrum of the human RSK4 (hRSK4) gene (also called RPS6KA6) and identified 29 novel mRNA variants derived from alternative splicing, which, plus the NCBI-documented ones and the five we reported previously, totaled 50 hRSK4 RNAs that, by our bioinformatics analyses, encode 35 hRSK4 protein isoforms of 35–762 amino acids. Many of the mRNAs are bicistronic or tricistronic for hRSK4. The NCBI-normalized NM_014496.5 and the protein it encodes are designated herein as the Wt-1 mRNA and protein, respectively, whereas the NM_001330512.1 and the long protein it encodes are designated as the Wt-2 mRNA and protein, respectively. Many of the mRNA variants responded differently to different situations of stress, including serum starvation, a febrile temperature, treatment with ethanol or ethanol-extracted clove buds (an herbal medicine), whereas the same stressed situation often caused quite different alterations among different mRNA variants in different cell lines. Mosifloxacin, an antibiotics and also a functional inhibitor of hRSK4, could inhibit the expression of certain hRSK4 mRNA variants. The hRSK4 gene likely uses alternative splicing as a handy tool to adapt to different stressed situations, and the mRNA and protein multiplicities may partly explain the incongruous literature on its expression and comports.
... Advances in genomic technology have made it possible to analyze genetic data at a faster rate and with greater accuracy than ever before, being implemented in human anatomic studies, among others. However, the sheer amount of data generated by genomics studies can be overwhelming, and analysis and interpretation can be challenging [71][72][73][74]. In addition, in the era of data exchange and strict laws about privacy, there may be ethical concerns around the collection and use of genetic data, particularly when it comes to human subjects without informed consent [75][76][77]. ...
Article
Full-text available
Anatomic studies have traditionally relied on macroscopic, microscopic, and histological techniques to investigate the structure of tissues and organs. Anatomic studies are essential in many fields, including medicine, biology, and veterinary science. Advances in technology, such as imaging techniques and molecular biology, continue to provide new insights into the anatomy of living organisms. Therefore, anatomy remains an active and important area in the scientific field. The consolidation in recent years of some omics technologies such as genomics, transcriptomics, proteomics, and metabolomics allows for a more complete and detailed understanding of the structure and function of cells, tissues, and organs. These have been joined more recently by "omics" such as radiomics, pathomics, and connectomics, supported by computer-assisted technologies such as neural networks, 3D bioprinting, and artificial intelligence. All these new tools, although some are still in the early stages of development, have the potential to strongly contribute to the macroscopic and microscopic characterization in medicine. For anatomists, it is time to hitch a ride and get on board omics technologies to sail to new frontiers and to explore novel scenarios in anatomy.
... Additionally, high-dimensional issues are easily introduced by the general representation method. The challenge of study is how to efficiently represent sequence attributes and analyze high dimensional data [19]. ...
Article
Full-text available
Hashimoto’s thyroiditis is an autoimmune disorder characterized by the destruction of thyroid cells through immune-mediated mechanisms involving cells and antibodies. The condition can trigger disturbances in metabolism, leading to the development of other autoimmune diseases, known as concomitant diseases. Multiple concomitant diseases may coexist in a single individual, making it challenging to diagnose and manage them effectively. This study aims to propose a novel hybrid algorithm that classifies concomitant diseases associated with Hashimoto’s thyroiditis based on sequences. The approach involves building distinct prediction models for each class and using the output of one model as input for the subsequent one, resulting in a dynamic decision-making process. Genes associated with concomitant diseases were collected alongside those related to Hashimoto’s thyroiditis, and their sequences were obtained from the NCBI site in fasta format. The hybrid algorithm was evaluated against common machine learning algorithms and their various combinations. The experimental results demonstrate that the proposed hybrid model outperforms existing classification methods in terms of performance metrics. The significance of this study lies in its two distinctive aspects. Firstly, it presents a new benchmarking dataset that has not been previously developed in this field, using diverse methods. Secondly, it proposes a more effective and efficient solution that accounts for the dynamic nature of the dataset. The hybrid approach holds promise in investigating the genetic heterogeneity of complex diseases such as Hashimoto’s thyroiditis and identifying new autoimmune disease genes. Additionally, the results of this study may aid in the development of genetic screening tools and laboratory experiments targeting Hashimoto’s thyroiditis genetic risk factors. New software, models, and techniques for computing, including systems biology, machine learning, and artificial intelligence, are used in our study.
... Venus 57 or PhyreRisk 58 . These annotations provide valuable information to gain a deeper understanding of the relationships between protein structure and function 59 , which can be used to link structural and functional data on a genome-wide scale 60 . By integrating these annotations in PDBx/mmCIF files, it becomes easier to Fig. 4 Category relationship diagram including new SIFTS specific PDBx/mmCIF categories. ...
Article
Full-text available
More than 61,000 proteins have up-to-date correspondence between their amino acid sequence (UniProtKB) and their 3D structures (PDB), enabled by the Structure Integration with Function, Taxonomy and Sequences (SIFTS) resource. SIFTS incorporates residue-level annotations from many other biological resources. SIFTS data is available in various formats like XML, CSV and TSV format or also accessible via the PDBe REST API but always maintained separately from the structure data (PDBx/mmCIF file) in the PDB archive. Here, we extended the wwPDB PDBx/mmCIF data dictionary with additional categories to accommodate SIFTS data and added the UniProtKB, Pfam, SCOP2, and CATH residue-level annotations directly into the PDBx/mmCIF files from the PDB archive. With the integrated UniProtKB annotations, these files now provide consistent numbering of residues in different PDB entries allowing easy comparison of structure models. The extended dictionary yields a more consistent, standardised metadata description without altering the core PDB information. This development enables up-to-date cross-reference information at the residue level resulting in better data interoperability, supporting improved data analysis and visualisation.
... The notion of errors in public protein databases is a recurrent problem [30][31][32] and substantial efforts have been invested to identify and correct genome annotation errors [33][34][35]. Some important causes of erroneous protein sequences have been identified, including the genome sequence quality and gene structure complexity [36], as well as redundant or conflicting information in different resources or in the literature [32,37]. ...
Article
Full-text available
In fungi, the most abundant transcription factor (TF) class contains a fungal-specific ‘GAL4-like’ Zn2C6 DNA binding domain (DBD), while the second class contains another fungal-specific domain, known as ‘fungal_trans’ or middle homology domain (MHD), whose function remains largely uncharacterized. Remarkably, almost a third of MHD-containing TFs in public sequence databases apparently lack DNA binding activity, since they are not predicted to contain a DBD. Here, we reassess the domain organization of these ‘MHD-only’ proteins using an in silico error-tracking approach. In a large-scale analysis of ~17,000 MHD-only TF sequences present in all fungal phyla except Microsporidia and Cryptomycota, we show that the vast majority (>90%) result from genome annotation errors and we are able to predict a new DBD sequence for 14,261 of them. Most of these sequences correspond to a Zn2C6 domain (82%), with a small proportion of C2H2 domains (4%) found only in Dikarya. Our results contradict previous findings that the MHD-only TF are widespread in fungi. In contrast, we show that they are exceptional cases, and that the fungal-specific Zn2C6–MHD domain pair represents the canonical domain signature defining the most predominant fungal TF family. We call this family CeGAL, after the highly characterized members: Cep3, whose 3D structure is determined, and GAL4, a eukaryotic TF archetype. We believe that this will not only improve the annotation and classification of the Zn2C6 TF but will also provide critical guidance for future fungal gene regulatory network analyses.
... The completion of the genome of the multicellular organism Caenorhabditis elegans (Nematoda) [1] marked the beginning of a revolution in genomics [2], associated with major advances in sequencing, informatic and functional genomic technologies [3]. The massive expansion in the numbers of genomic, transcriptomic and proteomic data sets resulting from these advances has demanded efficient and reliable in silico (bioinformatic) methods for the assembly of genomes, prediction of genes and the annotation of genes and their products [4][5][6]. ...
Article
Full-text available
Major advances in genomic and associated technologies have demanded reliable bioinformatic tools and workflows for the annotation of genes and their products via comparative analyses using well-curated reference data sets, accessible in public repositories. However, the accurate in silico annotation of molecules (proteins) encoded in organisms (e.g., multicellular parasites) which are evolutionarily distant from those for which these extensive reference data sets are available, including invertebrate model organisms (e.g., Caenorhabditis elegans – free-living nematode, and Drosophila melanogaster – the vinegar fly) and vertebrate species (e.g., Homo sapiens and Mus musculus), remains a major challenge. Here, we constructed an informatic workflow for the enhanced annotation of biologically-important, excretory/secretory (ES) proteins (“secretome”) encoded in the genome of a parasitic roundworm, called Haemonchus contortus (commonly known as the barber’s pole worm). We critically evaluated the performance of five distinct methods, refined some of them, and then combined the use of all five methods to comprehensively annotate ES proteins, according to gene ontology, biological pathways and/or metabolic (enzymatic) processes. Then, using optimised parameter settings, we applied this workflow to comprehensively annotate 2591 of all 3353 proteins (77.3%) in the secretome of H. contortus. This result is a substantial improvement (10–25%) over previous annotations using individual, “off-the-shelf” algorithms and default settings, indicating the ready applicability of the present, refined workflow to gene/protein sequence data sets from a wide range of organisms in the Tree-of-Life.
... The most recent research projects in scientific field of eutherian comparative genomics included intentions to sequence every extant eutherian species genome in foreseeable future, so that future revisions and updates of eutherian gene data sets were expected [1][2][3][4][5][6][7][8][9][10][11][12][13]. For example, the human protein coding gene census remained unfinished: contemporary estimates included about 20,000-21,000 protein coding genes in human genome [14][15][16][17][18][19][20][21][22][23][24][25][26][27]. In addition, the proven utility of public eutherian reference genomic sequences could become compromised by potential genomic sequence errors, including analytical and bioinformatical errors, as well as Sanger DNA sequencing method errors [28][29][30][31][32][33]. ...
Article
Full-text available
Objectives The most recent research projects in scientific field of eutherian comparative genomics included intentions to sequence every extant eutherian species genome in foreseeable future, so that future revisions and updates of eutherian gene data sets were expected. Data description Using 35 public eutherian reference genomic sequence assemblies and free available software, the eutherian comparative genomic analysis protocol RRID:SCR_014401 was published as guidance against potential genomic sequence errors. The protocol curated 14 eutherian third-party data gene data sets, including, in aggregate, 2615 complete coding sequences that were deposited in European Nucleotide Archive. The published eutherian gene collections were used in revisions and updates of eutherian gene data set classifications and nomenclatures that included gene annotations, phylogenetic analyses and protein molecular evolution analyses.
... The progression of most tumors is typically orchestrated by a widespread reprogramming of transcription in the genome. A key to the functional annotation of human genome is to understand the regulation of transcription (81)(82)(83), of which the epigenetic activity of chromatin is of fundamental importance (84)(85)(86). Dysregulation of RNA expression can be prognostic markers and therapy targets (87). ...
Article
Full-text available
Despite being a member of the chromodomain helicase DNA-binding protein family, little is known about the exact role of CHD6 in chromatin remodeling or cancer disease. Here we show that CHD6 binds to chromatin to promote broad nucleosome eviction for transcriptional activation of many cancer pathways. By integrating multiple patient cohorts for bioinformatics analysis of over a thousand prostate cancer datasets, we found CHD6 expression elevated in prostate cancer and associated with poor prognosis. Further comprehensive experiments demonstrated that CHD6 regulates oncogenicity of prostate cancer cells and tumor development in a murine xenograft model. ChIP-Seq for CHD6, along with MNase-Seq and RNA-Seq, revealed that CHD6 binds on chromatin to evict nucleosomes from promoters and gene bodies for transcriptional activation of oncogenic pathways. These results demonstrated a key function of CHD6 in evicting nucleosomes from chromatin for transcriptional activation of prostate cancer pathways.
... Particularly in intronic regions, current human annotation files have many overlapping genes. This presented a problem for properly annotating reads in these regions (61,62). If an RNA-seq read maps to a genomic location where multiple genes are annotated and overlapping, that read will be thrown out because it is unable to be singularly assigned. ...
Article
Full-text available
The poly(A)-tail appended to the 3′-end of most eukaryotic transcripts plays a key role in their stability, nuclear transport, and translation. These roles are largely mediated by Poly(A) Binding Proteins (PABPs) that coat poly(A)-tails and interact with various proteins involved in the biogenesis and function of RNA. While it is well-established that the nuclear PABP (PABPN) binds newly synthesized poly(A)-tails and is replaced by the cytoplasmic PABP (PABPC) on transcripts exported to the cytoplasm, the distribution of transcripts for different genes or isoforms of the same gene on these PABPs has not been investigated on a genome-wide scale. Here, we analyzed the identity, splicing status, poly(A)-tail size, and translation status of RNAs co-immunoprecipitated with endogenous PABPN or PABPC in human cells. At steady state, many protein-coding and non-coding RNAs exhibit strong bias for association with PABPN or PABPC. While PABPN-enriched transcripts more often were incompletely spliced and harbored longer poly(A)-tails and PABPC-enriched RNAs had longer half-lives and higher translation efficiency, there are curious outliers. Overall, our study reveals the landscape of RNAs bound by PABPN and PABPC, providing new details that support and advance the current understanding of the roles these proteins play in poly(A)-tail synthesis, maintenance, and function.
... Moreover, GRCh38 was assembled from multiple donors with clone-based sequencing, which creates an excess of artificial haplotype structures that can subtly bias analyses (1,24). Over the years, there have been attempts to replace certain rare alleles with more common alleles, but hundreds of thousands of artificial haplotypes and rare alleles remain to this day (3,25,26). Increasing the continuity, quality, and representativeness of the reference genome is therefore crucial for improving genetic diagnosis, as well as for understanding the complex relationship between genetic and phenotypic variation. ...
Article
Compared to its predecessors, the Telomere-to-Telomere CHM13 genome adds nearly 200 million base pairs of sequence, corrects thousands of structural errors, and unlocks the most complex regions of the human genome for clinical and functional study. We show how this reference universally improves read mapping and variant calling for 3202 and 17 globally diverse samples sequenced with short and long reads, respectively. We identify hundreds of thousands of variants per sample in previously unresolved regions, showcasing the promise of the T2T-CHM13 reference for evolutionary and biomedical discovery. Simultaneously, this reference eliminates tens of thousands of spurious variants per sample, including reduction of false positives in 269 medically relevant genes by up to a factor of 12. Because of these improvements in variant discovery coupled with population and functional genomic resources, T2T-CHM13 is positioned to replace GRCh38 as the prevailing reference for human genetics.