Data

The Universal Protein Resource (UniProt) 2009

Authors:
To read the file of this research, you can request a copy directly from the author.

Abstract

The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information that is essential for modern biological research. UniProt is produced by the UniProt Consortium which consists of groups from the European Bioinformatics Institute, the Protein Information Resource and the Swiss Institute of Bioinformatics. The core activities include manual curation of protein sequences assisted by computa-tional analysis, sequence archiving, a user-friendly UniProt website and the provision of additional value-added information through cross-references to other databases. UniProt is comprised of four major components, each optimized for different uses: the UniProt Archive, the UniProt Knowledge-base, the UniProt Reference Clusters and the Uni-Prot Metagenomic and Environmental Sequence Database. One of the key achievements of the UniProt consortium in 2008 is the completion of the first draft of the complete human proteome in UniProtKB/Swiss-Prot. This manually annotated representation of all currently known human protein-coding genes was made available in UniProt release 14.0 with 20 325 entries. UniProt is updated and distributed every three weeks and can be accessed online for searches or downloaded at www.uniprot.org. INTRODUCTION

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the author.

... DasR and DasR-EBD were overexpressed in E. coli BL21 (DE3) pREP4::groESL cells (Novagen, EMD Biosciences, Darmstadt, Germany) utilizing pET15b vectors (Novagen, EMD Biosciences , Darmstadt, Germany) that contain either the dasR (residues 1–254; UniProtKB O34817, [19]) or the dasR-ebd gene (residues 88–254). Both constructs display an additional N-terminal hexahistidine tag and a thrombin cleavage site. ...
... (PDF) S5 Fig. Sequence alignment of DasR-EBD from S. coelicolor and NagR-EBD from B. subtilis . The sequence alignment was performed with CLUSTAL OMEGA [59] using the canonical protein sequences of entries Q9K492 and O34817 from the UniProt database [19]. Secondary structure elements refer to the topology of DasR and are marked with (h) or (s) for α-helices and β-strands, respectively. ...
... The listed regulators were identified by a protein structure database search (Dali Lite v.3, [60]) via the Dali server using the crystal structure of full-length DasR (PDB-ID 4ZS8) as a search model. From the resulting structures only those containing a GntR-family-specific wHTH domain as well as a HutC-subfamily-specific UTRA domain (as described in the respective entry in the UniProt database [19]) were used for a subsequent multiple sequence alignment via CLUSTAL OMEGA [59] that is shown in S4 Fig. For abetter discrimination, regulators without an individual gene or protein name and mostly of unknown function were given unambiguous acronyms, e.g. ...
Article
Full-text available
Small molecule effectors regulate gene transcription in bacteria by altering the DNA-binding affinities of specific repressor proteins. Although the GntR proteins represent a large family of bacterial repressors, only little is known about the allosteric mechanism that enables their function. DasR from Streptomyces coelicolor belongs to the GntR/HutC subfamily and specifically recognises operators termed DasR-responsive elements (dre-sites). Its DNA-binding properties are modulated by phosphorylated sugars. Here, we present several crystal structures of DasR, namely of dimeric full-length DasR in the absence of any effector and of only the effector-binding domain (EBD) of DasR without effector or in complex with glucosamine-6-phosphate (GlcN-6-P) and N-acetylglucosamine-6-phosphate (GlcNAc-6-P). Together with molecular dynamics (MD) simulations and a comparison with other GntR/HutC family members these data allowed for a structural characterisation of the different functional states of DasR. Allostery in DasR and possibly in many other GntR/HutC family members is best described by a conformational selection model. In ligand-free DasR, an increased flexibility in the EBDs enables the attached DNA-binding domains (DBD) to sample a variety of different orientations and among these also a DNA-binding competent conformation. Effector binding to the EBDs of DasR significantly reorganises the atomic structure of the latter. However, rather than locking the orientation of the DBDs, the effector-induced formation of β-strand β* in the DBD-EBD-linker segment merely appears to take the DBDs 'on a shorter leash' thereby impeding the 'downwards' positioning of the DBDs that is necessary for a concerted binding of two DBDs of DasR to operator DNA.
... A supporting factor for the possibility of drug repositioning is the concept of "polypharmacology", i.e. individual drugs interacting with multiple targets rather than a single target. For example, Drug Nicotine (DrugBank ID DB00184) interacts with 10 targets (UniProt ID, P17787, P30532, P30926, P32297, P43681, Q05901, Q15822, Q15825, Q9G226 and Q9UGM1) [2]. There is a challenge to reduce side-effects as one drug interacts with more than single targets. ...
... Similar to SPOT-1D-base and MUFOLD-SS, our base model contains 57 features from PSSM profiles, HHM profiles and physicochemical properties. To generate PSSM, PSI-BLAST [56] was run against Uniref90 database [57] with inclusion threshold 0.001 and three iterations. The HHM profiles were generated using HHblits [58] using default parameters against uniprot20 2013 03 sequence database, which can be downloaded from http://wwwuser.gwdg.de/~compbiol/data/hhsuite/ ...
Preprint
Full-text available
Motivation: Protein structures provide basic insight into how they can interact with other proteins, their functions and biological roles in an organism. Experimental methods (e.g., X-ray crystallography, nuclear magnetic resonance spectroscopy) for predicting the secondary structure (SS) of proteins are very expensive and time-consuming. Therefore, developing efficient computational approaches for predicting the secondary structure of protein is of utmost importance. Advances in developing highly accurate SS prediction methods have mostly been focused on 3-class (Q3) structure prediction. However, 8-class (Q8) resolution of secondary structure contains more useful information and is much more challenging than the Q3 prediction. Results: We present SAINT, a highly accurate method for Q8 structure prediction, which incorporates self-attention mechanism (a concept from natural language processing) with the Deep Inception-Inside-Inception (Deep3I) network in order to effectively capture both the short-range and long-range interactions among the amino acid residues. SAINT offers a more interpretable framework than the typical black-box deep neural network methods. Through an extensive evaluation study, we report the performance of SAINT in comparison with the existing best methods on a collection of benchmark datasets, namely, TEST2016, TEST2018, CASP12 and CASP13. Our results suggest that self-attention mechanism improves the prediction accuracy and outperforms the existing best alternate methods. SAINT is the first of its kind and offers the best known Q8 accuracy. Thus, we believe SAINT represents a major step towards the accurate and reliable prediction of secondary structures of proteins. Availability: SAINT is freely available as an open-source project at https: //github.com/SAINTProtein/SAINT.
... Annotations from the UniProt [50] database were used to determine the number and location of native disulfide bonds for each protein of interest. ...
Article
Full-text available
Amyloidogenic protein aggregation impairs cell function and is a hallmark of many chronic degenerative disorders. Protein aggregation is also a major event during acute injury; however, unlike amyloidogenesis, the process of injury-induced protein aggregation remains largely undefined. To provide this insight, we profiled the insoluble proteome of several cell types after acute injury. These experiments show that the disulfide-driven process of nucleocytoplasmic coagulation (NCC) is the main form of injury-induced protein aggregation. NCC is mechanistically distinct from amyloidogenesis, but still broadly impairs cell function by promoting the aggregation of hundreds of abundant and essential intracellular proteins. A small proportion of the intracellular proteome resists NCC and is instead released from necrotic cells. Notably, the physicochemical properties of NCC-resistant proteins are contrary to those of NCC-sensitive proteins. These observations challenge the dogma that liberation of constituents during necrosis is anarchic. Rather, inherent physicochemical features including cysteine content, hydrophobicity and intrinsic disorder determine whether a protein is released from necrotic cells. Furthermore, as half of the identified NCC-resistant proteins are known autoantigens, we propose that physicochemical properties that control NCC also affect immune tolerance and other host responses important for the restoration of homeostasis after necrotic injury.
... Mitochondrial/prokaryotic type aaRSs found in our analysis, but without the predicted mitochondrial signal sequence were also assigned as putatively mitochondrial. Splice variants from single gene and any atypical aaRSs found in Uniprot [46] were verified via EnsemblMetazoa transcript database (http://metazoa.ensembl.org/index. html). ...
Article
Full-text available
Helminth parasites are an assemblage of two major phyla of nematodes (also known as roundworms) and platyhelminths (also called flatworms). These parasites are a major human health burden, and infections caused by helminths are considered under neglected tropical diseases (NTDs). These infections are typified by limited clinical treatment options and threat of drug resistance. Aminoacyl-tRNA synthetases (aaRSs) are vital enzymes that decode genetic information and enable protein translation. The specific inhibition of pathogen aaRSs bores well for development of next generation anti-parasitics. Here, we have identified and annotated aaRSs and accessory proteins from Loa loa (nematode) and Schistosoma mansoni (flatworm) to provide a glimpse of these protein translation enzymes within these parasites. Using purified parasitic lysyl-tRNA synthetases (KRSs), we developed series of assays that address KRS enzymatic activity, oligomeric states, crystal structure and inhibition profiles. We show that L. loa and S. mansoni KRSs are potently inhibited by the fungal metabolite cladosporin. Our co-crystal structure of Loa loa KRS-cladosporin complex reveals key interacting residues and provides a platform for structure-based drug development. This work hence provides a new direction for both novel target discovery and inhibitor development against eukaryotic pathogens that include L. loa and S. mansoni.
... Database Survey Amino acid sequence (UniProt) and protein structure (PDB) databases were screened for proteins that were orthologous (Figure S1) and structurally similar proteins (Figures 2 and S3) to EndoMS, respectively ( UniProt Consortium, 2013). STRING and IntAct databases were referenced for the experimentally detected protein-protein interactions and used to predict the functional relationships of EndoMS (Figure 5B) (Franceschini et al., 2012; Orchard et al., 2013). ...
Article
Archaeal NucS nuclease was thought to degrade the single-stranded region of branched DNA, which contains flapped and splayed DNA. However, recent findings indicated that EndoMS, the orthologous enzyme of NucS, specifically cleaves double-stranded DNA (dsDNA) containing mismatched bases. In this study, we determined the structure of the EndoMS-DNA complex. The complex structure of the EndoMS dimer with dsDNA unexpectedly revealed that the mismatched bases were flipped out into binding sites, and the overall architecture most resembled that of restriction enzymes. The structure of the apo form was similar to the reported structure of Pyrococcus abyssi NucS, indicating that movement of the C-terminal domain from the resting state was required for activity. In addition, a model of the EndoMS-PCNA-DNA complex was preliminarily verified with electron microscopy. The structures strongly support the idea that EndoMS acts in a mismatch repair pathway.
... The particular challenge in this workflow is not the input size but the computational requirements in conjunction with the size of the output as will become apparent in the following sections. The dataset consist of files downloaded from the online and publicly accessible databases of UniProt [10] and PLAZA[36] and can also be provided by our repositories upon request. The source code of the proposed framework along with the datasets utilized in this work can be found in our repository https://www.github.com/ ...
Article
Full-text available
Life Sciences have been established and widely accepted as a foremost Big Data discipline; as such they are a constant source of the most computationally challenging problems. In order to provide efficient solutions, the community is turning towards scalable approaches such as the utilization of cloud resources in addition to any existing local computational infrastructures. Although bioinformatics workflows are generally amenable to parallelization, the challenges involved are however not only computationally, but also data intensive. In this paper we propose a data management methodology for achieving parallelism in bioinformatics workflows, while simultaneously minimizing data-interdependent file transfers. We combine our methodology with a novel two-stage scheduling approach capable of performing load estimation and balancing across and within heterogeneous distributed computational resources. Beyond an exhaustive experimentation regime to validate the scalability and speed-up of our approach, we compare it against a state-of-the-art high performance computing framework and showcase its time and cost advantages.
... According to their content, databanks can be classified as primary, the one storing primary sequences, and derived, which contain informations obtained by the analysis of primary sequences. Important protein related BDB are: @BULLET UniProt (The universal protein resource) [4] : is the biggest bioinformatics database created by the European Bioinformatic Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). It collects protein sequences of most of living beings and viruses from the main publicly available databases and organize them in a comprehensive, non-redundant database (Uni- Parc). ...
Thesis
High throughput sequencing techniques have highly impactedon modern biology, widening the gap between sequenced andannotated data. Automatic annotation tools are thereforeof the foremost importance to guide biologists' experiments. However, most of the state-of-the-art methods rely on annotation transfer, offering reliable predictions only in homology settings. In this work we present a novel appraoch to protein feature prediction, which exploits the Semanti Based Regularization to inject prior knowledge in the learning process. The experimental results conducted on the yeast genome show that the introduction of the constraints positively impacts on the overall prediction quality.
... Rice gene annotations were acquired from the Rice Annotation Project Database (RAP-DB) [40], the Michigan State University (MSU) Rice Genome Annotation [41] and UniProt [42]. Chloroplast proteins were identified from uniprot (www.uniprot.org). ...
Article
Full-text available
Background Polyploidy has pivotal influences on rice (Oryza sativa L.) morphology and physiology, and is very important for understanding rice domestication and improving agricultural traits. Diploid (DP) and triploid (TP) rice shows differences in morphological parameters, such as plant height, leaf length, leaf width and the physiological index of chlorophyll content. However, the underlying mechanisms determining these morphological differences are remain to be defined. To better understand the proteomic changes between DP and TP, tandem mass tags (TMT) mass spectrometry (MS)/MS was used to detect the significant changes to protein expression between DP and TP. ResultsResults indicated that both photosynthesis and metabolic pathways were highly significantly associated with proteomic alteration between DP and TP based on biological process and pathway enrichment analysis, and 13 higher abundance chloroplast proteins involving in these two pathways were identified in TP. Quantitative real-time PCR analysis demonstrated that 5 of the 13 chloroplast proteins ATPF, PSAA, PSAB, PSBB and RBL in TP were higher abundance compared with those in DP. Conclusions This study integrates morphology, physiology and proteomic profiling alteration of DP and TP to address their underlying different molecular mechanisms. Our finding revealed that ATPF, PSAA, PSAB, PSBB and RBL can induce considerable expression changes in TP and may affect the development and growth of rice through photosynthesis and metabolic pathways.
... More details about how we evaluated performance of each method can be seen in Evaluation of generated synonyms. Besides the rules presented, there are a number of manually curated external mappings from Gene Ontology concepts to other data sources such as UniProt [36], the Brenda database [37], and Wikipedia [38] . To test the usefulness of these mappings as sources of synonyms, we imputed synonyms for the Gene Ontology concept from synonyms of the linked concept in the respective data source. ...
Article
Full-text available
Background: Gene Ontology (GO) terms represent the standard for annotation and representation of molecular functions, biological processes and cellular compartments, but a large gap exists between the way concepts are represented in the ontology and how they are expressed in natural language text. The construction of highly specific GO terms is formulaic, consisting of parts and pieces from more simple terms. Results: We present two different types of manually generated rules to help capture the variation of how GO terms can appear in natural language text. The first set of rules takes into account the compositional nature of GO and recursively decomposes the terms into their smallest constituent parts. The second set of rules generates derivational variations of these smaller terms and compositionally combines all generated variants to form the original term. By applying both types of rules, new synonyms are generated for two-thirds of all GO terms and an increase in F-measure performance for recognition of GO on the CRAFT corpus from 0.498 to 0.636 is observed. Additionally, we evaluated the combination of both types of rules over one million full text documents from Elsevier; manual validation and error analysis show we are able to recognize GO concepts with reasonable accuracy (88 %) based on random sampling of annotations. Conclusions: In this work we present a set of simple synonym generation rules that utilize the highly compositional and formulaic nature of the Gene Ontology concepts. We illustrate how the generated synonyms aid in improving recognition of GO concepts on two different biomedical corpora. We discuss other applications of our rules for GO ontology quality assurance, explore the issue of overgeneration, and provide examples of how similar methodologies could be applied to other biomedical terminologies. Additionally, we provide all generated synonyms for use by the text-mining community.
... Because of advances in high-throughput technologies especially in protein mass spectrometry, enormous amounts of data related to PTMs have been obtained. At the present, there are multiple databases available for studying PTMs such as UniProt [8], dbPTM [9], PTMCuration [10], PTMcode [11], and PhosphoSitePlus [12] . Among these databases, PhosphoSitePlus is the largest , most frequently updated and curated PTM database which both stores non-redundant information and provides tools for studying PTMs [9, 12, 13]. ...
Article
Full-text available
Background One very important functional domain of proteins is the protein-protein interacting region (PPIR), which forms the binding interface between interacting polypeptide chains. Post-translational modifications (PTMs) that occur in the PPIR can either interfere with or facilitate the interaction between proteins. The ability to predict whether sites of protein modifications are inside or outside of PPIRs would be useful in further elucidating the regulatory mechanisms by which modifications of specific proteins regulate their cellular functions. ResultsUsing two of the comprehensive databases for protein-protein interaction and protein modification site data (PDB and PhosphoSitePlus, respectively), we created new databases that map PTMs to their locations inside or outside of PPIRs. The mapped PTMs represented only 5 % of all known PTMs. Thus, in order to predict localization within or outside of PPIRs for the vast majority of PTMs, a machine learning strategy was used to generate predictive models from these mapped databases. For the three mapped PTM databases which had sufficient numbers of modification sites for generating models (acetylation, phosphorylation, and ubiquitylation), the resulting models yielded high overall predictive performance as judged by a combined performance score (CPS). Among the multiple properties of amino acids that were used in the classification tasks, hydrophobicity was found to contribute substantially to the performance of the final predictive models. Compared to the other classifiers we also evaluated, the SVM provided the best performance overall. Conclusions These models are the first to predict whether PTMs are located inside or outside of PPIRs, as demonstrated by their high predictive performance. The models and data presented here should be useful in prioritizing both known and newly identified PTMs for further studies to determine the functional relationship between specific PTMs and protein-protein interactions. The implemented R package is available online (http://sysbio.chula.ac.th/PtmPPIR).
... Annotation analysis of the significantly associated genes was performed using the GeneCards, ENTREZ and UniProtKB web portals [33,54]. The MalaCards web site was used to detect association between the genes and hereditary syndromes [55]. ...
Article
Full-text available
Despite intensive research on genetics of the craniofacial morphology using animal models and human craniofacial syndromes, the genetic variation that underpins normal human facial appearance is still largely elusive. Recent development of novel digital methods for capturing the complexity of craniofacial morphology in conjunction with high-throughput genotyping methods, show great promise for unravelling the genetic basis of such a complex trait. As a part of our efforts on detecting genomic variants affecting normal craniofacial appearance, we have implemented a candidate gene approach by selecting 1,201 single nucleotide polymorphisms (SNPs) and 4,732 tag SNPs in over 170 candidate genes and intergenic regions. We used 3-dimentional (3D) facial scans and direct cranial measurements of 587 volunteers to calculate 104 craniofacial phenotypes. Following genotyping by massively parallel sequencing, genetic associations between 2,332 genetic markers and 104 craniofacial phenotypes were tested. An application of a Bonferroni–corrected genome–wide significance threshold produced significant associations between five craniofacial traits and six SNPs. Specifically, associations of nasal width with rs8035124 (15q26.1), cephalic index with rs16830498 (2q23.3), nasal index with rs37369 (5q13.2), transverse nasal prominence angle with rs59037879 (10p11.23) and rs10512572 (17q24.3), and principal component explaining 73.3% of all the craniofacial phenotypes, with rs37369 (5p13.2) and rs390345 (14q31.3) were observed. Due to over-conservative nature of the Bonferroni correction, we also report all the associations that reached the traditional genome-wide p-value threshold (<5.00E-08) as suggestive. Based on the genome-wide threshold, 8 craniofacial phenotypes demonstrated significant associations with 34 intergenic and extragenic SNPs. The majority of associations are novel, except PAX3 and COL11A1 genes, which were previously reported to affect normal craniofacial variation. This study identified the largest number of genetic variants associated with normal variation of craniofacial morphology to date by using a candidate gene approach, including confirmation of the two previously reported genes. These results enhance our understanding of the genetics that determines normal variation in craniofacial morphology and will be of particular value in medical and forensic fields. Author Summary There is a remarkable variety of human facial appearances, almost exclusively the result of genetic differences, as exemplified by the striking resemblance of identical twins. However, the genes and specific genetic variants that affect the size and shape of the cranium and the soft facial tissue features are largely unknown. Numerous studies on animal models and human craniofacial disorders have identified a large number of genes, which may regulate normal craniofacial embryonic development. In this study we implemented a targeted candidate gene approach to select more than 1,200 polymorphisms in over 170 genes that are likely to be involved in craniofacial development and morphology. These markers were genotyped in 587 DNA samples using massively parallel sequencing and analysed for association with 104 traits generated from 3-dimensional facial images and direct craniofacial measurements. Genetic associations (p-values<5.00E-08) were observed between 8 craniofacial traits and 34 single nucleotide polymorphisms (SNPs), including two previously described genes and 26 novel candidate genes and intergenic regions. This comprehensive candidate gene study has uncovered the largest number of novel genetic variants affecting normal facial appearance to date. These results will appreciably extend our understanding of the normal and abnormal embryonic development and impact our ability to predict the appearance of an individual from a DNA sample in forensic criminal investigations and missing person cases.
... In 2013, Yu et al. constructed a dataset of Gram-negative bacterial secreted proteins which contains 839 secreted pro- teins [23]. The proteins are collected from three data sources, namely, SwissProt, TrEMBL [24], and RefSeq [25]. They used an improved PseAAC consisting of amino acid composition (AAC) and autocovariance (AC) to extract information from PSI-BLAST profile. ...
Article
Full-text available
Prediction of secreted protein types based solely on sequence data remains to be a challenging problem. In this study, we extract the long-range correlation information and linear correlation information from position-specific score matrix (PSSM). A total of 6800 features are extracted at 17 different gaps; then, 309 features are selected by a filter feature selection method based on the training set. To verify the performance of our method, jackknife and independent dataset tests are performed on the test set and the reported overall accuracies are 93.60% and 100%, respectively. Comparison of our results with the existing method shows that our method provides the favorable performance for secreted protein type prediction.
... We favor annotation from S. salmonicida because it is the closest sequenced relative with a manually annotated genome sequence [20]. The UniProtKB 20130905 database [79] and a database containing only S. salmonicida proteins were used for BLAST separately. Genes that are positioned very close to each other or overlap each other can be assembled into a single transcript , since we do not have strand-specific or paired-end reads. ...
Article
Full-text available
Background It is generally thought that the evolutionary transition to parasitism is irreversible because it is associated with the loss of functions needed for a free-living lifestyle. Nevertheless, free-living taxa are sometimes nested within parasite clades in phylogenetic trees, which could indicate that they are secondarily free-living. Herein, we test this hypothesis by studying the genomic basis for evolutionary transitions between lifestyles in diplomonads, a group of anaerobic eukaryotes. Most described diplomonads are intestinal parasites or commensals of various animals, but there are also free-living diplomonads found in oxygen-poor environments such as marine and freshwater sediments. All these nest well within groups of parasitic diplomonads in phylogenetic trees, suggesting that they could be secondarily free-living. Results We present a transcriptome study of Trepomonas sp. PC1, a diplomonad isolated from marine sediment. Analysis of the metabolic genes revealed a number of proteins involved in degradation of the bacterial membrane and cell wall, as well as an extended set of enzymes involved in carbohydrate degradation and nucleotide metabolism. Phylogenetic analyses showed that most of the differences in metabolic capacity between free-living Trepomonas and the parasitic diplomonads are due to recent acquisitions of bacterial genes via gene transfer. Interestingly, one of the acquired genes encodes a ribonucleotide reductase, which frees Trepomonas from the need to scavenge deoxyribonucleosides. The transcriptome included a gene encoding squalene-tetrahymanol cyclase. This enzyme synthesizes the sterol substitute tetrahymanol in the absence of oxygen, potentially allowing Trepomonas to thrive under anaerobic conditions as a free-living bacterivore, without depending on sterols from other eukaryotes. Conclusions Our findings are consistent with the phylogenetic evidence that the last common ancestor of diplomonads was dependent on a host and that Trepomonas has adapted secondarily to a free-living lifestyle. We believe that similar studies of other groups where free-living taxa are nested within parasites could reveal more examples of secondarily free-living eukaryotes. Electronic supplementary material The online version of this article (doi:10.1186/s12915-016-0284-z) contains supplementary material, which is available to authorized users.
... As protein database UniProtKB/ Swiss-Prot (version: 23.10.2014) [57] extended by seven metagenomes [11, 15, 19, 20] was used. The results of database search were submitted to PRIDE [58] with the accession number PXD003526. ...
Article
Full-text available
Background: Methane yield and biogas productivity of biogas plants (BGPs) depend on microbial community structure and function, substrate supply, and general biogas process parameters. So far, however, relatively little is known about correlations between microbial community function and process parameters. To close this knowledge gap, microbial communities of 40 samples from 35 different industrial biogas plants were evaluated by a metaproteomics approach in this study. Results: Liquid chromatography coupled to tandem mass spectrometry (Orbitrap Elite™ Hybrid Ion Trap-Orbitrap Mass Spectrometer) of all 40 samples as triplicate enabled the identification of 3138 different metaproteins belonging to 162 biological processes and 75 different taxonomic orders. The respective database searches were performed against UniProtKB/Swiss-Prot and seven metagenome databases. Subsequent clustering and principal component analysis of these data allowed for the identification of four main clusters associated with mesophile and thermophile process conditions, the use of upflow anaerobic sludge blanket reactors and BGP feeding with sewage sludge. Observations confirm a previous phylogenetic study of the same BGP samples that was based on 16S rRNA gene sequencing by De Vrieze et al. (Water Res 75:312-323, 2015). In particular, we identified similar microbial key players of biogas processes, namely Bacillales, Enterobacteriales, Bacteriodales, Clostridiales, Rhizobiales and Thermoanaerobacteriales as well as Methanobacteriales, Methanosarcinales and Methanococcales. For the elucidation of the main biomass degradation pathways, the most abundant 1 % of metaproteins was assigned to the KEGG map 1200 representing the central carbon metabolism. Additionally, the effect of the process parameters (i) temperature, (ii) organic loading rate (OLR), (iii) total ammonia nitrogen (TAN), and (iv) sludge retention time (SRT) on these pathways was investigated. For example, high TAN correlated with hydrogenotrophic methanogens and bacterial one-carbon metabolism, indicating syntrophic acetate oxidation. Conclusions: This is the first large-scale metaproteome study of BGPs. Proteotyping of BGPs reveals general correlations between the microbial community structure and its function with process parameters. The monitoring of changes on the level of microbial key functions or even of the microbial community represents a well-directed tool for the identification of process problems and disturbances.Graphical abstractCorrelation between the different orders and process parameter, as well as principle component analysis of all investigated biogas plants based on the identified metaproteins.
... An ontology is a tool to provide meaning to data, the information of which can then be subjected to algorithmic processing [6, 7]. For example, the Gene Ontology [6] provides additional information on the genomic level, the NCBI Taxonomy [8] provides information about the nomenclature of species, and UniProt [9] provides information about proteins. We believe that a similar approach should be taken for the semantic description of differences between versions of a model. ...
Article
Full-text available
Background: Open model repositories provide ready-to-reuse computational models of biological systems. Models within those repositories evolve over time, leading to different model versions. Taken together, the underlying changes reflect a model's provenance and thus can give valuable insights into the studied biology. Currently, however, changes cannot be semantically interpreted. To improve this situation, we developed an ontology of terms describing changes in models. The ontology can be used by scientists and within software to characterise model updates at the level of single changes. When studying or reusing a model, these annotations help with determining the relevance of a change in a given context. Methods: We manually studied changes in selected models from BioModels and the Physiome Model Repository. Using the BiVeS tool for difference detection, we then performed an automatic analysis of changes in all models published in these repositories. The resulting set of concepts led us to define candidate terms for the ontology. In a final step, we aggregated and classified these terms and built the first version of the ontology. Results: We present COMODI, an ontology needed because COmputational MOdels DIffer. It empowers users and software to describe changes in a model on the semantic level. COMODI also enables software to implement user-specific filter options for the display of model changes. Finally, COMODI is a step towards predicting how a change in a model influences the simulation results. Conclusion: COMODI, coupled with our algorithm for difference detection, ensures the transparency of a model's evolution, and it enhances the traceability of updates and error corrections. COMODI is encoded in OWL. It is openly available at http://comodi.sems.uni-rostock.de/ .
... The activity of protein kinases is affected by the alteration of functionally relevant residues involved, for example, in catalysis or phosphorilation. In the implementation of KinMutRF, residue annotations in UniProt [53] define functionally relevant amino acids. The residue annoations include the following categories: active sites (act_site), general (binding) or specialised binding (carbohyd, metal, np_bind), disulfid bonding, experimentally modified residues (mod_res), repeat regions (repeat), signal peptides (signal), transmembrane regions (transmem) and zinc fingers (zn_fing), among others broadly defined sites. ...
Article
Full-text available
Background: The association between aberrant signal processing by protein kinases and human diseases such as cancer was established long time ago. However, understanding the link between sequence variants in the protein kinase superfamily and the mechanistic complex traits at the molecular level remains challenging: cells tolerate most genomic alterations and only a minor fraction disrupt molecular function sufficiently and drive disease. Results: KinMutRF is a novel random-forest method to automatically identify pathogenic variants in human kinases. Twenty six decision trees implemented as a random forest ponder a battery of features that characterize the variants: a) at the gene level, including membership to a Kinbase group and Gene Ontology terms; b) at the PFAM domain level; and c) at the residue level, the types of amino acids involved, changes in biochemical properties, functional annotations from UniProt, Phospho.ELM and FireDB. KinMutRF identifies disease-associated variants satisfactorily (Acc: 0.88, Prec:0.82, Rec:0.75, F-score:0.78, MCC:0.68) when trained and cross-validated with the 3689 human kinase variants from UniProt that have been annotated as neutral or pathogenic. All unclassified variants were excluded from the training set. Furthermore, KinMutRF is discussed with respect to two independent kinase-specific sets of mutations no included in the training and testing, Kin-Driver (643 variants) and Pon-BTK (1495 variants). Moreover, we provide predictions for the 848 protein kinase variants in UniProt that remained unclassified. A public implementation of KinMutRF, including documentation and examples, is available online (http://kinmut2.bioinfo.cnio.es). The source code for local installation is released under a GPL version 3 license, and can be downloaded from https://github.com/Rbbt-Workflows/KinMut2. Conclusions: KinMutRF is capable of classifying kinase variation with good performance. Predictions by KinMutRF compare favorably in a benchmark with other state-of-the-art methods (i.e. SIFT, Polyphen-2, MutationAssesor, MutationTaster, LRT, CADD, FATHMM, and VEST). Kinase-specific features rank as the most elucidatory in terms of information gain and are likely the improvement in prediction performance. This advocates for the development of family-specific classifiers able to exploit the discriminatory power of features unique to individual protein families.
... Functional information for marine-accelerated and marinedecelerated genes in table 1 was taken from the Uniprot and RefSeq databases, and from literature cited directly (Pruitt et al. 2007; Uniprot Consortium 2007). Computational tests for functional enrichment were performed using the hypergeometric test with the background set of genes restricted to genes that were tested for marine convergence and had at least one annotation in the corresponding annotation file. ...
Article
Mammal species have made the transition to the marine environment several times, and their lineages represent one of the classical examples of convergent evolution in morphological and physiological traits. Nevertheless, the genetic mechanisms of their phenotypic transition are poorly understood, and investigations into convergence at the molecular level have been inconclusive. While past studies have searched for convergent changes at specific amino acid sites, we propose an alternative strategy to identify those genes that experienced convergent changes in their selective pressures, visible as changes in evolutionary rate specifically in the marine lineages. We present evidence of widespread convergence at the gene level by identifying parallel shifts in evolutionary rate during three independent episodes of mammalian adaptation to the marine environment. Hundreds of genes accelerated their evolutionary rates in all three marine mammal lineages during their transition to aquatic life. These marine-accelerated genes are highly enriched for pathways that control recognized functional adaptations in marine mammals, including muscle physiology, lipid-metabolism, sensory systems, and skin and connective tissue. The accelerations resulted from both adaptive evolution as seen in skin and lung genes, and loss of function as in gustatory and olfactory genes. In regard to sensory systems, this finding provides further evidence that reduced senses of taste and smell are ubiquitous in marine mammals. Our analysis demonstrates the feasibility of identifying genes underlying convergent organism-level characteristics on a genome-wide scale and without prior knowledge of adaptations, and provides a powerful approach for investigating the physiological functions of mammalian genes.
Article
Full-text available
Despite being identified over a hundred years ago, there is still no commercially available vaccine for the highly contagious and deadly African swine fever virus (ASFV). This study used immunoinformatics for the rapid and inexpensive designing of a safe and effective multi-epitope subunit vaccine for ASFV. A total of 18,858 proteins from 100 well-annotated ASFV proteomes were screened using various computational tools to identify potential epitopes, or peptides capable of triggering an immune response in swine. Proteins from genotypes I and II were prioritized for their involvement in the recent global ASFV outbreaks. The screened epitopes exhibited promising qualities that positioned them as effective components of the ASFV vaccine. They demonstrated antigenicity, immunogenicity, and cytokine-inducing properties indicating their ability to induce potent immune responses. They have strong binding affinities to multiple swine allele receptors suggesting a high likelihood of yielding more amplified responses. Moreover, they were non-allergenic and non-toxic, a crucial prerequisite for ensuring safety and minimizing any potential adverse effects when the vaccine is processed within the host. Integrated with an immunogenic 50S ribosomal protein adjuvant and linkers, the epitopes formed a 364-amino acid multi-epitope subunit vaccine. The ASFV vaccine construct exhibited notable immunogenicity in immune simulation and molecular docking analyses, and stable profiles in secondary and tertiary structure assessments. Moreover, this study designed an optimized codon for efficient translation of the ASFV vaccine construct into the Escherichia coli K-12 expression system using the pET28a(+) vector. Overall, both sequence and structural evaluations suggested the potential of the ASFV vaccine construct as a candidate for controlling and eradicating outbreaks caused by the pathogen.
Article
Full-text available
A long-standing question is to what degree genetic drift and selection drive the divergence in rare accessory gene content between closely related bacteria. Rare genes, including singletons, make up a large proportion of pangenomes (all genes in a set of genomes), but it remains unclear how many such genes are adaptive, deleterious or neutral to their host genome. Estimates of species’ effective population sizes (Ne) are positively associated with pangenome size and fluidity, which has independently been interpreted as evidence for both neutral and adaptive pangenome models. We hypothesized that pseudogenes, used as a neutral reference, could be used to distinguish these models. We find that most functional categories are depleted for rare pseudogenes when a genome encodes only a single intact copy of a gene family. In contrast, transposons are enriched in pseudogenes, suggesting they are mostly neutral or deleterious to the host genome. Thus, even if individual rare accessory genes vary in their effects on host fitness, we can confidently reject a model of entirely neutral or deleterious rare genes. We also define the ratio of singleton intact genes to singleton pseudogenes (si/sp) within a pangenome, compare this measure across 668 prokaryotic species and detect a signal consistent with the adaptive value of many rare accessory genes. Taken together, our work demonstrates that comparing with pseudogenes can improve inferences of the evolutionary forces driving pangenome variation.
Article
Full-text available
Angelica sinensis roots (Angelica roots) are rich in many bioactive compounds, including phthalides, coumarins, lignans, and terpenoids. However, the molecular bases for their biosynthesis are still poorly understood. Here, an improved chromosome-scale genome for A. sinensis var. Qinggui1 is reported, with a size of 2.16 Gb, contig N50 of 4.96 Mb and scaffold N50 of 198.27 Mb, covering 99.8% of the estimated genome. Additionally, by integrating genome sequencing, metabolomic profiling, and transcriptome analysis of normally growing and early-flowering Angelica roots that exhibit dramatically different metabolite profiles, the pathways and critical metabolic genes for the biosynthesis of these major bioactive components in Angelica roots have been deciphered. Multiomic analyses have also revealed the evolution and regulation of key metabolic genes for the biosynthesis of pharmaceutically bioactive components; in particular, TPSs for terpenoid volatiles, ACCs for malonyl CoA, PKSs for phthalide, and PTs for coumarin biosynthesis were expanded in the A. sinensis genome. These findings provide new insights into the biosynthesis of pharmaceutically important compounds in Angelica roots for exploration of synthetic biology and genetic improvement of herbal quality.
Article
Machine learning algorithms play an essential role in bioinformatics and allow exploring the vast and noisy biological data in unrivalled ways. This paper is a systematic review of the applications of machine learning in the study of HIV neutralizing antibodies. This significant and vast research domain can pave the way to novel treatments and to a vaccine. We selected the relevant papers by investigating the available literature from the Web of Science and PubMed databases in the last decade. The computational methods are applied in neutralization potency prediction, neutralization span prediction against multiple viral strains, antibody-virus binding sites detection, enhanced antibodies design, and the study of the antibody-induced immune response. These methods are viewed from multiple angles spanning data processing, model description, feature selection, evaluation, and sometimes paper comparisons. The algorithms are diverse and include supervised, unsupervised, and generative types. Both classical machine learning and modern deep learning were taken into account. The review ends with our ideas regarding future research directions and challenges.
Article
Full-text available
Protein-protein interactions (PPIs) are key drivers of cell function and evolution. While it is widely assumed that most permanent PPIs are important for cellular function, it remains unclear whether transient PPIs are equally important. Here, we estimate and compare dispensable content among transient PPIs and permanent PPIs in human. Starting with a human reference interactome mapped by experiments, we construct a human structural interactome by building three-dimensional structural models for PPIs, and then distinguish transient PPIs from permanent PPIs using several structural and biophysical properties. We map common mutations from healthy individuals and disease-causing mutations onto the structural interactome, and perform structure-based calculations of the probabilities for common mutations (assumed to be neutral) and disease mutations (assumed to be mildly deleterious) to disrupt transient PPIs and permanent PPIs. Using Bayes’ theorem we estimate that a similarly small fraction (<~20%) of both transient and permanent PPIs are completely dispensable, i.e., effectively neutral upon disruption. Hence, transient and permanent interactions are subject to similarly strong selective constraints in the human interactome.
Article
Machine learning (ML) is revolutionizing our ability to understand and predict the complex relationships between protein sequence, structure, and function. Predictive sequence–function models are enabling protein engineers to efficiently search the sequence space for useful proteins with broad applications in biotechnology. In this review, we highlight the recent advances in applying ML to protein engineering. We discuss supervised learning methods that infer the sequence–function mapping from experimental data and new sequence representation strategies for data-efficient modeling. We then describe the various ways in which ML can be incorporated into protein engineering workflows, including purely in silico searches, ML-assisted directed evolution, and generative models that can learn the underlying distribution of the protein function in a sequence space. ML-driven protein engineering will become increasingly powerful with continued advances in high-throughput data generation, data science, and deep learning.
Article
Full-text available
It is unclear if coexistence theory can be applied to gut microbiomes to understand their characteristics and modulate their composition. Through experiments in gnotobiotic mice with complex microbiomes, we demonstrated that strains of Akkermansia muciniphila and Bacteroides vulgatus could only be established if microbiomes were devoid of these species. Strains of A. muciniphila showed strict competitive exclusion, while B. vulgatus strains coexisted but populations were still influenced by competitive interactions. These differences in competitive behavior were reflective of genomic variation within the two species, indicating considerable niche overlap for A. muciniphila strains and a broader niche space for B. vulgatus strains. Priority effects were detected for both species as strains’ competitive fitness increased when colonizing first, which resulted in stable persistence of the A. muciniphila strain colonizing first and competitive exclusion of the strain arriving second. Based on these observations, we devised a subtractive strategy for A. muciniphila using antibiotics and showed that a strain from an assembled community can be stably replaced by another strain. By demonstrating that competitive outcomes in gut ecosystems depend on niche differences and are historically contingent, our study provides novel information to explain the ecological characteristics of gut microbiomes and a basis for their modulation.
Article
Full-text available
The development of effective and safe vaccines is the ultimate way to efficiently stop the ongoing COVID-19 pandemic, which is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Built on the fact that SARS-CoV-2 utilizes the association of its Spike (S) protein with the human Angiotensin-converting enzyme 2 (ACE2) receptor to invade host cells, we computationally redesigned the S protein sequence to improve its immunogenicity and antigenicity. Toward this purpose, we extended an evolutionary protein design algorithm, EvoDesign, to create thousands of stable S protein variants that perturb the core protein sequence but keep the surface conformation and B cell epitopes. The T cell epitope content and similarity scores of the perturbed sequences were calculated and evaluated. Out of 22,914 designs with favorable stability energy, 301 candidates contained at least two pre-existing immunity-related epitopes and had promising immunogenic potential. The benchmark tests showed that, although the epitope restraints were not included in the scoring function of EvoDesign, the top S protein design successfully recovered 31 out of the 32 major histocompatibility complex (MHC) -II T cell promiscuous epitopes in the native S protein, where two epitopes were present in all seven human coronaviruses. Moreover, the newly designed S protein introduced nine new MHC-II T cell promiscuous epitopes that do not exist in the wildtype SARS-CoV-2. These results demonstrated a new and effective avenue to enhance a target protein’s immunogenicity using rational protein design, which could be applied for new vaccine design against COVID-19 and other human viruses.
Article
Full-text available
The practical application of nanoparticles (NPs) as chemotherapeutic drug delivery systems is often hampered by issues such as poor circulation stability and targeting inefficiency. Here, we have utilized a simple approach to prepare biocompatible and biodegradable pH-responsive hybrid NPs that overcome these issues. The NPs consist of a drug-loaded polylactic-co-glycolic acid (PLGA) core covalently ‘wrapped’ with a crosslinked bovine serum albumin (BSA) shell designed to minimize interactions with serum proteins and macrophages that inhibit target recognition. The shell is functionalized with the acidity-triggered rational membrane (ATRAM) peptide to facilitate internalization specifically into cancer cells within the acidic tumor microenvironment. Following uptake, the unique intracellular conditions of cancer cells degrade the NPs, thereby releasing the chemotherapeutic cargo. The drug-loaded NPs showed potent anticancer activity in vitro and in vivo while exhibiting no toxicity to healthy tissue. Our results demonstrate that the ATRAM-BSA-PLGA NPs are a promising targeted cancer drug delivery platform. Palanikumar et al. prepare pH-responsive nanoparticles with drug-loaded PLGA core, cross-linked BSA corona to avoid opsonisation, and functionalised with ATRAM peptide that binds the cell membrane at low pH such as tumour microenvironment. The nanoparticles display both in vitro and in vivo efficacy while evading recognition by macrophages.
Article
Full-text available
The pharmacological activity of Acacia nilotica’s phytochemical constituents was confirmed with evidence-based studies, but the determination of exact targets that they bind and the mechanism of action were not done; consequently, we aim to identify the exact targets that are responsible for the pharmacological activity via the computational methods. Furthermore, we aim to predict the pharmacokinetics (ADME) properties and the safety profile in order to identify the best drug candidates. To achieve those goals, various computational methods were used including the ligand-based virtual screening and molecular docking. Moreover, pkCSM and SwissADME web servers were used for the prediction of pharmacokinetics and safety. The total number of the investigated compounds and targets was 25 and 61, respectively. According to the results, the pharmacological activity was attributed to the interaction with essential targets. Ellagic acid, Kaempferol, and Quercetin were the best A. nilotica’s phytochemical constituents that contribute to the therapeutic activities, were non-toxic as well as non-carcinogen. The administration of Ellagic acid, Kaempferol, and Quercetin as combined drug via the novel drug delivery systems will be a valuable therapeutic choice for the treatment of recent diseases attacking the public health including cancer, multidrug-resistant bacterial infections, diabetes mellitus, and chronic inflammatory systemic disease.
Article
Full-text available
The Unfolded Protein Response (UPR) is an adaptive pathway that restores cellular homeostasis after endoplasmic reticulum (ER) stress. The ER-resident kinase/ribonuclease Ire1 is the only UPR sensor conserved during evolution. Autophagy, a lysosomal degradative pathway, also contributes to the recovery of cell homeostasis after ER-stress but the interplay between these two pathways is still poorly understood. We describe the Dictyostelium discoideum ER-stress response and characterize its single bonafide Ire1 orthologue, IreA. We found that tunicamycin (TN) triggers a gene-expression reprogramming that increases the protein folding capacity of the ER and alleviates ER protein load. Further, IreA is required for cell-survival after TN-induced ER-stress and is responsible for nearly 40% of the transcriptional changes induced by TN. The response of Dictyostelium cells to ER-stress involves the combined activation of an IreA-dependent gene expression program and the autophagy pathway. These two pathways are independently activated in response to ER-stress but, interestingly, autophagy requires IreA at a later stage for proper autophagosome formation. We propose that unresolved ER-stress in cells lacking IreA causes structural alterations of the ER, leading to a late-stage blockade of autophagy clearance. This unexpected functional link may critically affect eukaryotic cell survival under ER-stress.
Article
tRNA-derived fragments (tRFs) constitute a new class of short regulatory RNAs that are a product of nascent or mature tRNA processing. tRF sequences have been identified in all domains of life; however, most published research pertains to human, yeast and some bacterial organisms. Despite growing interest in plant tRFs and accumulating evidence of their function in plant development and stress responses, no public, web-based repository dedicated to these molecules is currently available. Here, we introduce tRex (http://combio.pl/trex) - the first comprehensive data-driven online resource specifically dedicated to tRNA-derived fragments in the model plant Arabidopsis thaliana. The portal is based on verified Arabidopsis tRNA annotation and includes in-house generated and publicly available sRNA-Seq experiments from various tissues, ecotypes, genotypes and stress conditions. Provided web-based tools are designed in a user-friendly manner and allow for seamless exploration of the data that is presented in the form of dynamic tables and cumulative coverage profiles. The tRex database is connected to external genomic and citation resources, which makes it a one-stop solution for Arabidopsis tRF-related research.
Article
Full-text available
Background While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets across 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. Results The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with unusually many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs posed the greatest difficulty for gene-finders. Conclusion This experiment establishes a baseline of gene prediction accuracy in Caenorhabditis genomes, and has guided the choice of gene-finders for the annotation of newly sequenced genomes of Caenorhabditis and other nematode species. We have created new gene sets for C. briggsae, C. remanei, C. brenneri, C. japonica, and Brugia malayi using some of the best-performing gene-finders.
Article
Full-text available
The genomes of four strains (MB11, MB14, MB30, and MB66) of the species Corynebacterium pseudotuberculosis biovar equi were sequenced on the Ion Torrent PGM platform, completely assembled, and their gene content and structure were analyzed. The strains were isolated from horses with distinct signs of infection, including ulcerative lymphangitis, external abscesses on the chest, or internal abscesses on the liver, kidneys, and lungs. The average size of the genomes was 2.3 Mbp, with 2169 (Strain MB11) to 2235 (Strain MB14) predicted coding sequences (CDSs). An optical map of the MB11 strain generated using the KpnI restriction enzyme showed that the approach used to assemble the genome was satisfactory, producing good alignment between the sequence observed in vitro and that obtained in silico. In the resulting Neighbor-Joining dendrogram, the C. pseudotuberculosis strains sequenced in this study were clustered into a single clade supported by a high bootstrap value. The structural analysis showed that the genomes of the MB11 and MB14 strains were very similar, while the MB30 and MB66 strains had several inverted regions. The observed genomic characteristics were similar to those described for other strains of the same species, despite the number of inversions found. These genomes will serve as a basis for determining the relationship between the genotype of the pathogen and the type of infection that it causes.
Article
Full-text available
Text mining is a powerful technology for quickly distilling key information from vast quantities of biomedical literature. However, to harness this power the researcher must be well versed in the availability, suitability, adaptability, interoperability and comparative accuracy of current text mining resources. In this survey, we give an overview of the text mining resources that exist in the life sciences to help researchers, especially those employed in biocuration, to engage with text mining in their own work. We categorize the various resources under three sections: Content Discovery looks at where and how to find biomedical publications for text mining; Knowledge Encoding describes the formats used to represent the different levels of information associated with content that enable text mining, including those formats used to carry such information between processes; Tools and Services gives an overview of workflow management systems that can be used to rapidly configure and compare domain- and task-specific processes, via access to a wide range of pre-built tools. We also provide links to relevant repositories in each section to enable the reader to find resources relevant to their own area of interest. Throughout this work we give a special focus to resources that are interoperable—those that have the crucial ability to share information, enabling smooth integration and reusability.
Article
Full-text available
Engineering the coenzyme specificity of redox enzymes plays an important role in metabolic engineering, synthetic biology, and biocatalysis, but it has rarely been applied to bioelectrochemistry. Here we develop a rational design strategy to change the coenzyme specificity of 6-phosphogluconate dehydrogenase (6PGDH) from a hyperthermophilic bacterium Thermotoga maritima from its natural coenzyme NADP⁺ to NAD⁺. Through amino acid-sequence alignment of NADP⁺- and NAD⁺-preferred 6PGDH enzymes and computer-aided substrate-coenzyme docking, the key amino acid residues responsible for binding the phosphate group of NADP⁺ were identified. Four mutants were obtained via site-directed mutagenesis. The best mutant N32E/R33I/T34I exhibited a ~6.4 × 10⁴-fold reversal of the coenzyme selectivity from NADP⁺ to NAD⁺. The maximum power density and current density of the biobattery catalyzed by the mutant were 0.135 mW cm⁻² and 0.255 mA cm⁻², ~25% higher than those obtained from the wide-type 6PGDH-based biobattery at the room temperature. By using this 6PGDH mutant, the optimal temperature of running the biobattery was as high as 65 °C, leading to a high power density of 1.75 mW cm⁻². This study demonstrates coenzyme engineering of a hyperthermophilic 6PGDH and its application to high-temperature biobatteries.
Article
Full-text available
Arenibacter sp. strain C-21, isolated from surface marine sediment of Japan, accumulates iodine in the presence of glucose and iodide (I ⁻ ). We report here the draft genome sequence of this strain to provide insight into the molecular mechanism underlying its iodine-accumulating ability.
Chapter
In general, a rare or orphan disease is any disease that affects a small percentage of the population. Since a majority of the known orphan diseases are genetic, they are present throughout the life of affected individuals. Many of the orphan diseases appear early in life and approximately 30 % of children with orphan diseases die before the age of 5. Further, a large majority of these diseases lack effective treatments. While most of genes and pathways underlying orphan diseases remain obscure, technological advances and innovative informatics approaches are expected to accelerate the rate of identification of underlying causal mutations and therapeutic discovery. Recent technological advances in DNA sequencing for instance, can aid in identifying genes associated with orphan diseases of previously unknown etiology using DNA from as few as 2–4 patients. Likewise, advanced computational statistical techniques permit integration and mining of omics data from orphan disease patients with high throughput “signatures” representing cellular responses to perturbing agents to identify therapeutic candidates for orphan diseases. In this chapter, we review some of the current bioinformatic analytical options available for orphan disease and drug research including computational approaches for candidate gene prioritization and high throughput compound screening to enable therapeutic discovery. We also discuss strategies and present examples and case studies of common drugs being repositioned for treatment of orphan diseases.
Article
Full-text available
A significant obstacle in training predictive cell models is the lack of integrated data sources. We develop semi-supervised normalization pipelines and perform experimental characterization (growth, transcriptional, proteome) to create Ecomics, a consistent, quality-controlled multi-omics compendium for Escherichia coli with cohesive meta-data information. We then use this resource to train a multi-scale model that integrates four omics layers to predict genome-wide concentrations and growth dynamics. The genetic and environmental ontology reconstructed from the omics data is substantially different and complementary to the genetic and chemical ontologies. The integration of different layers confers an incremental increase in the prediction performance, as does the information about the known gene regulatory and protein-protein interactions. The predictive performance of the model ranges from 0.54 to 0.87 for the various omics layers, which far exceeds various baselines. This work provides an integrative framework of omics-driven predictive modelling that is broadly applicable to guide biological discovery.
Article
Full-text available
We present a new strategy for systematic identification of phosphotyrosine (pTyr) by affinity purification mass spectrometry (AP-MS) using a Src homology 2 (SH2)-domain-derived pTyr superbinder as the affinity reagent. The superbinder allows for markedly deeper coverage of the Tyr phosphoproteome than anti-pTyr antibodies when an optimal amount is used. We identified ∼20,000 distinct phosphotyrosyl peptides and >10,000 pTyr sites, of which 36% were 'novel', from nine human cell lines using the superbinder approach. Tyrosine kinases, SH2 domains and phosphotyrosine phosphatases were preferably phosphorylated, suggesting that the toolkit of kinase signaling is subject to intensive regulation by phosphorylation. Cell-type-specific global kinase activation patterns inferred from label-free quantitation of Tyr phosphorylation guided the design of experiments to inhibit cancer cell proliferation by blocking the highly activated tyrosine kinases. Therefore, the superbinder is a highly efficient and cost-effective alternative to conventional antibodies for systematic and quantitative characterization of the tyrosine phosphoproteome under normal or pathological conditions.
Article
Full-text available
Weed control failures due to herbicide resistance are an increasing and worldwide problem significantly impacting crop yields. Herbicide resistance due to increased herbicide metabolism in weeds is not well characterized at the genetic level. An RNA-Seq transcriptome analysis was used to identify genes conferring metabolism-based herbicide resistance (MBHR) in a population (R) of a major global weed (Lolium rigidum), in which resistance to the herbicide diclofop-methyl was experimentally evolved through recurrent selection from a susceptible (S) progenitor population. A reference transcriptome of 19,623 contigs was assembled using 454 sequencing technology on a cDNA library and annotated using UniProt and Pfam databases. Transcriptomic-level gene expression was measured using Illumina 100 bp reads from untreated control, mock, and diclofop-methyl treatments of R and S. Due to the established importance of cytochrome P450 (CytP450), glutathione-S-transferase (GST), and glucosyltransferase (GT) genes in MBHR, 11 contigs with these annotations and higher constitutive expression in untreated R than in untreated S were selected as candidate genes for hypothesis testing, along with 17 additional differentially expressed contigs with annotations related to metabolism or signal transduction. In a forward genetics validation experiment, higher constitutive expression of nine contigs co-segregated with the resistance phenotype in an F2 population, including 3 CytP450, 3 GST, and 1 GT. At least nine genes with heritable increased constitutive expression are associated with MBHR trait. In a physiological validation experiment where 2, 4-D pre-treatment induced diclofop-methyl protection in S individuals due to increased metabolism, seven of the nine genetically-validated contigs were significantly induced. These data help explain accumulation of resistance-endowing genes and rapid evolution of MBHR, and provide the opportunity to improve diagnostics of MBHR using molecular tools such as transcriptional markers.
Article
Full-text available
The protein coding sequences of the human reference genome GRCh38, RefSeq mRNA and UniProt protein databases are sometimes inconsistent with each other, due to polymorphisms in the human population, but the overall landscape of the discordant sequences has not been clarified. In this study, we comprehensively listed the discordant bases and regions between the GRCh38, RefSeq and UniProt reference sequences, based on the genomic coordinates of GRCh38. We observed that the RefSeq sequences are more likely to represent the major alleles than GRCh38 and UniProt, by assigning the alternative allele frequencies of the discordant bases. Since some reference sequences have minor alleles, functional and structural annotations may be performed based on rare alleles in the human population, thereby biasing these analyses. Some of the differences between the RefSeq and GRCh38 account for biological differences due to known RNA-editing sites. The definitions of the coding regions are frequently complicated by possible micro-exons within introns and by SNVs with large alternative allele frequencies near exon-intron boundaries. The mRNA or protein regions missing from GRCh38 were mainly due to small deletions, and these sequences need to be identified. Taken together, our results clarify overall consistency and remaining inconsistency between the reference sequences.
Article
Full-text available
The single subunit T7 RNA polymerase (T7RNAP) is a model enzyme for studying the transcription process and for various biochemical and biophysical studies. Heparin is a commonly used inhibitor against T7RNAP and other RNA polymerases. However, exact interaction between heparin and T7RNAP is still not completely understood. In this work, we analyzed the binding pattern of heparin by docking heparin and few of its low molecular weight derivatives to T7RNAP, which helps in better understanding of T7RNAP inhibition mechanism. The efficiency of the compounds was calculated by docking the selected compounds and post-docking molecular mechanics/generalized Born surface area analysis. Evaluation of the simulation trajectories and binding free energies of the complexes after simulation showed enoxaparin to be the best among low molecular weight heparins. Binding free energy analysis revealed that van der Waals interactions and polar solvation energy provided the substantial driving force for the binding process. Furthermore, per-residue free energy decomposition analysis revealed that the residues Asp 471, Asp 506, Asp 537, Tyr 571, Met 635, Asp 653, Pro 780, and Asp 812 are important for heparin interaction. Apart from these residues, most favorable contribution in all the three complexes came from Asp 506, Tyr 571, Met 635, Glu 652, and Asp 653, which can be essential for binding of heparin-like structures with T7RNAP. The results obtained from this study will be valuable for the future rational design of novel and potent inhibitors against T7RNAP and related proteins.
Article
Using bioinformatics analysis, the homologs of genes Sr33 and Sr35 were identified in the genomes of Triticum aestivum, Hordeum vulgare, and Triticum urartu. It is known that these genes confer resistance to highly virulent wheat stem rust races (Ug99). To identify amino acid sites important for this resistance, the found homologs were compared with the Sr33 and Sr35 protein sequences. It was found that sequences S5DMA6 and E9P785 are the closest homologs of protein RGAle, a Sr33 gene product, and sequences M7YFA9 (CNL-C) and F2E9R2 are homologs of protein CNL9, a Sr35 gene product. It is assumed that the homologs of genes Sr33 and Sr35, which were obtained from the wild relatives of wheat and barley, can confer resistance to various forms of stem rust and can be used in the future breeding programs aimed at improvement of national wheat varieties.
Chapter
Epitranscriptomics is the study of global modification patterns to both coding and noncoding RNA. Understanding the epitranscriptomic profile of disease states or individual patients is imperative to understanding human health and molecular disease pathology. Modifications have long been established as important determinants of tRNA stability, dynamics, and ribosome binding and of maintenance of the translational reading frame. These modifications also serve as biomarkers for several human diseases, including type 2 diabetes, cardiac dysfunction, intellectual disability, and skin, breast, and colorectal cancers. Of particular note, several mitochondrial disorders trace their molecular pathogenesis to deficiencies in specific tRNA modifications. Pathology can also be attributed to mutations affecting protein recognition of tRNA substrates. However, protein recognition of RNA modification is at present an underdeveloped field and the subject of increasing attention. Epitranscriptomic profiling will be readily achievable with new advances in the detection of RNA modifications by peptides and mass spectrometry at the attomole level. These technologies will allow for single-cell analysis of modifications and will serve as a platform for increased sensitivity for biomarker identification. Thus, RNA modifications are a real-time code to RNA structure and function that has yet to be deciphered.
Article
Full-text available
Importance: Lactobacillus acidophilus is one of the most widely used probiotic microbes incorporated across many dairy foods and dietary supplements. This organism produces a Surface (S-) layer, which is a self-assembling crystalline array found as the outermost layer of the cell wall. The S-layer, along with co-localized associated proteins, is an important mediator of probiotic activity through intestinal adhesion and modulation of the mucosal immune system. However, there is still a dearth of information regarding the basic cellular and evolutionary function of S-layers. Here, we demonstrate that multiple autolysins, responsible for breaking down the cell wall during cell division, are associated with the S-layer. Deletion of the gene encoding one of these S-layer associated autolysins confirmed its autolytic role, and resulted in reduced binding capacity to mucin and intestinal extracellular matrices. These data suggest a functional association between the S-layer and autolytic activity through the extracellular presentation of autolysins.
Article
Full-text available
A novel mannose-specific lectin, named CGL1 (15.5 kDa), was isolated from the oyster Crassostrea gigas. Characterization of CGL1 involved isothermal titration calorimetry (ITC), glycoconjugate microarray, and frontal affinity chromatography (FAC). This analysis revealed that CGL1 has strict specificity for the mannose monomer and for high mannose-type N-glycans (HMTGs). Primary structure of CGL1 did not show any homology with known lectins but did show homology with proteins of the natterin family. Crystal structure of the CGL1 revealed a unique homodimer in which each protomer was composed of 2 domains related by a pseudo two-fold axis. Complex structures of CGL1 with mannose molecules showed that residues have 8 hydrogen bond interactions with O1, O2, O3, O4, and O5 hydroxyl groups of mannose. The complex interactions that are not observed with other mannose-binding lectins revealed the structural basis for the strict specificity for mannose. These characteristics of CGL1 may be helpful as a research tool and for clinical applications.
Article
Full-text available
Genomics, epigenomics, transcriptomics, proteomics and metabolomics efforts rapidly generate a plethora of data on the activity and levels of biomolecules within mammalian cells. At the same time, curation projects that organize knowledge from the biomedical literature into online databases are expanding. Hence, there is a wealth of information about genes, proteins and their associations, with an urgent need for data integration to achieve better knowledge extraction and data reuse. For this purpose, we developed the Harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins from over 70 major online resources. We extracted, abstracted and organized data into ∼72 million functional associations between genes/proteins and their attributes. Such attributes could be physical relationships with other biomolecules, expression in cell lines and tissues, genetic associations with knockout mouse or human phenotypes, or changes in expression after drug treatment. We stored these associations in a relational database along with rich metadata for the genes/proteins, their attributes and the original resources. The freely available Harmonizome web portal provides a graphical user interface, a web service and a mobile app for querying, browsing and downloading all of the collected data. To demonstrate the utility of the Harmonizome, we computed and visualized gene–gene and attribute–attribute similarity networks, and through unsupervised clustering, identified many unexpected relationships by combining pairs of datasets such as the association between kinase perturbations and disease signatures. We also applied supervised machine learning methods to predict novel substrates for kinases, endogenous ligands for G-protein coupled receptors, mouse phenotypes for knockout genes, and classified unannotated transmembrane proteins for likelihood of being ion channels. The Harmonizome is a comprehensive resource of knowledge about genes and proteins, and as such, it enables researchers to discover novel relationships between biological entities, as well as form novel data-driven hypotheses for experimental validation. Database URL: http://amp.pharm.mssm.edu/Harmonizome.
Article
Full-text available
Identification and analysis of host–pathogen interactions (HPI) is essential to study infectious diseases. However, HPI data are sparse in existing molecular interaction databases, especially for agricultural host–pathogen systems. Therefore, resources that annotate, predict and display the HPI that underpin infectious diseases are critical for developing novel intervention strategies. HPIDB 2.0 (http://www.agbase.msstate.edu/hpi/main.html) is a resource for HPI data, and contains 45, 238 manually curated entries in the current release. Since the first description of the database in 2010, multiple enhancements to HPIDB data and interface services were made that are described here. Notably, HPIDB 2.0 now provides targeted biocuration of molecular interaction data. As a member of the International Molecular Exchange consortium, annotations provided by HPIDB 2.0 curators meet community standards to provide detailed contextual experimental information and facilitate data sharing. Moreover, HPIDB 2.0 provides access to rapidly available community annotations that capture minimum molecular interaction information to address immediate researcher needs for HPI network analysis. In addition to curation, HPIDB 2.0 integrates HPI from existing external sources and contains tools to infer additional HPI where annotated data are scarce. Compared to other interaction databases, our data collection approach ensures HPIDB 2.0 users access the most comprehensive HPI data from a wide range of pathogens and their hosts (594 pathogen and 70 host species, as of February 2016). Improvements also include enhanced search capacity, addition of Gene Ontology functional information, and implementation of network visualization. The changes made to HPIDB 2.0 content and interface ensure that users, especially agricultural researchers, are able to easily access and analyse high quality, comprehensive HPI data. All HPIDB 2.0 data are updated regularly, are publically available for direct download, and are disseminated to other molecular interaction resources. Database URL http://www.agbase.msstate.edu/hpi/main.html
Article
Full-text available
CRISPR interference (CRISPRi) represents a newly developed tool for targeted gene repression. It has great application potential for studying gene function and mapping gene regulatory elements. However, the optimal parameters for efficient single guide RNA (sgRNA) design for CRISPRi are not fully defined. In this study, we systematically assessed how sgRNA position affects the efficiency of CRISPRi in human cells. We analyzed 155 sgRNAs targeting 41 genes and found that CRISPRi efficiency relies heavily on the precise recruitment of the effector complex to the target gene transcription start site (TSS). Importantly, we demonstrate that the FANTOM5/CAGE promoter atlas represents the most reliable source of TSS annotations for this purpose. We also show that the proximity to the FANTOM5/CAGE-defined TSS predicts sgRNA functionality on a genome-wide scale. Moreover, we found that once the correct TSS is identified, CRISPRi efficiency can be further improved by considering sgRNA sequence preferences. Lastly, we demonstrate that CRISPRi sgRNA functionality largely depends on the chromatin accessibility of a target site, with high efficiency focused in the regions of open chromatin. In summary, our work provides a framework for efficient CRISPRi assay design based on functionally defined TSSs and features of the target site chromatin.
Article
Full-text available
The mangrove rivulus (Kryptolebias marmoratus) is one of two preferentially self-fertilizing hermaphroditic vertebrates. This mode of reproduction makes mangrove rivulus an important model for evolutionary and biomedical studies because long periods of self-fertilization result in naturally homozygous genotypes that can produce isogenic lineages without significant limitations associated with inbreeding depression. Over 400 isogenic lineages currently held in laboratories across the globe show considerable among-lineage variation in physiology, behavior, and life history traits that is maintained under common garden conditions. Temperature mediates the development of primary males and also sex change between hermaphrodites and secondary males, which makes the system ideal for the study of sex determination and sexual plasticity. Mangrove rivulus also exhibit remarkable adaptations to living in extreme environments, and the system has great promise to shed light on the evolution of terrestrial locomotion, aerial respiration, and broad tolerances to hypoxia, salinity, temperature, and environmental pollutants. Genome assembly of the mangrove rivulus allows the study of genes and gene families associated with the traits described above. Here we present a de novo assembled reference genome for the mangrove rivulus, with an approximately 900 Mb genome, including 27,328 annotated, predicted, protein-coding genes. Moreover, we are able to place more than 50% of the assembled genome onto a recently published linkage map. The genome provides an important addition to the linkage map and transcriptomic tools recently developed for this species that together provide critical resources for epigenetic, transcriptomic, and proteomic analyses. Moreover, the genome will serve as the foundation for addressing key questions in behavior, physiology, toxicology, and evolutionary biology.
Article
Full-text available
The UniProt consortium was formed in 2002 by groups from the Swiss Institute of Bioinformatics (SIB), the European Bioinformatics Institute (EBI) and the Protein Information Resource (PIR) at Georgetown University, and soon afterwards the website http://www.uniprot.org was set up as a central entry point to UniProt resources. Requests to this address were redirected to one of the three organisations' websites. While these sites shared a set of static pages with general information about UniProt, their pages for searching and viewing data were different. To provide users with a consistent view and to cut the cost of maintaining three separate sites, the consortium decided to develop a common website for UniProt. Following several years of intense development and a year of public beta testing, the http://www.uniprot.org domain was switched to the newly developed site described in this paper in July 2008. The UniProt consortium is the main provider of protein sequence and annotation data for much of the life sciences community. The http://www.uniprot.org website is the primary access point to this data and to documentation and basic tools for the data. These tools include full text and field-based text search, similarity search, multiple sequence alignment, batch retrieval and database identifier mapping. This paper discusses the design and implementation of the new website, which was released in July 2008, and shows how it improves data access for users with different levels of experience, as well as to machines for programmatic access.http://www.uniprot.org/ is open for both academic and commercial use. The site was built with open source tools and libraries. Feedback is very welcome and should be sent to help@uniprot.org. The new UniProt website makes accessing and understanding UniProt easier than ever. The two main lessons learned are that getting the basics right for such a data provider website has huge benefits, but is not trivial and easy to underestimate, and that there is no substitute for using empirical data throughout the development process to decide on what is and what is not working for your users.
Article
Full-text available
UniProtKB/Swiss-Prot, a curated protein database, and dictyBase, the Model Organism Database for Dictyostelium discoideum, have established a collaboration to improve data sharing. One of the major steps in this effort was the 'Dicty annotation marathon', a week-long exercise with 30 annotators aimed at achieving a major increase in the number of D. discoideum proteins represented in UniProtKB/Swiss-Prot. The marathon led to the annotation of over 1000 D. discoideum proteins in UniProtKB/Swiss-Prot. Concomitantly, there were a large number of updates in dictyBase concerning gene symbols, protein names and gene models. This exercise demonstrates how UniProtKB/Swiss-Prot can work in very close cooperation with model organism databases and how the annotation of proteins can be accelerated through those collaborations.
Article
Full-text available
Effective use of the human and mouse genomes requires reliable identification of genes and their products. Although multiple public resources provide annotation, different methods are used that can result in similar but not identical representation of genes, transcripts, and proteins. The collaborative consensus coding sequence (CCDS) project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented on the NCBI, Ensembl, and UCSC Genome Browsers. Importantly, the project coordinates on manually reviewing inconsistent protein annotations between sites, as well as annotations for which new evidence suggests a revision is needed, to progressively converge on a complete protein-coding set for the human and mouse reference genomes, while maintaining a high standard of reliability and biological accuracy. To date, the project has identified 20,159 human and 17,707 mouse consensus coding regions from 17,052 human and 16,893 mouse genes. Three evaluation methods indicate that the entries in the CCDS set are highly likely to represent real proteins, more so than annotations from contributing groups not included in CCDS. The CCDS database thus centralizes the function of identifying well-supported, identically-annotated, protein-coding regions.
Article
Full-text available
Dramatic increases in the throughput of nucleotide sequencing machines, and the promise of ever greater performance, have thrust bioinformatics into the era of petabyte-scale data sets. Sequence repositories, which provide the feed for these data sets into the worldwide computational infrastructure, are challenged by the impact of these data volumes. The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/embl), comprising the EMBL Nucleotide Sequence Database and the Ensembl Trace Archive, has identified challenges in the storage, movement, analysis, interpretation and visualization of petabyte-scale data sets. We present here our new repository for next generation sequence data, a brief summary of contents of the ENA and provide details of major developments to submission pipelines, high-throughput rule-based validation infrastructure and data integration approaches.
Article
Full-text available
Functional partnerships between proteins are at the core of complex cellular phenotypes, and the networks formed by interacting proteins provide researchers with crucial scaffolds for modeling, data reduction and annotation. STRING is a database and web resource dedicated to protein–protein interactions, including both physical and functional interactions. It weights and integrates information from numerous sources, including experimental repositories, computational prediction methods and public text collections, thus acting as a meta-database that maps all interaction evidence onto a common set of genomes and proteins. The most important new developments in STRING 8 over previous releases include a URL-based programming interface, which can be used to query STRING from other resources, improved interaction prediction via genomic neighborhood in prokaryotes, and the inclusion of protein structures. Version 8.0 of STRING covers about 2.5 million proteins from 630 organisms, providing the most comprehensive view on protein–protein interactions currently available. STRING can be reached at http://string-db.org/.
Article
Full-text available
To cope with the increasing amount of sequence data, reliable automatic annotation tools are required. The TrEMBL database contains together with SWISS-PROT nearly all publicly available protein sequences, but in contrast to SWISS-PROT only limited functional annotation. To improve this situation, we had to develop a method of automatic annotation that produces highly reliable functional prediction using the language and the syntax of SWISS-PROT. An algorithm was developed and successfully used for the automatic annotation of a testset of unknown proteins. The predicted information included description, function, catalytic activity, cofactors, pathway, subcellular location, quaternary structure, similarity to other protein, active sites, and keywords. The algorithm showed a low coverage (10%), but a high specificity and reliability. The results can be obtained by anonymous ftp from ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb. The source code is available on request from the authors.
Article
Full-text available
We have sequenced and annotated the genome of fission yeast (Schizosaccharomyces pombe), which contains the smallest number of protein-coding genes yet recorded for a eukaryote: 4,824. The centromeres are between 35 and 110 kilobases (kb) and contain related repeats including a highly conserved 1.8-kb element. Regions upstream of genes are longer than in budding yeast (Saccharomyces cerevisiae), possibly reflecting more-extended control regions. Some 43% of the genes contain introns, of which there are 4,730. Fifty genes have significant similarity with human disease genes; half of these are cancer related. We identify highly conserved genes important for eukaryotic cell organization including those required for the cytoskeleton, compartmentation, cell-cycle control, proteolysis, protein phosphorylation and RNA splicing. These genes may have originated with the appearance of eukaryotic life. Few similarly conserved genes that are important for multicellular organization were identified, suggesting that the transition from prokaryotes to eukaryotes required more new genes than did the transition from unicellular to multicellular organization.
Article
Full-text available
The Protein Information Resource (PIR) is an integrated public resource of protein informatics. To facilitate the sensible propagation and standardization of protein annotation and the systematic detection of annotation errors, PIR has extended its superfamily concept and developed the SuperFamily (PIRSF) classification system. Based on the evolutionary relationships of whole proteins, this classification system allows annotation of both specific biological and generic biochemical functions. The system adopts a network structure for protein classification from superfamily to subfamily levels. Protein family members are homologous (sharing common ancestry) and homeomorphic (sharing full-length sequence similarity with common domain architecture). The PIRSF database consists of two data sets, preliminary clusters and curated families. The curated families include family name, protein membership, parent-child relationship, domain architecture, and optional description and bibliography. PIRSF is accessible from the website at http://pir.georgetown.edu/pirsf/ for report retrieval and sequence classification. The report presents family annotation, membership statistics, cross-references to other databases, graphical display of domain architecture, and links to multiple sequence alignments and phylogenetic trees for curated families. PIRSF can be utilized to analyze phylogenetic profiles, to reveal functional convergence and divergence, and to identify interesting relationships between homeomorphic families, domains and structural classes.
Article
Full-text available
UniProt Archive (UniParc) is the most comprehensive, non-redundant protein sequence database available. Its protein sequences are retrieved from predominant, publicly accessible resources. All new and updated protein sequences are collected and loaded daily into UniParc for full coverage. To avoid redundancy, each unique sequence is stored only once with a stable protein identifier, which can be used later in UniParc to identify the same protein in all source databases. When proteins are loaded into the database, database cross-references are created to link them to the origins of the sequences. As a result, performing a sequence search against UniParc is equivalent to performing the same search against all databases cross-referenced by UniParc. UniParc contains only protein sequences and database cross-references; all other information must be retrieved from the source databases. Availability: http://www.ebi.ac.uk/uniparc/
Article
Full-text available
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.
Article
Full-text available
Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on sequence similarity helps organize all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences. The UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase (UniProtKB) and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Currently covering >4 million source sequences, the UniRef100 database combines identical sequences and subfragments from any source organism into a single UniRef entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at the 90 or 50% sequence identity levels. UniRef100, UniRef90 and UniRef50 yield a database size reduction of approximately 10, 40 and 70%, respectively, from the source sequence set. The reduced redundancy increases the speed of similarity searches and improves detection of distant relationships. UniRef entries contain summary cluster and membership information, including the sequence of a representative protein, member count and common taxonomy of the cluster, the accession numbers of all the merged entries and links to rich functional annotation in UniProtKB to facilitate biological discovery. UniRef has already been applied to broad research areas ranging from genome annotation to proteomics data analysis. UniRef is updated biweekly and is available for online search and retrieval at http://www.uniprot.org, as well as for download at ftp://ftp.uniprot.org/pub/databases/uniprot/uniref. Supplementary data are available at Bioinformatics online.
Article
Full-text available
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) collects and organizes biological information about the chromosomal features and gene products of the budding yeast Saccharomyces cerevisiae. Although published data from traditional experimental methods are the primary sources of evidence supporting Gene Ontology (GO) annotations for a gene product, high-throughput experiments and computational predictions can also provide valuable insights in the absence of an extensive body of literature. Therefore, GO annotations available at SGD now include high-throughput data as well as computational predictions provided by the GO Annotation Project (GOA UniProt; http://www.ebi.ac.uk/GOA/). Because the annotation method used to assign GO annotations varies by data source, GO resources at SGD have been modified to distinguish data sources and annotation methods. In addition to providing information for genes that have not been experimentally characterized, GO annotations from independent sources can be compared to those made by SGD to help keep the literature-based GO annotations current.
Article
Full-text available
The Arabidopsis Information Resource (TAIR, http://arabidopsis.org) is the model organism database for the fully sequenced and intensively studied model plant Arabidopsis thaliana. Data in TAIR is derived in large part from manual curation of the Arabidopsis research literature and direct submissions from the research community. New developments at TAIR include the addition of the GBrowse genome viewer to the TAIR site, a redesigned home page, navigation structure and portal pages to make the site more intuitive and easier to use, the launch of several TAIR web services and a new genome annotation release (TAIR7) in April 2007. A combination of manual and computational methods were used to generate this release, which contains 27 029 protein-coding genes, 3889 pseudogenes or transposable elements and 1123 ncRNAs (32 041 genes in all, 37 019 gene models). A total of 681 new genes and 1002 new splice variants were added. Overall, 10 098 loci (one-third of all loci from the previous TAIR6 release) were updated for the TAIR7 release.
Article
Full-text available
The Ensembl project (http://www.ensembl.org) is a comprehensive genome information system featuring an integrated set of genome annotation, databases and other information for chordate and selected model organism and disease vector genomes. As of release 47 (October 2007), Ensembl fully supports 35 species, with preliminary support for six additional species. New species in the past year include platypus and horse. Major additions and improvements to Ensembl since our previous report include extensive support for functional genomics data in the form of a specialized functional genomics database, genome-wide maps of protein–DNA interactions and the Ensembl regulatory build; support for customization of the Ensembl web interface through the addition of user accounts and user groups; and increased support for genome resequencing. We have also introduced new comparative genomics-based data mining options and report on the continued development of our software infrastructure.
Article
Full-text available
The Ensembl Trace Archive (http://trace.ensembl.org/) and the EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/), known together as the European Nucleotide Archive, continue to see growth in data volume and diversity. Selected major developments of 2007 are presented briefly, along with data submission and retrieval information. In the face of increasing requirements for nucleotide trace, sequence and annotation data archiving, data capture priority decisions have been taken at the European Nucleotide Archive. Priorities are discussed in terms of how reliably information can be captured, the long-term benefits of its capture and the ease with which it can be captured.
Article
Full-text available
KEGG (http://www.genome.jp/kegg/) is a database of biological systems that integrates genomic, chemical and systemic functional information. KEGG provides a reference knowledge base for linking genomes to life through the process of PATHWAY mapping, which is to map, for example, a genomic or transcriptomic content of genes to KEGG reference pathways to infer systemic behaviors of the cell or the organism. In addition, KEGG provides a reference knowledge base for linking genomes to the environment, such as for the analysis of drug-target relationships, through the process of BRITE mapping. KEGG BRITE is an ontology database representing functional hierarchies of various biological objects, including molecules, cells, organisms, diseases and drugs, as well as relationships among them. KEGG PATHWAY is now supplemented with a new global map of metabolic pathways, which is essentially a combined map of about 120 existing pathway maps. In addition, smaller pathway modules are defined and stored in KEGG MODULE that also contains other functional units and complexes. The KEGG resource is being expanded to suit the needs for practical applications. KEGG DRUG contains all approved drugs in the US and Japan, and KEGG DISEASE is a new database linking disease genes, pathways, drugs and diagnostic markers.
Chapter
Experimentally-verified information on protein function lags far behind the rapid accumulation of protein sequences. The simple approach to propagating information from characterized proteins to unknown proteins—namely, by sequence similarity search against databases of individual proteins—may fail to produce accurate results, and typically is used to transfer only protein name information. A more accurate, consistent, and comprehensive approach for large- scale automated annotation makes use of protein family classification-driven rules. Unannotated proteins that satisfy a set of conditions for a particular rule can be annotated with the information appropriate for that rule. The approach leads to facile, accurate prediction and functional inference for uncharacterized proteins, allows systematic detection of genome annotation errors, and provides sensible propagation and standardization of protein annotation, including position- specific sequence features, protein names and synonyms, and Gene Ontology terms. Rule-based annotation will be discussed in the context of the PIRSF protein classification system, PIRNR Name Rule system, and the PIRSR Site Rule system.
Article
Large-scale sequencing of prokaryotic genomes demands the automation of certain annotation tasks currently manually performed in the production of the SWISS-PROT protein knowledgebase. The HAMAP project, or 'High-quality Automated and Manual Annotation of microbial Proteomes', aims to integrate manual and automatic annotation methods in order to enhance the speed of the curation process while preserving the quality of the database annotation. Automatic annotation is only applied to entries that belong to manually defined orthologous families and to entries with no identifiable similarities (ORFans). Many checks are enforced in order to prevent the propagation of wrong annotation and to spot problematic cases, which are channelled to manual curation. The results of this annotation are integrated in SWISS-PROT, and a website is provided at http://www.expasy.org/sprot/hamap/.
Article
The Gene Ontology Annotation (GOA) database (http://www.ebi.ac.uk/GOA) aims to provide high-quality electronic and manual annotations to the UniProt Knowledgebase (Swiss-Prot, TrEMBL and PIR-PSD) using the standardized vocabulary of the Gene Ontology (GO). As a supplementary archive of GO annotation, GOA promotes a high level of integration of the knowledge represented in UniProt with other databases. This is achieved by converting UniProt annotation into a recognized computational format. GOA provides annotated entries for nearly 60,000 species (GOA-SPTr) and is the largest and most comprehensive open-source contributor of annotations to the GO Consortium annotation effort. By integrating GO annotations from other model organism groups, GOA consolidates specialized knowledge and expertise to ensure the data remain a key reference for up-to-date biological information. Furthermore, the GOA database fully endorses the Human Proteomics Initiative by prioritizing the annotation of proteins likely to benefit human health and disease. In addition to a non-redundant set of annotations to the human proteome (GOA-Human) and monthly releases of its GO annotation for all species (GOA-SPTr), a series of GO mapping files and specific cross-references in other databases are also regularly distributed. GOA can be queried through a simple user-friendly web interface or downloaded in a parsable format via the EBI and GO FTP websites. The GOA data set can be used to enhance the annotation of particular model organism or gene expression data sets, although increasingly it has been used to evaluate GO predictions generated from text mining or protein interaction experiments. In 2004, the GOA team will build on its success and will continue to supplement the functional annotation of UniProt and work towards enhancing the ability of scientists to access all available biological information. Researchers wishing to query or contribute to the GOA project are encouraged to email: goa@ebi.ac.uk.
Article
The increasing availability of polymorphism data has allowed more gene association studies to be carried out and the number of published genetic association studies is growing rapidly. Studies done secondarily to successful linkage studies over the last decade have also fueled the increase in published association studies. Although there are single-nucleotide polymorphism and human variation databases1, 2, there is currently no public repository for genetic association data. It is difficult to query association data in a systematic manner or to integrate association data with other molecular databases. OMIM3, the main repository of genetic information for mendelian disorders, is largely text based and is of a historical narrative design, making it difficult to compare large sets of molecular data. Moreover, OMIM archives mature, high-quality data of high significance, the standard in rare mendelian disorders. Although this data is useful, OMIM does not routinely collect findings of lower significance or negative findings. The study of nonmendelian, common complex disorders is often a struggle to find disease relevance with lower significance values, and often conflicting evidence. Negative data are often not reported or are marginalized into obscure and less accessible scientific journals, resulting in a publication bias favoring positive genetic associations4. Here, we describe the development of a genetic association database (GAD; http://geneticassociationdb.nih.gov) that aims to collect, standardize and archive genetic association study data and to make it easily accessible to the scientific community.
Article
Increasingly, scientists have begun to tackle gene functions and other complex regulatory processes by studying organisms at the global scales for various levels of biological organization, ranging from genomes to metabolomes and physiomes. Meanwhile, new bioinformatics methods have been developed for inferring protein function using associative analysis of functional properties to complement the traditional sequence homology-based methods. To fully exploit the value of the high-throughput system biology data and to facilitate protein functional studies requires bioinformatics infrastructures that support both data integration and associative analysis. The iProClass database, designed to serve as a framework for data integration in a distributed networking environment, provides comprehensive descriptions of all proteins, with rich links to over 50 databases of protein family, function, pathway, interaction, modification, structure, genome, ontology, literature, and taxonomy. In particular, the database is organized with PIRSF family classification and maps to other family, function, and structure classification schemes. Coupled with the underlying taxonomic information for complete genomes, the iProClass system (http://pir.georgetown.edu/iproclass/) supports associative studies of protein family, domain, function, and structure. A case study of the phosphoglycerate mutases illustrates a systematic approach for protein family and phylogenetic analysis. Such studies may serve as a basis for further analysis of protein functional evolution, and its relationship to the co-evolution of metabolic pathways, cellular networks, and organisms.