Article

Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

A two-stage neural network has been used to predict protein secondary structure based on the position specific scoring matrices generated by PSI-BLAST. Despite the simplicity and convenience of the approach used, the results are found to be superior to those produced by other methods, including the popular PHD method according to our own benchmarking results and the results from the recent Critical Assessment of Techniques for Protein Structure Prediction experiment (CASP3), where the method was evaluated by stringent blind testing. Using a new testing set based on a set of 187 unique folds, and three-way cross-validation based on structural similarity criteria rather than sequence similarity criteria used previously (no similar folds were present in both the testing and training sets) the method presented here (PSIPRED) achieved an average Q3 score of between 76.5% to 78.3% depending on the precise definition of observed secondary structure used, which is the highest published score for any method to date. Given the success of the method in CASP3, it is reasonable to be confident that the evaluation presented here gives a fair indication of the performance of the method in general.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... SOPMA [35] and PSIPRED [36] were used in predicting the secondary structure of the E proteins across the four DENV serotypes. Figure 3 displayed the percentages of alpha helices, beta turns, and random coils for each envelope (E) protein within the four DENV serotypes. ...
... Figure 3 displayed the percentages of alpha helices, beta turns, and random coils for each envelope (E) protein within the four DENV serotypes. According to the results obtained from SOPMA [35] and visually inspected using PSIPRED [36] , the E protein of DENV-2 exhibited the highest content of alpha helices at 23.43%, whereas the corresponding protein of DENV-4 contained the least number of alpha helices at 19.57%. Moreover, the E protein of DENV-4 possessed the greatest percentage of beta turns at 8.97%, while the same protein in DENV-2 had the lowest at 7.27%. ...
Article
Full-text available
Dengue virus is classified into serotypes, DENV-1, DENV-2, DENV-3, and DENV-4, causing symptoms from mild fevers to severe complications. This study comparatively analyzes the structural and functional variations of envelope (E) protein across DENV serotypes through computational methods. Multiple sequence alignment, physicochemical characterization, and secondary and tertiary structural comparison are conducted using ClustalW, ProtParam, and SWISS-MODEL. DENV-1 and DENV-3 E proteins show the highest identity, while DENV-4 E protein has the lowest conserved regions. Findings also indicate that DENV-1 has the highest ability to severely infect, specifically in a host cell attachment. Structurally, DENV-2 is the most stable, while DENV-4 is the least pathogenic. Quantitative comparisons reveal statistically significant variation across serotypes, indicating variances in virion assembly, host binding, and infectivity. The computational analysis contributes to the molecular understanding of E protein. However, further characterizing serotype-specific variations in genotype-based studies is necessary due to the molecular diversity within each serotype.
... Molecular structure predictions were carried out using the PsiPred Workbench (http://bioinf.cs.ucl.ac.uk/psipred, accessed on 4 October 2023). Predictions on secondary structure we carried out with PsiPred 4.0 tool [29]. Information on residues that can be predicted as disordered and that are structured by joining other proteins were extracted using DisoPred 3.1 tool [30]. ...
... Only the zinc finger C2H2-type of SUP has had its secondary structure experimentally solved by NMR [33]. So, to gain perspective on protein structure among SUP, RA1, and RA1-like, secondary structures predictions were carried out using the PsiPred Workbench (http://bioinf.cs.ucl.ac.uk/psipred) and its PsiPred 4.0 tool [29]. We also used the server's tool DisoPred 3.1 [30] to predict regions that cannot be assigned a known secondary structure, which are termed as disordered regions. ...
Article
Full-text available
RAMOSA1 (RA1) is a Cys2-His2-type (C2H2) zinc finger transcription factor that controls plant meristem fate and identity and has played an important role in maize domestication. Despite its importance, the origin of RA1 is unknown, and the evolution in plants is only partially understood. In this paper, we present a well-resolved phylogeny based on 73 amino acid sequences from 48 embryophyte species. The recovered tree topology indicates that, during grass evolution, RA1 arose from two consecutive SUPERMAN duplications, resulting in three distinct grass sequence lineages: RA1-like A, RA1-like B, and RA1; however, most of these copies have unknown functions. Our findings indicate that RA1 and RA1-like play roles in the nucleus despite lacking a traditional nuclear localization signal. Here, we report that copies diversified their coding region and, with it, their protein structure, suggesting different patterns of DNA binding and protein–protein interaction. In addition, each of the retained copies diversified regulatory elements along their promoter regions, indicating differences in their upstream regulation. Taken together, the evidence indicates that the RA1 and RA1-like gene families in grasses underwent subfunctionalization and neofunctionalization enabled by gene duplication.
... Secondary structure predictions were carried out using the PsiPred server [28]. Information on residues that can be predicted as disordered and that are structured by joining other proteins were extracted using DisoPred 3.1 server [29]. ...
... Only the zinc finger C2H2-type of SUP has had its secondary structure experimentally solved by NMR [45]. So, to gain perspective on protein structure among SUP, RA1, and RA1-like, disordered regions and secondary structures were predicted using the DisoPred 3.1 tool [29] and PsiPred server [28,46], respectively ( Figure 4, Figure S10, and Table S4). Overall, the N-terminal region of SUP, RA1, and RA1-like is predicted as disordered with protein binding affinities (Figure 4, Figure S10, and Table S4). ...
Preprint
Full-text available
RAMOSA1 (RA1) is a Cys2-His2-type (C2H2) zinc finger transcription factor that controls plant meristem fate and identity and has played an important role in maize domestication. Despite its importance, the origin of RA1 is unknown and the evolution in plants is partially understood. In this paper, we present a well resolved phylogeny based on 73 amino acid sequences from 48 embryophyte species. The recovered tree topology indicates that, during grass evolution, RA1 arose from two consecutive SUPERMAN duplications resulting in three distinct grass sequence lineages: RA1-like A, RA1-like B, and RA1; however, most of these copies have unknown functions. Our findings indicate that RA1 and RA1-like play roles in the nucleus despite lacking a traditional nuclear localization signal. Here we report that copies diversified their coding region and, with it, their protein structure, suggesting different patterns of DNA binding and protein-protein interaction. In addition, each of the retained copies diversified regulatory elements along their promoter regions, indicating differences in their upstream regulation. Taken together, we propose that the RA1 and RA1-like gene families in grasses may have undergone subfunctionalization and neofunctionalization enabled by gene duplication.
... The vaccine construct's secondary structure was predicted using the PSIPRED tool (PSIPRED 4.0) (http://bioinf.cs.ucl.ac.uk/psipred/) [46], the SIMPA96 tool (https://npsa-prabi.ibcp.fr/cgibin/npsa_automat.pl?page=/NPSA/npsa_simpa96.html) [47], SOPMA (https://npsapbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_sopma.html) [47] with the parameters: 4 conformational states, 8 similarity thresholds, and 17 window widths. These servers are reliable, rapid, and effective for predicting the percentages or quantities of amino acids in α helix, β-sheet, and coil structural formations [48][49][50][51]. The tertiary structure was predicted using the SCRATCH Protein Predictor's 3Dpro tool (https://scratch.proteomics.ics.uci.edu/). ...
Preprint
Full-text available
Glioblastoma multiforme (GBM) is a highly aggressive form of brain cancer classified as grade 4 glioma with a median survival rate of 12-14 months. Currently there is no cure and the conventional treatment outcomes are poor making it imperative to develop novel treatments. WT1, also known as Wilms' Tumor 1, is a protein that is often found to be overexpressed in GBM and has minimal expression in normal tissues has become a promising target for immunotherapy for its oncogenicity and immunogenicity. This study aimed to develop a multi-epitope vaccine using immuno-informatics approaches that specifically targets the WT1 protein. The WT1 sequence was used to predict B and T cell epitopes which showed probable antigenic, non-allergic and non-toxic properties. Using suitable linker and adjuvants the vaccine construct (370 amino acids) was prepared which were then analyzed for solubility, physicochemical properties, molecular docking to show receptor interactions and molecular dynamics to show stability of the vaccine-receptor complexes. Subsequently, the vaccine sequence was back translated (1110 nucleotides), codon adaptation, pET-28a (+) vector in-silico cloning, and immune response simulations were performed. The designed vaccine lays groundwork for future in-vitro and in-vivo studies and potentially develop it into a novel treatment option for GBM patients
... The benchmark dataset comprises FASTA sequences, which require their conversion into PSSM(Position Specific Scoring Matrices) Jones (1999), Jeong et al. (2010) for subsequent analysis. PSSM is a commonly used matrix representation method in bioinformatics, reflecting the conservation of each amino acid at specific positions within a set of sequences. ...
Article
Full-text available
The recognition of DNA Binding Proteins (DBPs) plays a crucial role in understanding biological functions such as replication, transcription, and repair. Although current sequence-based methods have shown some effectiveness, they often fail to fully utilize the potential of deep learning in capturing complex patterns. This study introduces a novel model, LGC-DBP, which integrates Long Short-Term Memory (LSTM), Gated Inception Convolution, and Improved Channel Attention mechanisms to enhance the prediction of DBPs. Initially, the model transforms protein sequences into Position Specific Scoring Matrices (PSSM), then processed through our deep learning framework. Within this framework, Gated Inception Convolution merges the concepts of gating units with the advantages of Graph Convolutional Network (GCN) and Dilated Convolution, significantly surpassing traditional convolution methods. The Improved Channel Attention mechanism substantially enhances the model’s responsiveness and accuracy by shifting from a single input to three inputs and integrating three sigmoid functions along with an additional layer output. These innovative combinations have significantly improved model performance, enabling LGC-DBP to recognize and interpret the complex relationships within DBP features more accurately. The evaluation results show that LGC-DBP achieves an accuracy of 88.26% and a Matthews correlation coefficient of 0.701, both surpassing existing methods. These achievements demonstrate the model’s strong capability in integrating and analyzing multi-dimensional data and mark a significant advancement over traditional methods by capturing deeper, nonlinear interactions within the data.
... To construct the protein model for the target sequence, Modeller 9.12 suite (25-27) was used. The secondary structure was predicted using PsiPred (28). Alignments of target and template sequences are adjusted to avoid big gaps in the secondary structure domain. ...
Article
A large number of infections are caused by Salmonella worldwide at a remarkable pace. Salmonella enteric serotype typhi (S. typhi), gram negative bacteria (only cause disease in man) is the predominant causative agent A of Typhoid fever. Typhoid fever is most common in poor and undeveloped countries of Asia and Africa. One of the major components of virulence factors produced during salmonella infection is Lipid A, which acts as a strong human immuno-modulator bacterial endotoxin. Regulation of Lipid A biosynthetic pathway occurs at second step, catalyzed by LpxC, a Zn2+ dependent metalloamidase. Systematic Screening of natural products library database fruitfully provided us a potent lead molecule {Pubchem CID: 1788783 (ZINC02133485) ((2S)-2-[[2-[3-(4-fluorophenyl)-5-methyl-7-oxofuro[3,2-g]chromen-6-yl]acetyl]amino]-3-(5-hydroxy-1H-indol-3-yl)propanoic acid)} which actively binds (binding energy:-10.7 kJ/mol and Kd: 14.35 nM) with LpxC enzyme and could be developed into a sound and potent inhibitor of LpxC enzyme after the application of drug development and processing strategies. Wet lab experimentation is required to validate these results for further use.
... This result was in good accordance with the predicted secondary structure calculated using the PSIPRED server protein secondary structure prediction algorithm (http://bioinf.cs.ucl.ac. uk/psipred/) [37], that envisioned four a-helixes in CdFur N-terminal domain while the C-terminal domain consisting of both a-helixes and ß-sheets (Fig. 3B). The secondary structure motif distribution obtained was analogous to that shown by other Fur homologs whose 3D-structure had already been solved [24,38,39]. ...
Article
Full-text available
Clostridioides (formerly Clostridium) difficile is a leading cause of infectious diarrhea associated with antibiotic therapy. The ability of this anaerobic pathogen to acquire enough iron to proliferate under iron limitation conditions imposed by the host largely determines its pathogenicity. However, since high intracellular iron catalyzes formation of deleterious reactive hydroxyl radicals, iron uptake is tightly regulated at the transcriptional level by the ferric uptake regulator Fur. Several studies relate lacking a functional fur gene in C. difficile cells to higher oxidative stress sensitivity, colonization defect and less toxigenicity, although Fur does not appear to directly regulate either oxidative stress response genes or pathogenesis genes. In this work, we report the functional characterization of C. difficile Fur and describe an additional oxidation sensing Fur‐mediated mechanism independent of iron, which affects Fur DNA‐binding. Using electrophoretic mobility shift assays, we show that Fur binding to the promoters of fur, feoA and fldX genes, identified as iron and Fur‐regulated genes in vivo, is specific and does not require co‐regulator metal under reducing conditions. Fur treatment with H2O2 produces dose‐dependent soluble high molecular weight species unable to bind to target promoters. Moreover, Fur oligomers are dithiotreitol sensitive, highlighting the importance of some interchain disulfide bond(s) for Fur oligomerization, and hence for activity. Additionally, the physiological electron transport chain NADPH‐thioredoxin reductase/thioredoxin from Escherichia coli reduces inactive oligomerized C. difficile Fur that recovers activity. In conjunction with available transcriptomic data, these results suggest a previously underappreciated complexity in the control of some members of the Fur regulon that is based on Fur redox properties and might be fundamental for the adaptive response of C. difficile during infection.
... 60 In order to identify putative docking domains at the extremities of the toblerol PKS subunits, the boundaries of their adjacent domains were determined by multiple sequence alignment of known PKS domains using Clustal Omega 61 and HHpred. 62 Secondary structure prediction of the putative docking domains was performed with PSIPRED, 36,63 and disorder and interaction propensity predicted with IUPred3/ANCHOR2. 37 As no confident secondary structure prediction emerged, the regions directly down-/upstream of the flanking domains were expressed as TobC C DD and TobE N DD, respectively. ...
Article
Full-text available
The fidelity of biosynthesis by modular polyketide synthases (PKSs) depends on specific, moderate affinity interactions between successive polypeptide subunits mediated by docking domains (DDs). These sequence elements are notably portable,...
... Various methods have been proposed for feature extraction in recent years. Some of the methods include mathematical statistics and spectrum analysis [2], pseudoamino acid composition [3], and position-specific scoring matrix (PSSM) (David [4]). PSSM is a commonly used method for feature extraction because it fully uses the position information of amino acids. ...
Article
Full-text available
Protein structure prediction is one of the main research areas in the field of Bio-informatics. The importance of proteins in drug design attracts researchers for finding the accurate tertiary structure of the protein which is dependent on its secondary structure. In this paper, we focus on improving the accuracy of protein secondary structure prediction. To do so, a Multi-scale convolutional neural network with a Gated recurrent neural network (MCNN-GRNN) is proposed. The novel amino acid encoding method along with layered convolutional neural network and Gated recurrent neural network blocks helps to retrieve local and global relationships between features, which in turn effectively classify the input protein sequence into 3 and 8 states. We have evaluated our algorithm on CullPDB, CB513, PDB25, CASP10, CASP11, CASP12, CASP13, and CASP14 datasets. We have compared our algorithm with different state-of-the-art algorithms like DCNN-SS, DCRNN, MUFOLD-SS, DLBLS_SS, and CGAN-PSSP. The Q3 accuracy of the proposed algorithm is 82–87% and Q8 accuracy is 69–77% on different datasets.
... Sequence alignment was done using Clustal Omega online (ClustalW) and compared with the sequences deposited in the National Center for Biotechnology Information (NCBI) database. The secondary and tertiary structures of the proteins were predicted by PSIPRED (Buchan et al. 2010;Jones 1999) and SWISS-MODEL (Arnold et al. 2006). For this purpose, 10 samples of each genotype (Brazil and Spain) were sequenced. ...
Article
Full-text available
Stevia plants are well-known for their ability to synthesize steviol glycosides (SGs), a natural sweetener blend. The principal SGs include stevioside (STV) and Rebaudioside-A (Reb-A), with the latter exhibiting superior sweetness and organoleptic properties. UDP glucosyltransferase-76G1 (UGT76G1) is responsible for converting STV to Reb-A, determining the intensity of sweetness. A better understanding of the structure/activity of SrUGT76G1 could provide insights into Reb-A production in stevia plants. To this end, a combination of enzymatic assays and sequencing analysis was performed using two stevia genotypes (Brazilian and Spanish) with contrasting Reb-A production capabilities (off/on). Relative expression of SrUGT76G1 gene showed remarkably higher expression (~ threefold) in Spanish samples compared to Brazilian ones. Foliar protein fractions (crude or partially purified extract) from Brazil plants were unable to convert STV into Reb-A under in vitro conditions, resulting in undetectable levels of Reb-A by HPLC. Molecular analyses revealed that the Brazilian SrUGT76G1 gene not only presents a premature stop codon, resulting in the absence of PSPG motif responsible for the binding of glycosyl groups, but also exhibits mutations affecting key amino acid residues in the acceptor-binding pocket. These alterations provide a plausible explanation for the Brazilian protein inability to catalyze the transformation of STV into Reb-A.
... This server was chosen because of its availability, blended modelling approach, & enactment in the CASP cooperation. The I-TASSER technique contains general steps such as threading, structural assemblage, model assortment, refining, & structure founded functional annotation [19,20]. The request sequence was then LOMETS [21] routed through the representative PDB structure collection. ...
Article
Full-text available
Background: ZIKV is one of the re-emerging arboviruses (viruses carried by arthropods), which is spread through the Aedes mosquito. It is an RNA virus with only one strand that is appropriate to the family Flaviviridae's Flavivirus (genus) & has been linked to other Flaviviruses such as the West Nile virus, chikungunya virus, & dengue (DENV) virus. The envelope, precursor membrane, and capsid are three structural proteins, and seven nonstructural proteins are also encoded by the Zika virus genome. Methods: We conducted an in-silico analysis of the Zika virus' MTase domain protein for this publication. We predicted that methylation would play a significant role in the available Prosite, Pfam, and InterProScan tools to aid in locating the MTase domain. Along with alignment, amino acid composition, charged amino acids, atomic level studies, & molecular weight, we also make predictions for these variables, including theoretical Pi. Results: We also examine the MTase domain's simulated structure (alpha helix, beta sheet, turn) and its specifics, including secondary structure. We also pinpoint the locations where proteins, DNA, and RNA bind. Potential phosphorylation sites can be found on the Ser, Thr, and Tyr residues in the MTase domain. Conclusion: These outcomes imply a complicated interaction between different phosphorylation modifications that modulates the activity of the MTase domain. To fully appreciate the auxiliary and practical perspectives and to clarify the varied roles of PTM in the MTase domain will be a primary goal of future study.
... COQ2 secondary structure, disordered regions and transmembrane helices were predicted using PSIPRED 4.0 [31], DISOPRED3 [32] and MEMSAT-SVM (Nugent and Jones 2009), respectively, all of which are available at the PSIPRED Workbench [33]. ...
Preprint
Full-text available
Coenzyme Q (CoQ) is a lipidic compound widely distributed in nature with crucial functions in metabolism, protection against oxidative damage and ferroptosis, and other processes. CoQ biosynthesis is a conserved and complex pathway involving several proteins. COQ2 is a member of the UbiA family of transmembrane prenyltransferases that catalyzes the condensation of the head and tail precursors of CoQ, a key step in the process because its product is the first intermediate that will be modified in the head by the next component of the synthesis process. Mutations in this protein have been linked to primary CoQ deficiency in humans, a rare disease predominantly affecting organs with a high energy demand. The reaction catalyzed by COQ2 and its mechanism are still unknown. Here we aimed at clarifying COQ2 reaction by exploring possible substrate binding sites using a strategy based on homology, comprising the identification of ligand-bound homologs with solved structures available in the Protein Data Bank (PDB) and their subsequent structural superposition to the AlphaFold predicted model for COQ2. The results highlight some residues located on the central cavity or the matrix loops that may be involved in substrate interaction, some of them mutated in primary CoQ deficiency patients. Furthermore, we analyze the structural modifications introduced by the pathogenic mutations found in humans. These findings shed new light on the understanding of COQ2 function and, thus, CoQ biosynthesis and pathogenicity of primary CoQ deficiency.
... The template with the top total score (the product of the scores of BAST alignment, the WHAT_CHECK [37] quality, and the target coverage) from the hits found was selected as the main template. For alignment correction and modeling of loops, the target sequence profile creation and secondary structure prediction were made by running PSI-BLAST and PSI-Pred algorithms [38], respectively. After building the three initial homology models, they were sorted by their overall Z-scores; a number indicates standard deviations in the quality of the protein model and high-resolution X-ray structures [36,37,39]. ...
Article
Full-text available
Bacteriophage endolysins are potential alternatives to conventional antibiotics for treating multidrug-resistant gram-negative bacterial infections. However, their structure–function relationships are poorly understood, hindering their optimization and application. In this study, we focused on the individual functionality of the C-terminal muramidase domain of Gp127, a modular endolysin from E. coli O157:H7 bacteriophage PhaxI. This domain is responsible for the enzymatic activity, whereas the N-terminal domain binds to the bacterial cell wall. Through protein modeling, docking experiments, and molecular dynamics simulations, we investigated the activity, stability, and interactions of the isolated C-terminal domain with its ligand. We also assessed its expression, solubility, toxicity, and lytic activity using the experimental data. Our results revealed that the C-terminal domain exhibits high activity and toxicity when tested individually, and its expression is regulated in different hosts to prevent self-destruction. Furthermore, we validated the muralytic activity of the purified refolded protein by zymography and standardized assays. These findings challenge the need for the N-terminal binding domain to arrange the active site and adjust the gap between crucial residues for peptidoglycan cleavage. Our study shed light on the three-dimensional structure and functionality of muramidase endolysins, thereby enriching the existing knowledge pool and laying a foundation for accurate in silico modeling and the informed design of next-generation enzybiotic treatments.
... We used PSIPred 4.0 (51,52) or DSSP (53,54) analysis of AlphaFold (55,56) structures to predict the secondary structure of the small proteins in this study. Signal peptide prediction was performed using the SignalP-6.0 ...
Article
Full-text available
Significant efforts have been made to characterize the biophysical properties of proteins. Small proteins have received less attention because their annotation has historically been less reliable. However, recent improvements in sequencing, proteomics, and bioinformatics techniques have led to the high-confidence annotation of small open reading frames (smORFs) that encode for functional proteins, producing smORF-encoded proteins (SEPs). SEPs have been found to perform critical functions in several species, including humans. While significant efforts have been made to annotate SEPs, less attention has been given to the biophysical properties of these proteins. We characterized the distributions of predicted and curated biophysical properties, including sequence composition, structure, localization, function, and disease association of a conservative list of previously identified human SEPs. We found significant differences between SEPs and both larger proteins and control sets. Additionally, we provide an example of how our characterization of biophysical properties can contribute to distinguishing protein-coding smORFs from non-coding ones in otherwise ambiguous cases.
... COQ2's secondary structure, disordered regions and transmembrane helices were predicted using PSIPRED 4.0 [31], DISOPRED3 [32] and MEMSAT-SVM (Nugent and Jones 2009), respectively, all of which are available at the PSIPRED Workbench [33]. ...
Article
Full-text available
Coenzyme Q (CoQ) is a lipidic compound that is widely distributed in nature, with crucial functions in metabolism, protection against oxidative damage and ferroptosis and other processes. CoQ biosynthesis is a conserved and complex pathway involving several proteins. COQ2 is a member of the UbiA family of transmembrane prenyltransferases that catalyzes the condensation of the head and tail precursors of CoQ, which is a key step in the process, because its product is the first intermediate that will be modified in the head by the next components of the synthesis process. Mutations in this protein have been linked to primary CoQ deficiency in humans, a rare disease predominantly affecting organs with a high energy demand. The reaction catalyzed by COQ2 and its mechanism are still unknown. Here, we aimed at clarifying the COQ2 reaction by exploring possible substrate binding sites using a strategy based on homology, comprising the identification of available ligand-bound homologs with solved structures in the Protein Data Bank (PDB) and their subsequent structural superposition in the AlphaFold predicted model for COQ2. The results highlight some residues located on the central cavity or the matrix loops that may be involved in substrate interaction, some of which are mutated in primary CoQ deficiency patients. Furthermore, we analyze the structural modifications introduced by the pathogenic mutations found in humans. These findings shed new light on the understanding of COQ2’s function and, thus, CoQ’s biosynthesis and the pathogenicity of primary CoQ deficiency.
... Cladogram was created using EMBL Simple Phylogeny tool (Madeira et al., 2022). The secondary structure of KNL-1 in different nematode species was predicted using PsiPred (Jones, 1999). ...
Article
Full-text available
During mitosis, the Bub1–Bub3 complex concentrates at kinetochores, the microtubule-coupling interfaces on chromosomes, where it contributes to spindle checkpoint activation, kinetochore-spindle microtubule interactions, and protection of centromeric cohesion. Bub1 has a conserved N-terminal tetratricopeptide repeat (TPR) domain followed by a binding motif for its conserved interactor Bub3. The current model for Bub1–Bub3 localization to kinetochores is that Bub3, along with its bound motif from Bub1, recognizes phosphorylated “MELT” motifs in the kinetochore scaffold protein Knl1. Motivated by the greater phenotypic severity of BUB-1 versus BUB-3 loss in C. elegans, we show that the BUB-1 TPR domain directly recognizes a distinct class of phosphorylated motifs in KNL-1 and that this interaction is essential for BUB-1–BUB-3 localization and function. BUB-3 recognition of phospho-MELT motifs additively contributes to drive super-stoichiometric accumulation of BUB-1–BUB-3 on its KNL-1 scaffold during mitotic entry. Bub1’s TPR domain interacts with Knl1 in other species, suggesting that collaboration of TPR-dependent and Bub3-dependent interfaces in Bub1–Bub3 localization and functions may be conserved.
... The best model was selected based on the MolProbity score, which assesses the structural quality of the entire structure (43). The folding of the linker was assessed through secondary structure prediction tools using PSIPRED, v.3 (44), and PROF (45) which showed the absence of regions with secondary structures in the linker (data not shown). The incorporation of a lipid bilayer is crucial for understanding the extrusion mechanism of chemotherapeutics. ...
Article
Full-text available
Leishmaniasis is a neglected tropical disease infecting the world’s poorest populations. Miltefosine (ML) remains the primary oral drug against the cutaneous form of leishmaniasis. The ATP-binding cassette (ABC) transporters are key players in the xenobiotic efflux, and their inhibition could enhance the therapeutic index. In this study, the ability of beauvericin (BEA) to overcome ABC transporter-mediated resistance of Leishmania tropica to ML was assessed. In addition, the transcription profile of genes involved in resistance acquisition to ML was inspected. Finally, we explored the efflux mechanism of the drug and inhibitor. The efficacy of ML against all developmental stages of L. tropica in the presence or absence of BEA was evaluated using an absolute quantification assay. The expression of resistance genes was evaluated, comparing susceptible and resistant strains. Finally, the mechanisms governing the interaction between the ABC transporter and its ligands were elucidated using molecular docking and dynamic simulation. Relative quantification showed that the expression of the ABCG sub-family is mostly modulated by ML. In this study, we used BEA to impede resistance of Leishmania tropica. The IC50 values, following BEA treatment, were significantly reduced from 30.83, 48.17, and 16.83 µM using ML to 8.14, 11.1, and 7.18 µM when using a combinatorial treatment (ML + BEA) against promastigotes, axenic amastigotes, and intracellular amastigotes, respectively. We also demonstrated a favorable BEA-binding enthalpy to L. tropica ABC transporter compared to ML. Our study revealed that BEA partially reverses the resistance development of L. tropica to ML by blocking the alternate ATP hydrolysis cycle.
... However, this method has shown limited accuracy, in prediction. Another used technique involves utilizing PSSM profile features [8] or HHM profile features [9]. These profile features incorporate information derived from analyzing sequence alignments obtained from a large protein sequence database. ...
Article
Full-text available
In bioinformatics, protein secondary structure prediction plays a significant role in understanding protein function and interactions. This study presents the TE_SS approach, which uses a transformer encoder-based model and the Ankh protein language model to predict protein secondary structures. The research focuses on the prediction of nine classes of structures, according to the Dictionary of Secondary Structure of Proteins (DSSP) version 4. The model's performance was rigorously evaluated using various datasets. Additionally, this study compares the model with the state-of-the-art methods in the prediction of eight structure classes. The findings reveal that TE_SS excels in nine- and three-class structure predictions while also showing remarkable proficiency in the eight-class category. This is underscored by its performance in Qs and SOV evaluation metrics, demonstrating its capability to discern complex protein sequence patterns. This advancement provides a significant tool for protein structure analysis, thereby enriching the field of bioinformatics.
... The AAindex database [31] comprises an assortment of numerical indices that encapsulate a variety of physicochemical and biochemical attributes of amino acids as well as amino acid pairs. The One-of-K encoding scheme [32] is a fundamental yet crucial method for feature extraction which represents each of the 20 standard amino acids as a 20-length vector, where a singular position is assigned the value 1 to indicate presence, and the remainder are set to zero. The BLOSUM62 [33] is an extensively utilized scoring matrix for amino acid substitutions, which is constructed from empirical comparisons of aligned blocks of conserved protein sequences. ...
Article
Full-text available
COVID-19, caused by the highly contagious SARS-CoV-2 virus, is distinguished by its positive-sense, single-stranded RNA genome. A thorough understanding of SARS-CoV-2 pathogenesis is crucial for halting its proliferation. Notably, the 3C- like protease of the coronavirus (denoted as $3CL^{pro}$ ) is instrumental in the viral replication process. Precise delineation of $3CL^{pro}$ cleavage sites is imperative for elucidating the transmission dynamics of SARS-CoV-2. While machine learning tools have been deployed to identify potential $3CL^{pro}$ cleavage sites, these existing methods often fall short in terms of accuracy. To improve the performances of these predictions, we propose a novel analytical framework, the Transformer and Deep Forest Fusion Model (TDFFM). Within TDFFM, we utilize the AAindex and the BLOSUM62 matrix to encode protein sequences. These encoded features are subsequently input into two distinct components: a Deep Forest, which is an effective decision tree ensemble methodology, and a Transformer equipped with a Multi-Level Attention Model (TMLAM). The integration of the attention mechanism allows our model to more accurately identify positive samples, thus enhancing the overall predictive performance. Evaluation on a test set demonstrates that our TDFFM achieves an accuracy of 0.955, an AUC of 0.980, and an F1-score of 0.367, substantiating the model's superior prediction capabilities.
... The resulting diagrams of protein domains were visualized using TBtools v1.108 [85], a biosequence structure illustrator. The protein secondary structure was visualized using PSIPRED program [86]. AlphaFold was performed by running the AlphaFold2 notebook on Google Collaboratory cloud computing facilities with default parameters. ...
Article
Full-text available
Background Metazoan adenosine-to-inosine (A-to-I) RNA editing resembles A-to-G mutation and increases proteomic diversity in a temporal-spatial manner, allowing organisms adapting to changeable environment. The RNA editomes in many major animal clades remain unexplored, hampering the understanding on the evolution and adaptation of this essential post-transcriptional modification. Methods We assembled the chromosome-level genome of Coridius chinensis belonging to Hemiptera, the fifth largest insect order where RNA editing has not been studied yet. We generated ten head RNA-Seq libraries with DNA-Seq from the matched individuals. Results We identified thousands of high-confidence RNA editing sites in C. chinensis. Overrepresentation of nonsynonymous editing was observed, but conserved recoding across different orders was very rare. Under cold stress, the global editing efficiency was down-regulated and the general transcriptional processes were shut down. Nevertheless, we found an interesting site with “conserved editing but non-conserved recoding” in potassium channel Shab which was significantly up-regulated in cold, serving as a candidate functional site in response to temperature stress. Conclusions RNA editing in C. chinensis largely recodes the proteome. The first RNA editome in Hemiptera indicates independent origin of beneficial recoding during insect evolution, which advances our understanding on the evolution, conservation, and adaptation of RNA editing.
... In addition to sequencebased features, structural information can provide valuable insights into the function and activity of peptides. Several techniques have been employed to predict and incorporate structural features, as detailed below: o Secondary structure prediction: Various computational methods, such as PSIPRED (52) and SPIDER3 (44), have been used to predict the secondary structure elements (e.g., α-helices, β-sheets) of peptides, which can be used as input features for machine learning models. ...
Preprint
Full-text available
Prediction, design and discovery of antimicrobial peptides against AMR. https://ssrn.com/abstract=4740758
... A study found that the side chain of these two types of amino acids tend to be more adjacent with other proteins [42]. These results supported the conclusion that hydrophilic amino acids on the rim segment of the interaction interface are essential for PPI and might enhance the interaction efficiency [43][44][45]. However, the interface core residues are more conservative and less mobile than edge residues. ...
Article
Full-text available
Perilipin-2 (PLIN2) can anchor to lipid droplets (LDs) and play a crucial role in regulating nascent LDs formation. Bimolecular fluorescence complementation (BiFC) and flow cytometry were examined to verify the PLIN2-CGI-58 interaction efficiency in bovine adipocytes. GST-Pulldown assay was used to detect the key site arginine315 function in PLIN2-CGI-58 interaction. Experiments were also examined to research these mutations function of PLIN2 in LDs formation during adipocytes differentiation, LDs were measured after staining by BODIPY, lipogenesis-related genes were also detected. Results showed that Leucine (L371A, L311A) and glycine (G369A, G376A) mutations reduced interaction efficiencies. Serine (S367A) mutations enhanced the interaction efficiency. Arginine (R315A) mutations resulted in loss of fluorescence in the cytoplasm and disrupted the interaction with CGI-58, as verified by pulldown assay. R315W mutations resulted in a significant increase in the number of LDs compared with wild-type (WT) PLIN2 or the R315A mutations. Lipogenesis-related genes were either up- or downregulated when mutated PLIN2 interacted with CGI-58. Arginine315 in PLIN2 is required for the PLIN2-CGI-58 interface and could regulate nascent LD formation and lipogenesis. This study is the first to study amino acids on the PLIN2 interface during interaction with CGI-58 in bovine and highlight the role played by PLIN2 in the regulation of bovine adipocyte lipogenesis.
... The secondary structure properties of Sal k 1 was predicted using PSIPRED and NetSurfP ver. 1.1 27,28 . ...
Article
Full-text available
Allergens originated from Salsola kali (Russian thistle) pollen grains are one of the most important sources of aeroallergens causing pollinosis in desert and semi-desert regions. T-cell epitope-based vaccines (TEV) are more effective among different therapeutic approaches developed to alleviate allergic diseases. The physicochemical properties, and B as well as T cell epitopes of Sal k 1 (a major allergen of S. kali) were predicted using immunoinformatic tools. A TEV was constructed using the linkers EAAAK, GPGPG and the most suitable CD4⁺ T cell epitopes. RS04 adjuvant was added as a TLR4 agonist to the amino (N) and carboxyl (C) terminus of the TEV protein. The secondary and tertiary structures, solubility, allergenicity, toxicity, stability, physicochemical properties, docking with immune receptors, BLASTp against the human and microbiota proteomes, and in silico cloning of the designed TEV were assessed using immunoinformatic analyses. Two CD4⁺ T cell epitopes of Sal k1 that had high affinity with different alleles of MHC-II were selected and used in the TEV. The molecular docking of the TEV with HLADRB1, and TLR4 showed TEV strong interactions and stable binding pose to these receptors. Moreover, the codon optimized TEV sequence was cloned between NcoI and XhoI restriction sites of pET-28a(+) expression plasmid. The designed TEV can be used as a promising candidate in allergen-specific immunotherapy against S. kali. Nonetheless, effectiveness of this vaccine should be validated through immunological bioassays.
... Cladogram was created using EMBL Simple Phylogeny tool (Madeira et al., 2022). Secondary structure of KNL-1 in different nematode species was predicted using PsiPred (Jones, 1999). ...
Preprint
Full-text available
During mitosis, the Bub1-Bub3 complex concentrates at kinetochores, the microtubule-coupling interfaces on chromosomes, where it contributes to spindle checkpoint activation, kinetochore-spindle microtubule interactions, and protection of centromeric cohesion. Bub1 has a conserved N-terminal tetratricopeptide (TPR) domain followed by a binding motif for its conserved interactor Bub3. The current model for Bub1-Bub3 localization to kinetochores is that Bub3, along with its bound motif from Bub1, recognizes phosphorylated “MELT” motifs in the kinetochore scaffold protein Knl1. Motivated by the greater phenotypic severity of BUB-1 versus BUB-3 loss in C. elegans , we show that the BUB-1 TPR domain directly recognizes a distinct class of phosphorylated motifs in KNL-1 and that this interaction is essential for BUB-1–BUB-3 localization and function. BUB-3 recognition of phospho-MELT motifs additively contributes to drive super-stoichiometric accumulation of BUB-1–BUB-3 on its KNL-1 scaffold during mitotic entry. Bub1’s TPR domain interacts with Knl1 in other species, suggesting that collaboration of TPR-dependent and Bub3-dependent interfaces in Bub1-Bub3 localization and functions may be conserved.
... These methods involve extracting manually crafted features from protein sequences and structures (e.g. position-specific scoring matrix [24] and peptide backbone torsion angles [10]), which are further fed to machine learning models (e.g. support vector machine [25] and random forest [26]) to carry out DNA-binding site prediction, including classical examples such as DNAPred [13], TargetDNA [27], MetaDBSite [28] and TargetS [29]. ...
Article
Full-text available
Efficient and accurate recognition of protein–DNA interactions is vital for understanding the molecular mechanisms of related biological processes and further guiding drug discovery. Although the current experimental protocols are the most precise way to determine protein–DNA binding sites, they tend to be labor-intensive and time-consuming. There is an immediate need to design efficient computational approaches for predicting DNA-binding sites. Here, we proposed ULDNA, a new deep-learning model, to deduce DNA-binding sites from protein sequences. This model leverages an LSTM-attention architecture, embedded with three unsupervised language models that are pre-trained on large-scale sequences from multiple database sources. To prove its effectiveness, ULDNA was tested on 229 protein chains with experimental annotation of DNA-binding sites. Results from computational experiments revealed that ULDNA significantly improves the accuracy of DNA-binding site prediction in comparison with 17 state-of-the-art methods. In-depth data analyses showed that the major strength of ULDNA stems from employing three transformer language models. Specifically, these language models capture complementary feature embeddings with evolution diversity, in which the complex DNA-binding patterns are buried. Meanwhile, the specially crafted LSTM-attention network effectively decodes evolution diversity-based embeddings as DNA-binding results at the residue level. Our findings demonstrated a new pipeline for predicting DNA-binding sites on a large scale with high accuracy from protein sequence alone.
... The final construct is subjected to multi-tier analysis to examine the vaccine like characteristics. For instance, The physicochemical properties, including instability index, molecular weight, pI, half-life, and secondary structure distribution, of the vaccine construct were assessed using the Expasy Protparam and PSIPRED online tools [34,35]. In addition, the immunogenic and allergenic properties of the vaccine construct were examined. ...
Article
Full-text available
Introduction: Targeting tumor microenvironment is beneficial and present an ideal setting for the development of futuristic immunotherapy. Here, we make use of Nuclear prelamin A recognition factor (NARF), a protein linked to the coactivation of transcriptional regulators in human breast cancer stem cells (CSC) in our investigation. Methods: In this study, we initially computed the epitope regions possessing the ability to stimulate both T and B cells within the NARF protein. These identified epitope areas were fused with an adjuvant such as RpfB and RpfE as well as linkers like AAY, GPGPG, KK, and EAAAK. The constructed vaccine was further characterized by assessing its physicochemical properties and population coverage. The potential interactions of the designed vaccine with different toll-like receptors were examined by a sequence of computational studies. Of note, docking study were employed to understand its mechanism of action. Molecular dynamics and immune simulation studies were conducted to comprehend more into their structural stability and immune responses. The resultant vaccine was back-translated, codon-optimised and introduced into pET-28 (+) vector. Results and discussion: We hypothesize from the results that the designed NARF protein-based vaccine in our analysis could effectively provoke the immune responses in the target organism through TLR-7 binding and promotes MHC class-II mediated antigen presentation. Indeed, comprehensive evaluations conducted in both in vitro and in vivo settings are imperative to substantiate the safety and efficacy of the developed vaccine.
Article
Full-text available
Deep learning approaches have spurred substantial advances in the single-state prediction of biomolecular structures. The function of biomolecules is, however, dependent on the range of conformations they can assume. This is especially true for peptides, a highly flexible class of molecules that are involved in numerous biological processes and are of high interest as therapeutics. Here we introduce PepFlow, a transferable generative model that enables direct all-atom sampling from the allowable conformational space of input peptides. We train the model in a diffusion framework and subsequently use an equivalent flow to perform conformational sampling. To overcome the prohibitive cost of generalized all-atom modelling, we modularize the generation process and integrate a hypernetwork to predict sequence-specific network parameters. PepFlow accurately predicts peptide structures and effectively recapitulates experimental peptide ensembles at a fraction of the running time of traditional approaches. PepFlow can also be used to sample conformations that satisfy constraints such as macrocyclization.
Article
Adenosine-to-inosine (A-to-I) RNA editing recodes the genome and confers flexibility for the organisms to adapt to the environment. It is believed that RNA recoding sites are well suited for facilitating adaptive evolution by increasing the proteomic diversity in a temporal-spatial manner. The function and essentiality of a few conserved recoding sites are recognized. However, the experimentally discovered functional sites only make up a small corner of the total sites, and there is still the need to expand the repertoire of such functional sites with bioinformatic approaches. In this study, we define a new category of RNA editing sites termed 'conserved editing with non-conserved recoding' and systematically identify such sites in Drosophila editomes, figuring out their selection pressure and signals of adaptation at inter-species and intra-species levels. Surprisingly, conserved editing sites with non-conserved recoding are not suppressed and are even slightly overrepresented in Drosophila. DNA mutations leading to such cases are also favoured during evolution, suggesting that the function of those recoding events in different species might be diverged, specialized, and maintained. Finally, structural prediction suggests that such recoding in potassium channel Shab might increase ion permeability and compensate the effect of low temperature. In conclusion, conserved editing with non-conserved recoding might be functional as well. Our study provides novel aspects in considering the adaptive evolution of RNA editing sites and meanwhile expands the candidates of functional recoding sites for future validation.
Preprint
Mucormycosis is an invasive fungal infection with considerably high mortality rates in immunocompromised individuals. Due to COVID-19 pandemic, the disease has resurfaced recently and lack of appropriate antifungals resulted in a poor outcome in patients. The iron uptake mechanism in Rhizopus delemar, the predominant causal agent is crucial for its survival and pathogenesis in human host. The current study focuses on the structural dynamics of high affinity iron permease (Ftr1) which act as a virulence factor in this fatal fungal disease. Ftr1 is a transmembrane protein which is responsible for the transport of Fe ion from the extracellular milieu to the cytoplasm under iron-starving conditions in Rhizopus . In this work, the three-dimensional modelling of Ftr1 was carried out and it was found to possess seven transmembrane helices with N-terminal lying in the extracellular region and C-terminal in the intracellular region. Moreover, the present study delineates the interaction of glutamic acid residues, found in the REGLE motif of the fourth transmembrane helix with Fe . The molecular dynamics (MD) simulation study revealed that the glycine present in the motif destabilizes the helix thereby bringing E157 closer to the positively charged ion. Understanding the interaction between Fe ion and Ftr1 would be helpful in designing effective small molecule drugs against this novel therapeutic target for treating mucormycosis.
Article
Full-text available
Background: Monkeypox has emerged as a noteworthy worldwide issue due to its daily escalating case count. This illness presents diverse symptoms, including skin manifestations, which have the potential to spread through contact. The transmission of this infectious agent is intricate and readily transfers between individuals. Methods: The hypothetical protein MPXV-SI-2022V502225_00135 strain of monkeypox underwent structural and functional analysis using NCBI-CD Search, Pfam, and InterProScan. Quality assessment utilized PROCHECK, QMEAN, Verify3D, and ERRAT, followed by protein-ligand docking, visualization, and a 100-nanosecond simulation on Schrodinger Maestro. Results: Different physicochemical properties were estimated, indicating a stable molecular weight (49147.14) and theoretical pI (5.62) with functional annotation tools predicting the target protein to contain the domain of Chordopox_A20R domain. In secondary structure analysis, the helix coil was found to be predominant. The three-dimensional (3D) structure of the protein was obtained using a template protein (PDB ID: 6zyc.1), which became more stable after YASARA energy minimization and was validated by quality assessment tools like PROCHECK, QMEAN, Verify3D, and ERRAT. Protein-ligand docking was conducted using PyRx 9.0 software to examine the binding and interactions between a ligand and a hypothetical protein, focusing on various amino acids. The model structure, active site, and binding site were visualized using the CASTp server, FTsite, and PyMOL. A 100 nanosecond simulation was performed with ligand CID_16124688 to evaluate the efficiency of this protein. Conclusion: The analysis revealed significant binding interactions and enhanced stability, aiding in drug or vaccine design for effective antiviral treatment and patient management.
Article
Full-text available
BACKGROUND Human induced pluripotent stem cell (hiPSC) technology is a valuable tool for generating patient-specific stem cells, facilitating disease modeling, and investigating disease mechanisms. However, iPSCs carrying specific mutations may limit their clinical applications due to certain inherent characteristics. AIM To investigate the impact of MERTK mutations on hiPSCs and determine whether hiPSC-derived extracellular vesicles (EVs) influence anomalous cell junction and differentiation potential. METHODS We employed a non-integrating reprogramming technique to generate peripheral blood-derived hiPSCs with and hiPSCs without a MERTK mutation. Chromosomal karyotype analysis, flow cytometry, and immunofluorescent staining were utilized for hiPSC identification. Transcriptomics and proteomics were employed to elucidate the expression patterns associated with cell junction abnormalities and cellular differentiation potential. Additionally, EVs were isolated from the supernatant, and their RNA and protein cargos were examined to investigate the involvement of hiPSC-derived EVs in stem cell junction and differentiation. RESULTS The generated hiPSCs, both with and without a MERTK mutation, exhibited normal karyotype and expressed pluripotency markers; however, hiPSCs with a MERTK mutation demonstrated anomalous adhesion capability and differentiation potential, as confirmed by transcriptomic and proteomic profiling. Furthermore, hiPSC-derived EVs were involved in various biological processes, including cell junction and differentiation. CONCLUSION HiPSCs with a MERTK mutation displayed altered junction characteristics and aberrant differentiation potential. Furthermore, hiPSC-derived EVs played a regulatory role in various biological processes, including cell junction and differentiation.
Article
The PSIRED Workbench is a long established and popular bioinformatics web service offering a wide range of machine learning based analyses for characterizing protein structure and function. In this paper we provide an update of the recent additions and developments to the webserver, with a focus on new Deep Learning based methods. We briefly discuss some trends in server usage since the publication of AlphaFold2 and we give an overview of some upcoming developments for the service. The PSIPRED Workbench is available at http://bioinf.cs.ucl.ac.uk/psipred.
Preprint
Full-text available
Translation is regulated mainly in the initiation step, and its dysregulation is implicated in many human diseases. Several proteins have been found to regulate translational initiation, including Pdcd4 (programmed cell death gene 4). Pdcd4 is a tumor suppressor protein that prevents cell growth, invasion, and metastasis. It is downregulated in most tumor cells, while global translation in the cell is upregulated. To understand the mechanisms underlying translational control by Pdcd4, we used single-particle cryo-electron microscopy to determine the structure of human Pdcd4 bound to 40S small ribosomal subunit, including Pdcd4-40S and Pdcd4-40S-eIF4A-eIF3-eIF1 complexes. The structures reveal the binding site of Pdcd4 at the mRNA entry site in the 40S, where the C-terminal domain (CTD) interacts with eIF4A at the mRNA entry site, while the N-terminal domain (NTD) is inserted into the mRNA channel and decoding site. The structures, together with quantitative binding and in vitro translation assays, shed light on the critical role of the NTD for the recruitment of Pdcd4 to the ribosomal complex and suggest a model whereby Pdcd4 blocks the eIF4F-independent role of eIF4A during recruitment and scanning of the 5' UTR of mRNA.
Chapter
English: Computational biology has changed the way healthcare systems and biomedical engineering work. Nature-inspired intelligent computing (NIIC) approaches to predict potential biomarkers and drug targets could be an amazing bridge between biology/nature and today's advanced and sophisticated fields such as artificial intelligence, deep learning, computational vision, and others. The analysis of disease biomarkers is an emerging area of interest. Several molecular assessments have been developed to identify biomarkers that respond to specific therapies. Recognizing these molecules and understanding their molecular mechanisms is critical for disease prognosis and late-stage therapeutics development. Breakthroughs in genomics and transcriptional analysis have greatly expanded our understanding of poorly understood genomic matter or dark matter. The systematic identification of disease-associated lncRNAs has advanced our understanding of the underlying molecular mechanisms of complex diseases, but it has also been shown to have an inherent advantage over protein-coding genes in the diagnosis, prognosis, and treatment of disease. Given the lower efficiency and increased time and cost burden of biological experiments, computational inference of disease-associated RNAs using nature-inspired smart computational methods has emerged as a promising approach to accelerate the study of lncRNA functions and complement the experimental analytical value. In this chapter, we have discussed the basics of NIIC techniques, their role in diagnosing various diseases, and their future role in the healthcare industry. German: Die Computerbiologie hat die Arbeitsweise von Gesundheitssystemen und biomedizinischer Technik verändert. Naturinspirierte intelligente Rechenansätze (NIIC) zur Vorhersage potenzieller Biomarker und Arzneimittelziele könnten eine erstaunliche Brücke zwischen Biologie/Natur und heutigen fortschrittlichen und anspruchsvollen Bereichen wie künstlicher Intelligenz, tiefem Lernen, computerbasiertem Sehen und anderen sein. Die Analyse von Krankheitsbiomarkern ist ein aufkommendes Interessengebiet. Mehrere molekulare Bewertungen wurden entwickelt, um Biomarker zu erkennen, die auf spezifische Therapien ansprechen. Die Erkennung dieser Moleküle und das Verständnis ihrer molekularen Mechanismen ist entscheidend für die Prognose von Krankheiten und die Entwicklung von Therapeutika in einem späten Stadium. Durchbrüche in der Genomik und transkriptionellen Analysen haben unser Verständnis der schlecht verstandenen genomischen Materie oder dunklen Materie erheblich erweitert. Die systematische Identifizierung von Krankheiten assoziierten lncRNAs hat unser Verständnis der zugrunde liegenden molekularen Mechanismen komplexer Krankheiten erweitert, aber es hat sich auch gezeigt, dass sie einen inhärenten Vorteil gegenüber proteinkodierenden Genen bei der Diagnose, Prognose und Behandlung von Krankheiten hat. Angesichts der geringeren Effizienz und der erhöhten Zeit- und Kostenbelastung biologischer Experimente hat sich die computergestützte Inferenz von Krankheiten assoziierten RNAs unter Verwendung von naturinspirierten intelligenten Rechenmethoden als vielversprechender Ansatz zur Beschleunigung der Untersuchung von lncRNA-Funktionen und zur Ergänzung des experimentellen Analysenwerts herausgestellt. In diesem Kapitel haben wir die Grundlagen der NIIC-Techniken, ihre Rolle bei der Diagnose verschiedener Krankheiten und ihre zukünftige Rolle in der Gesundheitsbranche diskutiert.
Preprint
Chromatin architecture in the cells of animals and fungi influences gene expression. The molecular factors that influence higher genome architecture in plants and their effects on gene expression remain unknown. Cohesin complexes, conserved in eukaryotes, are essential factors in genome structuring. Here, we investigated the relevance of the plant-specific somatic cohesin subunit SYN4 for chromatin organisation in Arabidopsis thaliana . Plants mutated in SYN4 were studied using HRM, Hi-C, RNA sequencing, untargeted and targeted metabolomics and physiological assays to understand the role of this plant-specific cohesin. We show that syn4 mutants exhibit altered intra- and interchromosomal interactions, expressed as sharply reduced contacts between telomeres and chromosome arms but not between centromeres, and differences in the placement and number of topologically associated domains (TADs)-like structures. By transcriptome sequencing, we also show that syn4 mutants have altered gene expression, including numerous genes that control abiotic stress responses. The response to drought stress in Arabidopsis is strongly influenced by the genome structure in syn4 mutants, potentially due to altered expression of CYP707A3 , an ABA 8’-hydroxylase. In brief The 3D architecture of the genome extensively influences gene expression in animals and fungi. We show that the plant-specific cohesin subunit SYN4 affects intra- and interchromosomal interactions including telomere clustering, with consequences for the expression of genes of transient and induced biological pathways and the biosynthesis of bioactive compounds. Highly condensed genome structures at the CYP707A3 locus positively affects the stress response to water deprivation regulated by abscisic acid.
Article
TonB dependent transporters (TBDTs) mediate energised transport of essential nutrients into Gram-negative bacteria. TBDTs are increasingly being exploited for the delivery of antibiotics into drug resistant bacteria. While much is known about ground state complexes of TBDTs few details have emerged about the transport process itself. Here, we exploit bacteriocin parasitization of a TBDT to probe the mechanics of transport. Previous work has shown that the N-terminal domain of Pseudomonas aeruginosa-specific bacteriocin pyocin S2 (PyoS2NTD) is imported through the pyoverdine receptor FpvAI. PyoS2NTD transport follows opening of a proton-motive force (PMF)-dependent pore through FpvAI and delivery of its own TonB box which engages TonB. We use molecular models and simulations to formulate a complete translocation pathway for PyoS2NTD which we validate using protein engineering and cytotoxicity measurements. We show that following partial removal of the FpvAI plug domain which occludes the channel, the pyocin’s N-terminus enters the channel by electrostatic steering and ratchets to the periplasm. Application of force, mimicking that exerted by TonB, leads to unravelling of PyoS2NTD as it squeezes through the channel. Remarkably, while some parts of PyoS2NTD must unfold, complete unfolding is not required for transport, a result we confirmed by disulphide bond engineering. Moreover, the section of the FpvAI plug that remains embedded in the channel appears to serve as a buttress against which PyoS2NTD is pushed to destabilize the domain. Our study reveals the limits of structural deformation that accompanies import through a TBDT and the role the TBDT itself plays in accommodating transport.
Chapter
The prediction of protein secondary structure from a protein sequence provides useful information for predicting the three-dimensional structure and function of the protein. In recent decades, protein secondary structure prediction systems have been improved benefiting from the advances in computational techniques as well as the growth and increased availability of solved protein structures in protein data banks. Existing methods for predicting the secondary structure of proteins can be roughly subdivided into statistical, nearest-neighbor, machine learning, meta-predictors, and deep learning approaches. This chapter provides an overview of these computational approaches to predict the secondary structure of proteins, focusing on deep learning techniques, with highlights on key aspects in each approach.
Article
Full-text available
The extent to which prophage proteins interact with eukaryotic macromolecules is largely unknown. In this work, we show that cytoplasmic incompatibility factor A (CifA) and B (CifB) proteins, encoded by prophage WO of the endosymbiont Wolbachia, alter long noncoding RNA (lncRNA) and DNA during Drosophila sperm development to establish a paternal-effect embryonic lethality known as cytoplasmic incompatibility (CI). CifA is a ribonuclease (RNase) that depletes a spermatocyte lncRNA important for the histone-to-protamine transition of spermiogenesis. Both CifA and CifB are deoxyribonucleases (DNases) that elevate DNA damage in late spermiogenesis. lncRNA knockdown enhances CI, and mutagenesis links lncRNA depletion and subsequent sperm chromatin integrity changes to embryonic DNA damage and CI. Hence, prophage proteins interact with eukaryotic macromolecules during gametogenesis to create a symbiosis that is fundamental to insect evolution and vector control.
Article
The absence of robust interspecific isolation barriers among pantherines, including the iconic South American jaguar ( Panthera onca ), led us to study molecular evolution of typically rapidly evolving reproductive proteins within this subfamily and related groups. In this study, we delved into the evolutionary forces acting on the zona pellucida (ZP) gamete interaction protein family and the sperm‐oocyte fusion protein pair IZUMO1‐JUNO across the Carnivora order, distinguishing between Caniformia and Feliformia suborders and anticipating few significant diversifying changes in the Pantherinae subfamily. A chromosome‐resolved jaguar genome assembly facilitated coding sequences, enabling the reconstruction of protein evolutionary histories. Examining sequence variability across more than 30 Carnivora species revealed that Feliformia exhibited significantly lower diversity compared to its sister taxa, Caniformia . Molecular evolution analyses of ZP2 and ZP3, subunits directly involved in sperm‐recognition, unveiled diversifying positive selection in Feliformia , Caniformia and Pantherinae , although no significant changes were linked to sperm binding. Structural cross‐linking ZP subunits, ZP4 and ZP1 exhibited lower levels or complete absence of positive selection. Notably, the fusion protein IZUMO1 displayed prominent positive selection signatures and sites in basal lineages of both Caniformia and Feliformia , extending along the Caniformia subtree but absent in Pantherinae . Conversely, JUNO did not exhibit any positive selection signatures across tested lineages and clades. Eight Caniformia‐specific positive selected sites in IZUMO1 were detected within two JUNO‐interaction clusters. Our findings provide for the first time insights into the evolutionary trajectories of ZP proteins and the IZUMO1‐JUNO gamete interaction pair within the Carnivora order.
Article
Full-text available
1.(1) Co-operation between a laboratory interested in developing the theory for protein secondary structure prediction methods and a laboratory interested in applying and comparing such methods has led to the development of a simple predictive algorithm.2.(2) Four-state predictions, in which each residue is unambiguously assigned one conformational state of α-helix, extended chain, reverse turn or coil, predict 49% of residue states correctly (in a sample of 26 proteins) when the overall helix and extended-chain content is not taken into account.3.(3) When the relative abundances of helix, extended chain, reverse turn and coil observed by X-ray crystallography are taken into account, a single constant for each protein and type of conformation can be used to bias the prediction. When predictions are optimized in this way, 63% of all residue states are unambiguously and correctly assigned.4.(4) By analysing the nature of the bias required, proteins can be classified into helix-rich types, pleated-sheet-rich types, and so on. It is shown that, if the type of protein can be determined even approximately by circular dichroism, 57% of residue states can be correctly predicted without taking into account the X-ray structure. Further, comparable predictions can be obtained if, instead of circular dichroism, preliminary predictions are made to assess the protein type.5.(5) It is emphasized that the numbers quoted here depend on the method used to assess accuracy, and the algorithm is shown to be at least as good as, and usually superior to, the reported prediction methods assessed in the same way.6.(6) Ways of further enhancing predictions by the use of additional information from hydrophobic triplets and homologous sequences are also explored. Hydro-phobic triplet information does not significantly improve predictive power and it is concluded that this information is used by proteins in the next stage of folding. On the other hand, the use of homologous sequences appears to be very promising.7.(7) The implication of these results in protein folding is discussed.
Article
Full-text available
The inverse protein folding problem, the problem of finding which amino acid sequences fold into a known three-dimensional (3D) structure, can be effectively attacked by finding sequences that are most compatible with the environments of the residues in the 3D structure. The environments are described by: (i) the area of the residue buried in the protein and inaccessible to solvent; (ii) the fraction of side-chain area that is covered by polar atoms (O and N); and (iii) the local secondary structure. Examples of this 3D profile method are presented for four families of proteins: the globins, cyclic AMP (adenosine 3',5'-monophosphate) receptor-like proteins, the periplasmic binding proteins, and the actins. This method is able to detect the structural similarity of the actins and 70- kilodalton heat shock proteins, even though these protein families share no detectable sequence similarity.
Article
Full-text available
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.
Article
A new dataset of 396 protein domains is developed and used to evaluate the performance of the protein secondary structure prediction algorithms DSC, PHD, NNSSP, and PREDATOR. The maximum theoretical Q3 accuracy for combination of these methods is shown to be 78%. A simple consensus prediction on the 396 domains, with automatically generated multiple sequence alignments gives an average Q3 prediction accuracy of 72.9%. This is a 1% improvement over PHD, which was the best single method evaluated. Segment Overlap Accuracy (SOV) is 75.4% for the consensus method on the 396-protein set. The secondary structure definition method DSSP defines 8 states, but these are reduced by most authors to 3 for prediction. Application of the different published 8- to 3-state reduction methods shows variation of over 3% on apparent prediction accuracy. This suggests that care should be taken to compare methods by the same reduction method. Two new sequence datasets (CB513 and CB251) are derived which are suitable for cross-validation of secondary structure prediction methods without artifacts due to internal homology. A fully automatic World Wide Web service that predicts protein secondary structure by a combination of methods is available via http://barton.ebi.ac.uk/. Proteins 1999;34:508–519. © 1999 Wiley-Liss, Inc.
Article
In this study we present an accurate secondary structure prediction procedure by using a query and related sequences. The most novel aspect of our approach is its reliance on local pairwise alignment of the sequence to be predicted with each related sequence rather than utilization of a multiple alignment. The residue-by-residue accuracy of the method is 75% in three structural states after jack-knife tests. The gain in prediction accuracy compared with the existing techniques, which are at best 72%, is achieved by secondary structure propensities based on both local and long-range effects, utilization of similar sequence information in the form of carefully selected pairwise alignment fragments, and reliance on a large collection of known protein primary structures. The method is especially appropriate for large-scale sequence analysis efforts such as genome characterization, where precise and significant multiple sequence alignments are not available or achievable. Proteins 27:329–335, 1997. © 1997 Wiley-Liss, Inc.
Article
The helix, s Applequist, 1963) in which the Zimm-Bragg parameters u and s are defined respectively as the cooperativity factor for helix initiation, and the equi- librium constant for converting a coil residue to a helical ~~~~
Article
We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal 'hidden' units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.
Article
For a successful analysis of the relation between amino acid sequence and protein structure, an unambiguous and physically meaningful definition of secondary structure is essential. We have developed a set of simple and physically motivated criteria for secondary structure, programmed as a pattern-recognition process of hydrogen-bonded and geometrical features extracted from x-ray coordinates. Cooperative secondary structure is recognized as repeats of the elementary hydrogen-bonding patterns “turn” and “bridge.” Repeating turns are “helices,” repeating bridges are “ladders,” connected ladders are “sheets.” Geometric structure is defined in terms of the concepts torsion and curvature of differential geometry. Local chain “chirality” is the torsional handedness of four consecutive Cα positions and is positive for right-handed helices and negative for ideal twisted β-sheets. Curved pieces are defined as “bends.” Solvent “exposure” is given as the number of water molecules in possible contact with a residue. The end result is a compilation of the primary structure, including SS bonds, secondary structure, and solvent exposure of 62 different globular proteins. The presentation is in linear form: strip graphs for an overall view and strip tables for the details of each of 10.925 residues. The dictionary is also available in computer-readable form for protein structure prediction work.
Article
A new dataset of 396 protein domains is developed and used to evaluate the performance of the protein secondary structure prediction algorithms DSC, PHD, NNSSP, and PREDATOR. The maximum theoretical Q3 accuracy for combination of these methods is shown to be 78%. A simple consensus prediction on the 396 domains, with automatically generated multiple sequence alignments gives an average Q3 prediction accuracy of 72.9%. This is a 1% improvement over PHD, which was the best single method evaluated. Segment Overlap Accuracy (SOV) is 75.4% for the consensus method on the 396-protein set. The secondary structure definition method DSSP defines 8 states, but these are reduced by most authors to 3 for prediction. Application of the different published 8- to 3-state reduction methods shows variation of over 3% on apparent prediction accuracy. This suggests that care should be taken to compare methods by the same reduction method. Two new sequence datasets (CB513 and CB251) are derived which are suitable for cross-validation of secondary structure prediction methods without artifacts due to internal homology. A fully automatic World Wide Web service that predicts protein secondary structure by a combination of methods is available via http://barton.ebi.ac.uk/. Proteins 1999;34:508–519. © 1999 Wiley-Liss, Inc.
Article
Protein sequences contain surprisingly many local regions of low compositional complexity. These include different types of residue clusters, some of which contain homopolymers, short period repeats or aperiodic mosaics of a few residue types. Several different formal definitions of local complexity and probability are presented here and are compared for their utility in algorithms for localization of such regions in amino acid sequences and sequence databases. The definitions are:—(1) those derived from enumeration a priori by a treatment analogous to statistical mechanics, (2) a log likelihood definition of complexity analogous to informational entropy, (3) multinomial probabilities of observed compositions, (4) an approximation resembling the χ2 statistic and (5) a modification of the coefficient of divergence. These measures, together with a method based on similarity scores of self-aligned sequences at different offsets, are shown to be broadly similar for first-pass, approximate localization of low-complexity regions in protein sequences, but they give significantly different results when applied in optimal segmentation algorithms. These comparisons underpin the choice of robust optimization heuristics in an algorithm, SEG, designed to segment amino acid sequences fully automatically into subsequences of contrasting complexity. After the abundant low-complexity segments have been partitioned from the Swissprot database, the remaining high-complexity sequence set is adequately approximated by a first-order random model.
Article
Recently a new method called the self-optimized prediction method (SOPM) has been described to improve the success rate in the prediction of the secondary structure of proteins. In this paper we report improvements brought about by predicting all the sequences of a set of aligned proteins belonging to the same family. This improved SOPM method (SOPMA) correctly predicts 69.5% of amino acids for a three-state description of the secondary structure (α-helix, β-sheet and coil) in a whole database containing 126 chains of non-homologous (less than 25% identity) proteins. Joint prediction with SOPMA and a neural networks method (PHD) correctly predicts 82.2% of residues for 74% of co-predicted amino acids. Predictions are available by Email to deleage@ibcp.fr or on a Web page ( http://www.ibcp.fr/predict.html ).
Article
The prediction of protein tertiary structure from sequence using molecular energy calculations has not yet been successful; an alternative strategy of recognizing known motifs or folds in sequences looks more promising. We present here a new approach to fold recognition, whereby sequences are fitted directly onto the backbone coordinates of known protein structures. Our method for protein fold recognition involves automatic modelling of protein structures using a given sequence, and is based on the frameworks of known protein folds. The plausibility of each model, and hence the degree of compatibility between the sequence and the proposed structure, is evaluated by means of a set of empirical potentials derived from proteins of known structure. The novel aspect of our approach is that the matching of sequences to backbone coordinates is performed in full three-dimensional space, incorporating specific pair interactions explicitly.
Article
The secondary structure and elements of tertiary structure have been predicted for the catalytic domain of protein kinases using a method that extracts structural information from the patterns of conservation and variation in an alignment of homologous proteins. The central features of this structural prediction are: (a) the catalytic domains of protein kinases do not incorporate a Rossmann fold; (b) the core of the structure is founded on beta sheets built from pairs of bent antiparallel beta strands; (c) five helices, including an especially long helix (alignment positions 129-152) that lie on the outside of the folded core. These proteins are important in many aspects of metabolic regulation.
Article
The prediction of protein secondary structure (alpha-helices, beta-sheets and coil) is improved by 9% to 66% using the information available from a family of homologous sequences. The approach is based both on averaging the Garnier et al. (1978) secondary structure propensities for aligned residues and on the observation that insertions and high sequence variability tend to occur in loop regions between secondary structures. Accordingly, an algorithm first aligns a family of sequences and a value for the extent of sequence conservation at each position is obtained. This value modifies a Garnier et al. prediction on the averaged sequence to yield the improved prediction. In addition, from the sequence conservation and the predicted secondary structure, many active site regions of enzymes can be located (26 out of 43) with limited over-prediction (8 extra). The entire algorithm is fully automatic and is applicable to all structural classes of globular proteins.
Article
The helix, β-sheet, and coil conformational parameters, Pα, Pβ, and Pc, for the 20 naturally occurring amino acids have been computed from the frequency of occurrence of each amino acid residue in the α, β, and coil conformations in 15 proteins, whose structure has been determined by X-ray crystallography. These values have been utilized to provide a simple procedure, devoid of complex computer calculations, to predict the secondary structure of proteins from their known amino acid sequences. The computed Pα values are within 10% of the experimental Zimm-Bragg helix growth parameters, s, evaluated from poly(α-amino acids). The environmental effects on the s values of polypepljdes and proteins are discussed, showing that Pα values may be more reliable in predicting protein conformation. A detailed analysis of the helix and β-sheet boundary residues in proteins provide amino acid frequencies at the N- and C-terminal ends which are used to delineate helical and β regions. Charged residues are found with the greatest frequency at both helical ends, but are mostly absent in β-sheet regions. The frequencies at the helical ends may also be correlated to the experimental Zimm-Bragg helix initiation parameter, σ, evaluated from poly(α-ammo acids). A mechanism of protein folding is proposed, whereby helix nucleation starts at the centers of the helix (where the Pα values are highest) and propagates in both directions, until strong helix breakers (where Pα values are lowest) terminate the growth at both ends. Similarly, residues with the highest Pβ values will initiate β regions and residues with the lowest Pβ values will terminate β regions. The helical region with the highest α potential (i.e., largest 〈Pα〉) is proposed as the site of the first fold during protein renaturation. The mechanism of folding of myoglobin is discussed. Thus, the protein conformational parameters and the conformational boundary frequencies determined for the first time in their hierarchical order in this paper will enable accurate prediction of protein secondary structure as well as providing insights into tertiary folding.
Article
Algorithms are suggested for identifying α-helical and β-structural regions in native globular proteins. α-Helical and β-structural regions are predicted, with accuracy of ~80 and ~85% respectively, for 25 proteins, the three-dimensional structures of which have been determined by X-ray diffraction crystallography. Secondary structure is predicted in 25 proteins with unknown three-dimensional structure.
Article
Introduction to Computational Biology: Maps, Sequencesand Genomes. Chapman Hall, 1995.[WF74] R.A. Wagner and M.J. Fischer. The String to String Correction Problem. Journal of the ACM, 21(1):168--173, 1974.[WM92] S. Wu and U. Manber. Fast Text Searching Allowing Errors. Communicationsof the ACM, 10(35):83--91, 1992.73Bibliography[KOS+00] S. Kurtz, E. Ohlebusch, J. Stoye, C. Schleiermacher, and R. Giegerich.Computation and Visualization of Degenerate Repeats in CompleteGenomes. In ...
Article
A protein sequence with at lease 40% identity to a known structure can now be modelled automatically, with an accuracy approaching that o fa low-resolution X-ray structure or a medium-resolution nuclear magnetic resonance structure. In general, these models have goods stereochemistry and an overall structural accuracy that is as high as the similarity between the template and the actual structure being predicted. As a result, the number of sequences that can be modelled is an order of magnitude larger then the number of experimentally determined protein structures. In addition, evaluation techniques are available that can estimated errors in different regions of the model. Thus, the number of applications where homology modelling is proving useful is growing rapidly.
Article
Recently Yi & Lander used a neural network and nearest-neighbor method with a scoring system that combined a sequence-similarity matrix with the local structural environment scoring scheme described by Bowie and co-workers for predicting protein secondary structure. We have improved their scoring system by taking into consideration N and C-terminal positions of alpha-helices and beta-strands and also beta-turns as distinctive types of secondary structure. Another improvement, which also decreases the time of computation, is performed by restricting a data base with a smaller subset of proteins that are similar with a query sequence. Using multiple sequence alignments rather than single sequences and a simple jury decision procedure our method reaches a sustained overall three-state accuracy of 72.2%, which is better than that observed for the most accurate multilayered neural-network approach, tested on the same data set of 126 non-homologous protein chains.
Article
This paper describes a new method for the prediction of the secondary structure and topology of integral membrane proteins based on the recognition of topological models. The method employs a set of statistical tables (log likelihoods) complied from well-characterized membrane protein data, and a novel dynamic programming algorithm to recognize membrane topology models by expectation maximization. The statistical tables show definite biases toward certain amino acid species on the inside, middle, and outside of a cellular membrane. Using a set of 83 integral membrane protein sequences taken from a variety of bacterial, plant, and animal species, and a strict jackknifing procedure, where each protein (along with any detectable homologues) is removed from the training set used to calculate the tables before prediction, the method successfully predicted 64 of the 83 topologies, and of the 37 complex multispanning topologies 34 were predicted correctly.
Article
Secondary structure prediction recently has surpassed the 70% level of average accuracy, evaluated on the single residue states helix, strand and loop (Q3). But the ultimate goal is reliable prediction of tertiary (three-dimensional, 3D) structure, not 100% single residue accuracy for secondary structure. A comparison of pairs of structurally homologous proteins with divergent sequences reveals that considerable variation in the position and length of secondary structure segments can be accommodated within the same 3D fold. It is therefore sufficient to predict the approximate location of helix, strand, turn and loop segments, provided they are compatible with the formation of 3D structure. Accordingly, we define here a measure of segment overlap (Sov) that is somewhat insensitive to small variations in secondary structure assignments. The new segment overlap measure ranges from an ignorance level of 37% (random protein pairs) via a current level of 72% for a prediction method based on sequence profile input to neural networks (PHD) to an average 90% level for homologous protein pairs. We conclude that the highest scores one can reasonably expect for secondary structure prediction are a single residue accuracy of Q3 > 85% and a fractional segment overlap of Sov > 90%.
Article
We have trained a two-layered feed-forward neural network on a non-redundant data base of 130 protein chains to predict the secondary structure of water-soluble proteins. A new key aspect is the use of evolutionary information in the form of multiple sequence alignments that are used as input in place of single sequences. The inclusion of protein family information in this form increases the prediction accuracy by six to eight percentage points. A combination of three levels of networks results in an overall three-state accuracy of 70.8% for globular proteins (sustained performance). If four membrane protein chains are included in the evaluation, the overall accuracy drops to 70.2%. The prediction is well balanced between alpha-helix, beta-strand and loop: 65% of the observed strand residues are predicted correctly. The accuracy in predicting the content of three secondary structure types is comparable to that of circular dichroism spectroscopy. The performance accuracy is verified by a sevenfold cross-validation test, and an additional test on 26 recently solved proteins. Of particular practical importance is the definition of a position-specific reliability index. For half of the residues predicted with a high level of reliability the overall accuracy increases to better than 82%. A further strength of the method is the more realistic prediction of segment length. The protein family prediction method is available for testing by academic researchers via an electronic mail server.
Article
A strategy is presented for protein fold recognition from secondary structure assignments (alpha-helix and beta-strand). The method can detect similarities between protein folds in the absence of sequence similarity. Secondary structure mapping first identifies all possible matches (maps) between a query string of secondary structures and the secondary structures of protein domains of known three-dimensional structure. The maps are then passed through a series of structural filters to remove those that do not obey simple rules of protein structure. The surviving maps are ranked by scores from the alignment of predicted and experimental accessibilities. Searches made with secondary structure assignments for a test set of 11 fold-families put the correct sequence-dissimilar fold in the first rank 8/11 times. With cross-validated predictions of secondary structure this drops to 4/11 which compares favourably with the widely used THREADER program (1/11). The structural class is correctly predicted 10/11 times by the method in contrast to 5/11 for THREADER. The new technique obtains comparable accuracy in the alignment of amino acid residues and secondary structure elements. Searches are also performed with published secondary structure predictions for the von-Willebrand factor type A domain, the proteasome 20 S alpha subunit and the phosphotyrosine interaction domain. These searches demonstrate how the method can find the correct fold for a protein from a carefully constructed secondary structure prediction, multiple sequence alignment and distant restraints. Scans with experimentally determined secondary structures and accessibility, recognise the correct fold with high alignment accuracy (86% on secondary structures). This suggests that the accuracy of mapping will improve alongside any improvements in the prediction of secondary structure or accessibility. Application to NMR structure determination is also discussed.
Article
This paper evaluates the results of a protein structure prediction contest. The predictions were made using threading procedures, which employ techniques for aligning sequences with 3D structures to select the correct fold of a given sequence from a set of alternatives. Nine different teams submitted 86 predictions, on a total of 21 target proteins with little or no sequence homology to proteins of known structure. The 3D structures of these proteins were newly determined by experimental methods, but not yet published or otherwise available to the predictors. The predictions, made from the amino acid sequence alone, thus represent a genuine test of the current performance of threading methods. Only a subset of all the predictions is evaluated here. It corresponds to the 44 predictions submitted for the 11 target proteins seen to adopt known folds. The predictions for the remaining 10 proteins were not analyzed, although weak similarities with known folds may also exist in these proteins. We find that threading methods are capable of identifying the correct fold in many cases, but not reliably enough as yet. Every team predicts correctly a different set of targets, with virtually all targets predicted correctly by at least one team. Also, common folds such as TIM barrels are recognized more readily than folds with only a few known examples. However, quite surprisingly, the quality of the sequence-structure alignments, corresponding to correctly recognized folds, is generally very poor, as judged by comparison with the corresponding 3D structure alignments. Thus, threading can presently not be relied upon to derive a detailed 3D model from the amino acid sequence. This raises a very intriguing question: how is fold recognition achieved? Our analysis suggests that it may be achieved because threading procedures maximize hydrophobic interactions in the protein core, and are reasonably good at recognizing local secondary structure.
Article
A protein secondary structure prediction method from multiply aligned homologous sequences is presented with an overall per residue three-state accuracy of 70.1%. There are two aims: to obtain high accuracy by identification of a set of concepts important for prediction followed by use of linear statistics; and to provide insight into the folding process. The important concepts in secondary structure prediction are identified as: residue conformational propensities, sequence edge effects, moments of hydrophobicity, position of insertions and deletions in aligned homologous sequence, moments of conservation, auto-correlation, residue ratios, secondary structure feedback effects, and filtering. Explicit use of edge effects, moments of conservation, and auto-correlation are new to this paper. The relative importance of the concepts used in prediction was analyzed by stepwise addition of information and examination of weights in the discrimination function. The simple and explicit structure of the prediction allows the method to be reimplemented easily. The accuracy of a prediction is predictable a priori. This permits evaluation of the utility of the prediction: 10% of the chains predicted were identified correctly as having a mean accuracy of > 80%. Existing high-accuracy prediction methods are "black-box" predictors based on complex nonlinear statistics (e.g., neural networks in PHD: Rost & Sander, 1993a). For medium- to short-length chains (> or = 90 residues and < 170 residues), the prediction method is significantly more accurate (P < 0.01) than the PHD algorithm (probably the most commonly used algorithm). In combination with the PHD, an algorithm is formed that is significantly more accurate than either method, with an estimated overall three-state accuracy of 72.4%, the highest accuracy reported for any prediction method.
Article
Existing approaches to protein secondary structure prediction from the amino acid sequence usually rely on the statistics of local residue interactions within a sliding window and the secondary structural state of the central residue. The practically achieved accuracy limit of such single residue and single sequence prediction methods is 65% in three structural stages (alpha-helix, beta-strand and coil). Further improvement in the prediction quality is likely to require exploitation of various aspects of three-dimensional protein architecture. Here we make such an attempt and present an accurate algorithm for secondary structure prediction based on recognition of potentially hydrogen-bonded residues in a single amino acid sequence. The unique feature of our approach involves database-derived statistics on residue type occurrences in different classes of beta-bridges to delineate interacting beta-strands. The alpha-helical structures are also recognized on the basis of amino acid occurrences in hydrogen-bonded pairs (i,i + 4). The algorithm has a prediction accuracy of 68% in three structural stages, relies only on a single protein sequence as input and has the potential to be improved by 5-7% if homologous aligned sequences are also considered.
Article
In this study we present an accurate secondary structure prediction procedure by using an query and related sequences. The most novel aspect of our approach is its reliance on local pairwise alignment of the sequence to be predicted with each related sequence rather than utilization of a multiple alignment. The residue-by-residue accuracy of the method is 75% in three structural states after jack-knife tests. The gain in prediction accuracy compared with the existing techniques, which are at best 72%, is achieved by secondary structure propensities based on both local and long-range effects, utilization of similar sequence information in the form of carefully selected pairwise alignment fragments, and reliance on a large collection of known protein primary structures. The method is especially appropriate for large-scale sequence analysis of efforts such as genome characterization, where precise and significant multiple sequence alignments are not available or achievable.
Article
The accuracy of secondary structure prediction methods has been improved significantly by the use of aligned protein sequences. The PHD method and the NNSSP method reach 71 to 72% of sustained overall three-state accuracy when multiple sequence alignments are with neural networks and nearest-neighbor algorithms, respectively. We introduce a variant of the nearest-neighbor approach that can achieve similar accuracy using a single sequence as the query input. We compute the 50 best non-intersecting local alignments of the query sequence with each sequence from a set of proteins with known 3D structures. Each position of the query sequence is aligned with the database amino acids in alpha-helical, beta-strand or coil states. The prediction type of secondary structure is selected as the type of aligned position with the maximal total score. On the dataset of 124 non-membrane non-homologous proteins, used earlier as a benchmark for secondary structure predictions, our method reaches an overall three-state accuracy of 71.2%. The performance accuracy is verified by an additional test on 461 non-homologous proteins giving an accuracy of 71.0%. The main strength of the method is the high level of prediction accuracy for proteins without any known homolog. Using multiple sequence alignments as input the method has a prediction accuracy of 73.5%. Prediction of secondary structure by the SSPAL method is available via Baylor College of Medicine World Wide Web server.
Article
In fold recognition by threading one takes the amino acid sequence of a protein and evaluates how well it fits into one of the known three-dimensional (3D) protein structures. The quality of sequence-structure fit is typically evaluated using inter-residue potentials of mean force or other statistical parameters. Here, we present an alternative approach to evaluating sequence-structure fitness. Starting from the amino acid sequence we first predict secondary structure and solvent accessibility for each residue. We then thread the resulting one-dimensional (1D) profile of predicted structure assignments into each of the known 3D structures. The optimal threading for each sequence-structure pair is obtained using dynamic programming. The overall best sequence-structure pair constitutes the predicted 3D structure for the input sequence. The method is fine-tuned by adding information from direct sequence-sequence comparison and applying a series of empirical filters. Although the method relies on reduction of 3D information into 1D structure profiles, its accuracy is, surprisingly, not clearly inferior to methods based on evaluation of residue interactions in 3D. We therefore hypothesise that existing 1D-3D threading methods essentially do not capture more than the fitness of an amino acid sequence for a particular 1D succession of secondary structure segments and residue solvent accessibility. The prediction-based threading method on average finds any structurally homologous region at first rank in 29% of the cases (including sequence information). For the 22% first hits detected at highest scores, the expected accuracy rose to 75%. However, the task of detecting entire folds rather than homologous fragments was managed much better; 45 to 75% of the first hits correctly recognised the fold.
Article
Protein evolution gives rise to families of structurally related proteins, within which sequence identities can be extremely low. As a result, structure-based classifications can be effective at identifying unanticipated relationships in known structures and in optimal cases function can also be assigned. The ever increasing number of known protein structures is too large to classify all proteins manually, therefore, automatic methods are needed for fast evaluation of protein structures. We present a semi-automatic procedure for deriving a novel hierarchical classification of protein domain structures (CATH). The four main levels of our classification are protein class (C), architecture (A), topology (T) and homologous superfamily (H). Class is the simplest level, and it essentially describes the secondary structure composition of each domain. In contrast, architecture summarises the shape revealed by the orientations of the secondary structure units, such as barrels and sandwiches. At the topology level, sequential connectivity is considered, such that members of the same architecture might have quite different topologies. When structures belonging to the same T-level have suitably high similarities combined with similar functions, the proteins are assumed to be evolutionarily related and put into the same homologous superfamily. Analysis of the structural families generated by CATH reveals the prominent features of protein structure space. We find that nearly a third of the homologous superfamilies (H-levels) belong to ten major T-levels, which we call superfolds, and furthermore that nearly two-thirds of these H-levels cluster into nine simple architectures. A database of well-characterised protein structure families, such as CATH, will facilitate the assignment of structure-function/evolution relationships to both known and newly determined protein structures.
Article
A simple approach to protein tertiary structure prediction is described, based on the assembly of recognized supersecondary structural fragments taken from highly resolved protein structures by using a simulated annealing algorithm. The results of blind-testing this method on CASP2 target T0042 (pig NK-lysin) are presented. The predicted structure had a C alpha root-mean-square deviation of only 6.2 A from the experimental structure (and less than 5.0 A over the first 66 residues), and clearly had the correct fold when judged by using a number of objective measures. Despite the significant degree of success in this case, there is clearly much more development required before predictions with the accuracy of a good homology model can be made with this kind of approach.
Article
A new dataset of 396 protein domains is developed and used to evaluate the performance of the protein secondary structure prediction algorithms DSC, PHD, NNSSP, and PREDATOR. The maximum theoretical Q3 accuracy for combination of these methods is shown to be 78%. A simple consensus prediction on the 396 domains, with automatically generated multiple sequence alignments gives an average Q3 prediction accuracy of 72.9%. This is a 1% improvement over PHD, which was the best single method evaluated. Segment Overlap Accuracy (SOV) is 75.4% for the consensus method on the 396-protein set. The secondary structure definition method DSSP defines 8 states, but these are reduced by most authors to 3 for prediction. Application of the different published 8- to 3-state reduction methods shows variation of over 3% on apparent prediction accuracy. This suggests that care should be taken to compare methods by the same reduction method. Two new sequence datasets (CB513 and CB251) are derived which are suitable for cross-validation of secondary structure prediction methods without artifacts due to internal homology. A fully automatic World Wide Web service that predicts protein secondary structure by a combination of methods is available via http://barton.ebi.ac.uk/.
Article
ses (including sequence information). For the 22% rst hits detected at highest scores, the expected accuracy rose to 75%. However, the task of detecting entire folds rather than homologous fragments was managed much better; 45 to 75% of the rst hits correctly recognised the fold. # 1997 Academic Press Limited Keywords: protein structure prediction; threading; remote homology detection; fold recognition; secondary structure *Corresponding author Introduction Reducing the sequence-structure gap by homology modelling Large scale gene-sequencing projects accumulate gene data, and consequently protein sequences, at a breathtaking pace (Oliver et al., 1992; Fleischmann et al., 1995; Dujon, 1996; Johnston, 1996) . However, information about three dimensional (3D) structure is available for only a small fraction of known proteins (Bernstein et al., 1977). Thus, although experime
a new generation of protein database search programs
  • Blast Gapped
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25, 3389-3402.
Gapped BLAST and PSI-BLAST
  • Altschul