ArticlePDF Available

PISCES: a protein sequence culling server

Authors:

Abstract and Figures

PISCES is a public server for culling sets of protein sequences from the Protein Data Bank (PDB) by sequence identity and structural quality criteria. PISCES can provide lists culled from the entire PDB or from lists of PDB entries or chains provided by the user. The sequence identities are obtained from PSI-BLAST alignments with position-specific substitution matrices derived from the non-redundant protein sequence database. PISCES therefore provides better lists than servers that use BLAST, which is unable to identify many relationships below 40% sequence identity and often overestimates sequence identity by aligning only well-conserved fragments. PDB sequences are updated weekly. PISCES can also cull non-PDB sequences provided by the user as a list of GenBank identifiers, a FASTA format file, or BLAST/PSI-BLAST output. Availability: The server is located at http://www.fccc.edu/research/labs/dunbrack/pisces Contact: rl_dunbrack@fccc.edu *To whom correspondence should be addressed
Content may be subject to copyright.
BIOINFORMATIC
S
APPLICATIONS NOTE
Vol. 19 no. 12 2003, pages 1589–1591
DOI: 10.1093/bioinformatics/btg224
PISCES: a protein sequence culling server
Guoli Wang and Roland L. Dunbrack Jr
Institute for Cancer Research, Fox Chase Cancer Center, 7701 Burholme Avenue,
Philadelphia, PA 19111, USA
Received on October 4, 2002; revised on January 30, 2003; accepted on March 18, 2003
ABSTRACT
Summary: PISCES is a public server for culling sets of
protein sequences from the Protein Data Bank (PDB) by
sequence identity and structural quality criteria. PISCES
can provide lists culled from the entire PDB or from lists of
PDB entries or chains provided by the user. The sequence
identities are obtained from PSI-BLAST alignments with
position-specific substitution matrices derived from the
non-redundant protein sequence database. PISCES
therefore provides better lists than servers that use
BLAST, which is unable to identify many relationships
below 40% sequence identity and often overestimates
sequence identity by aligning only well-conserved frag-
ments. PDB sequences are updated weekly. PISCES
can also cull non-PDB sequences provided by the user
as a list of GenBank identifiers, a FASTA format file, or
BLAST/PSI-BLAST output.
Availability: The server is located at http://www.fccc.edu/
research/labs/dunbrack/pisces
Contact: rl
dunbrack@fccc.edu
For many purposes, it is useful to obtain a subset of
sequences from some larger set that are related to one
another by no more than some fixed percentage sequence
identity. For the Protein Data Bank (PDB; Berman et
al., 2000), it is often the case that additional criteria are
desirable, such as resolution or length cutoffs. Several
web sites have provided such lists derived from the
PDB in recent years. The PDB-Select lists have had a
widespread impact on statistical analysis of the PDB by
providing pre-defined lists of chains with fixed maximum
percentage sequence identities (Hobohm et al., 1992;
Hobohm and Sander, 1994). The PDB itself provides
subsets of sequences that fulfill a certain query and
can be culled at 90, 70, or 50% sequence identity. The
PDB-REPRDB server (Noguchi et al., 2001) provides a
number of features, including an ability for the user to set
parameters to generate customized lists. PDB-REPRDB
uses a Needleman–Wunsch global alignment algorithm
(Needleman and Wunsch, 1970), does not provide
sequences, and appears to be updated approximately
To whom correspondence should be addressed.
monthly. The ASTRAL website provides lists of protein
domain sequences at fixed cutoffs of sequence identity
or E-values derived from BLAST pairwise alignments
(Brenner et al., 2000). For several years, we have provided
afixed set of lists on our ‘CulledPDB’ website as well as
a server for creating subsets of the entire PDB based on a
user’s input criteria. CulledPDB used BLAST to produce
alignments.
In this paper, we describe a new server called PISCES
that provides the following functionalities: (1) culling the
entire PDB according to user input criteria; (2) culling
a list of user-provided PDB chain identifiers, according
to user-input criteria; for instance, a user could use the
PDB’s search facility to select all human proteins, and then
submit this list to our server to obtain a subset according to
sequence identity and structural quality criteria; (3) culling
any set of sequences provided by the user in FASTA
format, as a set of GenBank identifiers, or as BLAST/PSI-
BLAST output.
Our goal in culling the PDB is to provide the longest
lists possible of the highest resolution structures that
fulfill the sequence identity and structural quality cutoffs.
We continue to pre-compile sequence identities of all
PDB chains versus all other PDB chains. Sequences
are obtained from the Uniformity Project mmCIF files
provided by RCSB (Bhat et al., 2001; Westbrook et
al., 2002), and are updated weekly. The resolution and
R-value data are also obtained from the Uniformity
Project files, since the RCSB has gone to some effort
to place these values in a standard format in these files.
Some missing values are obtained from the PDBFINDER
database (Hooft et al., 1996).
To provide better estimates of sequence identity at
longer evolutionary distances, we now use PSI-BLAST
(Altschul et al., 1997) to calculate these identities. PSI-
BLAST is used locally to build a position specific
similarity matrix (PSSM) or profile from homologous
sequences in NCBI’s non-redundant protein sequence
database (Wheeler et al., 2002) with each unique PDB
sequence as query. Three iterations are performed for each
query, with an E-value cutoff of 0.0001 for inclusion in the
profile. We control for drift in the PSSM by checking to
see whether hits in previous rounds with E-values better
Bioinformatics 19(12)
c
Oxford University Press 2003; all rights reserved. 1589
by guest on May 23, 2011bioinformatics.oxfordjournals.orgDownloaded from
G.Wang and R.L.Dunbrack Jr.
than 0.0001 appear with E-values worse than 0.0001 in
subsequent rounds. If so, we take the last profile not
exhibiting drift. The resulting matrix is used to search the
PDB for sequences related to each query with an E-value
better than 1.0 and alignment length greater than 20, and
the resulting sequence identities and alignment lengths
from the PSI-BLAST output are stored. PISCES updates
sequences in the PDB weekly, and these sequences are
put through the same process and added to the database
of alignment identities.
Criteria that apply to all sequences in a PDB entry in-
clude the experiment type (X-ray, NMR, etc.), resolution,
and R-value. Criteria that apply to individual chains in a
PDB entry include sequence length and Cα -only status.
While resolution and R-values as structure quality crite-
ria have their drawbacks and other criteria have been pro-
posed (Brenner et al., 2000), most users are much better
acquainted with these traditional measures, which are ad-
equate for most purposes. These criteria are applied first
either to the entire PDB or to an input list of PDB entries
or chains from the user. In either case, the result is a list
of PDB chains that fit the criteria that can then be culled
according to mutual sequence identity. This always results
in a longer list than applying the sequence identity criteria
first and then applying the single sequence or single entry
criteria to the resulting list.
We use the method of Hobohm and Sander (Hobohm et
al., 1992; Hobohm and Sander, 1994) to cull the sequences
that pass the criteria described above by sequence identity.
The list is first sorted according to resolution from best
to worst. Sequences with the same resolution are sorted
according to R-value. Non-X-ray structures if requested
follow the list of X-ray structures. The first sequence is
flagged as included in the culled list. Each sequence after
it in the list is flagged excluded if it has a sequence identity
with the first sequence higher than the desired cutoff.
The program then moves to each subsequent sequence in
the list and repeats this procedure. As described above,
the server also now provides a facility for culling non-
PDB sequences. In this case, the sequence identities are
calculated with PSI-BLAST but the PSSMs are created
from the set of input sequences, rather than the entire non-
redundant sequence database.
The website provides four options for users according to
the most common requests. The first option is normal PDB
sequence culling; users can specify their own parameters,
such as sequence identity, resolution, and R-value, to get a
list of sequences in current PDB files. The second option
provides an input form for a user’s list of PDB entries
or chains. The third option provides an input form for a
list of GenBank acccession numbers. These numbers can
include other information on each line, as long as the first
element on the line is the accession code. For instance, a
user can cut and paste the list of hits from BLAST or PSI-
BLAST output that includes protein names and E-values.
The server will use the GenBank accession numbers to
retrieve the sequences from the non-redundant protein
sequence database with the NCBI program fastacmd. The
fourth option allows the user to input protein sequences
in FASTA format or as BLAST or PSI-BLAST output. In
the case of BLAST/PSI-BLAST output, PISCES will take
the sequences from the ‘Sbjct’ lines as the set to be culled.
Once the user’s input is completed, the server performs
all calculations and sends the user an E-mail that includes
links to the following files that the user can then download:
the input list of chain identifiers (either PDB or
GenBank or user-defined), if provided by the user; for
PDB chains structure quality data are included;
the output list of chain identifiers; the user criteria are
given in the file title;
the output sequences in FASTA format.
Finally, we discuss the effect of choices made in de-
termining sequence identities for the culling procedure.
There are two components to this determination—the pro-
gram used to make the sequence alignments and the nor-
malization procedure used for the sequence identity cal-
culation. In choosing a program for sequence alignment,
we believe that a local alignment program such as BLAST
or PSI-BLAST is a better choice than a global alignment
program such as Clustal W. The reason is that a pair of pro-
teins may share a homologous domain but may each con-
tain other unrelated domains. A global alignment program
will attempt to align the complete sequences and there-
fore provide very low sequence identities, even though the
shared domain may be highly homologous. We prefer to
eliminate a sequence based on an accurate sequence iden-
tity of the shared domain. In Table 1 we show the number
of sequences in lists culled at 20–90% sequence identity
using Clustal W (Thompson et al., 1994), BLAST, and
PSI-BLAST. At most levels of sequence identity the lists
provided by PDB-REPRDB using Clustal W are longer
than BLAST or PSI-BLAST, because sequence identities
are underestimated by the global alignments. As an ex-
ample, chain 1D4XG (126 amino acids long) shares 98%
sequence identity over 124 amino acids with a domain of
1D0NA (729 amino acids long). PDB-REPRDB puts both
sequences in a 90% cull list. PISCES excludes 1D4XG as
it should. For reasons that are not clear, PDB-REPRDB
provides very short lists at sequence identity below 30%.
Secondly, we would like to have alignments of all ho-
mologous pairs in the PDB above 20% sequence identity.
Structure alignments with some similarity criterion would
achieve this. However, PSI-BLAST, which is much faster
than structural alignment, is able to identify most such
relationships with reasonable accuracy and completeness
1590
by guest on May 23, 2011bioinformatics.oxfordjournals.orgDownloaded from
Server for protein sequence culling
Table 1. Lengths of lists obtained using different sequence alignment
methods
% Ident. BLAST REPRDB PSI-BLAST
20% 1533 74 1973
25% 1699 1259 2351
30% 2032 2711 2660
40% 2848 3292 3178
50% 3451 3698 3587
60% 3832 4057 3901
70% 4149 4387 4193
80% 4406 4700 4443
90% 4791 5350 4820
Criteria for inclusion in the lists: resolution
3.0
˚
A; including Cα chains;
excluding non-X-ray entries.
(Sauder et al., 2000). In contrast, BLAST is often unable
to identify many relationships below 40% sequence iden-
tity. When this occurs a culled list will contain sequences
that should have been eliminated, if the sequence relation-
ships had been identified.
Even when BLAST does provide an alignment for a se-
quence pair, it may only align a short, well conserved frag-
ment. The resulting sequence identity however depends on
the normalization procedure used. Previously our Culled-
PDB server performed alignments with BLAST and used
the sequence identity as provided by the program, which
is calculated by dividing the number of identities in the
alignment by the length of the alignment, including gaps.
Forashort highly conserved fragment, this sequence iden-
tity is therefore overestimated, resulting in removing too
many sequences from the culled lists. As shown in Ta-
ble 1, the PSI-BLAST lists at low sequence identity are
much longer than the BLAST lists, while at high sequence
identity they are very nearly the same size as expected.
The ASTRAL site uses BLAST but normalizes by the
average of the lengths of the full sequences aligned (i.e.
not just the aligned segments). When BLAST aligns only
a fragment, this results in sequence identities that are
significantly underestimated. Given that BLAST may
also fail to align many pairs at sequence identity below
40%, it is likely that the ASTRAL lists include many
sequences that under other protocols would be eliminated.
To test this, we used PISCES to recull their 20% list
at 20% sequence identity according to our PSI-BLAST
alignments. A list of 1914 valid input sequences (only
complete chains were used) from ASTRAL resulted in
1617 sequences output from PISCES. While many of the
rejections of sequences were from marginal E-values and
sequence identities, a little over half had sequence iden-
tities above 25% and one third had E-values better than
1.0e-05.
ACKNOWLEDGEMENTS
This work was supported by NIH grants R01-HG02302
and CA06972.
REFERENCES
Altschul,S.F., Madden,T.L., Sch
¨
affer,A.A., Zhang,J., Zhang,Z.,
Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-
BLAST: a new generation of database programs. Nucleic Acids
Res., 25, 3389–3402.
Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N.,
Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The protein
data bank. Nucleic Acids Res., 28, 235–242.
Bhat,T.N., Bourne,P., Feng,Z., Gilliland,G., Jain,S., Ravichan-
dran,V., Schneider,B., Schneider,K., Thanki,N., Weissig,H.,
Westbrook,J. and Berman,H.M. (2001) The PDB data uniformity
project. Nucleic Acids Res., 29, 214–218.
Brenner,S.E., Koehl,P. and Levitt,M. (2000) The astral compendium
for protein structure and sequence analysis. Nucleic Acids Res.,
28, 254–256.
Hobohm,U. and Sander,C. (1994) Enlarged representative set of
protein structures. Protein Sci., 3, 522–524.
Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992) Selec-
tion of representative protein data sets. Protein Sci., 1, 409–417.
Hooft,R.W., Vriend,G., Sander,C. and Abola,E.E. (1996) Errors in
protein structures. Nature, 381, 272.
Needleman,S.B. and Wunsch,C.D. (1970) A general method appli-
cable to the search for similarities in the amino acid sequences of
two proteins. J. Mol. Biol., 48, 443–453.
Noguchi,T., Matsuda,H. and Akiyama,Y. (2001) PDB-REPRDB: a
database of representative protein chains from the protein data
bank (PDB). Nucleic Acids Res., 29, 219–220.
Sauder,J.M., Arthur,J.W. and Dunbrack,Jr,R.L. (2000) Large-scale
comparison of protein sequence alignment algorithms with
structure alignments. Proteins, 40, 6–22.
Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Clustal W:
improving the sensitivity of progressive multiple sequence
alignment through sequence weighting, position-specific gap
penalties and weight matrix choice. Nucleic Acids Res., 22,
4673–4680.
Westbrook,J., Feng,Z., Jain,S., Bhat,T.N., Thanki,N., Ravichan-
dran,V., Gilliland,G.L., Bluhm,W., Weissig,H., Greer,D.S. et al.
(2002) The protein data bank: unifying the archive.Nucleic Acids
Res., 30, 245–248.
Wheeler,D.L., Church,D.M., Lash,A.E., Leipe,D.D., Madden,T.L.,
Pontius,J.U., Schuler,G.D., Schriml,L.M., Tatusova,T.A., Wag-
ner,L. et al. (2002) Database resources of the national center
for biotechnology information: 2002 update. Nucleic Acids Res.,
30, 13–16.
1591
by guest on May 23, 2011bioinformatics.oxfordjournals.orgDownloaded from
... The substitution matrix is derived by applying the probabilistic definition to multiple sequence alignments of proteins from a well-curated dataset comprising 593 PDB structures (Wang and Dunbrack 2003). ...
... This was done using the software Arpeggio (Jubb et al. 2017). Secondly, divide this number by the maximum number of interatomic contacts observed for the native residue's type in a well-curated dataset of 593 PDB structures (Wang and Dunbrack 2003). ...
... The substitution matrices are derived by applying the probabilistic definition to multiple sequence alignments of proteins from a well-curated dataset comprising 593 PDB structures (Wang and Dunbrack 2003). ...
Preprint
Full-text available
Next-generation sequencing (NGS) has revolutionized genetic diagnostics, yet its application in precision medicine remains incomplete, despite significant advances in computational tools for variant annotation. Many variants remain unannotated, and existing tools often fail to accurately predict the range of impacts that variants have on protein function. This limitation restricts their utility in relevant applications such as predicting disease severity and onset age. In response to these challenges, a new generation of computational models is emerging, aimed at producing quantitative predictions of genetic variant impacts. However, the field is still in its early stages, and several issues need to be addressed, including improved performance and better interpretability. This study introduces QAFI, a novel methodology that integrates protein-specific regression models within an ensemble learning framework, utilizing conservation-based and structure-related features derived from AlphaFold models. Our findings indicate that QAFI significantly enhances the accuracy of quantitative predictions across various proteins. The approach has been rigorously validated through its application in the CAGI6 contest, focusing on ARSA protein variants, and further tested on a comprehensive set of clinically labeled variants, demonstrating its generalizability and robust predictive power. The straightforward nature of our models may also contribute to better interpretability of the results.
... These deviations from the allowed regions may indicate errors in the protein structure, such as improper backbone geometry, steric clashes, or incorrect side-chain conformations. Therefore, a Ramachandran plot is a valuable tool for evaluating the overall quality and reliability of protein structures [56] [57]. Compared to the control, this shift and change in the backbone geometry were validated in vibrations associated with phenylalanine (at 1002 cm^-1 and 1080 cm^-1), deformation vibrations of the CC group (occurring between 1100 and 1200 cm^-1), the α-helix in Amide III (at 1250 cm^-1), and deformation vibrations of the -C = C group (observed at 1590 cm^-1) were observed in the spectra. ...
... Ramachandran plot analysis indicates deviations in protein structures, possibly indicating errors or improper geometry, emphasizing the importance of structural assessment. Raman spectroscopy further corroborates these ndings, revealing shifts in vibrational frequencies indicative of amino acid composition alterations [56][57][58][59][60][61][62][63]. Collectively, these results underscore the multifaceted impact of CAP + TMZ treatment on cellular pathways and protein structures, providing valuable insights into the mechanisms underlying their therapeutic effects in glioblastoma. ...
Preprint
Full-text available
Glioblastoma multiforme (GBM) is one of the most common and aggressive forms of malignant brain cancer in adults and is classified based on its isocitrate dehydrogenase (IDH) mutation. Surgery, radiotherapy, and Temozolomide (TMZ) are the standard treatment methods for GBM. Here we present a combination therapy of cold atmospheric plasma (CAP) and TMZ as a key treatment for GBM. CAP works by increasing reactive oxygen and nitrogen species (RONS) and targets the spread of the tumor. In this study, we performed the transcriptomic analysis of U-87MG cells by high throughput deep RNA-Seq analysis to quantify differential gene expression across the genome. Furthermore, we studied various signaling pathways and predicted structural changes of consequential proteins to elucidate the functional changes caused by up or down-regulation of the most altered genes. Our results demonstrate that combination treatment downregulated key genes like p53, histones, DNA damage markers, cyclins, in the following pathways: MAPK, P53, DNA damage and cell cycle. Moreover, in silico studies were conducted for further investigation to verify these results, and the combination of CAP & TMZ showed a significant antitumor effect in the GBM cells leading to apoptosis and damaged key proteins. Further studies of the impact of TMZ on gene expression, biochemical pathways, and protein structure will lead to improved treatment approaches for GBM.
... 28,29). In this work, we searched for native β-turn fragments by mining a set of selected PDBs based on 90% maximum sequence identity and a 1.6 Å resolution cutoff from PISCES 30 . The collected β turns were further clustered by the K-centers algorithm 31 at a maximum cluster distance of 0.63 Å, resulting in 180 motif clusters. ...
Article
Full-text available
In natural proteins, structured loops have central roles in molecular recognition, signal transduction and enzyme catalysis. However, because of the intrinsic flexibility and irregularity of loop regions, organizing multiple structured loops at protein functional sites has been very difficult to achieve by de novo protein design. Here we describe a solution to this problem that designs tandem repeat proteins with structured loops (9–14 residues) buttressed by extensive hydrogen bonding interactions. Experimental characterization shows that the designs are monodisperse, highly soluble, folded and thermally stable. Crystal structures are in close agreement with the design models, with the loops structured and buttressed as designed. We demonstrate the functionality afforded by loop buttressing by designing and characterizing binders for extended peptides in which the loops form one side of an extended binding pocket. The ability to design multiple structured loops should contribute generally to efforts to design new protein functions.
... All hydrogens are placed using the REDUCE software [36]. When comparing simulation results to experimentally determined protein structures, we use a high-resolution dataset of ∼ 5, 000 structures with a resolution < 1.8 Å culled from the Protein Data Bank (PDB) [37,38]. For the HS protein simulations, we carry out Langevin dynamics over range of temperatures 10 −8 < T /ϵ r < 10 −2 using 20 randomly selected, single chain target proteins with no disulfide bonds from the x-ray crystal structure dataset. ...
Preprint
Full-text available
Proteins fold to a specific functional conformation with a densely packed hydrophobic core that controls their stability. We develop a geometric, yet all-atom model for proteins that explains the universal core packing fraction of $\phi_c=0.55$ found in experimental measurements. We show that as the hydrophobic interactions increase relative to the temperature, a novel jamming transition occurs when the core packing fraction exceeds $\phi_c$. The model also recapitulates the global structure of proteins since it can accurately refold to native-like structures from partially unfolded states.
Article
Intrinsically disordered proteins and protein regions (IDPs/IDRs) carry out important biological functions without relying on a single well-defined conformation. As these proteins are a challenge to study experimentally, computational methods play important roles in their characterization. One of the commonly used tools is the IUPred web server which provides prediction of disordered regions and their binding sites. IUPred is rooted in a simple biophysical model and uses a limited number of parameters largely derived on globular protein structures only. This enabled an incredibly fast and robust prediction method, however, its limitations have also become apparent in light of recent breakthrough methods using deep learning techniques. Here, we present AIUPred, a novel version of IUPred which incorporates deep learning techniques into the energy estimation framework. It achieves improved performance while keeping the robustness of the original method. Based on the evaluation of recent benchmark datasets, AIUPred scored amongst the top three single sequence based methods. With a new web server we offer fast and reliable visual analysis for users as well as options to analyze whole genomes in mere seconds with the downloadable package. AIUPred is available at https://aiupred.elte.hu.
Article
Full-text available
Protein folds and the local environments they create can be compared using a variety of differently designed measures, such as the root mean squared deviation, the global distance test, the template modeling score or the local distance difference test. Although these measures have proven to be useful for a variety of tasks, each fails to fully incorporate the valuable chemical information inherent to atoms and residues, and considers these only partially and indirectly. Here, we develop the highly flexible local composition Hellinger distance (LoCoHD) metric, which is based on the chemical composition of local residue environments. Using LoCoHD, we analyze the chemical heterogeneity of amino acid environments and identify valines having the most conserved-, and arginines having the most variable chemical environments. We use LoCoHD to investigate structural ensembles, to evaluate critical assessment of structure prediction (CASP) competitors, to compare the results with the local distance difference test (lDDT) scoring system, and to evaluate a molecular dynamics simulation. We show that LoCoHD measurements provide unique information about protein structures that is distinct from, for example, those derived using the alignment-based RMSD metric, or the similarly distance matrix-based but alignment-free lDDT metric.
Chapter
Determining and designing the structure and function of the protein has deepened our understanding of biology at a cellular and molecular level. There are numerous proteins whose structures are not known. However, prediction of protein structure is possible using amino acid sequences, if available. However, creating new protein structures in a principled and methodical way is very challenging and time-consuming. Due to the advancement in deep learning and computational modeling, exceptional results in protein generation have been achieved. It is necessary to create de novo protein designs to fully utilize the application of protein structures in technological, scientific, and medical applications. In this chapter, we have discussed the fundamental concepts of generative adversarial networks (GANs) and their applications in protein structure and ligand generation. This chapter also presents a few case studies for generating protein structures using GANs and other generative models. Further, challenges and future research directions in the area have been discussed.
Article
The ‘structure assessment’ web server is a one-stop shop for interactive evaluation and benchmarking of structural models of macromolecular complexes including proteins and nucleic acids. A user-friendly web dashboard links sequence with structure information and results from a variety of state-of-the-art tools, which facilitates the visual exploration and evaluation of structure models. The dashboard integrates stereochemistry information, secondary structure information, global and local model quality assessment of the tertiary structure of comparative protein models, as well as prediction of membrane location. In addition, a benchmarking mode is available where a model can be compared to a reference structure, providing easy access to scores that have been used in recent CASP experiments and CAMEO. The structure assessment web server is available at https://swissmodel.expasy.org/assess.
Article
Full-text available
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic, and statistical refinements permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is described for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position Specific Iterated BLAST (PSLBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities.
Article
Full-text available
The Protein Data Bank (PDB; http://www.rcsb.org/pdb/) is the single worldwide archive of structural data of biological macromolecules. This paper describes the data uniformity project that is underway to address the inconsistency in PDB data.
Article
In addition to maintaining the GenBank nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval resources that operate on the data in GenBank and a variety of other biological data made available through NCBI’s web site. NCBI data retrieval resources include Entrez, PubMed, LocusLink and the Taxonomy Browser. Data analysis resources include BLAST, Electronic PCR, OrfFinder, RefSeq, UniGene, HomoloGene, Database of Single Nucleotide Polymorphisms (dbSNP), Human Genome Sequencing, Human MapViewer, Human¡VMouse Homology Map, Cancer Chromosome Aberration Project (CCAP), Entrez Genomes, Clusters of Orthologous Groups (COGs) database, Retroviral Genotyping Tools, SAGEmap, Gene Expression Omnibus (GEO), Online Mendelian Inheritance in Man (OMIM), the Molecular Modeling Database (MMDB) and the Conserved Domain Database (CDD). Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of the resources can be accessed through the NCBI home page at http://www.ncbi.nlm.nih.gov.
Article
The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.
Article
The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.
Article
A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed. From these findings it is possible to determine whether significant homology exists between the proteins. This information is used to trace their possible evolutionary development.The maximum match is a number dependent upon the similarity of the sequences. One of its definitions is the largest number of amino acids of one protein that can be matched with those of a second protein allowing for all possible interruptions in either of the sequences. While the interruptions give rise to a very large number of comparisons, the method efficiently excludes from consideration those comparisons that cannot contribute to the maximum match.Comparisons are made from the smallest unit of significance, a pair of amino acids, one from each protein. All possible pairs are represented by a two-dimensional array, and all possible comparisons are represented by pathways through the array. For this maximum match only certain of the possible pathways must be evaluated. A numerical value, one in this case, is assigned to every cell in the array representing like amino acids. The maximum match is the largest number that would result from summing the cell values of every pathway.
Article
The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of divergent protein sequences. Firstly, individual weights are assigned to each sequence in a partial alignment in order to downweight near-duplicate sequences and up-weight the most divergent ones. Secondly, amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. Thirdly, residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. Fourthly, positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions. These modifications are incorporated into a new program, CLUSTAL W which is freely available.
Article
To reduce redundancy in the Protein Data Bank of 3D protein structures, which is caused by many homologous proteins in the data bank, we have selected a representative set of structures. The selection algorithm was designed to (1) select as many nonhomologous structures as possible, and (2) to select structures of good quality. The representative set may reduce time and effort in statistical analyses.