Content uploaded by Roland Dunbrack
Author content
All content in this area was uploaded by Roland Dunbrack
Content may be subject to copyright.
BIOINFORMATIC
S
APPLICATIONS NOTE
Vol. 19 no. 12 2003, pages 1589–1591
DOI: 10.1093/bioinformatics/btg224
PISCES: a protein sequence culling server
Guoli Wang and Roland L. Dunbrack Jr
∗
Institute for Cancer Research, Fox Chase Cancer Center, 7701 Burholme Avenue,
Philadelphia, PA 19111, USA
Received on October 4, 2002; revised on January 30, 2003; accepted on March 18, 2003
ABSTRACT
Summary: PISCES is a public server for culling sets of
protein sequences from the Protein Data Bank (PDB) by
sequence identity and structural quality criteria. PISCES
can provide lists culled from the entire PDB or from lists of
PDB entries or chains provided by the user. The sequence
identities are obtained from PSI-BLAST alignments with
position-specific substitution matrices derived from the
non-redundant protein sequence database. PISCES
therefore provides better lists than servers that use
BLAST, which is unable to identify many relationships
below 40% sequence identity and often overestimates
sequence identity by aligning only well-conserved frag-
ments. PDB sequences are updated weekly. PISCES
can also cull non-PDB sequences provided by the user
as a list of GenBank identifiers, a FASTA format file, or
BLAST/PSI-BLAST output.
Availability: The server is located at http://www.fccc.edu/
research/labs/dunbrack/pisces
Contact: rl
dunbrack@fccc.edu
For many purposes, it is useful to obtain a subset of
sequences from some larger set that are related to one
another by no more than some fixed percentage sequence
identity. For the Protein Data Bank (PDB; Berman et
al., 2000), it is often the case that additional criteria are
desirable, such as resolution or length cutoffs. Several
web sites have provided such lists derived from the
PDB in recent years. The PDB-Select lists have had a
widespread impact on statistical analysis of the PDB by
providing pre-defined lists of chains with fixed maximum
percentage sequence identities (Hobohm et al., 1992;
Hobohm and Sander, 1994). The PDB itself provides
subsets of sequences that fulfill a certain query and
can be culled at 90, 70, or 50% sequence identity. The
PDB-REPRDB server (Noguchi et al., 2001) provides a
number of features, including an ability for the user to set
parameters to generate customized lists. PDB-REPRDB
uses a Needleman–Wunsch global alignment algorithm
(Needleman and Wunsch, 1970), does not provide
sequences, and appears to be updated approximately
∗
To whom correspondence should be addressed.
monthly. The ASTRAL website provides lists of protein
domain sequences at fixed cutoffs of sequence identity
or E-values derived from BLAST pairwise alignments
(Brenner et al., 2000). For several years, we have provided
afixed set of lists on our ‘CulledPDB’ website as well as
a server for creating subsets of the entire PDB based on a
user’s input criteria. CulledPDB used BLAST to produce
alignments.
In this paper, we describe a new server called PISCES
that provides the following functionalities: (1) culling the
entire PDB according to user input criteria; (2) culling
a list of user-provided PDB chain identifiers, according
to user-input criteria; for instance, a user could use the
PDB’s search facility to select all human proteins, and then
submit this list to our server to obtain a subset according to
sequence identity and structural quality criteria; (3) culling
any set of sequences provided by the user in FASTA
format, as a set of GenBank identifiers, or as BLAST/PSI-
BLAST output.
Our goal in culling the PDB is to provide the longest
lists possible of the highest resolution structures that
fulfill the sequence identity and structural quality cutoffs.
We continue to pre-compile sequence identities of all
PDB chains versus all other PDB chains. Sequences
are obtained from the Uniformity Project mmCIF files
provided by RCSB (Bhat et al., 2001; Westbrook et
al., 2002), and are updated weekly. The resolution and
R-value data are also obtained from the Uniformity
Project files, since the RCSB has gone to some effort
to place these values in a standard format in these files.
Some missing values are obtained from the PDBFINDER
database (Hooft et al., 1996).
To provide better estimates of sequence identity at
longer evolutionary distances, we now use PSI-BLAST
(Altschul et al., 1997) to calculate these identities. PSI-
BLAST is used locally to build a position specific
similarity matrix (PSSM) or profile from homologous
sequences in NCBI’s non-redundant protein sequence
database (Wheeler et al., 2002) with each unique PDB
sequence as query. Three iterations are performed for each
query, with an E-value cutoff of 0.0001 for inclusion in the
profile. We control for drift in the PSSM by checking to
see whether hits in previous rounds with E-values better
Bioinformatics 19(12)
c
Oxford University Press 2003; all rights reserved. 1589
by guest on May 23, 2011bioinformatics.oxfordjournals.orgDownloaded from
G.Wang and R.L.Dunbrack Jr.
than 0.0001 appear with E-values worse than 0.0001 in
subsequent rounds. If so, we take the last profile not
exhibiting drift. The resulting matrix is used to search the
PDB for sequences related to each query with an E-value
better than 1.0 and alignment length greater than 20, and
the resulting sequence identities and alignment lengths
from the PSI-BLAST output are stored. PISCES updates
sequences in the PDB weekly, and these sequences are
put through the same process and added to the database
of alignment identities.
Criteria that apply to all sequences in a PDB entry in-
clude the experiment type (X-ray, NMR, etc.), resolution,
and R-value. Criteria that apply to individual chains in a
PDB entry include sequence length and Cα -only status.
While resolution and R-values as structure quality crite-
ria have their drawbacks and other criteria have been pro-
posed (Brenner et al., 2000), most users are much better
acquainted with these traditional measures, which are ad-
equate for most purposes. These criteria are applied first
either to the entire PDB or to an input list of PDB entries
or chains from the user. In either case, the result is a list
of PDB chains that fit the criteria that can then be culled
according to mutual sequence identity. This always results
in a longer list than applying the sequence identity criteria
first and then applying the single sequence or single entry
criteria to the resulting list.
We use the method of Hobohm and Sander (Hobohm et
al., 1992; Hobohm and Sander, 1994) to cull the sequences
that pass the criteria described above by sequence identity.
The list is first sorted according to resolution from best
to worst. Sequences with the same resolution are sorted
according to R-value. Non-X-ray structures if requested
follow the list of X-ray structures. The first sequence is
flagged as included in the culled list. Each sequence after
it in the list is flagged excluded if it has a sequence identity
with the first sequence higher than the desired cutoff.
The program then moves to each subsequent sequence in
the list and repeats this procedure. As described above,
the server also now provides a facility for culling non-
PDB sequences. In this case, the sequence identities are
calculated with PSI-BLAST but the PSSMs are created
from the set of input sequences, rather than the entire non-
redundant sequence database.
The website provides four options for users according to
the most common requests. The first option is normal PDB
sequence culling; users can specify their own parameters,
such as sequence identity, resolution, and R-value, to get a
list of sequences in current PDB files. The second option
provides an input form for a user’s list of PDB entries
or chains. The third option provides an input form for a
list of GenBank acccession numbers. These numbers can
include other information on each line, as long as the first
element on the line is the accession code. For instance, a
user can cut and paste the list of hits from BLAST or PSI-
BLAST output that includes protein names and E-values.
The server will use the GenBank accession numbers to
retrieve the sequences from the non-redundant protein
sequence database with the NCBI program fastacmd. The
fourth option allows the user to input protein sequences
in FASTA format or as BLAST or PSI-BLAST output. In
the case of BLAST/PSI-BLAST output, PISCES will take
the sequences from the ‘Sbjct’ lines as the set to be culled.
Once the user’s input is completed, the server performs
all calculations and sends the user an E-mail that includes
links to the following files that the user can then download:
• the input list of chain identifiers (either PDB or
GenBank or user-defined), if provided by the user; for
PDB chains structure quality data are included;
• the output list of chain identifiers; the user criteria are
given in the file title;
• the output sequences in FASTA format.
Finally, we discuss the effect of choices made in de-
termining sequence identities for the culling procedure.
There are two components to this determination—the pro-
gram used to make the sequence alignments and the nor-
malization procedure used for the sequence identity cal-
culation. In choosing a program for sequence alignment,
we believe that a local alignment program such as BLAST
or PSI-BLAST is a better choice than a global alignment
program such as Clustal W. The reason is that a pair of pro-
teins may share a homologous domain but may each con-
tain other unrelated domains. A global alignment program
will attempt to align the complete sequences and there-
fore provide very low sequence identities, even though the
shared domain may be highly homologous. We prefer to
eliminate a sequence based on an accurate sequence iden-
tity of the shared domain. In Table 1 we show the number
of sequences in lists culled at 20–90% sequence identity
using Clustal W (Thompson et al., 1994), BLAST, and
PSI-BLAST. At most levels of sequence identity the lists
provided by PDB-REPRDB using Clustal W are longer
than BLAST or PSI-BLAST, because sequence identities
are underestimated by the global alignments. As an ex-
ample, chain 1D4XG (126 amino acids long) shares 98%
sequence identity over 124 amino acids with a domain of
1D0NA (729 amino acids long). PDB-REPRDB puts both
sequences in a 90% cull list. PISCES excludes 1D4XG as
it should. For reasons that are not clear, PDB-REPRDB
provides very short lists at sequence identity below 30%.
Secondly, we would like to have alignments of all ho-
mologous pairs in the PDB above 20% sequence identity.
Structure alignments with some similarity criterion would
achieve this. However, PSI-BLAST, which is much faster
than structural alignment, is able to identify most such
relationships with reasonable accuracy and completeness
1590
by guest on May 23, 2011bioinformatics.oxfordjournals.orgDownloaded from
Server for protein sequence culling
Table 1. Lengths of lists obtained using different sequence alignment
methods
% Ident. BLAST REPRDB PSI-BLAST
20% 1533 74 1973
25% 1699 1259 2351
30% 2032 2711 2660
40% 2848 3292 3178
50% 3451 3698 3587
60% 3832 4057 3901
70% 4149 4387 4193
80% 4406 4700 4443
90% 4791 5350 4820
Criteria for inclusion in the lists: resolution
3.0
˚
A; including Cα chains;
excluding non-X-ray entries.
(Sauder et al., 2000). In contrast, BLAST is often unable
to identify many relationships below 40% sequence iden-
tity. When this occurs a culled list will contain sequences
that should have been eliminated, if the sequence relation-
ships had been identified.
Even when BLAST does provide an alignment for a se-
quence pair, it may only align a short, well conserved frag-
ment. The resulting sequence identity however depends on
the normalization procedure used. Previously our Culled-
PDB server performed alignments with BLAST and used
the sequence identity as provided by the program, which
is calculated by dividing the number of identities in the
alignment by the length of the alignment, including gaps.
Forashort highly conserved fragment, this sequence iden-
tity is therefore overestimated, resulting in removing too
many sequences from the culled lists. As shown in Ta-
ble 1, the PSI-BLAST lists at low sequence identity are
much longer than the BLAST lists, while at high sequence
identity they are very nearly the same size as expected.
The ASTRAL site uses BLAST but normalizes by the
average of the lengths of the full sequences aligned (i.e.
not just the aligned segments). When BLAST aligns only
a fragment, this results in sequence identities that are
significantly underestimated. Given that BLAST may
also fail to align many pairs at sequence identity below
40%, it is likely that the ASTRAL lists include many
sequences that under other protocols would be eliminated.
To test this, we used PISCES to recull their 20% list
at 20% sequence identity according to our PSI-BLAST
alignments. A list of 1914 valid input sequences (only
complete chains were used) from ASTRAL resulted in
1617 sequences output from PISCES. While many of the
rejections of sequences were from marginal E-values and
sequence identities, a little over half had sequence iden-
tities above 25% and one third had E-values better than
1.0e-05.
ACKNOWLEDGEMENTS
This work was supported by NIH grants R01-HG02302
and CA06972.
REFERENCES
Altschul,S.F., Madden,T.L., Sch
¨
affer,A.A., Zhang,J., Zhang,Z.,
Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-
BLAST: a new generation of database programs. Nucleic Acids
Res., 25, 3389–3402.
Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N.,
Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The protein
data bank. Nucleic Acids Res., 28, 235–242.
Bhat,T.N., Bourne,P., Feng,Z., Gilliland,G., Jain,S., Ravichan-
dran,V., Schneider,B., Schneider,K., Thanki,N., Weissig,H.,
Westbrook,J. and Berman,H.M. (2001) The PDB data uniformity
project. Nucleic Acids Res., 29, 214–218.
Brenner,S.E., Koehl,P. and Levitt,M. (2000) The astral compendium
for protein structure and sequence analysis. Nucleic Acids Res.,
28, 254–256.
Hobohm,U. and Sander,C. (1994) Enlarged representative set of
protein structures. Protein Sci., 3, 522–524.
Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992) Selec-
tion of representative protein data sets. Protein Sci., 1, 409–417.
Hooft,R.W., Vriend,G., Sander,C. and Abola,E.E. (1996) Errors in
protein structures. Nature, 381, 272.
Needleman,S.B. and Wunsch,C.D. (1970) A general method appli-
cable to the search for similarities in the amino acid sequences of
two proteins. J. Mol. Biol., 48, 443–453.
Noguchi,T., Matsuda,H. and Akiyama,Y. (2001) PDB-REPRDB: a
database of representative protein chains from the protein data
bank (PDB). Nucleic Acids Res., 29, 219–220.
Sauder,J.M., Arthur,J.W. and Dunbrack,Jr,R.L. (2000) Large-scale
comparison of protein sequence alignment algorithms with
structure alignments. Proteins, 40, 6–22.
Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Clustal W:
improving the sensitivity of progressive multiple sequence
alignment through sequence weighting, position-specific gap
penalties and weight matrix choice. Nucleic Acids Res., 22,
4673–4680.
Westbrook,J., Feng,Z., Jain,S., Bhat,T.N., Thanki,N., Ravichan-
dran,V., Gilliland,G.L., Bluhm,W., Weissig,H., Greer,D.S. et al.
(2002) The protein data bank: unifying the archive.Nucleic Acids
Res., 30, 245–248.
Wheeler,D.L., Church,D.M., Lash,A.E., Leipe,D.D., Madden,T.L.,
Pontius,J.U., Schuler,G.D., Schriml,L.M., Tatusova,T.A., Wag-
ner,L. et al. (2002) Database resources of the national center
for biotechnology information: 2002 update. Nucleic Acids Res.,
30, 13–16.
1591
by guest on May 23, 2011bioinformatics.oxfordjournals.orgDownloaded from