ArticlePDF Available

PISCES: a protein sequence culling server

September 2003
Bioinformatics 19(12):1589-91

September 2003
19(12):1589-91

DOI:10.1093/bioinformatics/btg224

Source
PubMed

Authors:

Guoli Wang

Roland Dunbrack

Fox Chase Cancer Center

PISCES is a public server for culling sets of protein sequences from the Protein Data Bank (PDB) by sequence identity and structural quality criteria. PISCES can provide lists culled from the entire PDB or from lists of PDB entries or chains provided by the user. The sequence identities are obtained from PSI-BLAST alignments with position-specific substitution matrices derived from the non-redundant protein sequence database. PISCES therefore provides better lists than servers that use BLAST, which is unable to identify many relationships below 40% sequence identity and often overestimates sequence identity by aligning only well-conserved fragments. PDB sequences are updated weekly. PISCES can also cull non-PDB sequences provided by the user as a list of GenBank identifiers, a FASTA format file, or BLAST/PSI-BLAST output. Availability: The server is located at http://www.fccc.edu/research/labs/dunbrack/pisces Contact: rl_dunbrack@fccc.edu *To whom correspondence should be addressed

. Lengths of lists obtained using different sequence alignment methods

…

Figures - uploaded by Roland Dunbrack

Content may be subject to copyright.

Content uploaded by Roland Dunbrack

Content may be subject to copyright.

BIOINFORMATIC

APPLICATIONS NOTE

Vol. 19 no. 12 2003, pages 1589–1591

DOI: 10.1093/bioinformatics/btg224

PISCES: a protein sequence culling server

Guoli Wang and Roland L. Dunbrack Jr

∗

Institute for Cancer Research, Fox Chase Cancer Center, 7701 Burholme Avenue,

Philadelphia, PA 19111, USA

Received on October 4, 2002; revised on January 30, 2003; accepted on March 18, 2003

ABSTRACT

Summary: PISCES is a public server for culling sets of

protein sequences from the Protein Data Bank (PDB) by

sequence identity and structural quality criteria. PISCES

can provide lists culled from the entire PDB or from lists of

PDB entries or chains provided by the user. The sequence

identities are obtained from PSI-BLAST alignments with

position-speciﬁc substitution matrices derived from the

non-redundant protein sequence database. PISCES

therefore provides better lists than servers that use

BLAST, which is unable to identify many relationships

below 40% sequence identity and often overestimates

sequence identity by aligning only well-conserved frag-

ments. PDB sequences are updated weekly. PISCES

can also cull non-PDB sequences provided by the user

as a list of GenBank identiﬁers, a FASTA format ﬁle, or

BLAST/PSI-BLAST output.

Availability: The server is located at http://www.fccc.edu/

research/labs/dunbrack/pisces

Contact: rl

dunbrack@fccc.edu

For many purposes, it is useful to obtain a subset of

sequences from some larger set that are related to one

another by no more than some ﬁxed percentage sequence

identity. For the Protein Data Bank (PDB; Berman et

al., 2000), it is often the case that additional criteria are

desirable, such as resolution or length cutoffs. Several

web sites have provided such lists derived from the

PDB in recent years. The PDB-Select lists have had a

widespread impact on statistical analysis of the PDB by

providing pre-deﬁned lists of chains with ﬁxed maximum

percentage sequence identities (Hobohm et al., 1992;

Hobohm and Sander, 1994). The PDB itself provides

subsets of sequences that fulﬁll a certain query and

can be culled at 90, 70, or 50% sequence identity. The

PDB-REPRDB server (Noguchi et al., 2001) provides a

number of features, including an ability for the user to set

parameters to generate customized lists. PDB-REPRDB

uses a Needleman–Wunsch global alignment algorithm

(Needleman and Wunsch, 1970), does not provide

sequences, and appears to be updated approximately

∗

To whom correspondence should be addressed.

monthly. The ASTRAL website provides lists of protein

domain sequences at ﬁxed cutoffs of sequence identity

or E-values derived from BLAST pairwise alignments

(Brenner et al., 2000). For several years, we have provided

aﬁxed set of lists on our ‘CulledPDB’ website as well as

a server for creating subsets of the entire PDB based on a

user’s input criteria. CulledPDB used BLAST to produce

alignments.

In this paper, we describe a new server called PISCES

that provides the following functionalities: (1) culling the

entire PDB according to user input criteria; (2) culling

a list of user-provided PDB chain identiﬁers, according

to user-input criteria; for instance, a user could use the

PDB’s search facility to select all human proteins, and then

submit this list to our server to obtain a subset according to

sequence identity and structural quality criteria; (3) culling

any set of sequences provided by the user in FASTA

format, as a set of GenBank identiﬁers, or as BLAST/PSI-

BLAST output.

Our goal in culling the PDB is to provide the longest

lists possible of the highest resolution structures that

fulﬁll the sequence identity and structural quality cutoffs.

We continue to pre-compile sequence identities of all

PDB chains versus all other PDB chains. Sequences

are obtained from the Uniformity Project mmCIF ﬁles

provided by RCSB (Bhat et al., 2001; Westbrook et

al., 2002), and are updated weekly. The resolution and

R-value data are also obtained from the Uniformity

Project ﬁles, since the RCSB has gone to some effort

to place these values in a standard format in these ﬁles.

Some missing values are obtained from the PDBFINDER

database (Hooft et al., 1996).

To provide better estimates of sequence identity at

longer evolutionary distances, we now use PSI-BLAST

(Altschul et al., 1997) to calculate these identities. PSI-

BLAST is used locally to build a position speciﬁc

similarity matrix (PSSM) or proﬁle from homologous

sequences in NCBI’s non-redundant protein sequence

database (Wheeler et al., 2002) with each unique PDB

sequence as query. Three iterations are performed for each

query, with an E-value cutoff of 0.0001 for inclusion in the

proﬁle. We control for drift in the PSSM by checking to

see whether hits in previous rounds with E-values better

Bioinformatics 19(12)

by guest on May 23, 2011bioinformatics.oxfordjournals.orgDownloaded from

G.Wang and R.L.Dunbrack Jr.

than 0.0001 appear with E-values worse than 0.0001 in

subsequent rounds. If so, we take the last proﬁle not

exhibiting drift. The resulting matrix is used to search the

PDB for sequences related to each query with an E-value

better than 1.0 and alignment length greater than 20, and

the resulting sequence identities and alignment lengths

from the PSI-BLAST output are stored. PISCES updates

sequences in the PDB weekly, and these sequences are

put through the same process and added to the database

of alignment identities.

Criteria that apply to all sequences in a PDB entry in-

clude the experiment type (X-ray, NMR, etc.), resolution,

and R-value. Criteria that apply to individual chains in a

PDB entry include sequence length and Cα -only status.

While resolution and R-values as structure quality crite-

ria have their drawbacks and other criteria have been pro-

posed (Brenner et al., 2000), most users are much better

acquainted with these traditional measures, which are ad-

equate for most purposes. These criteria are applied ﬁrst

either to the entire PDB or to an input list of PDB entries

or chains from the user. In either case, the result is a list

of PDB chains that ﬁt the criteria that can then be culled

according to mutual sequence identity. This always results

in a longer list than applying the sequence identity criteria

ﬁrst and then applying the single sequence or single entry

criteria to the resulting list.

We use the method of Hobohm and Sander (Hobohm et

al., 1992; Hobohm and Sander, 1994) to cull the sequences

that pass the criteria described above by sequence identity.

The list is ﬁrst sorted according to resolution from best

to worst. Sequences with the same resolution are sorted

according to R-value. Non-X-ray structures if requested

follow the list of X-ray structures. The ﬁrst sequence is

ﬂagged as included in the culled list. Each sequence after

it in the list is ﬂagged excluded if it has a sequence identity

with the ﬁrst sequence higher than the desired cutoff.

The program then moves to each subsequent sequence in

the list and repeats this procedure. As described above,

the server also now provides a facility for culling non-

PDB sequences. In this case, the sequence identities are

calculated with PSI-BLAST but the PSSMs are created

from the set of input sequences, rather than the entire non-

redundant sequence database.

The website provides four options for users according to

the most common requests. The ﬁrst option is normal PDB

sequence culling; users can specify their own parameters,

such as sequence identity, resolution, and R-value, to get a

list of sequences in current PDB ﬁles. The second option

provides an input form for a user’s list of PDB entries

or chains. The third option provides an input form for a

list of GenBank acccession numbers. These numbers can

include other information on each line, as long as the ﬁrst

element on the line is the accession code. For instance, a

user can cut and paste the list of hits from BLAST or PSI-

BLAST output that includes protein names and E-values.

The server will use the GenBank accession numbers to

retrieve the sequences from the non-redundant protein

sequence database with the NCBI program fastacmd. The

fourth option allows the user to input protein sequences

in FASTA format or as BLAST or PSI-BLAST output. In

the case of BLAST/PSI-BLAST output, PISCES will take

the sequences from the ‘Sbjct’ lines as the set to be culled.

Once the user’s input is completed, the server performs

all calculations and sends the user an E-mail that includes

links to the following ﬁles that the user can then download:

• the input list of chain identiﬁers (either PDB or

GenBank or user-deﬁned), if provided by the user; for

PDB chains structure quality data are included;

• the output list of chain identiﬁers; the user criteria are

given in the ﬁle title;

• the output sequences in FASTA format.

Finally, we discuss the effect of choices made in de-

termining sequence identities for the culling procedure.

There are two components to this determination—the pro-

gram used to make the sequence alignments and the nor-

malization procedure used for the sequence identity cal-

culation. In choosing a program for sequence alignment,

we believe that a local alignment program such as BLAST

or PSI-BLAST is a better choice than a global alignment

program such as Clustal W. The reason is that a pair of pro-

teins may share a homologous domain but may each con-

tain other unrelated domains. A global alignment program

will attempt to align the complete sequences and there-

fore provide very low sequence identities, even though the

shared domain may be highly homologous. We prefer to

eliminate a sequence based on an accurate sequence iden-

tity of the shared domain. In Table 1 we show the number

of sequences in lists culled at 20–90% sequence identity

using Clustal W (Thompson et al., 1994), BLAST, and

PSI-BLAST. At most levels of sequence identity the lists

provided by PDB-REPRDB using Clustal W are longer

than BLAST or PSI-BLAST, because sequence identities

are underestimated by the global alignments. As an ex-

ample, chain 1D4XG (126 amino acids long) shares 98%

sequence identity over 124 amino acids with a domain of

1D0NA (729 amino acids long). PDB-REPRDB puts both

sequences in a 90% cull list. PISCES excludes 1D4XG as

it should. For reasons that are not clear, PDB-REPRDB

provides very short lists at sequence identity below 30%.

Secondly, we would like to have alignments of all ho-

mologous pairs in the PDB above 20% sequence identity.

Structure alignments with some similarity criterion would

achieve this. However, PSI-BLAST, which is much faster

than structural alignment, is able to identify most such

relationships with reasonable accuracy and completeness

1590

by guest on May 23, 2011bioinformatics.oxfordjournals.orgDownloaded from

Server for protein sequence culling

Table 1. Lengths of lists obtained using different sequence alignment

methods

% Ident. BLAST REPRDB PSI-BLAST

20% 1533 74 1973

25% 1699 1259 2351

30% 2032 2711 2660

40% 2848 3292 3178

50% 3451 3698 3587

60% 3832 4057 3901

70% 4149 4387 4193

80% 4406 4700 4443

90% 4791 5350 4820

Criteria for inclusion in the lists: resolution



3.0

A; including Cα chains;

excluding non-X-ray entries.

(Sauder et al., 2000). In contrast, BLAST is often unable

to identify many relationships below 40% sequence iden-

tity. When this occurs a culled list will contain sequences

that should have been eliminated, if the sequence relation-

ships had been identiﬁed.

Even when BLAST does provide an alignment for a se-

quence pair, it may only align a short, well conserved frag-

ment. The resulting sequence identity however depends on

the normalization procedure used. Previously our Culled-

PDB server performed alignments with BLAST and used

the sequence identity as provided by the program, which

is calculated by dividing the number of identities in the

alignment by the length of the alignment, including gaps.

Forashort highly conserved fragment, this sequence iden-

tity is therefore overestimated, resulting in removing too

many sequences from the culled lists. As shown in Ta-

ble 1, the PSI-BLAST lists at low sequence identity are

much longer than the BLAST lists, while at high sequence

identity they are very nearly the same size as expected.

The ASTRAL site uses BLAST but normalizes by the

average of the lengths of the full sequences aligned (i.e.

not just the aligned segments). When BLAST aligns only

a fragment, this results in sequence identities that are

signiﬁcantly underestimated. Given that BLAST may

also fail to align many pairs at sequence identity below

40%, it is likely that the ASTRAL lists include many

sequences that under other protocols would be eliminated.

To test this, we used PISCES to recull their 20% list

at 20% sequence identity according to our PSI-BLAST

alignments. A list of 1914 valid input sequences (only

complete chains were used) from ASTRAL resulted in

1617 sequences output from PISCES. While many of the

rejections of sequences were from marginal E-values and

sequence identities, a little over half had sequence iden-

tities above 25% and one third had E-values better than

1.0e-05.

ACKNOWLEDGEMENTS

This work was supported by NIH grants R01-HG02302

and CA06972.

REFERENCES

Altschul,S.F., Madden,T.L., Sch

affer,A.A., Zhang,J., Zhang,Z.,

Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-

BLAST: a new generation of database programs. Nucleic Acids

Res., 25, 3389–3402.

Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N.,

Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The protein

data bank. Nucleic Acids Res., 28, 235–242.

Bhat,T.N., Bourne,P., Feng,Z., Gilliland,G., Jain,S., Ravichan-

dran,V., Schneider,B., Schneider,K., Thanki,N., Weissig,H.,

Westbrook,J. and Berman,H.M. (2001) The PDB data uniformity

project. Nucleic Acids Res., 29, 214–218.

Brenner,S.E., Koehl,P. and Levitt,M. (2000) The astral compendium

for protein structure and sequence analysis. Nucleic Acids Res.,

28, 254–256.

Hobohm,U. and Sander,C. (1994) Enlarged representative set of

protein structures. Protein Sci., 3, 522–524.

Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992) Selec-

tion of representative protein data sets. Protein Sci., 1, 409–417.

Hooft,R.W., Vriend,G., Sander,C. and Abola,E.E. (1996) Errors in

protein structures. Nature, 381, 272.

Needleman,S.B. and Wunsch,C.D. (1970) A general method appli-

cable to the search for similarities in the amino acid sequences of

two proteins. J. Mol. Biol., 48, 443–453.

Noguchi,T., Matsuda,H. and Akiyama,Y. (2001) PDB-REPRDB: a

database of representative protein chains from the protein data

bank (PDB). Nucleic Acids Res., 29, 219–220.

Sauder,J.M., Arthur,J.W. and Dunbrack,Jr,R.L. (2000) Large-scale

comparison of protein sequence alignment algorithms with

structure alignments. Proteins, 40, 6–22.

Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Clustal W:

improving the sensitivity of progressive multiple sequence

alignment through sequence weighting, position-speciﬁc gap

penalties and weight matrix choice. Nucleic Acids Res., 22,

4673–4680.

Westbrook,J., Feng,Z., Jain,S., Bhat,T.N., Thanki,N., Ravichan-

dran,V., Gilliland,G.L., Bluhm,W., Weissig,H., Greer,D.S. et al.

(2002) The protein data bank: unifying the archive.Nucleic Acids

Res., 30, 245–248.

Wheeler,D.L., Church,D.M., Lash,A.E., Leipe,D.D., Madden,T.L.,

Pontius,J.U., Schuler,G.D., Schriml,L.M., Tatusova,T.A., Wag-

ner,L. et al. (2002) Database resources of the national center

for biotechnology information: 2002 update. Nucleic Acids Res.,

30, 13–16.

1591

by guest on May 23, 2011bioinformatics.oxfordjournals.orgDownloaded from

QAFI: A Novel Method for Quantitative Estimation of Missense Variant Impact Using Protein-Specific Predictors and Ensemble Learning

Preprint

Full-text available

May 2024

Next-generation sequencing (NGS) has revolutionized genetic diagnostics, yet its application in precision medicine remains incomplete, despite significant advances in computational tools for variant annotation. Many variants remain unannotated, and existing tools often fail to accurately predict the range of impacts that variants have on protein function. This limitation restricts their utility in relevant applications such as predicting disease severity and onset age. In response to these challenges, a new generation of computational models is emerging, aimed at producing quantitative predictions of genetic variant impacts. However, the field is still in its early stages, and several issues need to be addressed, including improved performance and better interpretability. This study introduces QAFI, a novel methodology that integrates protein-specific regression models within an ensemble learning framework, utilizing conservation-based and structure-related features derived from AlphaFold models. Our findings indicate that QAFI significantly enhances the accuracy of quantitative predictions across various proteins. The approach has been rigorously validated through its application in the CAGI6 contest, focusing on ARSA protein variants, and further tested on a comprehensive set of clinically labeled variants, demonstrating its generalizability and robust predictive power. The straightforward nature of our models may also contribute to better interpretability of the results.

High-throughput RNA-Seq and In-silico analysis of glioblastoma cells treated with cold atmospheric plasma and temozolomide.

Preprint

Full-text available

Jun 2024

Glioblastoma multiforme (GBM) is one of the most common and aggressive forms of malignant brain cancer in adults and is classified based on its isocitrate dehydrogenase (IDH) mutation. Surgery, radiotherapy, and Temozolomide (TMZ) are the standard treatment methods for GBM. Here we present a combination therapy of cold atmospheric plasma (CAP) and TMZ as a key treatment for GBM. CAP works by increasing reactive oxygen and nitrogen species (RONS) and targets the spread of the tumor. In this study, we performed the transcriptomic analysis of U-87MG cells by high throughput deep RNA-Seq analysis to quantify differential gene expression across the genome. Furthermore, we studied various signaling pathways and predicted structural changes of consequential proteins to elucidate the functional changes caused by up or down-regulation of the most altered genes. Our results demonstrate that combination treatment downregulated key genes like p53, histones, DNA damage markers, cyclins, in the following pathways: MAPK, P53, DNA damage and cell cycle. Moreover, in silico studies were conducted for further investigation to verify these results, and the combination of CAP & TMZ showed a significant antitumor effect in the GBM cells leading to apoptosis and damaged key proteins. Further studies of the impact of TMZ on gene expression, biochemical pathways, and protein structure will lead to improved treatment approaches for GBM.

De novo design of buttressed loops for sculpting protein functions

Article

Full-text available

May 2024
NAT CHEM BIOL

In natural proteins, structured loops have central roles in molecular recognition, signal transduction and enzyme catalysis. However, because of the intrinsic flexibility and irregularity of loop regions, organizing multiple structured loops at protein functional sites has been very difficult to achieve by de novo protein design. Here we describe a solution to this problem that designs tandem repeat proteins with structured loops (9–14 residues) buttressed by extensive hydrogen bonding interactions. Experimental characterization shows that the designs are monodisperse, highly soluble, folded and thermally stable. Crystal structures are in close agreement with the design models, with the loops structured and buttressed as designed. We demonstrate the functionality afforded by loop buttressing by designing and characterizing binders for extended peptides in which the loops form one side of an extended binding pocket. The ability to design multiple structured loops should contribute generally to efforts to design new protein functions.

Protein folding as a jamming transition

Preprint

Full-text available

May 2024

Proteins fold to a specific functional conformation with a densely packed hydrophobic core that controls their stability. We develop a geometric, yet all-atom model for proteins that explains the universal core packing fraction of $\phi_c=0.55$ found in experimental measurements. We show that as the hydrophobic interactions increase relative to the temperature, a novel jamming transition occurs when the core packing fraction exceeds $\phi_c$. The model also recapitulates the global structure of proteins since it can accurately refold to native-like structures from partially unfolded states.

AIUPred: combining energy estimation with deep learning for the enhanced prediction of protein disorder

Article

May 2024

Intrinsically disordered proteins and protein regions (IDPs/IDRs) carry out important biological functions without relying on a single well-defined conformation. As these proteins are a challenge to study experimentally, computational methods play important roles in their characterization. One of the commonly used tools is the IUPred web server which provides prediction of disordered regions and their binding sites. IUPred is rooted in a simple biophysical model and uses a limited number of parameters largely derived on globular protein structures only. This enabled an incredibly fast and robust prediction method, however, its limitations have also become apparent in light of recent breakthrough methods using deep learning techniques. Here, we present AIUPred, a novel version of IUPred which incorporates deep learning techniques into the energy estimation framework. It achieves improved performance while keeping the robustness of the original method. Based on the evaluation of recent benchmark datasets, AIUPred scored amongst the top three single sequence based methods. With a new web server we offer fast and reliable visual analysis for users as well as options to analyze whole genomes in mere seconds with the downloadable package. AIUPred is available at https://aiupred.elte.hu.

LoCoHD: a metric for comparing local environments of proteins

Article

Full-text available

May 2024

Protein folds and the local environments they create can be compared using a variety of differently designed measures, such as the root mean squared deviation, the global distance test, the template modeling score or the local distance difference test. Although these measures have proven to be useful for a variety of tasks, each fails to fully incorporate the valuable chemical information inherent to atoms and residues, and considers these only partially and indirectly. Here, we develop the highly flexible local composition Hellinger distance (LoCoHD) metric, which is based on the chemical composition of local residue environments. Using LoCoHD, we analyze the chemical heterogeneity of amino acid environments and identify valines having the most conserved-, and arginines having the most variable chemical environments. We use LoCoHD to investigate structural ensembles, to evaluate critical assessment of structure prediction (CASP) competitors, to compare the results with the local distance difference test (lDDT) scoring system, and to evaluate a molecular dynamics simulation. We show that LoCoHD measurements provide unique information about protein structures that is distinct from, for example, those derived using the alignment-based RMSD metric, or the similarly distance matrix-based but alignment-free lDDT metric.

Amino acid propensities for secondary structures and its variation across protein structures using exhaustive PDB data

Article

Apr 2024

Generative adversarial networks in protein and ligand structure generation: a case study

Chapter

Jan 2024

Determining and designing the structure and function of the protein has deepened our understanding of biology at a cellular and molecular level. There are numerous proteins whose structures are not known. However, prediction of protein structure is possible using amino acid sequences, if available. However, creating new protein structures in a principled and methodical way is very challenging and time-consuming. Due to the advancement in deep learning and computational modeling, exceptional results in protein generation have been achieved. It is necessary to create de novo protein designs to fully utilize the application of protein structures in technological, scientific, and medical applications. In this chapter, we have discussed the fundamental concepts of generative adversarial networks (GANs) and their applications in protein structure and ligand generation. This chapter also presents a few case studies for generating protein structures using GANs and other generative models. Further, challenges and future research directions in the area have been discussed.

The structure assessment web server: for proteins, complexes and more

Article

Apr 2024
NUCLEIC ACIDS RES

The ‘structure assessment’ web server is a one-stop shop for interactive evaluation and benchmarking of structural models of macromolecular complexes including proteins and nucleic acids. A user-friendly web dashboard links sequence with structure information and results from a variety of state-of-the-art tools, which facilitates the visual exploration and evaluation of structure models. The dashboard integrates stereochemistry information, secondary structure information, global and local model quality assessment of the tertiary structure of comparative protein models, as well as prediction of membrane location. In addition, a benchmarking mode is available where a model can be compared to a reference structure, providing easy access to scores that have been used in recent CASP experiments and CAMEO. The structure assessment web server is available at https://swissmodel.expasy.org/assess.

Acyl Capping Group Identity Effects on α-Helicity: On the Importance of Amide·Water Hydrogen Bonds to α-Helix Stability

Article

Apr 2024
BIOCHEMISTRY-US

Gapped BLAST and PSI-BLAST: A new generation of protein database search programs

Article

Full-text available

Sep 1997

The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic, and statistical refinements permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is described for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position Specific Iterated BLAST (PSLBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities.

The PDB data uniformity project

Article

Full-text available

Feb 2001
NUCLEIC ACIDS RES

The Protein Data Bank (PDB; http://www.rcsb.org/pdb/) is the single worldwide archive of structural data of biological macromolecules. This paper describes the data uniformity project that is underway to address the inconsistency in PDB data.

An efficient method applicable to the search for similarities in the amino acid sequences of two proteins

Article

Database resources of the National Center for Biotechnology

Article

Jan 2002

David L Wheeler

In addition to maintaining the GenBank nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval resources that operate on the data in GenBank and a variety of other biological data made available through NCBI’s web site. NCBI data retrieval resources include Entrez, PubMed, LocusLink and the Taxonomy Browser. Data analysis resources include BLAST, Electronic PCR, OrfFinder, RefSeq, UniGene, HomoloGene, Database of Single Nucleotide Polymorphisms (dbSNP), Human Genome Sequencing, Human MapViewer, Human¡VMouse Homology Map, Cancer Chromosome Aberration Project (CCAP), Entrez Genomes, Clusters of Orthologous Groups (COGs) database, Retroviral Genotyping Tools, SAGEmap, Gene Expression Omnibus (GEO), Online Mendelian Inheritance in Man (OMIM), the Molecular Modeling Database (MMDB) and the Conserved Domain Database (CDD). Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of the resources can be accessed through the NCBI home page at http://www.ncbi.nlm.nih.gov.

The Protein Data Bank

Article

Jan 2000

Helen Berman

The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.

The ASTRAL compendium for protein structure and sequence analysis

Article

Jan 2000

The Protein Data Bank

Article

Dec 1999
NUCLEIC ACIDS RES

A General Method Applicable to Search for Similarities in Amino Acid Sequence of 2 Proteins

Article

Apr 1970

A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed. From these findings it is possible to determine whether significant homology exists between the proteins. This information is used to trace their possible evolutionary development.The maximum match is a number dependent upon the similarity of the sequences. One of its definitions is the largest number of amino acids of one protein that can be matched with those of a second protein allowing for all possible interruptions in either of the sequences. While the interruptions give rise to a very large number of comparisons, the method efficiently excludes from consideration those comparisons that cannot contribute to the maximum match.Comparisons are made from the smallest unit of significance, a pair of amino acids, one from each protein. All possible pairs are represented by a two-dimensional array, and all possible comparisons are represented by pathways through the array. For this maximum match only certain of the possible pathways must be evaluated. A numerical value, one in this case, is assigned to every cell in the array representing like amino acids. The maximum match is the largest number that would result from summing the cell values of every pathway.

W: CLUSTAL. Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice

Article

Dec 1994

The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of divergent protein sequences. Firstly, individual weights are assigned to each sequence in a partial alignment in order to downweight near-duplicate sequences and up-weight the most divergent ones. Secondly, amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. Thirdly, residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. Fourthly, positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions. These modifications are incorporated into a new program, CLUSTAL W which is freely available.

Enlarged representative set of proteins

Article

Mar 2008

To reduce redundancy in the Protein Data Bank of 3D protein structures, which is caused by many homologous proteins in the data bank, we have selected a representative set of structures. The selection algorithm was designed to (1) select as many nonhomologous structures as possible, and (2) to select structures of good quality. The representative set may reduce time and effort in statistical analyses.

PISCES: a protein sequence culling server

Abstract and Figures

Recommended publications

Domain-based small molecule binding site annotation

PISCES: recent improvements to a PDB sequence culling server

ProtBuD: A database of biological unit structures of protein families and superfamilies

MDB: The Metalloprotein Database and Browser at The Scripps Research Institute

CASA: A server for the critical assessment of protein sequence alignment accuracy