ArticlePDF AvailableLiterature Review

On the Use of Knowledge-Based Potentials for the Evaluation of Models of Protein-Protein, Protein-DNA, and Protein-RNA Interactions

Authors:

Abstract and Figures

Proteins are the bricks and mortar of cells, playing structural and functional roles. In order to perform their function, they interact with each other as well as with other biomolecules such as DNA or RNA. Therefore, to fathom the function of a protein, we require knowing its partners and the atomic details of its interactions (i.e., the structure of the complex). However, the amount of protein interactions with an experimentally determined three-dimensional structure is scarce. Therefore, computational techniques such as homology modeling are foremost to fill this gap. Protein interactions can be modeled using as templates the interactions of homologous proteins, if the structure of the complex is known, or using docking methods. In both approaches, the estimation of the quality of models is essential. There are several ways to address this problem. In this review, we focus on the use of knowledge-based potentials for the analysis of protein interactions. We describe the procedure to derive statistical potentials and split them into different energetic terms that can be used for different purposes. We extensively discuss the fields where knowledge-based potentials have been successfully applied to (1) model protein-protein, protein-DNA, and protein-RNA interactions and (2) predict binding sites (in the protein and in the DNA). Moreover, we provide ready-to-use resources for docking and benchmarking protein interactions.
Content may be subject to copyright.
Provided for non-commercial research and educational use only.
Not for reproduction, distribution or commercial use.
This chapter was originally published in the book Advances in Protein Chemistry and
Structural Biology, Vol. 94 published by Elsevier, and the attached copy is provided
by Elsevier for the author's benefit and for the benefit of the author's institution, for
non-commercial research and educational use including without limitation use in
instruction at your institution, sending it to specific colleagues who know you, and
providing a copy to your institution’s administrator.
All other uses, reproduction and distribution, including without limitation commercial
reprints, selling or licensing copies or access, or posting on open internet sites, your
personal or institution’s website or repository, are prohibited. For exceptions,
permission may be sought for such use through Elsevier's permissions site at:
http://www.elsevier.com/locate/permissionusematerial
From Oriol Fornes, Javier Garcia-Garcia, Jaume Bonet, Baldo Oliva, On the Use of
Knowledge-Based Potentials for the Evaluation of Models of Protein–Protein,
Protein–DNA, and Protein–RNA Interactions. In Rossen Donev, editor: Advances in
Protein Chemistry and Structural Biology, Vol. 94, Burlington: Academic Press,
2014, pp. 77-120. ISBN: 978-0-12-800168-4 © Copyright 2014 Elsevier Inc.
Academic Press
CHAPTER FOUR
On the Use of Knowledge-Based
Potentials for the Evaluation of
Models of ProteinProtein,
ProteinDNA, and ProteinRNA
Interactions
Oriol Fornes, Javier Garcia-Garcia, Jaume Bonet, Baldo Oliva
1
Structural Bioinformatics Lab. (GRIB), Departament de Cie
`ncies Experimentals i de la Salut, Universitat
Pompeu Fabra, Barcelona, Catalunya, Spain
1
Corresponding author: e-mail address: baldo.oliva@upf.edu
Contents
1. Introduction 78
2. Knowledge-Based Potentials 80
2.1 Split-statistical potentials 81
3. Modeling of Protein Interactions Using Templates 82
3.1 Models of binary complexes 83
3.2 Models of multimeric complexes 84
4. Modeling Interactions of Proteins Using Docking 85
4.1 Proteinprotein docking 90
4.2 Proteinnucleic acid docking 92
5. Prediction of Protein-Binding Regions 93
5.1 Identification of protein interfaces 93
5.2 Prediction of DNA/RNA-binding proteins 97
6. Characterization of Transcription Factor-Binding Sites 98
6.1 Application of knowledge-based potentials on DREAM5 targets 100
7. Adapting Split-Statistical Potentials for ProteinDNA Interactions 105
7.1 Application of split-statistical potentials on DREAM5 targets 106
8. Conclusions 109
Acknowledgments 110
References 110
Abstract
Proteins are the bricks and mortar of cells, playing structural and functional roles.
In order to perform their function, they interact with each other as well as with other
biomolecules such as DNA or RNA. Therefore, to fathom the function of a protein, we
require knowing its partners and the atomic details of its interactions (i.e., the structure
Advances in Protein Chemistry and Structural Biology, Volume 94 #2014 Elsevier Inc.
ISSN 1876-1623 All rights reserved.
http://dx.doi.org/10.1016/B978-0-12-800168-4.00004-4
77
Author's personal copy
of the complex). However, the amount of protein interactions with an experimentally
determined three-dimensional structure is scarce. Therefore, computational techniques
such as homology modeling are foremost to fill this gap. Protein interactions can be
modeled using as templates the interactions of homologous proteins, if the structure
of the complex is known, or using docking methods. In both approaches, the estimation
of the quality of models is essential. There are several ways to address this problem.
In this review, we focus on the use of knowledge-based potentials for the analysis of
protein interactions. We describe the procedure to derive statistical potentials and split
them into different energetic terms that can be used for different purposes. We exten-
sively discuss the fields where knowledge-based potentials have been successfully
applied to (1) model proteinprotein, proteinDNA, and proteinRNA interactions
and (2) predict binding sites (in the protein and in the DNA). Moreover, we provide
ready-to-use resources for docking and benchmarking protein interactions.
1. INTRODUCTION
During the past decade hundreds of sequenced genomes have come to
light, producing a vast amount of protein sequences. Therefore, unraveling
the function of these proteins has become one of the major challenges in
biology. It is widely accepted that the function of a protein can be predicted
from its structure (Watson, Laskowski, & Thornton, 2005). But proteins
rarely act alone; instead, they form networks of physical interactions with
other biomolecules (i.e., protein–protein, protein–DNA, and protein–
RNA interactions). Thus, in order to have a better understanding of the
function of a protein, it is also necessary to know with whom it is associated
and how, even at atomic level.
The number of proteins with an experimentally determined three-
dimensional (3D) structure in the Protein Data Bank (PDB) (Berman
et al., 2000) is very low in comparison to the number of known protein
sequences, even for well-characterized organisms (Sharan, Ulitsky, &
Shamir, 2007), and even lower in the case of protein binary complexes
(Kirsanov et al., 2012; Mosca, Ce
´ol, Stein, Olivella, & Aloy, 2013). The dis-
proportion between the number of solved 3D structures and protein
sequences has encouraged the development of many strategies to model
the structure of proteins from their sequence (Dunbrack, 2006; Ginalski,
2006). These strategies have become the basis for the modeling of protein
interactions. Protein–protein interactions can be modeled by using as tem-
plates complexes of homologous proteins with known structure. This
approach relies on the principle that, given a pair of interacting proteins,
78 Oriol Fornes et al.
Author's personal copy
their homologs will also interact (interologs approach; Garcia-Garcia,
Schleker, Klein-Seetharaman, & Oliva, 2012; Matthews et al., 2001), and
it is assumed that they will do it in a similar fashion. Occasionally, the models
of protein–protein interactions can be constructed by superimposition of the
models of the unbound partners over the structure of a template complex.
They can also be obtained by docking the structure of one of the two
proteins onto the other (Vajda & Kozakov, 2009) via previous modeling
of their unbound structures (if necessary). We recently reviewed in detail
the modeling of tertiary and quaternary structures of proteins and their role
in protein–protein interaction networks (Garcia-Garcia, Bonet, et al., 2012;
Planas-Iglesias, Bonet, Feliu, Gursoy, & Oliva, 2012). Similarly, we can also
obtain the models of protein–DNA and protein–RNA interactions, which
also require to model the nucleotide sequences of interest (Feig,
Karanicolas, & Brooks, 2004; Lu & Olson, 2008).
Paired with the modeling of 3D structures, the estimation of their quality
has become crucial. In this particular context, several methods have been
developed to score models based on energies. One approach to address this
problem is based on the derivation of knowledge-based potentials (also
referred to as statistical potentials or potentials of mean force) (Sippl,
1990). Knowledge-based potentials have been used to (1) discriminate
whether or not a model has the correct fold (Panjkovich, Melo, &
Marti-Renom, 2008; Shen & Sali, 2006); (2) detect localized errors in pro-
tein structures (Wiederstein & Sippl, 2007); (3) predict the stability of
mutant proteins (Zhou & Zhou, 2002); (4) select the closest near-native
models from a set of decoys (Aloy & Oliva, 2009; Ferrada & Melo,
2009); (5) model protein–protein interactions (Aloy & Russell, 2003; Lu,
Lu, & Skolnick, 2003); (6) analyze the outcome of docking experiments,
including protein–protein (analyzed in Moal, Torchala, Bates, &
Ferna
´ndez-Recio, 2013), protein–DNA (Robertson & Varani, 2007;
Takeda, Corona, & Guo, 2013; Xu, Yang, Liang, & Zhou, 2009), and
protein–RNA (Pe
´rez-Cano, Solernou, Pons, & Ferna
´ndez-Recio, 2010;
Tuszynska & Bujnicki, 2011; Zheng, Robertson, & Varani, 2007); (7) infer
the ability of proteins to bind DNA (Gao & Skolnick, 2008, 2009; Zhao,
Yang, & Zhou, 2010) and RNA (Zhao, Yang, & Zhou, 2011); (8) recognize
the binding regions in proteins (Feliu, Aloy, & Oliva, 2011; Pe
´rez-Cano &
Ferna
´ndez-Recio, 2010); and (9) identify transcription factor-binding sites
(Alamanova, Stegmaier, & Kel, 2010; Angarica, Pe
´rez, Vasconcelos,
Collado-Vides, & Contreras-Moreira, 2008; Chen, Chien, et al., 2012;
Liu, Guo, Li, & Xu, 2008; Xu et al., 2009).
79Statistical Potentials for Protein Interactions
Author's personal copy
In the following sections, we review the use of knowledge-based poten-
tials for the analysis of protein interactions. In Section 2, we introduce
knowledge-based potentials and we split them into different energetic terms.
Sections 3–6 are devoted to different fields where knowledge-based poten-
tials have successfully been applied. Specifically, we focus on (1) modeling
of protein interactions (including homology modeling and integrative
modeling); (2) docking of protein interactions (including protein–protein,
protein–DNA, and protein–RNA); (3) prediction of protein-binding
regions; (4) characterization of transcription factor-binding sites; and (5)
prediction of DNA-binding sites. In Section 7, we adapt the procedure
to split-statistical potentials (Aloy & Oliva, 2009) to predict protein–
DNA interactions and DNA-binding sites.
2. KNOWLEDGE-BASED POTENTIALS
A knowledge-based potential is an energy function derived from the
analysis of known protein structures. There are many methods to obtain
such potentials including the quasi-chemical (Miyazawa & Jernigan,
1985) and the potential of mean force (PMF) approximations (Sippl,
1990). We have used the general definition of knowledge-base potential
described in Aloy and Oliva (2009) (i.e., Eq. 4.1):
PMF a,bðÞ¼PMFstd da,b
ðÞkBTlog Pa,bjdðÞ
PaðÞPbðÞ

PMFstd da,b
ðÞ¼kBTlog Pd
a,b
ðÞ
weightref
 ð4:1Þ
Where “k
B
” is the Boltzmann constant, “T” is the standard temperature,
d
a,b
” is the pairwise distance between a pair of residues “a,b”, being
P(a)” and “P(b)” their respective probabilities. “P(a,bjd
a,b
)” is the condi-
tional probability of finding residues “a,b” at a maximum distance “d
a,b
and “P(d
a,b
)” is the probability of observing any pair of residues up to that
distance. Finally, the “weight
ref
” is the reference state function. The prob-
abilities “P(*)” are approximated from the observed frequencies of interac-
tions in a nonredundant set of PDB structures. Moreover, the distance can
be calculated as the minimum distance between any pair of heavy atoms or as
the pairwise distance between two specific atoms (e.g., between Cbatoms;
Cafor glycine residues). Also, Aloy and Oliva (2009) proved that the ref-
erence state function can be neglected for the comparison of decoys (see fur-
ther in Section 2.1).
80 Oriol Fornes et al.
Author's personal copy
The application of Eq. (4.1) over all interacting pairs of residues “a” and
b” in a protein structure results in an estimation of its quality given in terms
of energy (i.e., Eq. 4.2):
E¼X
a,b
PMF a,bðÞ ð4:2Þ
It has to be noted that, while for protein folding residues “a” and “b” belong
to the same protein (or single protein chain), for protein–protein interac-
tions, residues “a” and “b” belong to a pair of interacting proteins “A”
and “B” (or different protein chains), respectively.
2.1. Split-statistical potentials
Aloy and Oliva (2009) demonstrated that, using the Bayes theorem,
Eq. (4.1) can be decomposed into several energetic terms, one of them
including the reference state. We have selected some of these terms as poten-
tials of mean force to score the quality of decoys (i.e., Eq. 4.3):
PMFpair a,bðÞ¼kBTlog Pa,bjda,b
ðÞ
PaðÞPbðÞPd
a,b
ðÞ

PMFlocal a,bðÞ¼kBTlog Pajya
ðÞ
PaðÞ

þkBTlog Pbjyb
ðÞ
PbðÞ

PMF3D a,bðÞ¼kBTlog Pd
a,b
ðÞðÞ
PMF3DC a,bðÞ¼kBTlog Pya,ybjda,b
ðÞ
Pya,yb
ðÞ

PMFS3DC a,bðÞ¼kBTlog Pa,bjda,b,ya,yb
ðÞPya,yb
ðÞ
Pa,bjya,yb
ðÞPya,ybjda,b
ðÞ

ð4:3Þ
Where “y
a
” and “y
b
” are the environments of a pair of residues “a,b”, as
defined by their hydrophobicity (i.e., polar or nonpolar), degree of exposure
(i.e., buried or exposed), and surrounding secondary structure (i.e., a-helix,
b-sheet, or coil). As an example, “P(a,bjd
ab
,y
a
,y
b
)” is the conditional proba-
bility of finding residues “a,b”, in their respective environments “y
a
” and
y
b
”, at a maximum distance “d
a,b
” (see Aloy & Oliva, 2009 for more details).
The statistical potentials “E
pair
”, “E
local
”, “E
3D
”, “E
3DC
”, and “E
S3DC
are defined using Eq. (4.2), with the corresponding subscripts between “E_”
and “PMF_”. We name these potentials “split-statistical potentials”.
The statistical potential “E
S3DC
” can be understood as a refinement of the
residue-pair statistical potential “E
pair
”. It takes into account not only the
81Statistical Potentials for Protein Interactions
Author's personal copy
residues that interact but also their environments. The statistical potential
E
3DC
” depends only on the occurrence of interacting environments with-
out considering the specific interacting residues. The score “E
local
” is dis-
tance independent, and it reflects the probability of placing a residue in a
specific environment. The energy term “E
3D
” concerns only the distance
at which pairs of residues interact, and it increases together with the number
of interacting residue pairs.
The statistical potentials described in Eq. (4.3) differ in order of magnitude
and their values cannot be used straightforward for the comparison of confor-
mational decoys. Therefore, they are translated into Z-scores. The Z-score of
an energy (or score within a distribution) is defined as the difference between
the energy (i.e., “E_”) and the average of energies in the distribution (i.e.,
m”), divided by the standard deviation of the distribution (i.e., “s”). In gen-
eral, the background distribution to calculate a Z-score uses a random distri-
bution, which in the case of folds or interactions is obtained by shuffling the
residues of one or two sequences, respectively. The translation of energies into
Z-scores neglects the “E
3D
” term because it is independent of the sequence
(i.e., E
3D
¼mfor any distribution of shuffled sequences). Also, Aloy and
Oliva (2009) demonstrated that the distribution of the Z-score of the reference
state was similar to the random distribution and could be neglected too.
Therefore, neither the energy “E
3D
” nor the reference state were considered
when selecting the best model among conformational decoys.
In a recent work, Feliu et al. (2011) modified the split-statistical potentials
of Eq. (4.3) for its application to protein–protein interactions. The frequen-
cies of amino acid pairs were extracted from residues belonging to different
chains in the interface of protein complexes from the 3DID database
(Stein, Ce
´ol, & Aloy, 2011). In protein–protein interactions, the Z-score
of the reference state was still irrelevant. Therefore, the energetic term that
included the reference state was assumed to be irrelevant too when ranking
decoys of interactions (i.e., docking poses). However, the score “E
3D
” was
associated with the extension of the interacting interface (it is proportional
to the number of residues implied in the interface), and it was still valuable
in the analysis of protein-docking decoys (see further in Section 4.1).
3. MODELING OF PROTEIN INTERACTIONS USING
TEMPLATES
The continuous increase of structural data on protein complexes in the
PDB has been exploited for modeling the structure of protein–protein
82 Oriol Fornes et al.
Author's personal copy
interactions as well as the interactions of proteins with other biomolecules
(i.e., protein–DNA or protein–RNA interactions) based on homology.
However, when structures of the interaction are not available, docking
methods can be used (see further in Section 4). In the past years, new
approaches have been developed to assemble large macromolecular com-
plexes by combining different experimental data to apply restraints upon
complex assembly.
3.1. Models of binary complexes
The most common way of modeling protein interactions is via comparative
modeling (revised in Planas-Iglesias et al., 2012). This approach can only be
applied as long as there is a homologous structure of the interaction. Then,
applications such as MODELLER (Eswar et al., 2006) are able to directly
model the interaction of interest. Nevertheless, homology modeling is lim-
ited by those homologs whose structure is too remote to help assigning the
correct fold (Rost, 1999). Still, even distantly related proteins may use the
same binding regions to interact (Gao & Skolnick, 2010; Tuncbag, Gursoy,
Guney, Nussinov, & Keskin, 2008; Zhang, Petrey, Norel, & Honig, 2010),
which has been exploited by different authors to model protein–protein
interactions. For example, in M-TASSER (Chen & Skolnick, 2008),
protein sequences are threaded against a monomer template library. All
threading solutions belonging to the same dimer template are then identi-
fied. However, if both monomers share less than 30% sequence identity with
their templates on the dimer, the threaded dimer is evaluated with statistical
potentials (Lu et al., 2003) and, when necessary, discarded. Next, the tertiary
structure of each protein is obtained by rearrangement of continuous tem-
plate fragments (Zhang & Skolnick, 2004). Finally, the quaternary structure
is assembled by superimposition of both protein structures over the dimer
template.
Recently, three different methods have been proposed for model-
building the structure of protein–protein interactions on a genome-wide
scale. In PRISM (Tuncbag, Gursoy, Nussinov, & Keskin, 2011), the struc-
tures (or models) of two proteins are aligned against a set of known protein–
protein interfaces (i.e., template set). If the two complementary sides of a
template interface are structurally similar to the proteins (each side to a dif-
ferent protein), then the proteins are predicted to interact and the interaction
is modeled using the binding site, as dictated by the template interface. All
models produced with this approach are refined to account for flexible
83Statistical Potentials for Protein Interactions
Author's personal copy
changes and finally ranked. In PrePPI (Zhang et al., 2012), the individual
structures of the proteins are searched in the PDB or in a database of homol-
ogy models (i.e., SkyBase (Lee et al., 2010) and ModBase (Pieper et al.,
2011)). This step is followed by the identification of close and remote homo-
logs of the two partners. Then, if a PDB structure contains the interaction
between the homologs of each partner, it is used as template and the inter-
action is modeled by superimposition. In order to calculate the reliability of
the model, five different empirical structure-based scores are assigned and
combined using a Bayesian network, which scores the quality of the struc-
tural model of the interaction. Finally, in Interactome3D (Mosca, Ce
´ol, &
Aloy, 2013), the interaction is modeled in a similar fashion than PrePPI, but
it increases the structural coverage of the approach by using templates of
interacting domains from 3DID (Mosca, Ce
´ol, Stein, et al., 2013). The
resulting models are finally evaluated with InterPrets (Aloy & Russell, 2003).
3.2. Models of multimeric complexes
Methods described in Section 3.1 are useful for complexes formed by few
molecules. However, the assembly of large macromolecular complexes
requires an integrative structural modeling approach. The main idea behind
this methodology is to characterize the structural and topological features of
the complex in order to reduce the number of plausible solutions. For exam-
ple, the Integrative Modeling Platform (IMP) (Russel et al., 2012) has been
used to describe the yeast nuclear pore complex (Alber et al., 2007) and the
structure of chromatin at mega base scale (Bau
`et al., 2011). The assembly of
a complex in IMP is a cyclic procedure involving four different steps (revised
in detail in Planas-Iglesias et al., 2012):
(1) Collecting the information regarding the complex. This step includes
collecting experimental data from SAXS profiles (Schneidman-
Duhovny, Hammel, & Sali, 2011), proteomics data (Alber, Fo
¨rster,
Korkin, Topf, & Sali, 2008), EM images (Lasker, Phillips, et al.,
2010), density maps (Lasker, Sali, & Wolfson, 2010), nuclear magnetic
resonance (NMR) spectroscopy (Simon, Madl, Mackereth, Nilges, &
Sattler, 2010), or even 5C data (Bau
`et al., 2011). It also implies to
include physical–chemical information, such as molecular mechanics
force fields (Brooks et al., 1983) and potentials of mean force or statis-
tically derived potentials (Shen & Sali, 2006).
(2) Select a method to represent the data and use the information collected
in the previous step, translating it into spatial restraints. IMP uses
84 Oriol Fornes et al.
Author's personal copy
structures solved with different resolutions. High-resolution structures
can be represented by atoms, but low-resolution structures are repre-
sented by groups of atoms, such as residues, motifs, or even domains.
The translation of information into spatial restraints is used to test the
consistency of the model.
(3) Constructing a model that is consistent with the aforementioned spatial
restraints. The entire rotational and translational 3D space is searched in
order to position and orientate each individual structure inside the
complex.
(4) Evaluation of the modeled complex. In theory, if there is only one
native state of the complex, we should obtain a single model satisfying
all restraints. In contrast, if the data used to encode the restraints is
insufficient, more than one possible solution can be obtained or none.
4. MODELING INTERACTIONS OF PROTEINS USING
DOCKING
In contrast to the previous methods, which require the structural
knowledge of the interaction, docking is used for modeling the structure
of an interaction formed by two or more molecules (e.g., two proteins)
when the structure of the interaction is not available but the structures
of the individual molecules are known (or can be modeled). Docking
addresses the problem of finding the best-fit orientation of one molecule
with respect to the other. This idea was first introduced 30 years ago by
Wodak and Janin (1978). Since then, docking algorithms have largely
improved (summarized in Table 4.1). The simplest method of docking
two structures is to treat them as rigid bodies, usually using the Fast Fourier
Transform technique (e.g., MolFit (Katchalski-Katzir et al., 1992),
FTDock (Gabb et al., 1997), PIPER (Kozakov et al., 2006), and ZDOCK
(Mintseris et al., 2007)) or geometric matching (e.g., Hex (Ritchie &
Kemp, 2000) and FRODOCK (Garzon et al., 2009)). Moreover, several
methods have been developed that take into consideration the flexibility
of proteins, including Monte Carlo-based methods (e.g., RosettaDock;
Gray et al., 2003), the High Ambiguity Driven biomolecular DOCKing
(HADDOCK) (Dominguez, Boelens, & Bonvin, 2003), and the use of
normal modes describing the changes of conformation suffered upon bind-
ing (e.g., SwarmDock; Moal & Bates, 2010). However, it has been shown
that for approximately 65% of interactions, proteins suffer little or none
conformational changes when they associate, while only for 15% of
85Statistical Potentials for Protein Interactions
Author's personal copy
Table 4.1 Docking methods
Program Algorithm Evaluation Server References
Rigid-body docking methods
ClusPro FFT Geometric fit, van der Waals, atomic
desolvation energy, electrostatics, and
knowledge-based potentials
http://cluspro.bu.edu/login.
php
Comeau, Gatchell, Vajda, and
Camacho (2004a)
CS GM Atomic desolvation energy Shentu, Al Hasan, Bystroff, and
Zaki (2008)
DOT2 FFT Electrostatics and atomic desolvation
energies
Roberts, Thompson, Pique, Perez,
and Ten Eyck (2013)
FRODOCK GM van der Waals, electrostatics, and atomic
desolvation energies
http://frodock.chaconlab.
org
Garzon et al. (2009)
FTDock FFT Hydrogen bonding, electrostatics, and
RPScore (Moont, Gabb, & Sternberg,
1999)
Gabb, Jackson, and Sternberg
(1997)
GRAMM-X FFT Lennard-Jones potential, evolutionary
conservation, knowledge-based
potentials, van der Waals, and atomic
contact energy
http://vakser.
bioinformatics.ku.edu/
resources/gramm/grammx
Tovchigrechko and Vakser (2006)
Hex GM Geometric fit and electrostatics http://hexserver.loria.fr Macindoe, Mavridis,
Venkatraman, Devignes, and
Ritchie (2010)
LZerD GM Geometric fit and atomic desolvation
energy
Venkatraman, Yang, Sael, and
Kihara (2009)
Author's personal copy
MolFit FFT Katchalski-Katzir et al. (1992)
PatchDock GM Geometric fit and atomic desolvation
energy
http://bioinfo3d.cs.tau.ac.
il/PatchDock
Schneidman-Duhovny, Inbar,
Nussinov, and Wolfson (2005)
PIPER FFT Geometric fit, electrostatics, and atomic
desolvation energy
Kozakov, Brenke, Comeau, and
Vajda (2006)
pyDOCK FFT Electrostatics, desolvation energies,
ODA (Fernandez-Recio, Totrov,
Skorodumov, & Abagyan, 2005), and
SIPPER (Pons, Talavera, de la Cruz,
Orozco, & Fernandez-Recio, 2011)
http://life.bsc.es/servlet/
pydock/home
Jime
´nez-Garcı
´a, Pons, and
Ferna
´ndez-Recio (2013)
shDock GM Collision filtering Gu, Koehl, Hass, and Amenta
(2012)
SP-dock GM Atomic desolvation energy,
electrostatics, hydrophobicity, and
Lennard-Jones potential
Axenopoulos, Daras,
Papadopoulos, and Houstis (2013)
ZDOCK FFT Linear combination of atomistic
potentials, and ZRANK2 (Pierce &
Weng, 2008)
http://zdock.umassmed.edu Mintseris et al. (2007)
Flexible docking methods
3D-Garden MC Lennard-Jones potential and
electrostatics
http://www.sbg.bio.ic.ac.
uk/3dgarden
Lesk and Sternberg (2008)
Continued
Author's personal copy
Table 4.1 Docking methodscont'd
Program Algorithm Evaluation Server References
ATTRACT EM Hydrophobic and hydrophilic contacts Schneider, Saladin, Fiorucci,
Pre
´vost, and Zacharias (2012)
FireDock EM van der Waals, electrostatics, atomic
desolvation energies, hydrogen and
disulfide bonds, p-stacking and aliphatic
interactions, rotamer probabilities, etc.
http://bioinfo3d.cs.tau.ac.
il/FireDock
Mashiach, Schneidman-Duhovny,
Andrusier, Nussinov, and Wolfson
(2008)
HADDOCK MCS van der Waals and electrostatics http://haddock.chem.uu.nl De Vries, van Dijk, and Bonvin
(2010)
RosettaDock MCS van der Waals, hydrogen bonds,
rotamer, knowledge-based potentials,
electrostatics, and atomic solvation
energies
http://antibody.graylab.jhu.
edu
Lyskov and Gray (2008)
SwarmDock NM van der Waals and electrostatics http://bmm.
cancerresearchuk.org/
SwarmDock
Torchala, Moal, Chaleil,
Fernandez-Recio, and Bates
(2013)
EM, energy minimization; FFT, Fast Fourier Transform; GM, geometric matching; MC, marching cubes; MCS, Monte Carlo simulation; NM, normal modes.
Author's personal copy
interactions, proteins undergo flexible deformations (Stein, Rueda,
Panjkovich, Orozco, & Aloy, 2011). As rigid-body docking approaches
are in the first step of docking, previous to the introduction of flexibility,
we will focus this section on rigid-body docking, which accounts for at
least 65% of protein–protein interactions.
A typical docking procedure between two molecules involves several
steps (Vajda & Kozakov, 2009). It begins with a rigid-body docking search
over the entire rotational and translational 3D space for the orientation and
position of one structure (i.e., ligand, usually the smallest structure) with
respect to the other (i.e., target or receptor). The resulting conformational
predictions (i.e., docking poses or decoys) are then ranked using scoring
functions with the objective to assign the higher scores to the poses more
similar to the native structure. These poses are named closest to native or
near-native structures. The definition of near-native solution relies on the
small structural differences of a decoy with respect to the 3D structure of
the binary complex (i.e., the native conformation). Several criteria can be
used to calculate these structural differences, but the most common measure,
as it has been established in the Critical Assessment of Predicted Interactions
(CAPRI) ( Janin et al., 2003), is to calculate the root mean square deviation
(RMSD) by comparing the decoy with the native conformation. However,
the selection of residues for the comparison can vary when (1) the whole
structure of the receptor is used as reference to superimpose the poses,
the RMSD shows the deviation on the location of the ligand (ligand-
RMSD) and (2) all residues in the interface of the native structure
are selected, the RMSD shows the different disposition of the interface
(I-RMSD). In CAPRI, a near-native prediction is achieved if the
I-RMSD and the ligand-RMSD are smaller than 2 and 5 A
˚, respectively.
Currently, this implies the prediction of more than 30% of the native resi-
due–residue pairwise contacts and at least 50% of correctly identified contact
residues ( Janin et al., 2003). The best docking poses are then refined, all-
owing for conformational changes of the two unbound structures upon
binding (Dobbins, Lesk, & Sternberg, 2008; Shen, Paschalidis, Vakili, &
Vajda, 2008). Nevertheless, as it has been observed in CAPRI, there are
still some difficulties concerning the use of these methods ( Janin, 2010;
Lensink & Wodak, 2010). On the one hand, programs devoted to rigid-
docking do not simulate the conformational changes that can occur during
complex formation. On the other hand, each available docking mechanism
is highly dependent on its scoring function and none of them can produce a
single correct solution among all the predictions (Moal et al., 2013).
89Statistical Potentials for Protein Interactions
Author's personal copy
4.1. Proteinprotein docking
Rigid-docking methods yield a large number of predictions (from 10,000 to
more than 50,000), including many false positives. Thus, an important
course of action is to identify those docking poses that are closer to the native
structure (i.e., near-native) before any refinement takes place. At this point,
the number of selected conformations typically spans between 10 and 2000.
There are two nonexcluding strategies to perform such selection. The first
strategy consists in reranking the docking conformations with a scoring
function (e.g., CHARMM (Brooks et al., 1983), AMBER (Cornell
et al., 1995), FOLD-X (Guerois, Nielsen, & Serrano, 2002) or ZRANK
(Pierce & Weng, 2007, 2008)). The second strategy relies on clustering sim-
ilar solutions by means of I-RMSD (Comeau, Gatchell, Vajda, & Camacho,
2004b) or ligand-RMSD (Ritchie & Kemp, 2000) in order to reduce redun-
dant solutions and detect energy favorable regions in the surface of the
receptor (Moal & Bates, 2010).
4.1.1 Benchmarking
In order to assess the ability of docking approaches to distinguish between
near-native and non-near-native structures, several benchmarks have been
created (see Table 4.2). These datasets are usually comprised of a non-
redundant set of real interactions for which the structure of the interaction
and the unbound molecules (in most cases) are available. Benchmark targets
are classified in three categories of difficulty based on the best I-RMSD
obtained with the unbound conformations of the two proteins: easy,
medium, and hard (hard cases usually involve large conformational changes
between the bound and the unbound forms of the molecules).
4.1.2 Application of split-statistical potentials to rank docking decoys
In a recent work (Feliu et al., 2011), split-statistical potentials performed bet-
ter than scoring functions encoding atomistic energy terms when applied to
rank protein–protein docking poses from targets of the hard category of dif-
ficulty of the protein-docking benchmark version 3.0 (Hwang, Pierce,
Mintseris, Janin, & Weng, 2008). Furthermore, the analysis over the whole
benchmark revealed that “E
pair
” and “E
S3DC
” provided a fair amount of
nonoverlapping results. Based on this observation, Feliu et al. (2011) defined
a new ranking strategy “MixRank”. In this strategy, they first considered the
list of decoys ranked by both statistical potentials separately, and they
selected the top-scored decoy from each list alternatively. In order to avoid
90 Oriol Fornes et al.
Author's personal copy
redundant predictions, they ignored decoys with less than 5 A
˚ligand-RMSD
from any previous selection, which removed redundant solutions and pro-
vided a better selection of near-native decoys (Feliu & Oliva, 2010).
“MixRank” outperformed, for the medium and hard targets of the bench-
mark, other ranking methods such as RPScore (Moont et al., 1999), which
is another statistical potential, or ZRANK (Mintseris et al., 2007), which is an
atomistic-detailed scoring function. The main reason behind this result was
due to the use of a rigid-body docking method (i.e., FTDock). Atomistic-
detailed scoring functions, such as ZRANK, require an accurate model of
the interaction to correctly rank the poses, which implies a flexible docking,
while coarse-grained potentials, such as “E
pair
” and “E
S3DC
”, are less affected
by the quality of the model. Recently, Moal et al. (2013) presented an eval-
uation of 115 different scoring functions for ranking docking poses. Interest-
ingly, “MixRank” and “E
S3DC
” performed among the best 40 approaches in
Table 4.2 Benchmark datasets for docking
References
Interaction
type Description Benchmark link
Hwang, Vreven,
Janin, and Weng
(2010)
Protein–
protein
176 complexes (121 easy,
30 medium, 25 hard)
http://zlab.
umassmed.edu/
benchmark/
van Dijk and
Bonvin (2008)
Protein–
DNA
47 complexes (13 easy,
22 medium, 12 hard)
http://haddock.
science.uu.nl/dna/
benchmark.html
Kim, Corona,
Hong, and Guo
(2011)
Protein–
DNA
38 complexes for rigid (21
easy, 17 hard) and flexible
docking (18 easy, 19 hard)
http://bioinfozen.
uncc.edu/tf-dna-
benchmarks/
Barik, Nithin,
Manasa, and
Bahadur (2012)
Protein–
RNA
45 complexes http://www.facweb.
iitkgp.ernet.in/
rbahadur/
benchmark.html
Pe
´rez-Cano,
Jime
´nez-Garcı
´a,
and Ferna
´ndez-
Recio (2012)
Protein–
RNA
106 complexes (35 by
homology modeling;
64 easy, 24 medium,
18 hard)
http://life.bsc.es/
pid/protein-rna-
benchmark/
Huang and Zou
(2013)
Protein–
RNA
72 complexes (49 easy,
12 medium, 7 hard)
http://zoulab.dalton.
missouri.edu/
RNAbenchmark/
Benchmarks for protein–protein, protein–DNA, and protein–RNA docking.
91Statistical Potentials for Protein Interactions
Author's personal copy
the analysis of docking decoys generated from the protein–protein docking
benchmark version 4.0 (Hwang et al., 2010) with a flexible-docking approach
(Moal & Bates, 2010). Still, the best results were obtained by the newest score
versions of ZRANK2 (Pierce & Weng, 2008), SIPPER (Pons et al., 2011),
and other atomistic potentials.
4.2. Proteinnucleic acid docking
While the field of protein–protein docking is advancing fast, the progress of
docking nucleic acids onto proteins lags behind. The flexibility of nucleic
acids, and the difficulty to recognize their interaction surface, has limited
the number of docking studies involving proteins and nucleic acids
(DNA (Knegtel, Antoon, Rullmann, Boelens, & Kaptein, 1994; Poulain,
Saladin, Hartmann, & Pre
´vost, 2008; Parisien, Freed, & Sosnick, 2012;
van Dijk & Bonvin, 2010; van Dijk, Visscher, Kastritis, & Bonvin, 2013)
and RNA (Pe
´rez-Cano et al., 2010)). Similarly, there are only few
knowledge-based potentials specifically intended to rank protein–nucleic
acid docking solutions.
Regarding the field of protein–DNA docking, Robertson and Varani
(2007) and Xu et al. (2009) designed two different all-atom statistical poten-
tials that showed similar results in identifying near-native structures from a
set of decoys generated with FTDock. Nevertheless, as shown in the previ-
ous section, atomistic-detailed potentials require more accurate conforma-
tions to correctly rank docking poses, while residue–residue potentials are
coarse-grained and less sensitive to small conformational changes, which
allows them to capture the dynamic nature of protein–DNA interactions
more accurately (Poulain et al., 2008). In this context, Takeda et al.
(2013) derived a residue-pair potential that accommodated the interaction
angles between amino acids and nucleotides. Their approach also showed
better performance than atomistic potentials in rigid-body docking between
protein and DNA.
With respect to protein–RNA docking, Zheng et al. (2007) adapted the
statistical potential for scoring protein–DNA interactions (Robertson &
Varani, 2007) to protein–RNA interactions. Their potential performed sim-
ilar to the more complex scoring function for protein–RNA interactions of
ROSETTA (Chen, Kortemme, Robertson, Baker, & Varani, 2004). In
addition, Tuszynska and Bujnicki (2011) built two statistical potentials
dependent on the interaction distance and angles of the contact site of the
nucleotide with the amino acids of the protein that penalized for spherical
clashes occurring during docking.
92 Oriol Fornes et al.
Author's personal copy
5. PREDICTION OF PROTEIN-BINDING REGIONS
One of the major challenges to understand protein interactions is the
identification of the specific binding regions (i.e., interfaces). In the previous
section, we have seen that docking methods try to find the best possible
fitting between two or more molecules by exploring the whole rotational
and translational 3D space. Therefore, these methods benefit from the
knowledge about the interacting interfaces, which saves computational time
and eliminates many potentially wrong solutions. In particular, for protein–
DNA and protein–RNA interactions, the problem is two-sided: at the side
of the protein and at the side of the nucleic acid. In this section, we will focus
on the interface at the side of the protein, either for the interaction with
other proteins or for the interaction with nucleic acids. Several approaches
have been developed for the prediction of protein-binding regions, but in
the case of protein–DNA/RNA binding, the problem has been associated to
whether the protein will interact with the nucleic acid or not. In this section,
we will split both problems: first on the prediction of binding sites, and
second, on the prediction of proteins that bind nucleic acids.
5.1. Identification of protein interfaces
The most straightforward methods to experimentally define the interacting
region of a protein are based on the determination of its 3D structure (i.e.,
X-ray crystallography and NMR spectroscopy). Other experimental
approaches such as deletion experiments, alanine-scanning mutations,
yeast-two hybrid or protein footprinting can be used to determine which
domains are involved in the interaction without the requirement of
structure (reviewed in Garcia-Garcia, Bonet, et al., 2012). Alternatively,
computational tools provide a significant advantage in terms of time- and
cost-effectiveness. We have split these computational tools according to
their input requirements into sequence-based and structure-based methods.
5.1.1 Methods based on sequence
It is known that protein interfaces share specific features that distinguish them
from the rest of the protein (e.g., there is higher conservation of residues in
interface regions due to evolutionary constraints; Valdar & Thornton,
2001). In addition, the physicochemical properties of protein–protein inter-
action interfaces have shown to bear specific properties due to different amino
acid composition propensities ( Jones & Thornton, 1997). Moreover, as the
93Statistical Potentials for Protein Interactions
Author's personal copy
conservation of residues is strongly dependent on their structural and func-
tional importance, the degree of conservation has been used not only to pre-
dict binding sites but also to infer functional annotation. This is the case of
Consurf (Ashkenazy, Erez, Martz, Pupko, & Ben-Tal, 2010), a method that
estimates the evolutionary rate of each protein residue derived from multiple
sequence alignments using an empirical Bayesian or a maximum likelihood
approach. Another method, FINDSITE (Brylinski & Skolnick, 2008), uses
a different strategy based on binding-site similarity among superimposed
groups of template structures identified by threading, which allows for the
analysis of groups with low similarity. The combination of FINDSITE with
databases such as DrugBank (Knox et al., 2011) and ChEMBL (Gaulton et al.,
2011) has been useful in high-throughput virtual ligand screening (Zhou &
Skolnick, 2013). A recent method, PIPE-Sites (Amos-Binks et al., 2011),
exploits protein–protein interaction networks to detect reoccurring polypep-
tide sequences in order to infer specific binding sites. Finally, PSIFR (Pandit
et al., 2010) combines different methodologies in a single server, including
structure-based prediction tools such as TASSER (Zhang & Skolnick,
2004) and functional inference tools such as FINDSITE, among others.
The observed amino acid conservation in protein–DNA interfaces
(Luscombe & Thornton, 2002) has also been exploited by many authors to
predict nucleic acid-binding residues of a protein with different machine
learning approaches. For example, BindN (Wang & Brown, 2006) predicts
DNA- and RNA-binding residues using a support vector machine approach
based on biochemical features of nucleic acid-binding amino acids, such as
side chain pK
a
value, hydrophobicity index, and molecular mass. An evolu-
tion of the previous method, BindN þ(Wang, Huang, Yang, & Yang, 2010),
incorporates evolutionary information as well. Similarly, DP-Bind (Hwang,
Gou, & Kuznetsov, 2007) relies on support vector machine, kernel logistic
regression, and penalized logistic regression based on amino acid composition
and evolutionary profiles. Another approach, NAPS (Carson, Langlois, & Lu,
2010), combines a decision tree algorithm with bootstrap aggregation and
cost-sensitive learning. Finally, metaDBSite (Si, Zhang, Lin, Schroeder, &
Huang, 2011) predicts DNA-binding residues by integrating the prediction
of six different methods (including BindN and DP-Bind).
5.1.2 Methods based on structure
Methods based on structure use features extracted from known 3D interfaces
to predict protein-binding regions. In particular, Fernandez-Recio et al.
(2005) used the Optimal Docking Area (ODA) of a protein based on atomic
94 Oriol Fornes et al.
Author's personal copy
solvation parameters. This method looks for favorable energy changes
when the residues involved in the interface become buried upon binding.
In addition, a few methods for predicting protein–DNA/RNA-binding
regions are based on structure too. For instance, DISPLAR (Tjong &
Zhou, 2007) uses neural networks trained on known structures of protein–
DNA interactions to predict the residues that contact DNA. The inputs to
the neural network include position-specific sequence profiles and solvent
accessibilities of each residue and its spatial neighbors. DNABINDPROT
(Ozbek, Soner, Erman, & Haliloglu, 2010) exploits Gaussian network models
to predict DNA-binding residues, based on the fluctuations of residues
in high-frequency modes. In DR_bind (Chen, Wright, & Lim, 2012), the
identification of DNA-binding residues is based on electrostatics, sequence
conservation, and structural geometry. Regarding the prediction of RNA-
binding sites, an evolution of ODA, Optimal Protein-RNA Area (OPRA)
(Pe
´rez-Cano & Ferna
´ndez-Recio, 2010), uses statistical potentials derived
from the differential propensities of amino acids at protein–RNA interfaces,
weighed by its accessible surface area, to predict RNA-binding regions in
proteins. Furthermore, OPRA was used in protein–RNA docking and suc-
cessfully selected near-native conformations of protein–RNA interactions by
simply using the correct prediction of the protein residues involved in the
interaction (Pe
´rez-Cano et al., 2010).
5.1.3 Application of split-statistical potentials to predict
protein-binding sites
The specific properties exhibited by protein interfaces are present in the split-
statistical potentials derived from known interacting domains (Feliu et al.,
2011). In fact, the statistical potential “E
local
” is based on the probability
of an amino acid to be in a certain environment, as defined by its hydropho-
bicity, degree of exposure, and secondary structure (see Section 2.1). In order
to show the ability of split-statistical potentials in identifying protein inter-
faces, we have tested both ODA and the potential “E
local
” on the unbound
structures retrieved from the protein docking benchmark version 3.0
(Hwang et al., 2008). The ODA predictions were obtained using the
pyDock software (Cheng, Blundell, & Fernandez-Recio, 2007). In the case
of “E
local
”, the prediction of the binding site (i.e., “BS-E
local
”) was obtained
by scoring and ranking into a list each residue in the protein surface. The
score of a residue in the surface was calculated by averaging the Z-scores
of “E
local
” of the residues within a radius of 15 A
˚, as defined by the distances
between their Cbatoms (Cafor glycines). Then, binding regions were
95Statistical Potentials for Protein Interactions
Author's personal copy
defined iteratively, starting from the top ranked residue in the list. The first
binding site was defined by the surface residues within a radius of 15 A
˚
around the top ranked residue. Residues belonging to a binding site were
removed from the list and the iteration was repeated until the next residue
in the list had a negative score or there were no more residues left. The score
of a binding region was defined as the sum of scores of its residues.
In Fig. 4.1, we show the performance of ODA, “BS-E
local
”, and their
combination (i.e., residues predicted by both methods to be in the binding
site), in terms of percentage of proteins with a minimum positive predictive
value (PPV) of the predicted residues to be involved in the real binding site
(see details in the legend). Results were compared with a background dis-
tribution of random predictions with similar distribution of binding sites
Figure 4.1 Coverage of the prediction of binding sites versus its minimum PPV. The Y
axes show the ratio of proteins with a PPV equal or greater than a threshold (Xaxes). We
have used ODA (Fernandez-Recio et al., 2005) with a minimum pyDockODA score of
10 (A), the prediction based on BS-E
local
with a minimum score of 2 (B), and the bind-
ing sites predicted by both (C). The testing dataset contains 85 nonredundant proteins
extracted from the docking benchmark 3.0 for which we know the real binding region.
PPV is defined as the proportion of correctly predicted residues for each protein over the
total number of predicted residues. The binding interface for a protein is defined as the
set of residues found to be closer than 12 Å with any other interacting protein reported
in the PDB database (Berman et al., 2000), which includes the interacting partner in the
benchmark. In order to validate the quality of the prediction, we have calculated the
background distribution of obtaining the same PPV thresholds by a random selection
of the same number of residues as the actual prediction with ODA (A), BS-E
local
(B), or
both (C). The background distribution is shown in boxplots, and it is calculated using
sliding windows of the size of each fragment of predicted residues. This definition
allows us to compare predictions with similar topology. A horizontal dashed line
indicates the applicability of each method (i.e., proportion of proteins with at least
one residue predicted in the binding site).
96 Oriol Fornes et al.
Author's personal copy
along the sequence (i.e., preserved the topology) as the predictions produced
by each method. On the one hand, it is noteworthy that the combination of
ODA and “BS-E
local
” yielded predictions that reached PPVs higher than
75% for about 40% of proteins of the benchmark. On the other hand, each
method could be applied to more than 80% of proteins of the benchmark,
but they achieved PPVs higher than 75% for less than 40% of the proteins.
Besides, the individual performances of ODA and “BS-E
local
” were not
strikingly different from random predictions, while the combination of both
methods differed considerably from the distribution of topologically similar
predictions (i.e., random predictions), thus being more significant.
5.2. Prediction of DNA/RNA-binding proteins
DNA- and RNA-binding proteins can be discriminated from others just
from their amino acid sequences using different features, such as amino acid
composition (Ahmad, Gromiha, & Sarai, 2004; Yu, Cao, Cai, Shi, & Li,
2006) or evolutionary profiles (Kumar, Gromiha, & Raghava, 2007,
2011; Nimrod, Schushan, Szila
´gyi, Leslie, & Ben-Tal, 2010). Also, the
ability of a protein to bind nucleic acids can be predicted using statistical
potentials. For example, DBD-Hunter (Gao & Skolnick, 2008) is a Web
server for predicting DNA-binding proteins that combine structural com-
parisons and evaluation with statistical potentials. Briefly, it scans a given
protein structure against a template library composed of 179 protein–
DNA complex structures using a structural alignment program (Zhang &
Skolnick, 2005). All templates that produce a good structural alignment with
the query protein are then evaluated with statistical potentials. Specifically,
the statistical potential is applied to score all protein–DNA contacts within a
distance of 4.5 A
˚. The potential also considers whether the contact occurs
through the phosphate, sugar, pyrimidine, or imidazole groups. This
approach performed better than classical sequence homology-based
approaches (i.e., PSI-BLAST; Altschul et al., 1997). An improved version
of the previous method, DBD-Threader (Gao & Skolnick, 2009), has the
advantage that it only requires the sequence of a protein as input. The
sequence is then threaded against the previous template library and, for
the best solutions, the interaction score between the threaded sequence
and the template DNA is calculated. The exact same procedure of DBD-
Hunter, but using an all-atom statistical potential (Xu et al., 2009), has also
been applied to predict DNA- (Zhao et al., 2010) as well as RNA-binding
proteins (Zhao et al., 2011).
97Statistical Potentials for Protein Interactions
Author's personal copy
6. CHARACTERIZATION OF TRANSCRIPTION
FACTOR-BINDING SITES
In the previous section, we have focused on identifying the binding
regions of proteins. However, in protein–DNA/RNA interactions, the
nucleic acid also contains specific regions that are recognized by the protein.
In particular, transcription factors (TFs) can promote or restrain gene tran-
scription by binding to specific nucleotide sequences (i.e., binding sites)
distributed along the genome. Binding sites are often represented with a
position weight matrix (PWM) reflecting the observed degeneracy among
the recognition sites of TFs. PWMs have been exploited by many methods
to search for novel targets of TFs (reviewed in Bulyk, 2003). Therefore, the
identification of TF-binding sites is an important step towards the under-
standing of many biological processes. During the past years, several exper-
imental methods have emerged with the objective to characterize
TF-binding sites (reviewed in Xie, Hu, Qian, Blackshaw, & Zhu, 2011).
Nevertheless, their application is laborious and expensive and, as a result,
they have only been applied to a small fraction of human proteins
(Hu et al., 2009). As an alternative, computational tools can be employed
to predict TF-binding sites. A well-established procedure consists in
searching for over-represented DNA sequences in the promoter regions
of genes regulated by a TF with a motif discovery algorithm (analyzed in
Das & Dai, 2007), but the success of these approaches depends on the avail-
ability of enough sequences for pattern discovery, mainly derived from
ChIP-seq, ChIP-exo, and protein-binding microarrays (Grau, Posch,
Grosse, & Keilwagen, 2013).
Another successful strategy currently employed is the analysis of
TF–DNA complex structures with statistical potentials. Briefly, the TF is
put face to face with different DNA sequences and the binding energies
of the resulting complexes are analyzed. Those sequences with the best
energy are considered to be bound by the TF and are incorporated into a
PWM. For example, Angarica et al. (2008) created an algorithm that, given
a TF–DNA complex, mutated all nucleotide positions one by one using the
3DNA package (Lu & Olson, 2008), until all possibilities were covered (i.e.,
A, C, G, and T). The mutated sequences were then scored with a
knowledge-based potential and the 50 best oligonucleotides were used to
construct a PWM. In another work, Liu et al. (2008) developed a method
based on protein–DNA docking coupled with threading of DNA sequences.
98 Oriol Fornes et al.
Author's personal copy
They were able to predict 50% of experimentally determined sites for the
cAMP regulatory protein (CRP) in the top 1% among all 639,232 possible
solutions. They also made a de novo prediction by modeling the ferric uptake
regulator in complex with DNA, which showed similar results as CRP.
Later on, Xu et al. (2009) calculated the PWMs for different TFs by
decomposing the binding energies of the FIRE potential into individual
contributions of each base. The FIRE potential was first described by
Zhou and Zhou (2002), and it was used mostly on homology modeling.
Afterwards, FIRE was readjusted so that it could be applied to predict pro-
tein–protein and protein–DNA interactions (Zhang, Liu, Zhu, & Zhou,
2005). More recently, Alamanova et al. (2010) used an all-atom statistical
potential (Robertson & Varani, 2007) in combination with the MMTSB
tool set (Feig et al., 2004) to recover the PWMs of various members from
two widely studied families of TFs such as p53 and NF-B. In particular, they
were able to create very accurate PWMs for p53 tetramer and p50 dimer as
well as for the p50p65 and p50RelB heterodimers. They also obtained very
good results with p63 and p73 dimers built by homology modeling using the
p53 DNA-binding domain as template. Finally, Chen, Chien, et al. (2012)
established a procedure to predict PWM when no protein–DNA complex is
available. They superimposed the unbound structure of a TF over the closest
homolog TF structure in complex with DNA. Then, the PWM was esti-
mated as in the work of Xu et al. (2009).
Although knowledge-based potentials have been a good alternative to
infer TF-binding sites, their application still has some limitations. One of
them is the lack of templates due to the small number of TF–DNA complex
structures available in the PDB. To avoid any bias, statistical potentials are
usually derived from a nonredundant dataset of structures. This redundancy
is generally removed on the TF side of the complex. Yet, TFs can recognize
different binding sites, and in addition, members of the same family of TFs
can bind to distinct DNA sequences (Luscombe & Thornton, 2002). For this
reason, the removal of redundancy can generate statistical potentials suffer-
ing from low-count and at the same time low diversity of binding patterns.
Another problem arises because statistical potentials are applied under the
assumption that the contribution of the different DNA base pairs to the
binding energy of the complex is independent from each other, which
is not true (Benos, Bulyk, & Stormo, 2002). Recently, AlQuraishi and
McAdams (2013) addressed the coverage problem by combining TF–DNA
structures with experimentally determined PWMs. The inclusion of PWM
data adapted the statistical potential to the varying binding preferences of
99Statistical Potentials for Protein Interactions
Author's personal copy
TFs for different binding sites. Still, they highlighted that the use of PWMs
cannot allocate for interposition dependencies among base pairs.
6.1. Application of knowledge-based potentials on
DREAM5 targets
In order to evaluate the real capability of statistical potentials in PWM prediction,
we have tested two available online methods, 3DTF (Gabdoulline, Eckweiler,
Kel, & Stegmaier, 2012)andPiDNA(Lin & Chen, 2013), on 83 mouse TFs
from the DREAM5 TF–DNA Motif Recognition Challenge (Weirauch
et al., 2013). These two servers only require a TF–DNA complex structure
in PDB format as input and return the predicted PWM of the TF as output.
6.1.1 Modeling TFDNA complexes
Since there were no TF–DNA complex structures available in the PDB for
the majority of the DREAM5 targets, we developed a novel modeling pro-
tocol that allowed us to obtain a TF–DNA model for a total of 71 DREAM5
targets. An overview of the procedure is shown in Fig. 4.2. In step 1, for each
TF target, we searched the best template in a database of TF–DNA com-
plexes using BLAST (Altschul et al., 1997). The database was obtained by
selecting from the PDB all TF–DNA complex structures annotated in the
TFinDit depository (Turner, Kim, & Guo, 2012) that, according to
3DNA (Lu & Olson, 2008), contained a double-stranded DNA of at least
eight base pairs. Then, we identified all dimers in the database by grouping
any two protein chains from the same PDB that (1) had at least one common
contact with the DNA and (2) had more than five residue–residue contacts
between them as to form a binary complex (Mosca, Ce
´ol, & Aloy, 2013). In
step 2, BLAST hits were filtered according to two criteria: (1) enough per-
centage of sequence identity and (2) no gaps or insertions in the region of the
interface. With respect to the percentage of identity, based on a recent work
where we observed that TFs sharing little sequence identity can still bind to
the same genes (Gitter et al., 2009), we included distantly related sequences
according to Rost’s sequence identity curve (Rost, 1999), using parameters
adjusted to ensure a 99% precision rate (i.e., n¼5). In step 3, the template
sequence that passed the filter and had the best BLAST e-value was realigned
with the TF using matcher, from the EMBOSS package (Rice, Longden, &
Bleasby, 2000). In step 4, the alignment was used to create a structural model
of the TF with MODELLER (Eswar et al., 2006). Models were created
applying 3D restraints between Caatoms. This is the pairwise distance from
the Caatom of each residue to the Caatoms of any residues within a radius
100 Oriol Fornes et al.
Author's personal copy
Figure 4.2 Pipeline for modeling transcription factors. Step 1: sequence homology search.
Step 2: filter results of step 1 by sequence identity and coverage of the proteinDNA inter-
face. Step 3: optimization of the alignment. Step 4: model-building of the three-dimensional
(Continued)
101Statistical Potentials for Protein Interactions
Author's personal copy
of 15 A
˚conserved between the template and the model. In step 5, the final
TF–DNA complex was obtained by superimposition of the protein model
on the template using PyMOL (Schro
¨dinger, 2010). Additionally, for TFs
from the bHLH and bZIP families, since they recognize DNA as homo-
or heterodimers, we modeled the dimer: if the two monomers were found
among the unfiltered hits in step 2, the dimer was obtained as before (i.e.,
steps 3–5), but using both template hits; otherwise, if only one monomer
could be modeled, it was superimposed as in step 5 on both template chains
of the dimer. Table 4.3 shows the 71 DREAM5 targets that could be
modeled following this procedure, the percentage of sequence identity,
and coverage of the pairwise alignments between the TFs and their tem-
plates, and the resulting RMSD of the superimposition.
6.1.2 Analysis of PWM predictions
A first analysis revealed that PiDNA is very sensitive to the format of input
files. It occasionally failed even when, for all models we had produced, the
DNA molecule had at least eight base pairs in the correct format (according
to 3DNA). In contrast, 3DTF could interpret all except one model, but it
produced uniform PWMs for 46 TFs (i.e., PWMs with null capacity of dis-
crimination). As a result of this analysis, the applicability of PiDNA (28/71)
was slightly better than 3DTF (24/71). Furthermore, out of the 13 different
families of TFs taken from the DREAM5 challenge that could be modeled,
3DTF and PiDNA could only make predictions for seven of them. Table 4.3
shows the quality of the predictions by means of comparing the PWMs pro-
duced by 3DTF and PiDNA with the real PWMs, using Tomtom (Gupta,
Stamatoyannopoulos, Bailey, & Noble, 2007), as distributed in the MEME
package (Bailey et al., 2009). Tomtom calculates the similarity between a
pair of PWMs by means of a P-value. Using a P-value threshold of 10
3
,
PiDNA predicted correctly the PWM for 10 TFs, while 3DTF for 7 (five
of which were common to both of them). Besides, we observed that
PWM predictions deteriorated together with the alignment between the
target and the template. One possible reason is that both 3DTF and PiDNA
Figure 4.2Cont'd structure of the TF. Step 5: superimposition of the model over the
template. If the TF works as a homodimer and only one monomer can be modeled using
a heterodimer as template, the model is superimposed on each chain of the template to
construct the homodimer. Structural images were created with the UCSF Chimera pack-
age (Fraenkel & Pabo, 1998; Glover & Harrison, 1995; Pettersen et al., 2004).
102 Oriol Fornes et al.
Author's personal copy
Table 4.3 PWM predictions for targets of the DREAM5 challenge
TF Family PDB Chain %ID %Cov RMSD 3DTF PiDNA E
S3DC
Egr2 C2H2 ZF 1p47 A 94 100 0.10 1.8 10
2
Esr1 NR 1hcq A 100 98 0.02 2.1 10
4
Esrrb NR 3dzy A 36 99 1.08 - 1.5 10
3
3.2 10
3
Esrrg NR 3dzy A 36 99 0.34 1.7 10
3
1.4 10
4
-
Foxc2 Forkhead 1vtn C 72 96 0.01 - 8.410
3
-
Foxo1 Forkhead 3co6 C 100 100 1.59 1.4 10
4
1.9 10
4
Foxo3 Forkhead 2uzk A 100 98 0.04 - 7.7 10
6
6.6 10
3
Foxo4 Forkhead 2uzk A 83 88 0.04 - 4.5 10
4
4.9 10
3
Foxo6 Forkhead 3co6 C 91 100 1.55 3.910
3
3.1 10
3
Gata4 GATA 4hc7 A 86 97 1.11 - 2.9 10
3
-
Hmga2 AT hook 2eze A 80 80 0.35 8.2 10
4
Klf12 C2H2 ZF 2wbu A 78 97 0.06 3.4 10
4
1.1 10
3
-
Klf8 C2H2 ZF 2wbu A 75 97 0.09 7.7 10
5
1.1 10
4
-
Nr2e1 NR 3e00 A 57 23 0.01 1.8 10
4
--
Nr2f1 NR 3dzy A 45 99 2.33 - 3.0 10
3
-
Nr2f6 NR 3e00 A 39 54 4.61 - 3.3 10
7
-
Continued
Author's personal copy
Table 4.3 PWM predictions for targets of the DREAM5 challengecont'd
TF Family PDB Chain %ID %Cov RMSD 3DTF PiDNA E
S3DC
Pou3f1 Hom 2xsd C 100 100 0.07 - 1.910
3
Sox6 Sox 3f27 D 56 98 0.04 - 3.9 10
3
-
Sp1 C2H2 ZF 2wbu A 57 97 0.08 - 4.8 10
3
2.6 10
4
Tbx1 T-box 4a04 B 100 98 0.02 9.3 10
7
9.6 10
6
4.5 10
3
Tbx20 T-box 4a04 A 66 99 0.02 7.7 10
6
2.2 10
5
7.7 10
4
Tbx4 T-box 2x6v A 93 99 0.01 8.2 10
7
1.3 10
6
-
Tbx5 T-box 2x6v A 100 100 0.01 1.1 10
6
2.2 10
6
-
Tcf3 bHLH 2ql2 C
D
100
*
100
*
0.09
3.12
2.2 10
3
--
Tcfec bHLH 4ati B
A
88
85
100
100
0.18
0.10
1.5 10
3
-
Zfp202 C2H2 ZF 2i13 A 51 100 0.38 1.1 10
2
Transcription factors (TF) from the DREAM5 challenge and their families are shown in the first columns. Families “NR” and “Hom” stand for “nuclear receptor” and
“homeodomain”, respectively. PDB codes and chains of the templates used to model the TFs are shown in the next columns. This is followed by the quality of the model
shown by means of the percentage of sequence identity (%ID) and template coverage (%Cov) of the sequence alignment, and the RMSD of the superimposition. For
dimers, the information regarding each monomer can be found in separate lines. Asterisks indicate that the homodimerwas built by superimposing the model of one chain
to both chains of the template heterodimer. The significance of similarity between the predicted and the real PWMsis shown with the P-value for 3DTF and PiDNA, and
the statistical potential “E
S3DC
”. A hyphen indicates that the P-value is not significant and the cell is left empty when the method failed to produce a PWM.
Note: Only TFs with significant predictions are shown.
Author's personal copy
rely on all-atom statistical potentials and they are sensitive to the wrong ori-
entation of amino acid side chains that could occur upon modeling.
7. ADAPTING SPLIT-STATISTICAL POTENTIALS FOR
PROTEINDNA INTERACTIONS
As shown in Section 6.1, much improvement is required in the area of
TF-binding site prediction based on structure (i.e., via statistical potentials).
In this section, we propose a series of changes to the previously described
split-statistical potentials for protein folding (Aloy & Oliva, 2009) and
protein–protein interactions (Feliu et al., 2011) in order to adapt them to
protein–DNA interactions.
The application of split-statistical potentials to protein–DNA interac-
tions requires the definition of an environment for nucleotides. Moreover,
in order to address the additivity problem (Benos et al., 2002), we have
described statistical potentials for dinucleotides (i.e., two consecutive nucle-
otides along the DNA sequence). Therefore, the DNA environment of a
dinucleotide is defined by its constituting bases (i.e., any combination of
two purines and pyrimidines) and three features regarding the interaction
between the amino acid and the dinucleotide: (1) the strand (i.e., forward
or reverse) that is closer to the amino acid; (2) the DNA groove (i.e., major
or minor) where the amino acid is located (or close to); and (3) the closest
chemical group of the dinucleotide (i.e., nucleobase or deoxyribose phos-
phate) to the amino acid (see Fig. 4.4 and Section 7.1.1 for more details).
The definition of environments yields several residue–environment
combinations. For amino acids, we consider 20 residues and 6 different envi-
ronments as before (i.e., helix, coil, or strand, and being buried or exposed).
This produces a total of 120 combinations of amino acids and environments.
In contrast, we consider 16 dinucleotides (i.e., 4
2
different combinations of
two nucleotides) and 8 environments: 2 for the closest strand, 2 for the clos-
est DNA groove, and 2 for the closest chemical group of the dinucleotide.
These definitions produce a total of 128 dinucleotide–environment
combinations.
Given a particular interaction between an amino acid “a” and a dinucle-
otide “mn” (where “m” and “n” can be any nucleotide), we define the sta-
tistical potentials “E
pair
”, “E
local
”, “E
3D
”, “E
3DC
”, and “E
S3DC
”asin
Section 2.1 by replacing “b” with “mn” in Eqs. (4.2) and (4.3). The contri-
bution of the reference state and the “E
3D
” potential are ignored, but also the
contributions of the “E
local
” terms. On the one hand, the “E
local
105Statistical Potentials for Protein Interactions
Author's personal copy
contribution of DNA is not considered because, as long as it is accessible, any
nucleotide sequence can be bound by a TF (Urnov, Rebar, Holmes,
Zhang, & Gregory, 2010) and, as a result, the environment conditions of
the base pairs are not relevant. On the other hand, given a TF, the “E
local
term dependent on the protein is always the same when discriminating
among different DNA-binding sites and thus, it is irrelevant too. Therefore,
we have selected the statistical potential “E
S3DC
” to evaluate the prediction
of DNA-binding sites for the targets of the DREAM5 challenge.
7.1. Application of split-statistical potentials on
DREAM5 targets
As a test pilot, we have applied these split-statistical potentials to predict the
PWM for the 71 modeled DREAM5 targets in Section 6.1.
7.1.1 Split-statistical potentials for proteinDNA interactions
We derived the potentials from a nonredundant set of templates of the
TFinDit repository (Turner et al., 2012) (see Section 6.1.1). Specifically,
templates were split into chains and redundancy was removed so that any
two chains shared less than 35% of protein–DNA contacts. A contact was
defined between an amino acid and a dinucleotide if the Cbatom of the
amino acid (Cafor glycines) was at 15 A
˚or less from the center of the dinu-
cleotide and its complementary bases (i.e., the geometrical center as defined
by the four phosphate atoms of the two nucleotides and its associated part-
ners in the complementary DNA strand; see Fig. 4.3B). In Fig. 4.3, we show
how the different details that define the environmental features used on the
description of the statistical potential are calculated. We used 3DNA (Lu &
Olson, 2008) to define which DNA residues constituted the reference strand
(i.e., forward) and which the complementary (i.e., reverse). Moreover, for
calculating the potential, we referred to “mn” as the pair of nucleotides from
the reference strand of the dinucleotide. Also, we used the distances between
the Cbatom of the amino acid and the phosphate atoms of each dinucleotide
to decide which of the two DNA strands was the closest (see Fig. 4.3C).
In order to identify the closest DNA groove (see Fig. 4.3A), we adapted a
definition of groove widths (El Hassan & Calladine, 1998): First, we selected
the closest phosphate from each strand to the Cbatom of the amino acid; let this
be at position “i” for strand S and at position “j” for strand S0.Second,ifi<j,we
calculated the distances between the phosphate atom at position “i”inSandthe
phosphate atoms at positions “iþ300 (i.e., D
iþ3
)and“iþ400 (i.e., D
iþ4
)inS
0.
Finally, if D
iþ3
>D
iþ4
, the amino acid was located in the major groove;
106 Oriol Fornes et al.
Author's personal copy
otherwise, it was located in the minor groove. Additionally, if in the second step
i>j, instead of calculating the distances to positions “iþ300 and “iþ400 in S0,we
used the distances to “i300 and “i400 in S0and applied the same criterion to
select the DNA groove where the amino acid was located.
The interaction between the amino acid and the DNA could be either
with the backbone of the DNA (i.e., any atom of the deoxyribose phos-
phate) or with the nucleobase (i.e., any atom of the nitrogenous base). This
was defined by the minimum distance between the atoms of the amino acid
and the atoms of the dinucleotide. If the closest atom of the nucleotide was
Figure 4.3 Definition of different DNA parameters used for deriving split-statistical
potentials. Distances between amino acids and DNA are represented in blue lines; solid
when displaying the minimal distance and dashed otherwise. Internal distances in the
DNA are shown in orange. Environment features of the DNA for a contact between an
amino acid and a dinucleotide at position i(see details in text for each definition):
groove in contact with the amino acid (A); distance between the amino acid and the
dinucleotide (B); strand in contact with the amino acid (C); and DNA chemical group
in contact with the amino acid (D). Structural images were created with the UCSF
Chimera package (Fraenkel & Pabo, 1998; Pettersen et al., 2004).
107Statistical Potentials for Protein Interactions
Author's personal copy
any atom of the phosphate group or the deoxyribose, the interaction was
with the backbone; otherwise, it was with the nucleobase (see Fig. 4.3D).
Finally, to make the potentials independent of the arbitrary designation
of forward and reverse strand as defined by 3DNA, we also considered, for
each protein–DNA contact, the complementary (i.e., the contact that would
have been created if the reference strand was the complementary). For
example, the complementary contact of a certain amino acid with two aden-
osines, through the forward strand, the major groove and the backbone,
would be with two thymidines, through the reverse strand, the major
groove and the backbone. This increases straight forward the knowledge-
base of interactions and improves in a natural way the number of pairs of
amino acids and dinucleotides of the structural database.
7.1.2 PWM prediction
The PWM of each TF was calculated by adapting the procedure of Xu et al.
(2009) to account for the interaction of amino acids with dinucleotides. We
used the scores of the interaction of the protein with a dinucleotide to cal-
culate the probability of a single nucleotide position in the PWM:
Pai
ðÞ
¼Xmi1
exp PMF a,mi1ai
ðÞðÞ
þXmiþ1
exp PMF a,aimiþ1
ðÞðÞ
XniXmi1
exp PMF a,mi1ni
ðÞðÞþ
Xmiþ1
exp PMF a,nimiþ1
ðÞðÞ
hi
Where “P(a
i
)” is the probability of nucleotide “a” at position “i”.
The PWMs produced for each modeled TF target were analyzed as in
Section 6.1.2 (see Table 4.3). In contrast to 3DTF and PiDNA, we could
apply the “E
S3DC
” score to all modeled DREAM5 targets, which implies
that we covered 13 out of 15 families of TFs (six more families than com-
bining both 3DTF and PiDNA). Moreover, we obtained significant results
for five different TFs, three of which could not be retrieved with 3DTF nor
PiDNA (see Table 4.3). In Fig. 4.4, we compare the logos produced by
E
S3DC
” with the logos produced by 3DTF and PiDNA for three specific
DREAM5 targets (Foxo1, Nr2e1, and Tbx20). As observed in Table 4.3,
PiDNA produced significant logos for all the targets, while “E
S3DC
” and
3DTF predicted significant logos for two TFs: “E
S3DC
” for Foxo1 and
Tbx20, and 3DTF predicted for Nr2e1 and Tbx20. Also, “E
S3DC
” predicted
a logo for Nr2e1 but it was not significant because it failed to predict two out
of four nucleotides of the central motif “GTCA” (it could only predict
“GT”). The full comparison of “E
S3DC
” with 3DTF and PiDNA can be
108 Oriol Fornes et al.
Author's personal copy
found in Table 4.3 and it shows a reasonable improvement not only in terms
of applicability but also in terms of specificity by means of significant matches
of the real PWMs.
8. CONCLUSIONS
We have reviewed the use of knowledge-based potentials as a tool for
the analysis of protein–protein, protein–DNA, and protein–RNA interac-
tions. We have explored the general definition of knowledge-based poten-
tials and described the procedure to split them into different energetic terms.
We have extensively discussed the application of statistical potentials in (1)
the evaluation of protein modeling, including homology and integrative
modeling; (2) the evaluation of protein–protein and protein–nucleic acids
docking; (3) the prediction of protein-binding regions; and (4) the charac-
terization of TF-binding sites. Finally, we have provided several resources
available online for docking, ranking scoring functions of interactions,
and benchmark databases.
We have shown that modeling of protein interactions is still limited by
the lack of 3D data. Still, this problem can be addressed using docking
approaches. We have also shown that docking methods benefit from the
prediction of binding sites. In this context, we have proposed two pilots
for predicting binding regions, one for proteins and another for DNA, based
Figure 4.4 Examples of PWM logos. PWM logos for Foxo1, Nr2e1, and Tbx, as described
in the DREAM5 challenge compared with the predictions produced by the statistical
potential E
S3DC
and the state-of-the-art methods 3DTF and PiDNA. Logos were created
with the R software environment (Bembom, 2007; R Core Team, 2013).
109Statistical Potentials for Protein Interactions
Author's personal copy
on split-statistical potentials. On the one hand, we have tested our approach
for predicting binding sites in proteins, “BS-E
local
”, and we have compared
it with another state-of-art methodology. This test revealed that, while none
of the methods yield significant predictions, their combination improve the
significance of the binding regions predicted for a protein. We have been
able to achieve PPVs higher than 80% for almost one quarter of the proteins
tested in our benchmark (for half of them, the methods produced different
results, and for half of the proteins for which this was applied, the result was
successful). On the other hand, we have proposed a modification of the split-
statistical potentials for protein–DNA interactions. We have applied them
to predict DNA-binding sites by modeling the structure of the TF and
constructing an artificial PWM logo. Our results were comparable to
state-of-the-art methods, such as 3DTF and PiDNA; and additionally, we
enlarged the application to several targets of the DREAM5 challenge for
which 3DTF and PiDNA could not be applied or did not produce signif-
icant results.
We have fathomed the main features involved in TF-binging sites, in the
modeling of protein–protein and protein–DNA complexes. In conclusion,
despite all the advances in the area, there is still a wide-range for improve-
ment in the exploitation of statistical potentials, especially in the field of pro-
tein–DNA interactions.
ACKNOWLEDGMENTS
O. F. and B. O. acknowledge the support of FEDER BIO2011-22568 grant from the
Spanish Ministry of Science and Innovation (MICINN). J.G.G. acknowledge support by
“Departament d’Educacio
´i Universitats de la Generalitat de Catalunya i del Fons Social
Europeu” through FI fellowships. J.B. is supported by BIO08-0206 grant from MICINN.
We are very grateful to Dr. Jun-tao Guo (UNCC) for providing us a comprehensible list
of PDB codes for all transcription factors from the TFinDit depository. We are also
thankful to Dr. Ferna
´ndez Recio (BSC) for providing us the latest version of pyDockOda.
REFERENCES
Ahmad, S., Gromiha, M. M., & Sarai, A. (2004). Analysis and prediction of DNA-binding pro-
teins and their binding residues based on composition, sequence and structural information.
Bioinformatics,20(4), 477–486. http://dx.doi.org/10.1093/bioinformatics/btg432.
Alamanova, D., Stegmaier, P., & Kel, A. (2010). Creating PWMs of transcription factors
using 3D structure-based computation of protein-DNA free binding energies. BMC
Bioinformatics,11(1), 225. http://dx.doi.org/10.1186/1471-2105-11-225.
Alber, F., Dokudovskaya, S., Veenhoff, L. M., Zhang, W., Kipper, J., & Devos, D. (2007).
Determining the architectures of macromolecular assemblies. Nature,450(7170),
683–694. http://dx.doi.org/10.1038/nature06404.
110 Oriol Fornes et al.
Author's personal copy
Alber, F., Fo
¨rster, F., Korkin, D., Topf, M., & Sali, A. (2008). Integrating diverse data for
structure determination of macromolecular assemblies. Annual Review of Biochemistry,
77(1), 443–477. http://dx.doi.org/10.1146/annurev.biochem.77.060407.135530.
Aloy, P., & Oliva, B. (2009). Splitting statistical potentials into meaningful scoring functions:
Testing the prediction of near-native structures from decoy conformations. BMC Struc-
tural Biology,9(1), 71. http://dx.doi.org/10.1186/1472-6807-9-71.
Aloy, P., & Russell, R. B. (2003). InterPreTS: Protein interaction prediction through tertiary
structure. Bioinformatics,19(1), 161–162. http://dx.doi.org/10.1093/bioinformatics/
19.1.161.
AlQuraishi, M., & McAdams, H. H. (2013). Three enhancements to the inference of statis-
tical protein-DNA potentials. Proteins: Structure, Function, and Bioinformatics,81(3),
426–442. http://dx.doi.org/10.1002/prot.24201.
Altschul, S. F., Madden, T. L., Scha
¨ffer, A. A., Zhang, J., Zhang, Z., Miller, W., et al. (1997).
Gapped BLAST and PSI-BLAST: A new generation of protein database search programs.
Nucleic Acids Research,25(17), 3389–3402. http://dx.doi.org/10.1093/nar/25.17.3389.
Amos-Binks, A., Patulea, C., Pitre, S., Schoenrock, A., Gui, Y., & Green, J. R. (2011). Bind-
ing site prediction for protein-protein interactions and novel motif discovery using
re-occurring polypeptide sequences. BMC Bioinformatics,12(1), 225. http://dx.doi.
org/10.1186/1471-2105-12-225.
Angarica, V. E., Pe
´rez, A. G., Vasconcelos, A. T., Collado-Vides, J., & Contreras-Moreira, B.
(2008). Prediction of TF target sites based on atomistic models of protein-DNA
complexes. BMC Bioinformatics,9(1), 436. http://dx.doi.org/10.1186/1471-2105-9-436.
Ashkenazy, H., Erez, E., Martz, E., Pupko, T., & Ben-Tal, N. (2010). ConSurf 2010:
Calculating evolutionary conservation in sequence and structure of proteins and nucleic
acids. Nucleic Acids Research,38(Suppl. 2), W529–W533. http://dx.doi.org/10.1093/
nar/gkq399.
Axenopoulos, A., Daras, P., Papadopoulos, G. E., & Houstis, E. N. (2013). SP-dock:
Protein-protein docking using shape and physicochemical complementarity.
IEEE/ACM Transactions on Computational Biology and Bioinformatics,10(1), 135–150.
http://dx.doi.org/10.1109/TCBB.2012.149.
Bailey, T. L., Boden, M., Buske, F. A., Frith, M., Grant, C. E., & Clementi, L. (2009).
MEME Suite: Tools for motif discovery and searching. Nucleic Acids Research,
37(Suppl. 2), W202–W208. http://dx.doi.org/10.1093/nar/gkp335.
Barik, A., Nithin, C., Manasa, P., & Bahadur, R. P. (2012). A protein-RNA docking bench-
mark (I): Nonredundant cases. Proteins: Structure, Function, and Bioinformatics,80(7),
1866–1871. http://dx.doi.org/10.1002/prot.24083.
Bau
`, D., Sanyal, A., Lajoie, B. R., Capriotti, E., Byron, M., & Lawrence, J. B. (2011). The
three-dimensional folding of the a-globin gene domain reveals formation of chromatin
globules. Nature Structural & Molecular Biology,18(1), 107–114. http://dx.doi.org/
10.1038/nsmb.1936.
Bembom, O. (2007). seqLogo: Sequence logos for DNA sequence alignments.
Benos, P. V., Bulyk, M. L., & Stormo, G. D. (2002). Additivity in protein-DNA interactions:
How good an approximation is it? Nucleic Acids Research,30(20), 4442–4451. http://dx.
doi.org/10.1093/nar/gkf578.
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., & Weissig, H. (2000).
The Protein Data Bank. Nucleic Acids Research,28(1), 235–242. http://dx.doi.org/
10.1093/nar/28.1.235.
Brooks, B. R., Bruccoleri, R. E., Olafson, B. D., States, D. J., Swaminathan, S., &
Karplus, M. (1983). CHARMM: A program for macromolecular energy, minimization,
and dynamics calculations. Journal of Computational Chemistry,4(2), 187–217. http://dx.
doi.org/10.1002/jcc.540040211.
111Statistical Potentials for Protein Interactions
Author's personal copy
Brylinski, M., & Skolnick, J. (2008). A threading-based method (FINDSITE) for ligand-
binding site prediction and functional annotation. Proceedings of the National Academy of
Sciences of the United States of America,105(1), 129–134. http://dx.doi.org/10.1073/
pnas.0707684105.
Bulyk, M. L. (2003). Computational prediction of transcription-factor binding site locations.
Genome Biology,5(1), 201. http://dx.doi.org/10.1186/gb-2003-5-1-201.
Carson, M. B., Langlois, R., & Lu, H. (2010). NAPS: A residue-level nucleic acid-binding
prediction server. Nucleic Acids Research,38(Suppl. 2), W431–W435. http://dx.doi.org/
10.1093/nar/gkq361.
Chen, C.-Y., Chien, T.-Y., Lin, C.-K., Lin, C.-W., Weng, Y.-Z., & Chang, D. (2012).
Predicting target DNA sequences of DNA-binding proteins based on unbound struc-
tures. PLoS One,7(2), e30446. http://dx.doi.org/10.1371/journal.pone.0030446.
Chen, Y., Kortemme, T., Robertson, T., Baker, D., & Varani, G. (2004). A new hydrogen-
bonding potential for the design of protein-RNA interactions predicts specific contacts
and discriminates decoys. Nucleic Acids Research,32(17), 5147–5162. http://dx.doi.org/
10.1093/nar/gkh785.
Chen, H., & Skolnick, J. (2008). M-TASSER: An algorithm for protein quaternary structure
prediction. Biophysical Journal,94(3), 918–928. http://dx.doi.org/10.1529/
biophysj.107.114280.
Chen, Y. C., Wright, J. D., & Lim, C. (2012). DR_bind: A web server for predicting DNA-
binding residues from the protein structure based on electrostatics, evolution and geometry.
Nucleic Acids Research,40(W1), W249–W256. http://dx.doi.org/10.1093/nar/gks481.
Cheng, T. M.-K., Blundell, T. L., & Fernandez-Recio, J. (2007). pyDock: Electrostatics and
desolvation for effective scoring of rigid-body protein-protein docking. Proteins: Struc-
ture, Function, and Bioinformatics,68(2), 503–515. http://dx.doi.org/10.1002/prot.21419.
Comeau, S. R., Gatchell, D. W., Vajda, S., & Camacho, C. J. (2004a). ClusPro: A fully auto-
mated algorithm for protein-protein docking. Nucleic Acids Research,32(Suppl. 2),
W96–W99. http://dx.doi.org/10.1093/nar/gkh354.
Comeau, S. R., Gatchell, D. W., Vajda, S., & Camacho, C. J. (2004b). ClusPro: An auto-
mated docking and discrimination method for the prediction of protein complexes.
Bioinformatics,20(1), 45–50. http://dx.doi.org/10.1093/bioinformatics/btg371.
Cornell, W. D., Cieplak, P., Bayly, C. I., Gould, I. R., Merz, K. M., & Ferguson, D. M.
(1995). A second generation force field for the simulation of proteins, nucleic acids,
and organic molecules. Journal of the American Chemical Society,117(19), 5179–5197.
http://dx.doi.org/10.1021/ja00124a002.
Das, M. K., & Dai, H.-K. (2007). A survey of DNA motif finding algorithms. BMC Bioin-
formatics,8(Suppl. 7), S21. http://dx.doi.org/10.1186/1471-2105-8-S7-S21.
De Vries, S. J., van Dijk, M., & Bonvin, A. M. J. J. (2010). The HADDOCK web server for
data-driven biomolecular docking. Nature Protocols,5(5), 883–897. http://dx.doi.org/
10.1038/nprot.2010.32.
Dobbins, S. E., Lesk, V. I., & Sternberg, M. J. E. (2008). Insights into protein flexibility: The
relationship between normal modes and conformational change upon protein-protein
docking. Proceedings of the National Academy of Sciences of the United States of America,
105(30), 10390–10395. http://dx.doi.org/10.1073/pnas.0802496105.
Dominguez, C., Boelens, R., & Bonvin, A. M. J. J. (2003). HADDOCK: A protein-protein
docking approach based on biochemical or biophysical information. Journal of the Amer-
ican Chemical Society,125(7), 1731–1737. http://dx.doi.org/10.1021/ja026939x.
Dunbrack, R. L., Jr. (2006). Sequence comparison and protein structure prediction. Current
Opinion in Structural Biology,16(3), 374–384. http://dx.doi.org/10.1016/j.sbi.2006.05.006.
El Hassan, M., & Calladine, C. (1998). Two distinct modes of protein-induced bending in
DNA. Journal of Molecular Biology,282(2), 331–343. http://dx.doi.org/10.1006/
jmbi.1998.1994.
112 Oriol Fornes et al.
Author's personal copy
Eswar, N., Webb, B., Marti-Renom, M. A., Madhusudhan, M., Eramian, D., Shen, M.-Y.,
et al. (2006). Comparative Protein Structure Modeling Using Modeller. Current Protocols
in Bioinformatics,15, 5.6.1–5.6.30.
Feig, M., Karanicolas, J., & Brooks, C. L., III (2004). MMTSB Tool Set: Enhanced sampling
and multiscale modeling methods for applications in structural biology. Journal of
Molecular Graphics and Modelling,22(5), 377–395. http://dx.doi.org/10.1016/
j.jmgm.2003.12.005.
Feliu, E., Aloy, P., & Oliva, B. (2011). On the analysis of protein-protein interactions via
knowledge-based potentials for the prediction of protein-protein docking. Protein
Science,20(3), 529–541. http://dx.doi.org/10.1002/pro.585.
Feliu, E., & Oliva, B. (2010). How different from random are docking predictions when
ranked by scoring functions? Proteins: Structure, Function, and Bioinformatics,78(16),
3376–3385. http://dx.doi.org/10.1002/prot.22844.
Fernandez-Recio, J., Totrov, M., Skorodumov, C., & Abagyan, R. (2005). Optimal docking
area: A new method for predicting protein-protein interaction sites. Proteins: Structure,
Function, and Bioinformatics,58(1), 134–143. http://dx.doi.org/10.1002/prot.20285.
Ferrada, E., & Melo, F. (2009). Effective knowledge-based potentials. Protein Science,18(7),
1469–1485. http://dx.doi.org/10.1002/pro.166.
Fraenkel, E., & Pabo, C. O. (1998). Comparison of X-ray and NMR structures for the
Antennapedia homeodomain-DNA complex. Nature Structural & Molecular Biology,
5(8), 692–697. http://dx.doi.org/10.1038/1382.
Gabb, H. A., Jackson, R. M., & Sternberg, M. J. E. (1997). Modelling protein docking using
shape complementarity, electrostatics and biochemical information. Journal of Molecular
Biology,272(1), 106–120. http://dx.doi.org/10.1006/jmbi.1997.1203.
Gabdoulline, R., Eckweiler, D., Kel, A., & Stegmaier, P. (2012). 3DTF: A web server for
predicting transcription factor PWMs using 3D structure-based energy calculations.
Nucleic Acids Research,40(W1), W180–W185. http://dx.doi.org/10.1093/nar/gks551.
Gao, M., & Skolnick, J. (2008). DBD-Hunter: A knowledge-based method for the predic-
tion of DNA-protein interactions. Nucleic Acids Research,36(12), 3978–3992. http://dx.
doi.org/10.1093/nar/gkn332.
Gao, M., & Skolnick, J. (2009). A threading-based method for the prediction of DNA-
binding proteins with application to the human genome. PLoS Computational Biology,
5(11), e1000567. http://dx.doi.org/10.1371/journal.pcbi.1000567.
Gao, M., & Skolnick, J. (2010). Structural space of protein-protein interfaces is degenerate,
close to complete, and highly connected. Proceedings of the National Academy of Sciences of
the United States of America,107(52), 22517–22522. http://dx.doi.org/10.1073/
pnas.1012820107.
Garcia-Garcia, J., Bonet, J., Guney, E., Fornes, O., Planas, J., & Oliva, B. (2012). Networks
of protein-protein interactions: From uncertainty to molecular details. Molecular Informat-
ics,31(5), 342–362. http://dx.doi.org/10.1002/minf.201200005.
Garcia-Garcia, J., Schleker, S., Klein-Seetharaman, J., & Oliva, B. (2012). BIPS: BIANA
Interolog Prediction Server. A tool for protein-protein interaction inference. Nucleic
Acids Research,40(W1), W147–W151. http://dx.doi.org/10.1093/nar/gks553.
Garzon, J. I., Lope
´z-Blanco, J. R., Pons, C., Kovacs, J., Abagyan, R., Fernandez-Recio, J.,
et al. (2009). FRODOCK: A new approach for fast rotational protein-protein dock-
ing. Bioinformatics,25(19), 2544–2551. http://dx.doi.org/10.1093/bioinformatics/
btp447.
Gaulton, A., Bellis, L. J., Bento, A. P., Chambers, J., Davies, M., & Hersey, A. (2011).
ChEMBL: A large-scale bioactivity database for drug discovery. Nucleic Acids Research,
40(D1), D1100–D1107. http://dx.doi.org/10.1093/nar/gkr777.
Ginalski, K. (2006). Comparative modeling for protein structure prediction. Current Opinion
in Structural Biology,16(2), 172–177. http://dx.doi.org/10.1016/j.sbi.2006.02.003.
113Statistical Potentials for Protein Interactions
Author's personal copy
Gitter, A., Siegfried, Z., Klutstein, M., Fornes, O., Oliva, B., Simon, I., et al. (2009). Backup
in gene regulatory networks explains differences between binding and knockout results.
Molecular Systems Biology,5(1), 276. http://dx.doi.org/10.1038/msb.2009.33.
Glover, J. N. M., & Harrison, S. C. (1995). Crystal structure of the heterodimeric bZIP tran-
scription factor c-Fos-c-Jun bound to DNA. Nature,373(6511), 257–261. http://dx.doi.
org/10.1038/373257a0.
Grau, J., Posch, S., Grosse, I., & Keilwagen, J. (2013). A general approach for discriminative
de novo motif discovery from high-throughput data. Nucleic Acids Research,41(21), e197.
http://dx.doi.org/10.1093/nar/gkt831.
Gray, J. J., Moughon, S., Wang, C., Schueler-Furman, O., Kuhlman, B., Rohl, C. A., et al.
(2003). Protein-protein docking with simultaneous optimization of rigid-body displace-
ment and side-chain conformations. Journal of Molecular Biology,331(1), 281–299. http://
dx.doi.org/10.1016/S0022-2836(03)00670-3.
Gu, S., Koehl, P., Hass, J., & Amenta, N. (2012). Surface-histogram: A new shape descriptor
for protein-protein docking. Proteins: Structure, Function, and Bioinformatics,80(1),
221–238. http://dx.doi.org/10.1002/prot.23192.
Guerois, R., Nielsen, J. E., & Serrano, L. (2002). Predicting changes in the stability of
proteins and protein complexes: A study of more than 1000 mutations. Journal of Molecular
Biology,320(2), 369–387. http://dx.doi.org/10.1016/S0022-2836(02)00442-4.
Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L., & Noble, W. S. (2007). Quantifying
similarity between motifs. Genome Biology,8(2), R24. http://dx.doi.org/10.1186/
gb-2007-8-2-r24.
Hu, S., Xie, Z., Onishi, A., Yu, X., Jiang, L., & Lin, J. (2009). Profiling the human protein-
DNA interactome reveals ERK2 as a transcriptional repressor of interferon signaling.
Cell,139(3), 610–622. http://dx.doi.org/10.1016/j.cell.2009.08.037.
Huang, S.-Y., & Zou, X. (2013). A nonredundant structure dataset for benchmarking
protein-RNA computational docking. Journal of Computational Chemistry,34(4),
311–318. http://dx.doi.org/10.1002/jcc.23149.
Hwang, S., Gou, Z., & Kuznetsov, I. B. (2007). DP-Bind: A web server for sequence-based
prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics,23(5),
634–636. http://dx.doi.org/10.1093/bioinformatics/btl672.
Hwang, H., Pierce, B., Mintseris, J., Janin, J., & Weng, Z. (2008). Protein-protein docking
benchmark version 3.0. Proteins: Structure, Function, and Bioinformatics,73(3), 705–709.
http://dx.doi.org/10.1002/prot.22106.
Hwang, H., Vreven, T., Janin, J., & Weng, Z. (2010). Protein-protein docking benchmark
version 4.0. Proteins: Structure, Function, and Bioinformatics,78(15), 3111–3114. http://dx.
doi.org/10.1002/prot.22830.
Janin, J. (2010). Protein-protein docking tested in blind predictions: The CAPRI experi-
ment. Molecular BioSystems,6(12), 2351–2362. http://dx.doi.org/10.1039/C005060C.
Janin, J., Henrick, K., Moult, J., Eyck, L. T., Sternberg, M. J. E., & Vajda, S. (2003). CAPRI:
A Critical Assessment of PRedicted Interactions. Proteins: Structure, Function, and Bioin-
formatics,52(1), 2–9. http://dx.doi.org/10.1002/prot.10381.
Jime
´nez-Garcı
´a, B., Pons, C., & Ferna
´ndez-Recio, J. (2013). pyDockWEB: A web server for
rigid-body protein-protein docking using electrostatics and desolvation scoring.
Bioinformatics,29(13), 1698–1699. http://dx.doi.org/10.1093/bioinformatics/btt262.
Jones, S., & Thornton, J. M. (1997). Analysis of protein-protein interaction sites using surface
patches. Journal of Molecular Biology,272(1), 121–132. http://dx.doi.org/10.1006/
jmbi.1997.1234.
Katchalski-Katzir, E., Shariv, I., Eisenstein, M., Friesem, A. A., Aflalo, C., & Vakser, I. A.
(1992). Molecular surface recognition: Determination of geometric fit between proteins
and their ligands by correlation techniques. Proceedings of the National Academy of Sciences of
the United States of America,89(6), 2195–2199.
114 Oriol Fornes et al.
Author's personal copy
Kim, R., Corona, R. I., Hong, B., & Guo, J. (2011). Benchmarks for flexible and rigid
transcription factor-DNA docking. BMC Structural Biology,11(1), 45. http://dx.doi.
org/10.1186/1472-6807-11-45.
Kirsanov, D. D., Zanegina, O. N., Aksianov, E. A., Spirin, S. A., Karyagina, A. S., &
Alexeevski, A. V. (2012). NPIDB: Nucleic acid-protein interaction database. Nucleic
Acids Research,41(D1), D517–D523. http://dx.doi.org/10.1093/nar/gks1199.
Knegtel, R. M. A., Antoon, J., Rullmann, C., Boelens, R., & Kaptein, R. (1994). MONTY:
A Monte Carlo approach to protein-DNA recognition. Journal of Molecular Biology,
235(1), 318–324. http://dx.doi.org/10.1016/S0022-2836(05)80035-X.
Knox, C., Law, V., Jewison, T., Liu, P., Ly, S., & Frolkis, A. (2011). DrugBank 3.0:
A comprehensive resource for “Omics” research on drugs. Nucleic Acids Research,
39(Suppl. 1), D1035–D1041. http://dx.doi.org/10.1093/nar/gkq1126.
Kozakov, D., Brenke, R., Comeau, S. R., & Vajda, S. (2006). PIPER: An FFT-based protein
docking program with pairwise potentials. Proteins: Structure, Function, and Bioinformatics,
65(2), 392–406. http://dx.doi.org/10.1002/prot.21117.
Kumar, M., Gromiha, M. M., & Raghava, G. P. (2007). Identification of DNA-binding pro-
teins using support vector machines and evolutionary profiles. BMC Bioinformatics,8(1),
463. http://dx.doi.org/10.1186/1471-2105-8-463.
Kumar, M., Gromiha, M. M., & Raghava, G. P. S. (2011). SVM based prediction of RNA-
binding proteins using binding residues and evolutionary information. Journal of Molecular
Recognition,24(2), 303–313. http://dx.doi.org/10.1002/jmr.1061.
Lasker, K., Phillips, J. L., Russel, D., Vela
´zquez-Muriel, J., Schneidman-Duhovny, D., &
Tjioe, E. (2010). Integrative structure modeling of macromolecular assemblies from pro-
teomics data. Molecular & Cellular Proteomics,9(8), 1689–1702. http://dx.doi.org/
10.1074/mcp.R110.000067.
Lasker, K., Sali, A., & Wolfson, H. J. (2010). Determining macromolecular assembly
structures by molecular docking and fitting into an electron density map. Proteins:
Structure, Function, and Bioinformatics,78(15), 3205–3211. http://dx.doi.org/10.1002/
prot.22845.
Lee, H., Li, Z., Silkov, A., Fischer, M., Petrey, D., Honig, B., et al. (2010). High-throughput
computational structure-based characterization of protein families: START domains and
implications for structural genomics. Journal of Structural and Functional Genomics,11(1),
51–59. http://dx.doi.org/10.1007/s10969-010-9086-7.
Lensink, M. F., & Wodak, S. J. (2010). Docking and scoring protein interactions: CAPRI
2009. Proteins: Structure, Function, and Bioinformatics,78(15), 3073–3084. http://dx.doi.
org/10.1002/prot.22818.
Lesk, V. I., & Sternberg, M. J. E. (2008). 3D-Garden: A system for modelling protein-protein
complexes based on conformational refinement of ensembles generated with the
marching cubes algorithm. Bioinformatics,24(9), 1137–1144. http://dx.doi.org/
10.1093/bioinformatics/btn093.
Lin, C.-K., & Chen, C.-Y. (2013). PiDNA: Predicting protein-DNA interactions with
structural models. Nucleic Acids Research,41(W1), W523–W530. http://dx.doi.org/
10.1093/nar/gkt388.
Liu, Z., Guo, J.-T., Li, T., & Xu, Y. (2008). Structure-based prediction of transcription fac-
tor binding sites using a protein-DNA docking approach. Proteins: Structure, Function, and
Bioinformatics,72(4), 1114–1124. http://dx.doi.org/10.1002/prot.22002.
Lu, H., Lu, L., & Skolnick, J. (2003). Development of unified statistical potentials describing
protein-protein interactions. Biophysical Journal,84(3), 1895–1901. http://dx.doi.org/
10.1016/S0006-3495(03)74997-2.
Lu, X.-J., & Olson, W. K. (2008). 3DNA: A versatile, integrated software system for the
analysis, rebuilding and visualization of three-dimensional nucleic-acid structures. Nature
Protocols,3(7), 1213–1227. http://dx.doi.org/10.1038/nprot.2008.104.
115Statistical Potentials for Protein Interactions
Author's personal copy
Luscombe, N. M., & Thornton, J. M. (2002). Protein-DNA interactions: Amino acid con-
servation and the effects of mutations on binding specificity. Journal of Molecular Biology,
320(5), 991–1009. http://dx.doi.org/10.1016/S0022-2836(02)00571-5.
Lyskov, S., & Gray, J. J. (2008). The RosettaDock server for local protein-protein docking.
Nucleic Acids Research,36(Suppl. 2), W233–W238. http://dx.doi.org/10.1093/nar/
gkn216.
Macindoe, G., Mavridis, L., Venkatraman, V., Devignes, M.-D., & Ritchie, D. W. (2010).
HexServer: An FFT-based protein docking server powered by graphics processors. Nucleic
Acids Research,38(Suppl. 2), W445–W449. http://dx.doi.org/10.1093/nar/gkq311.
Mashiach, E., Schneidman-Duhovny, D., Andrusier, N., Nussinov, R., & Wolfson, H. J.
(2008). FireDock: A web server for fast interaction refinement in molecular docking.
Nucleic Acids Research,36(Suppl. 2), W229–W232. http://dx.doi.org/10.1093/nar/
gkn186.
Matthews, L. R., Vaglio, P., Reboul, J., Ge, H., Davis, B. P., & Garrels, J. (2001). Identi-
fication of potential interaction networks using sequence-based searches for conserved
protein-protein interactions or “interologs” Genome Research,11(12), 2120–2126.
http://dx.doi.org/10.1101/gr.205301.
Mintseris, J., Pierce, B., Wiehe, K., Anderson, Robert, Chen, R., & Weng, Z. (2007).
Integrating statistical pair potentials into protein complex prediction. Proteins: Structure,
Function, and Bioinformatics,69(3), 511–520. http://dx.doi.org/10.1002/prot.21502.
Miyazawa, S., & Jernigan, R. L. (1985). Estimation of effective interresidue contact energies
from protein crystal structures: Quasi-chemical approximation. Macromolecules,18(3),
534–552. http://dx.doi.org/10.1021/ma00145a039.
Moal, I. H., & Bates, P. A. (2010). SwarmDock and the use of normal modes in protein-
protein docking. International Journal of Molecular Sciences,11(10), 3623–3648. http://
dx.doi.org/10.3390/ijms11103623.
Moal, I. H., Torchala, M., Bates, P. A., & Ferna
´ndez-Recio, J. (2013). The scoring of poses
in protein-protein docking: Current capabilities and future directions. BMC Bioinformat-
ics,14(1), 286. http://dx.doi.org/10.1186/1471-2105-14-286.
Moont, G., Gabb, H. A., & Sternberg, M. J. E. (1999). Use of pair potentials across protein
interfaces in screening predicted docked complexes. Proteins: Structure, Function, and Bio-
informatics,35(3), 364–373. http://dx.doi.org/10.1002/(SICI)1097-0134(19990515)
35:3<364::AID-PROT11>3.0.CO;2-4.
Mosca, R., Ce
´ol, A., & Aloy, P. (2013). Interactome3D: Adding structural details to protein
networks. Nature Methods,10(1), 47–53. http://dx.doi.org/10.1038/nmeth.2289.
Mosca, R., Ce
´ol, A., Stein, A., Olivella, R., & Aloy, P. (2013). 3did: A catalog of domain-
based interactions of known three-dimensional structure. Nucleic Acids Research,42(D1),
D374–D379. http://dx.doi.org/10.1093/nar/gkt887.
Nimrod, G., Schushan, M., Szila
´gyi, A., Leslie, C., & Ben-Tal, N. (2010). iDBPs: A web
server for the identification of DNA binding proteins. Bioinformatics,26(5), 692–693.
http://dx.doi.org/10.1093/bioinformatics/btq019.
Ozbek, P., Soner, S., Erman, B., & Haliloglu, T. (2010). DNABINDPROT: Fluctuation-
based predictor of DNA-binding residues within a network of interacting residues. Nucleic
Acids Research,38(Suppl. 2), W417–W423. http://dx.doi.org/10.1093/nar/gkq396.
Pandit, S. B., Brylinski, M., Zhou, H., Gao, M., Arakaki, A. K., & Skolnick, J. (2010).
PSiFR: An integrated resource for prediction of protein structure and function.
Bioinformatics,26(5), 687–688. http://dx.doi.org/10.1093/bioinformatics/btq006.
Panjkovich, A., Melo, F., & Marti-Renom, M. A. (2008). Evolutionary potentials: Structure
specific knowledge-based potentials exploiting the evolutionary record of sequence
homologs. Genome Biology,9(4), R68. http://dx.doi.org/10.1186/gb-2008-9-4-r68.
116 Oriol Fornes et al.
Author's personal copy
Parisien, M., Freed, K. F., & Sosnick, T. R. (2012). On docking, scoring and assessing
protein-DNA complexes in a rigid-body framework. PLoS One,7(2), e32647. http://
dx.doi.org/10.1371/journal.pone.0032647.
Pe
´rez-Cano, L., & Ferna
´ndez-Recio, J. (2010). Optimal protein-RNA area, OPRA:
A propensity-based method to identify RNA-binding sites on proteins. Proteins: Struc-
ture, Function, and Bioinformatics,78(1), 25–35. http://dx.doi.org/10.1002/prot.22527.
Pe
´rez-Cano, L., Jime
´nez-Garcı
´a, B., & Ferna
´ndez-Recio, J. (2012). A protein-RNA
docking benchmark (II): Extended set from experimental and homology modeling data.
Proteins: Structure, Function, and Bioinformatics,80(7), 1872–1882. http://dx.doi.org/
10.1002/prot.24075.
Pe
´rez-Cano, L., Solernou, A., Pons, C., & Ferna
´ndez-Recio, J. (2010). Structural prediction
of protein-RNA interaction by computational docking with propensity-based statistical
potentials. Pacific Symposium on Biocomputing,15, 269–280.
Pettersen, E. F., Goddard, T. D., Huang, C. C., Couch, G. S., Greenblatt, D. M.,
Meng, E. C., et al. (2004). UCSF chimera—A visualization system for exploratory
research and analysis. Journal of Computational Chemistry,25(13), 1605–1612. http://
dx.doi.org/10.1002/jcc.20084.
Pieper, U., Webb, B. M., Barkan, D. T., Schneidman-Duhovny, D., Schlessinger, A., &
Braberg, H. (2011). ModBase, a database of annotated comparative protein structure
models, and associated resources. Nucleic Acids Research,39(Suppl. 1), D465–D474.
http://dx.doi.org/10.1093/nar/gkq1091.
Pierce, B., & Weng, Z. (2007). ZRANK: Reranking protein docking predictions with an
optimized energy function. Proteins: Structure, Function, and Bioinformatics,67(4),
1078–1086. http://dx.doi.org/10.1002/prot.21373.
Pierce, B., & Weng, Z. (2008). A combination of rescoring and refinement significantly
improves protein docking performance. Proteins: Structure, Function, and Bioinformatics,
72(1), 270–279. http://dx.doi.org/10.1002/prot.21920.
Planas-Iglesias, J., Bonet, J., Marı
´n-Lo
´pez, M. A., Feliu, E., Gursoy, A., & Oliva, B. (2012).
Structural bioinformatics of proteins: Predicting the tertiary and quaternary structure of
proteins from sequence. In W. Cai (Ed.), Protein-protein interactions—Computational and
experimental tools.http://www.intechopen.com/books/protein-protein-interactions-
computational-and-experimental-tools/structural-bioinformatics-of-proteins-predicting-
the-tertiary-and-quaternary-structure-of-proteins-f.
Pons, C., Talavera, D., de la Cruz, X., Orozco, M., & Fernandez-Recio, J. (2011). Scoring
by intermolecular pairwise propensities of exposed residues (SIPPER): A new efficient
potential for protein-protein docking. Journal of Chemical Information and Modeling,51(2),
370–377. http://dx.doi.org/10.1021/ci100353e.
Poulain, P., Saladin, A., Hartmann, B., & Pre
´vost, C. (2008). Insights on protein-DNA
recognition by coarse grain modelling. Journal of Computational Chemistry,29(15),
2582–2592. http://dx.doi.org/10.1002/jcc.21014.
R Core Team, (2013). R: A language and environment for statistical computing. Vienna: Austria.
Rice, P., Longden, I., & Bleasby, A. (2000). EMBOSS: The European Molecular Biology
Open Software Suite. Trends in Genetics,16(6), 276–277. http://dx.doi.org/10.1016/
S0168-9525(00)02024-2.
Ritchie, D. W., & Kemp, G. J. L. (2000). Protein docking using spherical polar Fourier cor-
relations. Proteins: Structure, Function, and Bioinformatics,39(2), 178–194. http://dx.doi.
org/10.1002/(SICI)1097-0134(20000501)39:2<178::AID-PROT8>3.0.CO;2-6.
Roberts, V. A., Thompson, E. E., Pique, M. E., Perez, M. S., & Ten Eyck, L. F. (2013).
DOT2: Macromolecular docking with improved biophysical models. Journal of Compu-
tational Chemistry,34(20), 1743–1758. http://dx.doi.org/10.1002/jcc.23304.
117Statistical Potentials for Protein Interactions
Author's personal copy
Robertson, T. A., & Varani, G. (2007). An all-atom, distance-dependent scoring function for
the prediction of protein–DNA interactions from structure. Proteins: Structure, Function,
and Bioinformatics,66(2), 359–374. http://dx.doi.org/10.1002/prot.21162.
Rost, B. (1999). Twilight zone of protein sequence alignments. Protein Engineering,12(2),
85–94. http://dx.doi.org/10.1093/protein/12.2.85.
Russel, D., Lasker, K., Webb, B., Vela
´zquez-Muriel, J., Tjioe, E., & Schneidman-
Duhovny, D. (2012). Putting the pieces together: Integrative modeling platform soft-
ware for structure determination of macromolecular assemblies. PLoS Biology,10(1),
e1001244. http://dx.doi.org/10.1371/journal.pbio.1001244.
Schneider, S., Saladin, A., Fiorucci, S., Pre
´vost, C., & Zacharias, M. (2012). ATTRACT
and PTOOLS: Open source programs for protein-protein docking. In R. Baron
(Ed.), Computational drug discovery and design (pp. 221–232). New York: Springer.
http://link.springer.com/protocol/10.1007/978-1-61779-465-0_15.
Schneidman-Duhovny, D., Hammel, M., & Sali, A. (2011). Macromolecular docking
restrained by a small angle X-ray scattering profile. Journal of Structural Biology,173(3),
461–471. http://dx.doi.org/10.1016/j.jsb.2010.09.023.
Schneidman-Duhovny, D., Inbar, Y., Nussinov, R., & Wolfson, H. J. (2005). PatchDock
and SymmDock: Servers for rigid and symmetric docking. Nucleic Acids Research,
33(Suppl. 2), W363–W367. http://dx.doi.org/10.1093/nar/gki481.
Schro
¨dinger, L. (2010). The PyMOL molecular graphics system (Version 1.3r1).
Sharan, R., Ulitsky, I., & Shamir, R. (2007). Network-based prediction of protein function.
Molecular Systems Biology,3(1). http://dx.doi.org/10.1038/msb4100129.
Shen, Y., Paschalidis, I. C., Vakili, P., & Vajda, S. (2008). Protein docking by the underes-
timation of free energy funnels in the space of encounter complexes. PLoS Computational
Biology,4(10), e1000191. http://dx.doi.org/10.1371/journal.pcbi.1000191.
Shen, M., & Sali, A. (2006). Statistical potential for assessment and prediction of protein
structures. Protein Science,15(11), 2507–2524. http://dx.doi.org/10.1110/ps.062416606.
Shentu, Z., Al Hasan, M., Bystroff, C., & Zaki, M. J. (2008). Context shapes: Efficient com-
plementary shape matching for protein-protein docking. Proteins: Structure, Function, and
Bioinformatics,70(3), 1056–1073. http://dx.doi.org/10.1002/prot.21600.
Si, J., Zhang, Z., Lin, B., Schroeder, M., & Huang, B. (2011). MetaDBSite: A meta approach
to improve protein DNA-binding sites prediction (Report No. Suppl. 1) (p. S7).
BioMed Central Ltd. http://www.biomedcentral.com/1752-0509/5/S1/S7/abstract.
Simon, B., Madl, T., Mackereth, C. D., Nilges, M., & Sattler, M. (2010). An efficient pro-
tocol for NMR-spectroscopy-based structure determination of protein complexes in
solution. Angewandte Chemie, International Edition,49(11), 1967–1970. http://dx.doi.
org/10.1002/anie.200906147.
Sippl, M. J. (1990). Calculation of conformational ensembles from potentials of mean force.
An approach to the knowledge-based prediction of local structures in globular proteins.
Journal of Molecular Biology,213(4), 859–883.
Stein, A., Ce
´ol, A., & Aloy, P. (2011). 3did: Identification and classification of domain-based
interactions of known three-dimensional structure. Nucleic Acids Research,39(Suppl. 1),
D718–D723. http://dx.doi.org/10.1093/nar/gkq962.
Stein, A., Rueda, M., Panjkovich, A., Orozco, M., & Aloy, P. (2011). A systematic studyof the
energeticsinvolved in structuralchanges upon association and connectivityin protein inter-
action networks. Structure,19(6), 881–889. http://dx.doi.org/10.1016/j.str.2011.03.009.
Takeda, T., Corona, R. I., & Guo, J. (2013). A knowledge-based orientation potential for
transcription factor-DNA docking. Bioinformatics,29(3), 322–330. http://dx.doi.org/
10.1093/bioinformatics/bts699.
Tjong, H., & Zhou, H.-X. (2007). DISPLAR: An accurate method for predicting DNA-
binding sites on protein surfaces. Nucleic Acids Research,35(5), 1465–1477. http://dx.
doi.org/10.1093/nar/gkm008.
118 Oriol Fornes et al.
Author's personal copy
Torchala, M., Moal, I. H., Chaleil, R. A. G., Fernandez-Recio, J., & Bates, P. A. (2013).
SwarmDock: A server for flexible protein-protein docking. Bioinformatics,29(6),
807–809. http://dx.doi.org/10.1093/bioinformatics/btt038.
Tovchigrechko, A., & Vakser, I. A. (2006). GRAMM-X public web server for protein-
protein docking. Nucleic Acids Research,34(Web Server issue), W310–W314. http://
dx.doi.org/10.1093/nar/gkl206.
Tuncbag, N., Gursoy, A., Guney, E., Nussinov, R., & Keskin, O. (2008). Architectures and
functional coverage of protein-protein interfaces. Journal of Molecular Biology,381(3),
785–802. http://dx.doi.org/10.1016/j.jmb.2008.04.071.
Tuncbag, N., Gursoy, A., Nussinov, R., & Keskin, O. (2011). Predicting protein-protein
interactions on a proteome scale by matching evolutionary and structural similarities
at interfaces using PRISM. Nature Protocols,6(9), 1341–1354. http://dx.doi.org/
10.1038/nprot.2011.367.
Turner, D., Kim, R., & Guo, J. (2012). TFinDit: Transcription factor-DNA interaction
data depository. BMC Bioinformatics,13(1), 220. http://dx.doi.org/10.1186/1471-2105-
13-220.
Tuszynska, I., & Bujnicki, J. M. (2011). DARS-RNP and QUASI-RNP: New statistical
potentials for protein-RNA docking. BMC Bioinformatics,12(1), 348. http://dx.doi.
org/10.1186/1471-2105-12-348.
Urnov, F. D., Rebar, E. J., Holmes, M. C., Zhang, H. S., & Gregory, P. D. (2010). Genome
editing with engineered zinc finger nucleases. Nature Reviews Genetics,11(9), 636–646.
http://dx.doi.org/10.1038/nrg2842.
Vajda, S., & Kozakov, D. (2009). Convergence and combination of methods in protein-
protein docking. Current Opinion in Structural Biology,19(2), 164–170. http://dx.doi.
org/10.1016/j.sbi.2009.02.008.
Valdar, W. S. J., & Thornton, J. M. (2001). Protein-protein interfaces: Analysis of amino acid
conservation in homodimers. Proteins: Structure, Function, and Bioinformatics,42(1),
108–124. http://dx.doi.org/10.1002/1097-0134(20010101)42:1<108::AID-PROT110-
>3.0.CO;2-O.
van Dijk, M., & Bonvin, A. M. J. J. (2008). A protein-DNA docking benchmark. Nucleic
Acids Research,36(14), e88. http://dx.doi.org/10.1093/nar/gkn386.
van Dijk, M., & Bonvin, A. M. J. J. (2010). Pushing the limits of what is achievable in
protein-DNA docking: Benchmarking HADDOCK’s performance. Nucleic Acids
Research,38(17), 5634–5647. http://dx.doi.org/10.1093/nar/gkq222.
van Dijk, M., Visscher, K. M., Kastritis, P. L., & Bonvin, A. M. J. J. (2013). Solvated protein-
DNA docking using HADDOCK. Journal of Biomolecular NMR,56(1), 51–63. http://dx.
doi.org/10.1007/s10858-013-9734-x.
Venkatraman, V., Yang, Y. D., Sael, L., & Kihara, D. (2009). Protein-protein docking using
region-based 3D Zernike descriptors. BMC Bioinformatics,10(1), 407. http://dx.doi.org/
10.1186/1471-2105-10-407.
Wang, L., & Brown, S. J. (2006). BindN: A web-based tool for efficient prediction of DNA
and RNA binding sites in amino acid sequences. Nucleic Acids Research,34(Suppl. 2),
W243–W248. http://dx.doi.org/10.1093/nar/gkl298.
Wang, L., Huang, C., Yang, M. Q., & Yang, J. Y. (2010). BindNþfor accurate prediction of
DNA and RNA-binding residues from protein sequence features. BMC Systems Biology,
4(Suppl. 1), S3. http://dx.doi.org/10.1186/1752-0509-4-S1-S3.
Watson, J. D., Laskowski, R. A., & Thornton, J. M. (2005). Predicting protein function from
sequence and structural data. Current Opinion in Structural Biology,15(3), 275–284. http://
dx.doi.org/10.1016/j.sbi.2005.04.003.
Weirauch, M. T., Cote, A., Norel, R., Annala, M., Zhao, Y., & Riley, T. R. (2013).
Evaluation of methods for modeling transcription factor sequence specificity. Nature
Biotechnology,31(2), 126–134. http://dx.doi.org/10.1038/nbt.2486.
119Statistical Potentials for Protein Interactions
Author's personal copy
Wiederstein, M., & Sippl, M. J. (2007). ProSA-web: Interactive web service for the recog-
nition of errors in three-dimensional structures of proteins. Nucleic Acids Research,
35(Suppl. 2), W407–W410. http://dx.doi.org/10.1093/nar/gkm290.
Wodak, S. J., & Janin, J. (1978). Computer analysis of protein-protein interaction. Journal of
Molecular Biology,124(2), 323–342. http://dx.doi.org/10.1016/0022-2836(78)90302-9.
Xie, Z., Hu, S., Qian, J., Blackshaw, S., & Zhu, H. (2011). Systematic characterization of
protein-DNA interactions. Cellular and Molecular Life Sciences,68(10), 1657–1668.
http://dx.doi.org/10.1007/s00018-010-0617-y.
Xu, B., Yang, Y., Liang, H., & Zhou, Y. (2009). An all-atom knowledge-based energy func-
tion for protein-DNA threading, docking decoy discrimination, and prediction of
transcription-factor binding profiles. Proteins: Structure, Function, and Bioinformatics,
76(3), 718–730. http://dx.doi.org/10.1002/prot.22384.
Yu, X., Cao, J., Cai, Y., Shi, T., & Li, Y. (2006). Predicting rRNA-, RNA-, and DNA-
binding proteins from primary structure with support vector machines. Journal of
Theoretical Biology,240(2), 175–184. http://dx.doi.org/10.1016/j.jtbi.2005.09.018.
Zhang, C., Liu, S., Zhu, Q., & Zhou, Y. (2005). A knowledge-based energy function for
protein-ligand, protein-protein, and protein-DNA complexes. Journal of Medicinal Chem-
istry,48(7), 2325–2335. http://dx.doi.org/10.1021/jm049314d.
Zhang, Q. C., Petrey, D., Deng, L., Qiang, L., Shi, Y., & Thu, C. A. (2012). Structure-based
prediction of protein-protein interactions on a genome-wide scale. Nature,490(7421),
556–560. http://dx.doi.org/10.1038/nature11503.
Zhang, Q. C., Petrey, D., Norel, R., & Honig, B. H. (2010). Protein interface conservation
across structure space. Proceedings of the National Academy of Sciences of the United States of
America,107(24), 10896–10901. http://dx.doi.org/10.1073/pnas.1005894107.
Zhang, Y., & Skolnick, J. (2004). Automated structure prediction of weakly homologous
proteins on a genomic scale. Proceedings of the National Academy of Sciences of the United
States of America,101(20), 7594–7599. http://dx.doi.org/10.1073/pnas.0305695101.
Zhang, Y., & Skolnick, J. (2005). TM-align: A protein structure alignment algorithm based
on the TM-score. Nucleic Acids Research,33(7), 2302–2309. http://dx.doi.org/10.1093/
nar/gki524.
Zhao, H., Yang, Y., & Zhou, Y. (2010). Structure-based prediction of DNA-binding
proteins by structural alignment and a volume-fraction corrected DFIRE-based energy
function. Bioinformatics,26(15), 1857–1863. http://dx.doi.org/10.1093/bioinformatics/
btq295.
Zhao, H., Yang, Y., & Zhou, Y. (2011). Structure-based prediction of RNA-binding
domains and RNA-binding sites and application to structural genomics targets. Nucleic
Acids Research,39(8), 3017–3025. http://dx.doi.org/10.1093/nar/gkq1266.
Zheng, S., Robertson, T. A., & Varani, G. (2007). A knowledge-based potential function
predicts the specificity and relative binding energy of RNA-binding proteins. FEBS
Journal,274(24), 6378–6391. http://dx.doi.org/10.1111/j.1742-4658.2007.06155.x.
Zhou, H., & Skolnick, J. (2013). FINDSITEcomb: A threading/structure-based, proteomic-
scale virtual ligand screening approach. Journal of Chemical Information and Modeling,53(1),
230–240. http://dx.doi.org/10.1021/ci300510n.
Zhou, H., & Zhou, Y. (2002). Distance-scaled, finite ideal-gas reference state improves
structure-derived potentials of mean force for structure selection and stability prediction.
Protein Science,11(11), 2714–2726. http://dx.doi.org/10.1110/ps.0217002.
120 Oriol Fornes et al.
Author's personal copy
... The model of the pioneering regulatory complex locates Oct4, which is missed in the experimental structure, suggesting a potential role for nucleosome opening. potentials [28][29][30][31][32] . In previous works we developed a set of potentials 33 to analyse protein structures and their interactions [34][35][36] . ...
... We defined the contacts between TF and DNA using three residues: one amino acid and two contiguous nucleotides of the same strand. The distance of a contact is the distance between the Cβ atom of the amino acid residue and the average position of the atoms of the nitrogen-bases of the two nucleotides and their complementary pairs in the opposite strand 28 . Additional features are considered for a contact, such as the secondary structure and solvent accessibility of the amino-acid or the DNA closest groove (major or minor) of the two nucleotides. ...
... We used the definition of statistical potentials described by Feliu et al. 68 and Fornes et al. 28 . These were calculated with the distribution of contacts at less than 30 Å, using an interval criterion or a distance threshold. ...
Preprint
Full-text available
Transcription factor (TF) binding is a key component of genomic regulation. There are numerous high-throughput experimental methods to characterize TF-DNA binding specificities. Their application, however, is both laborious and expensive, which makes profiling all TFs challenging. For instance, the binding preferences of ~25% human TFs remain unknown; they neither have been determined experimentally nor inferred computationally. Here, we introduce ModCRE, a web server implementing a structure homology-modelling approach to predict TF motifs and automatically model higher-order TF regulatory complexes. Starting from a TF sequence or structure, ModCRE predicts a set of motifs for that TF. The predicted motifs are then used to scan the DNA for occurrences of each of them, and the best matches are either profiled with a binding score or collected for their subsequent modeling into a higher-order regulatory complex with DNA, as well as other TFs and co-factors. Moreover, we demonstrate that incorporating high-throughput TF binding data, such as from protein binding microarrays, addresses the protein-DNA structure scarcity problem for deriving statistical potentials. In turn, these statistical potentials are proven to be capable predictors of TF motifs. We also show the conditional advantage of using ModCRE over a nearest-neighbor approach for predicting TF binding sites as well as an improvement in prediction accuracy when using a rank-enrichment selection system. Finally, as case examples, we apply ModCRE to model the interferon beta enhanceosome and the complex of SOX2 and 11 with a nucleosome.
... Knowledge-based potentials have also been widely used to study protein/nucleic acid interactions [26,27], with some specific applications on protein-RNA recognition [28]. One of the main limitations of this method is that they rely on docking models and a detailed calculation of all the atomic interactions, and therefore have a strong dependence on the precise structural data that the potential was based on [29]. The recent release of RoseTTAFoldNA [30] has also implied a huge advance in the field, providing high accuracy models with atomic resolution for protein/nucleic-acid complexes, which is extremely useful for proteins where it is clear which RNA the protein binds, but that is not always the case. ...
Article
Full-text available
RNA recognition motifs (RRM) are the most prevalent class of RNA binding domains in eukaryotes. Their RNA binding preferences have been investigated for almost two decades, and even though some RRM domains are now very well described, their RNA recognition code has remained elusive. An increasing number of experimental structures of RRM-RNA complexes has become available in recent years. Here, we perform an in-depth computational analysis to derive an RNA recognition code for canonical RRMs. We present and validate a computational scoring method to estimate the binding between an RRM and a single stranded RNA, based on structural data from a carefully curated multiple sequence alignment, which can predict RRM binding RNA sequence motifs based on the RRM protein sequence. Given the importance and prevalence of RRMs in humans and other species, this tool could help design RNA binding motifs with uses in medical or synthetic biology applications, leading towards the de novo design of RRMs with specific RNA recognition.
... Further, we have refined the PatchDock output using the FireDock (https://bioinfo3d.cs.tau.ac.il/FireDock/) web server where the global energy values were depicted. 22 The more negative (lower) is the global energy value, higher is the binding affinity (less is the binding energy) of the two proteins. ...
Article
Full-text available
Accumulation of diverse mutations across the structural and non-structural genes is leading to rapid evolution of SARS-CoV-2, altering its pathogenicity. We performed whole genome sequencing of 239 SARS-CoV-2 RNA samples collected from both adult and pediatric patients across eastern India (West Bengal), during the second pandemic wave in India (April-May 2021). In addition to several common spike mutations within the Delta variant, a unique constellation of 8 co-appearing non-spike mutations was identified, which revealed a high degree of positive mutual correlation. Our results also demonstrated the dynamics of SARS-CoV-2 variants among unvaccinated pediatric patients. 41.4% of our studied Delta strains harbored this signature set of 8 co-appearing non-spike mutations and phylogenetically out-clustered other Delta sub-lineages like 21J, 21A or 21I. This is the first report from eastern India that portrayed a landscape of co-appearing mutations in the non-Spike proteins, which might have led to the evolution of a distinct Delta sub-cluster. Accumulation of such mutations in SARS-CoV-2 may lead to the emergence of “vaccine-evading variants”. Hence, monitoring of such non-Spike mutations will be significant in the formulation of any future vaccines against those SARS-CoV-2 variants that might evade the current vaccine-induced immunity, among both the pediatric and adult populations. This article is protected by copyright. All rights reserved.
Article
Transcription factor (TF) binding is a key component of genomic regulation. There are numerous high-throughput experimental methods to characterize TF–DNA binding specificities. Their application, however, is both laborious and expensive, which makes profiling all TFs challenging. For instance, the binding preferences of ∼25% human TFs remain unknown; they neither have been determined experimentally nor inferred computationally. We introduce a structure-based learning approach to predict the binding preferences of TFs and the automated modelling of TF regulatory complexes. We show the advantage of using our approach over the classical nearest-neighbor prediction in the limits of remote homology. Starting from a TF sequence or structure, we predict binding preferences in the form of motifs that are then used to scan a DNA sequence for occurrences. The best matches are either profiled with a binding score or collected for their subsequent modeling into a higher-order regulatory complex with DNA. Co-operativity is modelled by: (i) the co-localization of TFs and (ii) the structural modeling of protein–protein interactions between TFs and with co-factors. We have applied our approach to automatically model the interferon-β enhanceosome and the pioneering complexes of OCT4, SOX2 (or SOX11) and KLF4 with a nucleosome, which are compared with the experimentally known structures.
Article
Proteins usually perform their cellular functions by interacting with other proteins. Accurate identification of protein-protein interaction sites (PPIs) from sequence is import for designing new drugs and developing novel therapeutics. A lot of computational models for PPIs prediction have been developed because experimental methods are slow and expensive. Most models employ a sliding window approach in which local neighbors are concatenated to present a target residue. However, those neighbors are not distinguished by pairwise information between a neighbor and the target. In this study, we propose a novel PPIs prediction model AttCNNPPISP, which combines attention mechanism and convolutional neural networks (CNNs). The attention mechanism dynamically captures the pairwise correlation of each neighbor-target pair within a sliding window, and therefore makes a better understanding of the local environment of target residue. And then, CNNs take the local representation as input to make prediction. Experiments are employed on several public benchmark datasets. Compared with the state-of-the-art models, AttCNNPPISP improves the prediction performance. Also, the experimental results demonstrate that the attention mechanism is effective in terms of constructing comprehensive context information of target residue.
Chapter
Circular RNA (circRNA) is an RNA molecule different from linear RNA with covalently closed loop structure. CircRNAs can act as sponging miRNAs and can interact with RNA binding protein. Previous studies have revealed that circRNAs play important role in the development of different diseases. The biological functions of circRNAs can be investigated with the help of circRNA-protein interaction. Due to scarce circRNA data, long circRNA sequences and the sparsely distributed binding sites on circRNAs, much fewer endeavors are found in studying the circRNA-protein interaction compared to interaction between linear RNA and protein. With the increase in experimental data on circRNA, machine learning methods are widely used in recent times for predicting the circRNA-protein interaction. The existing methods either use RNA sequence or protein sequence for predicting the binding sites. In this paper, we present a new method PCPI (Predicting CircRNA and Protein Interaction) to predict the interaction between circRNA and protein using support vector machine (SVM) classifier. We have used both the RNA and protein sequences to predict their interaction. The circRNA sequences were converted in pseudo peptide sequences based on codon translation. The pseudo peptide and the protein sequences were classified based on dipole moments and the volume of the side chains. The 3-mers of the classified sequences were used as features for training the model. Several machine learning model were used for classification. Comparing the performances, we selected SVM classifier for predicting circRNA-protein interaction. Our method achieved 93% prediction accuracy.
Article
Proteins commonly perform biological functions through protein-protein interactions (PPIs). The knowledge of PPI sites is imperative for the understanding of protein functions, disease mechanisms, and drug design. Traditional biological experimental methods for studying PPI sites still incur considerable drawbacks, including long experimental time and high labor costs. Therefore, many computational methods have been proposed for predicting PPI sites. However, achieving high prediction performance and overcoming severe data imbalance remain challenging issues. In this paper, we propose a new sequence-based deep learning model called CLPPIS (standing for C NN- L STM ensemble based PPI S ites prediction). CLPPIS consists of CNN and LSTM components, which can capture spatial features and sequential features simultaneously. Further, it utilizes a novel feature group as input, which has 7 physicochemical, biophysical, and statistical properties. Besides, it adopts a batch-weighted loss function to reduce the interference of imbalance data. Our work suggests that the integration of protein spatial features and sequential features provides important information for PPI sites prediction. Evaluation on three public benchmark datasets shows that our CLPPIS model significantly outperforms existing state-of-the-art methods.
Article
The recognition of protein-protein interaction sites (PPIs) is beneficial for the interpretation of protein functions and the development of new drugs. Traditional biological experiments to identify PPI sites are expensive and inefficient, leading to the generation of various computational methods to predict PPIs. However, the accurate prediction of PPI sites remains a big challenge due to the existence of the sample imbalance issue. In this work, we design a novel model that combines convolutional neural networks (CNNs) with Batch Normalization to predict PPI sites, and employ an oversampling technique Borderline-SMOTE to address the sample imbalance issue. In particular, to better characterize the amino acid residues on the protein chains, we employ a sliding window approach for feature extraction of target residues and their contextual residues. We verify the effectiveness of our method by comparing our method with the existing state-of-the-art schemes. The performance validations of our method on three public datasets achieve accuracies of 88.6%, 89.9%, and 86.7%, respectively, all showing improved accuracies compared with the existing schemes. Moreover, the ablation experiment results suggest that Batch Normalization can greatly improve the generalization and the prediction stability of our model.
Article
Full-text available
Protein-protein docking, which aims to predict the structure of a protein-protein complex from its unbound components, remains an unresolved challenge in structural bioinformatics. An important step is the ranking of docked poses using a scoring function, for which many methods have been developed. There is a need to explore the differences and commonalities of these methods with each other, as well as with functions developed in the fields of molecular dynamics and homology modelling. We present an evaluation of 115 scoring functions on an unbound docking decoy benchmark covering 118 complexes for which a near-native solution can be found, yielding top 10 success rates of up to 58%. Hierarchical clustering is performed, so as to group together functions which identify near-natives in similar subsets of complexes. Three set theoretic approaches are used to identify pairs of scoring functions capable of correctly scoring different complexes. This shows that functions in different clusters capture different aspects of binding and are likely to work together synergistically. All functions designed specifically for docking perform well, indicating that functions are transferable between sampling methods. We also identify promising methods from the field of homology modelling. Further, differential success rates by docking difficulty and solution quality suggest a need for flexibility-dependent scoring. Investigating pairs of scoring functions, the set theoretic measures identify known scoring strategies as well as a number of novel approaches, indicating promising augmentations of traditional scoring methods. Such augmentation and parameter combination strategies are discussed in the context of the learning-to-rank paradigm.
Article
Evolutionary information derived from the large number of available protein sequences and structures could powerfully guide both analysis and prediction of protein–protein interfaces. To test the relevance of this information, we assess the conservation of residues at protein–protein interfaces compared with other residues on the protein surface. Six homodimer families are analyzed: alkaline phosphatase, enolase, glutathione S-transferase, copper-zinc superoxide dismutase, Streptomyces subtilisin inhibitor, and triose phosphate isomerase. For each family, random simulation is used to calculate the probability (P value) that the level of conservation observed at the interface occurred by chance. The results show that interface conservation is higher than expected by chance and usually statistically significant at the 5% level or better. The effect on the P values of using different definitions of the interface and of excluding active site residues is discussed. Proteins 2001;42:108–124. © 2000 Wiley-Liss, Inc.
Article
Empirical residue–residue pair potentials are used to screen possible complexes for protein–protein dockings. A correct docking is defined as a complex with not more than 2.5 Å root-mean-square distance from the known experimental structure. The complexes were generated by “ftdock” (Gabb et al. J Mol Biol 1997;272:106–120) that ranks using shape complementarity. The complexes studied were 5 enzyme-inhibitors and 2 antibody-antigens, starting from the unbound crystallographic coordinates, with a further 2 antibody-antigens where the antibody was from the bound crystallographic complex. The pair potential functions tested were derived both from observed intramolecular pairings in a database of nonhomologous protein domains, and from observed intermolecular pairings across the interfaces in sets of nonhomologous heterodimers and homodimers. Out of various alternate strategies, we found the optimal method used a mole-fraction calculated random model from the intramolecular pairings. For all the systems, a correct docking was placed within the top 12% of the pair potential score ranked complexes. A combined strategy was developed that incorporated “multidock,” a side-chain refinement algorithm (Jackson et al. J Mol Biol 1998;276:265–285). This placed a correct docking within the top 5 complexes for enzyme-inhibitor systems, and within the top 40 complexes for antibody–antigen systems. Proteins 1999;35:364–373. © 1999 Wiley-Liss, Inc.
Article
We present a new computational method of docking pairs of proteins by using spherical polar Fourier correlations to accelerate the search for candidate low-energy conformations. Interaction energies are estimated using a hydrophobic excluded volume model derived from the notion of “overlapping surface skins,” augmented by a rigorous but “soft” model of electrostatic complementarity. This approach has several advantages over former three-dimensional grid-based fast Fourier transform (FFT) docking correlation methods even though there is no analogue to the FFT in a spherical polar representation. For example, a complete search over all six rigid-body degrees of freedom can be performed by rotating and translating only the initial expansion coefficients, many infeasible orientations may be eliminated rapidly using only low-resolution terms, and the correlations are easily localized around known binding epitopes when this knowledge is available. Typical execution times on a single processor workstation range from 2 hours for a global search (5 × 10⁸ trial orientations) to a few minutes for a local search (over 6 × 10⁷ orientations). The method is illustrated with several domain dimer and enzyme–inhibitor complexes and 20 large antibody–antigen complexes, using both the bound and (when available) unbound subunits. The correct conformation of the complex is frequently identified when docking bound subunits, and a good docking orientation is ranked within the top 20 in 11 out of 18 cases when starting from unbound subunits. Proteins 2000;39:178–194. © 2000 Wiley-Liss, Inc.
Article
Evolutionary information derived from the large number of available protein sequences and structures could powerfully guide both analysis and prediction of protein–protein interfaces. To test the relevance of this information, we assess the conservation of residues at protein–protein interfaces compared with other residues on the protein surface. Six homodimer families are analyzed: alkaline phosphatase, enolase, glutathione S-transferase, copper-zinc superoxide dismutase, Streptomyces subtilisin inhibitor, and triose phosphate isomerase. For each family, random simulation is used to calculate the probability (P value) that the level of conservation observed at the interface occurred by chance. The results show that interface conservation is higher than expected by chance and usually statistically significant at the 5% level or better. The effect on the P values of using different definitions of the interface and of excluding active site residues is discussed. Proteins 2001;42:108–124. © 2000 Wiley-Liss, Inc.
Article
The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.