ArticlePDF AvailableLiterature Review

On the Use of Knowledge-Based Potentials for the Evaluation of Models of Protein-Protein, Protein-DNA, and Protein-RNA Interactions

March 2014
Advances in Protein Chemistry and Structural Biology 94:77-120

March 2014
94:77-120

DOI:10.1016/B978-0-12-800168-4.00004-4

Source
PubMed

Authors:

Oriol Fornés

University of British Columbia

Javier García-García

University Pompeu Fabra

Jaume Bonet

University Pompeu Fabra

Baldomero Oliva

University Pompeu Fabra

Proteins are the bricks and mortar of cells, playing structural and functional roles. In order to perform their function, they interact with each other as well as with other biomolecules such as DNA or RNA. Therefore, to fathom the function of a protein, we require knowing its partners and the atomic details of its interactions (i.e., the structure of the complex). However, the amount of protein interactions with an experimentally determined three-dimensional structure is scarce. Therefore, computational techniques such as homology modeling are foremost to fill this gap. Protein interactions can be modeled using as templates the interactions of homologous proteins, if the structure of the complex is known, or using docking methods. In both approaches, the estimation of the quality of models is essential. There are several ways to address this problem. In this review, we focus on the use of knowledge-based potentials for the analysis of protein interactions. We describe the procedure to derive statistical potentials and split them into different energetic terms that can be used for different purposes. We extensively discuss the fields where knowledge-based potentials have been successfully applied to (1) model protein-protein, protein-DNA, and protein-RNA interactions and (2) predict binding sites (in the protein and in the DNA). Moreover, we provide ready-to-use resources for docking and benchmarking protein interactions.

Coverage of the prediction of binding sites versus its minimum PPV. The Y axes show the ratio of proteins with a PPV equal or greater than a threshold ( X axes). We have used ODA (Fernandez-Recio et al., 2005) with a minimum pyDockODA score of À 10 (A), the prediction based on “ BS- E local ” with a minimum score of 2 (B), and the binding sites predicted by both (C). The testing dataset contains 85 nonredundant proteins

…

Examples of PWM logos. PWM logos for Foxo1, Nr2e1, and Tbx, as described in the DREAM5 challenge compared with the predictions produced by the statistical potential “ E S3DC ” and the state-of-the-art methods 3DTF and PiDNA. Logos were created with the R software environment (Bembom, 2007; R Core Team, 2013).

…

Definition of different DNA parameters used for deriving split-statistical potentials. Distances between amino acids and DNA are represented in blue lines; solid when displaying the minimal distance and dashed otherwise. Internal distances in the DNA are shown in orange. Environment features of the DNA for a contact between an amino acid and a dinucleotide at position “ i ” (see details in text for each definition): groove in contact with the amino acid (A); distance between the amino acid and the dinucleotide (B); strand in contact with the amino acid (C); and DNA chemical group in contact with the amino acid (D). Structural images were created with the UCSF Chimera package (Fraenkel & Pabo, 1998; Pettersen et al., 2004).

…

Pipeline for modeling transcription factors. Step 1: sequence homology search. Step 2: filter results of step 1 by sequence identity and coverage of the protein – DNA interface. Step 3: optimization of the alignment. Step 4: model-building of the three-dimensional (Continued)

…

.2 Benchmark datasets for docking

…

Figures - uploaded by Baldomero Oliva

Content may be subject to copyright.

Content uploaded by Baldomero Oliva

Content may be subject to copyright.

Provided for non-commercial research and educational use only.

Not for reproduction, distribution or commercial use.

This chapter was originally published in the book Advances in Protein Chemistry and

Structural Biology, Vol. 94 published by Elsevier, and the attached copy is provided

by Elsevier for the author's benefit and for the benefit of the author's institution, for

non-commercial research and educational use including without limitation use in

instruction at your institution, sending it to specific colleagues who know you, and

providing a copy to your institution’s administrator.

All other uses, reproduction and distribution, including without limitation commercial

reprints, selling or licensing copies or access, or posting on open internet sites, your

personal or institution’s website or repository, are prohibited. For exceptions,

permission may be sought for such use through Elsevier's permissions site at:

http://www.elsevier.com/locate/permissionusematerial

From Oriol Fornes, Javier Garcia-Garcia, Jaume Bonet, Baldo Oliva, On the Use of

Knowledge-Based Potentials for the Evaluation of Models of Protein–Protein,

Protein–DNA, and Protein–RNA Interactions. In Rossen Donev, editor: Advances in

Protein Chemistry and Structural Biology, Vol. 94, Burlington: Academic Press,

Academic Press

CHAPTER FOUR

On the Use of Knowledge-Based

Potentials for the Evaluation of

Models of Protein–Protein,

Protein–DNA, and Protein–RNA

Interactions

Oriol Fornes, Javier Garcia-Garcia, Jaume Bonet, Baldo Oliva

Structural Bioinformatics Lab. (GRIB), Departament de Cie

`ncies Experimentals i de la Salut, Universitat

Pompeu Fabra, Barcelona, Catalunya, Spain

Corresponding author: e-mail address: baldo.oliva@upf.edu

Contents

1. Introduction 78

2. Knowledge-Based Potentials 80

2.1 Split-statistical potentials 81

3. Modeling of Protein Interactions Using Templates 82

3.1 Models of binary complexes 83

3.2 Models of multimeric complexes 84

4. Modeling Interactions of Proteins Using Docking 85

4.1 Protein–protein docking 90

4.2 Protein–nucleic acid docking 92

5. Prediction of Protein-Binding Regions 93

5.1 Identification of protein interfaces 93

5.2 Prediction of DNA/RNA-binding proteins 97

6. Characterization of Transcription Factor-Binding Sites 98

6.1 Application of knowledge-based potentials on DREAM5 targets 100

7. Adapting Split-Statistical Potentials for Protein–DNA Interactions 105

7.1 Application of split-statistical potentials on DREAM5 targets 106

8. Conclusions 109

Acknowledgments 110

References 110

Abstract

Proteins are the bricks and mortar of cells, playing structural and functional roles.

In order to perform their function, they interact with each other as well as with other

biomolecules such as DNA or RNA. Therefore, to fathom the function of a protein, we

require knowing its partners and the atomic details of its interactions (i.e., the structure

Advances in Protein Chemistry and Structural Biology, Volume 94 #2014 Elsevier Inc.

http://dx.doi.org/10.1016/B978-0-12-800168-4.00004-4

Author's personal copy

of the complex). However, the amount of protein interactions with an experimentally

determined three-dimensional structure is scarce. Therefore, computational techniques

such as homology modeling are foremost to fill this gap. Protein interactions can be

modeled using as templates the interactions of homologous proteins, if the structure

of the complex is known, or using docking methods. In both approaches, the estimation

of the quality of models is essential. There are several ways to address this problem.

In this review, we focus on the use of knowledge-based potentials for the analysis of

protein interactions. We describe the procedure to derive statistical potentials and split

them into different energetic terms that can be used for different purposes. We exten-

sively discuss the fields where knowledge-based potentials have been successfully

applied to (1) model protein–protein, protein–DNA, and protein–RNA interactions

and (2) predict binding sites (in the protein and in the DNA). Moreover, we provide

ready-to-use resources for docking and benchmarking protein interactions.

1. INTRODUCTION

During the past decade hundreds of sequenced genomes have come to

light, producing a vast amount of protein sequences. Therefore, unraveling

the function of these proteins has become one of the major challenges in

biology. It is widely accepted that the function of a protein can be predicted

from its structure (Watson, Laskowski, & Thornton, 2005). But proteins

rarely act alone; instead, they form networks of physical interactions with

other biomolecules (i.e., protein–protein, protein–DNA, and protein–

RNA interactions). Thus, in order to have a better understanding of the

function of a protein, it is also necessary to know with whom it is associated

and how, even at atomic level.

The number of proteins with an experimentally determined three-

dimensional (3D) structure in the Protein Data Bank (PDB) (Berman

et al., 2000) is very low in comparison to the number of known protein

sequences, even for well-characterized organisms (Sharan, Ulitsky, &

Shamir, 2007), and even lower in the case of protein binary complexes

(Kirsanov et al., 2012; Mosca, Ce

´ol, Stein, Olivella, & Aloy, 2013). The dis-

proportion between the number of solved 3D structures and protein

sequences has encouraged the development of many strategies to model

the structure of proteins from their sequence (Dunbrack, 2006; Ginalski,

2006). These strategies have become the basis for the modeling of protein

interactions. Protein–protein interactions can be modeled by using as tem-

plates complexes of homologous proteins with known structure. This

approach relies on the principle that, given a pair of interacting proteins,

78 Oriol Fornes et al.

Author's personal copy

their homologs will also interact (interologs approach; Garcia-Garcia,

Schleker, Klein-Seetharaman, & Oliva, 2012; Matthews et al., 2001), and

it is assumed that they will do it in a similar fashion. Occasionally, the models

of protein–protein interactions can be constructed by superimposition of the

models of the unbound partners over the structure of a template complex.

They can also be obtained by docking the structure of one of the two

proteins onto the other (Vajda & Kozakov, 2009) via previous modeling

of their unbound structures (if necessary). We recently reviewed in detail

the modeling of tertiary and quaternary structures of proteins and their role

in protein–protein interaction networks (Garcia-Garcia, Bonet, et al., 2012;

Planas-Iglesias, Bonet, Feliu, Gursoy, & Oliva, 2012). Similarly, we can also

obtain the models of protein–DNA and protein–RNA interactions, which

also require to model the nucleotide sequences of interest (Feig,

Karanicolas, & Brooks, 2004; Lu & Olson, 2008).

Paired with the modeling of 3D structures, the estimation of their quality

has become crucial. In this particular context, several methods have been

developed to score models based on energies. One approach to address this

problem is based on the derivation of knowledge-based potentials (also

referred to as statistical potentials or potentials of mean force) (Sippl,

1990). Knowledge-based potentials have been used to (1) discriminate

whether or not a model has the correct fold (Panjkovich, Melo, &

Marti-Renom, 2008; Shen & Sali, 2006); (2) detect localized errors in pro-

tein structures (Wiederstein & Sippl, 2007); (3) predict the stability of

mutant proteins (Zhou & Zhou, 2002); (4) select the closest near-native

models from a set of decoys (Aloy & Oliva, 2009; Ferrada & Melo,

2009); (5) model protein–protein interactions (Aloy & Russell, 2003; Lu,

Lu, & Skolnick, 2003); (6) analyze the outcome of docking experiments,

including protein–protein (analyzed in Moal, Torchala, Bates, &

Ferna

´ndez-Recio, 2013), protein–DNA (Robertson & Varani, 2007;

Takeda, Corona, & Guo, 2013; Xu, Yang, Liang, & Zhou, 2009), and

protein–RNA (Pe

´rez-Cano, Solernou, Pons, & Ferna

´ndez-Recio, 2010;

Tuszynska & Bujnicki, 2011; Zheng, Robertson, & Varani, 2007); (7) infer

the ability of proteins to bind DNA (Gao & Skolnick, 2008, 2009; Zhao,

Yang, & Zhou, 2010) and RNA (Zhao, Yang, & Zhou, 2011); (8) recognize

the binding regions in proteins (Feliu, Aloy, & Oliva, 2011; Pe

´rez-Cano &

Ferna

´ndez-Recio, 2010); and (9) identify transcription factor-binding sites

(Alamanova, Stegmaier, & Kel, 2010; Angarica, Pe

´rez, Vasconcelos,

Collado-Vides, & Contreras-Moreira, 2008; Chen, Chien, et al., 2012;

Liu, Guo, Li, & Xu, 2008; Xu et al., 2009).

79Statistical Potentials for Protein Interactions

Author's personal copy

In the following sections, we review the use of knowledge-based poten-

tials for the analysis of protein interactions. In Section 2, we introduce

knowledge-based potentials and we split them into different energetic terms.

Sections 3–6 are devoted to different fields where knowledge-based poten-

tials have successfully been applied. Specifically, we focus on (1) modeling

of protein interactions (including homology modeling and integrative

modeling); (2) docking of protein interactions (including protein–protein,

protein–DNA, and protein–RNA); (3) prediction of protein-binding

regions; (4) characterization of transcription factor-binding sites; and (5)

prediction of DNA-binding sites. In Section 7, we adapt the procedure

to split-statistical potentials (Aloy & Oliva, 2009) to predict protein–

DNA interactions and DNA-binding sites.

2. KNOWLEDGE-BASED POTENTIALS

A knowledge-based potential is an energy function derived from the

analysis of known protein structures. There are many methods to obtain

such potentials including the quasi-chemical (Miyazawa & Jernigan,

1985) and the potential of mean force (PMF) approximations (Sippl,

1990). We have used the general definition of knowledge-base potential

described in Aloy and Oliva (2009) (i.e., Eq. 4.1):

PMF a,bðÞ¼PMFstd da,b

ðÞkBTlog Pa,bjdðÞ

PaðÞPbðÞ



PMFstd da,b

ðÞ¼kBTlog Pd

a,b

ðÞ

weightref

 ð4:1Þ

Where “k

” is the Boltzmann constant, “T” is the standard temperature,

“d

a,b

” is the pairwise distance between a pair of residues “a,b”, being

“P(a)” and “P(b)” their respective probabilities. “P(a,bjd

a,b

)” is the condi-

tional probability of finding residues “a,b” at a maximum distance “d

a,b

”

and “P(d

a,b

)” is the probability of observing any pair of residues up to that

distance. Finally, the “weight

ref

” is the reference state function. The prob-

abilities “P(*)” are approximated from the observed frequencies of interac-

tions in a nonredundant set of PDB structures. Moreover, the distance can

be calculated as the minimum distance between any pair of heavy atoms or as

the pairwise distance between two specific atoms (e.g., between Cbatoms;

Cafor glycine residues). Also, Aloy and Oliva (2009) proved that the ref-

erence state function can be neglected for the comparison of decoys (see fur-

ther in Section 2.1).

80 Oriol Fornes et al.

Author's personal copy

The application of Eq. (4.1) over all interacting pairs of residues “a” and

“b” in a protein structure results in an estimation of its quality given in terms

of energy (i.e., Eq. 4.2):

E¼X

a,b

PMF a,bðÞ ð4:2Þ

It has to be noted that, while for protein folding residues “a” and “b” belong

to the same protein (or single protein chain), for protein–protein interac-

tions, residues “a” and “b” belong to a pair of interacting proteins “A”

and “B” (or different protein chains), respectively.

2.1. Split-statistical potentials

Aloy and Oliva (2009) demonstrated that, using the Bayes theorem,

Eq. (4.1) can be decomposed into several energetic terms, one of them

including the reference state. We have selected some of these terms as poten-

tials of mean force to score the quality of decoys (i.e., Eq. 4.3):

PMFpair a,bðÞ¼kBTlog Pa,bjda,b

ðÞ

PaðÞPbðÞPd

a,b

ðÞ



PMFlocal a,bðÞ¼kBTlog Pajya

ðÞ

PaðÞ



þkBTlog Pbjyb

ðÞ

PbðÞ



PMF3D a,bðÞ¼kBTlog Pd

a,b

ðÞðÞ

PMF3DC a,bðÞ¼kBTlog Pya,ybjda,b

ðÞ

Pya,yb

ðÞ



PMFS3DC a,bðÞ¼kBTlog Pa,bjda,b,ya,yb

ðÞPya,yb

ðÞ

Pa,bjya,yb

ðÞPya,ybjda,b

ðÞ



ð4:3Þ

Where “y

” and “y

” are the environments of a pair of residues “a,b”, as

defined by their hydrophobicity (i.e., polar or nonpolar), degree of exposure

(i.e., buried or exposed), and surrounding secondary structure (i.e., a-helix,

b-sheet, or coil). As an example, “P(a,bjd

)” is the conditional proba-

bility of finding residues “a,b”, in their respective environments “y

” and

“y

”, at a maximum distance “d

a,b

” (see Aloy & Oliva, 2009 for more details).

The statistical potentials “E

pair

”, “E

local

”, “E

3DC

”, and “E

S3DC

”

are defined using Eq. (4.2), with the corresponding subscripts between “E_”

and “PMF_”. We name these potentials “split-statistical potentials”.

The statistical potential “E

S3DC

” can be understood as a refinement of the

residue-pair statistical potential “E

pair

”. It takes into account not only the

81Statistical Potentials for Protein Interactions

Author's personal copy

residues that interact but also their environments. The statistical potential

“E

3DC

” depends only on the occurrence of interacting environments with-

out considering the specific interacting residues. The score “E

local

” is dis-

tance independent, and it reflects the probability of placing a residue in a

specific environment. The energy term “E

” concerns only the distance

at which pairs of residues interact, and it increases together with the number

of interacting residue pairs.

The statistical potentials described in Eq. (4.3) differ in order of magnitude

and their values cannot be used straightforward for the comparison of confor-

mational decoys. Therefore, they are translated into Z-scores. The Z-score of

an energy (or score within a distribution) is defined as the difference between

the energy (i.e., “E_”) and the average of energies in the distribution (i.e.,

“m”), divided by the standard deviation of the distribution (i.e., “s”). In gen-

eral, the background distribution to calculate a Z-score uses a random distri-

bution, which in the case of folds or interactions is obtained by shuffling the

residues of one or two sequences, respectively. The translation of energies into

Z-scores neglects the “E

” term because it is independent of the sequence

(i.e., E

¼mfor any distribution of shuffled sequences). Also, Aloy and

Oliva (2009) demonstrated that the distribution of the Z-score of the reference

state was similar to the random distribution and could be neglected too.

Therefore, neither the energy “E

” nor the reference state were considered

when selecting the best model among conformational decoys.

In a recent work, Feliu et al. (2011) modified the split-statistical potentials

of Eq. (4.3) for its application to protein–protein interactions. The frequen-

cies of amino acid pairs were extracted from residues belonging to different

chains in the interface of protein complexes from the 3DID database

(Stein, Ce

´ol, & Aloy, 2011). In protein–protein interactions, the Z-score

of the reference state was still irrelevant. Therefore, the energetic term that

included the reference state was assumed to be irrelevant too when ranking

decoys of interactions (i.e., docking poses). However, the score “E

” was

associated with the extension of the interacting interface (it is proportional

to the number of residues implied in the interface), and it was still valuable

in the analysis of protein-docking decoys (see further in Section 4.1).

3. MODELING OF PROTEIN INTERACTIONS USING

TEMPLATES

The continuous increase of structural data on protein complexes in the

PDB has been exploited for modeling the structure of protein–protein

82 Oriol Fornes et al.

Author's personal copy

interactions as well as the interactions of proteins with other biomolecules

(i.e., protein–DNA or protein–RNA interactions) based on homology.

However, when structures of the interaction are not available, docking

methods can be used (see further in Section 4). In the past years, new

approaches have been developed to assemble large macromolecular com-

plexes by combining different experimental data to apply restraints upon

complex assembly.

3.1. Models of binary complexes

The most common way of modeling protein interactions is via comparative

modeling (revised in Planas-Iglesias et al., 2012). This approach can only be

applied as long as there is a homologous structure of the interaction. Then,

applications such as MODELLER (Eswar et al., 2006) are able to directly

model the interaction of interest. Nevertheless, homology modeling is lim-

ited by those homologs whose structure is too remote to help assigning the

correct fold (Rost, 1999). Still, even distantly related proteins may use the

same binding regions to interact (Gao & Skolnick, 2010; Tuncbag, Gursoy,

Guney, Nussinov, & Keskin, 2008; Zhang, Petrey, Norel, & Honig, 2010),

which has been exploited by different authors to model protein–protein

interactions. For example, in M-TASSER (Chen & Skolnick, 2008),

protein sequences are threaded against a monomer template library. All

threading solutions belonging to the same dimer template are then identi-

fied. However, if both monomers share less than 30% sequence identity with

their templates on the dimer, the threaded dimer is evaluated with statistical

potentials (Lu et al., 2003) and, when necessary, discarded. Next, the tertiary

structure of each protein is obtained by rearrangement of continuous tem-

plate fragments (Zhang & Skolnick, 2004). Finally, the quaternary structure

is assembled by superimposition of both protein structures over the dimer

template.

Recently, three different methods have been proposed for model-

building the structure of protein–protein interactions on a genome-wide

scale. In PRISM (Tuncbag, Gursoy, Nussinov, & Keskin, 2011), the struc-

tures (or models) of two proteins are aligned against a set of known protein–

protein interfaces (i.e., template set). If the two complementary sides of a

template interface are structurally similar to the proteins (each side to a dif-

ferent protein), then the proteins are predicted to interact and the interaction

is modeled using the binding site, as dictated by the template interface. All

models produced with this approach are refined to account for flexible

83Statistical Potentials for Protein Interactions

Author's personal copy

changes and finally ranked. In PrePPI (Zhang et al., 2012), the individual

structures of the proteins are searched in the PDB or in a database of homol-

ogy models (i.e., SkyBase (Lee et al., 2010) and ModBase (Pieper et al.,

2011)). This step is followed by the identification of close and remote homo-

logs of the two partners. Then, if a PDB structure contains the interaction

between the homologs of each partner, it is used as template and the inter-

action is modeled by superimposition. In order to calculate the reliability of

the model, five different empirical structure-based scores are assigned and

combined using a Bayesian network, which scores the quality of the struc-

tural model of the interaction. Finally, in Interactome3D (Mosca, Ce

´ol, &

Aloy, 2013), the interaction is modeled in a similar fashion than PrePPI, but

it increases the structural coverage of the approach by using templates of

interacting domains from 3DID (Mosca, Ce

´ol, Stein, et al., 2013). The

resulting models are finally evaluated with InterPrets (Aloy & Russell, 2003).

3.2. Models of multimeric complexes

Methods described in Section 3.1 are useful for complexes formed by few

molecules. However, the assembly of large macromolecular complexes

requires an integrative structural modeling approach. The main idea behind

this methodology is to characterize the structural and topological features of

the complex in order to reduce the number of plausible solutions. For exam-

ple, the Integrative Modeling Platform (IMP) (Russel et al., 2012) has been

used to describe the yeast nuclear pore complex (Alber et al., 2007) and the

structure of chromatin at mega base scale (Bau

`et al., 2011). The assembly of

a complex in IMP is a cyclic procedure involving four different steps (revised

in detail in Planas-Iglesias et al., 2012):

(1) Collecting the information regarding the complex. This step includes

collecting experimental data from SAXS profiles (Schneidman-

Duhovny, Hammel, & Sali, 2011), proteomics data (Alber, Fo

¨rster,

Korkin, Topf, & Sali, 2008), EM images (Lasker, Phillips, et al.,

2010), density maps (Lasker, Sali, & Wolfson, 2010), nuclear magnetic

resonance (NMR) spectroscopy (Simon, Madl, Mackereth, Nilges, &

Sattler, 2010), or even 5C data (Bau

`et al., 2011). It also implies to

include physical–chemical information, such as molecular mechanics

force fields (Brooks et al., 1983) and potentials of mean force or statis-

tically derived potentials (Shen & Sali, 2006).

(2) Select a method to represent the data and use the information collected

in the previous step, translating it into spatial restraints. IMP uses

84 Oriol Fornes et al.

Author's personal copy

structures solved with different resolutions. High-resolution structures

can be represented by atoms, but low-resolution structures are repre-

sented by groups of atoms, such as residues, motifs, or even domains.

The translation of information into spatial restraints is used to test the

consistency of the model.

(3) Constructing a model that is consistent with the aforementioned spatial

restraints. The entire rotational and translational 3D space is searched in

order to position and orientate each individual structure inside the

complex.

(4) Evaluation of the modeled complex. In theory, if there is only one

native state of the complex, we should obtain a single model satisfying

all restraints. In contrast, if the data used to encode the restraints is

insufficient, more than one possible solution can be obtained or none.

4. MODELING INTERACTIONS OF PROTEINS USING

DOCKING

In contrast to the previous methods, which require the structural

knowledge of the interaction, docking is used for modeling the structure

of an interaction formed by two or more molecules (e.g., two proteins)

when the structure of the interaction is not available but the structures

of the individual molecules are known (or can be modeled). Docking

addresses the problem of finding the best-fit orientation of one molecule

with respect to the other. This idea was first introduced 30 years ago by

Wodak and Janin (1978). Since then, docking algorithms have largely

improved (summarized in Table 4.1). The simplest method of docking

two structures is to treat them as rigid bodies, usually using the Fast Fourier

Transform technique (e.g., MolFit (Katchalski-Katzir et al., 1992),

FTDock (Gabb et al., 1997), PIPER (Kozakov et al., 2006), and ZDOCK

(Mintseris et al., 2007)) or geometric matching (e.g., Hex (Ritchie &

Kemp, 2000) and FRODOCK (Garzon et al., 2009)). Moreover, several

methods have been developed that take into consideration the flexibility

of proteins, including Monte Carlo-based methods (e.g., RosettaDock;

Gray et al., 2003), the High Ambiguity Driven biomolecular DOCKing

(HADDOCK) (Dominguez, Boelens, & Bonvin, 2003), and the use of

normal modes describing the changes of conformation suffered upon bind-

ing (e.g., SwarmDock; Moal & Bates, 2010). However, it has been shown

that for approximately 65% of interactions, proteins suffer little or none

conformational changes when they associate, while only for 15% of

85Statistical Potentials for Protein Interactions

Author's personal copy

Table 4.1 Docking methods

Program Algorithm Evaluation Server References

Rigid-body docking methods

ClusPro FFT Geometric fit, van der Waals, atomic

desolvation energy, electrostatics, and

knowledge-based potentials

http://cluspro.bu.edu/login.

php

Comeau, Gatchell, Vajda, and

Camacho (2004a)

CS GM Atomic desolvation energy Shentu, Al Hasan, Bystroff, and

Zaki (2008)

DOT2 FFT Electrostatics and atomic desolvation

energies

Roberts, Thompson, Pique, Perez,

and Ten Eyck (2013)

FRODOCK GM van der Waals, electrostatics, and atomic

desolvation energies

http://frodock.chaconlab.

org

Garzon et al. (2009)

FTDock FFT Hydrogen bonding, electrostatics, and

RPScore (Moont, Gabb, & Sternberg,

1999)

Gabb, Jackson, and Sternberg

(1997)

GRAMM-X FFT Lennard-Jones potential, evolutionary

conservation, knowledge-based

potentials, van der Waals, and atomic

contact energy

http://vakser.

bioinformatics.ku.edu/

resources/gramm/grammx

Tovchigrechko and Vakser (2006)

Hex GM Geometric fit and electrostatics http://hexserver.loria.fr Macindoe, Mavridis,

Venkatraman, Devignes, and

Ritchie (2010)

LZerD GM Geometric fit and atomic desolvation

energy

Venkatraman, Yang, Sael, and

Kihara (2009)

Author's personal copy

MolFit FFT Katchalski-Katzir et al. (1992)

PatchDock GM Geometric fit and atomic desolvation

energy

http://bioinfo3d.cs.tau.ac.

il/PatchDock

Schneidman-Duhovny, Inbar,

Nussinov, and Wolfson (2005)

PIPER FFT Geometric fit, electrostatics, and atomic

desolvation energy

Kozakov, Brenke, Comeau, and

Vajda (2006)

pyDOCK FFT Electrostatics, desolvation energies,

ODA (Fernandez-Recio, Totrov,

Skorodumov, & Abagyan, 2005), and

SIPPER (Pons, Talavera, de la Cruz,

Orozco, & Fernandez-Recio, 2011)

http://life.bsc.es/servlet/

pydock/home

Jime

´nez-Garcı

´a, Pons, and

Ferna

´ndez-Recio (2013)

shDock GM Collision filtering Gu, Koehl, Hass, and Amenta

(2012)

SP-dock GM Atomic desolvation energy,

electrostatics, hydrophobicity, and

Lennard-Jones potential

Axenopoulos, Daras,

Papadopoulos, and Houstis (2013)

ZDOCK FFT Linear combination of atomistic

potentials, and ZRANK2 (Pierce &

Weng, 2008)

http://zdock.umassmed.edu Mintseris et al. (2007)

Flexible docking methods

3D-Garden MC Lennard-Jones potential and

electrostatics

http://www.sbg.bio.ic.ac.

uk/3dgarden

Lesk and Sternberg (2008)

Continued

Author's personal copy

Table 4.1 Docking methods—cont'd

Program Algorithm Evaluation Server References

ATTRACT EM Hydrophobic and hydrophilic contacts Schneider, Saladin, Fiorucci,

Pre

´vost, and Zacharias (2012)

FireDock EM van der Waals, electrostatics, atomic

desolvation energies, hydrogen and

disulfide bonds, p-stacking and aliphatic

interactions, rotamer probabilities, etc.

http://bioinfo3d.cs.tau.ac.

il/FireDock

Mashiach, Schneidman-Duhovny,

Andrusier, Nussinov, and Wolfson

(2008)

HADDOCK MCS van der Waals and electrostatics http://haddock.chem.uu.nl De Vries, van Dijk, and Bonvin

(2010)

RosettaDock MCS van der Waals, hydrogen bonds,

rotamer, knowledge-based potentials,

electrostatics, and atomic solvation

energies

http://antibody.graylab.jhu.

edu

Lyskov and Gray (2008)

SwarmDock NM van der Waals and electrostatics http://bmm.

cancerresearchuk.org/

SwarmDock

Torchala, Moal, Chaleil,

Fernandez-Recio, and Bates

(2013)

EM, energy minimization; FFT, Fast Fourier Transform; GM, geometric matching; MC, marching cubes; MCS, Monte Carlo simulation; NM, normal modes.

Author's personal copy

interactions, proteins undergo flexible deformations (Stein, Rueda,

Panjkovich, Orozco, & Aloy, 2011). As rigid-body docking approaches

are in the first step of docking, previous to the introduction of flexibility,

we will focus this section on rigid-body docking, which accounts for at

least 65% of protein–protein interactions.

A typical docking procedure between two molecules involves several

steps (Vajda & Kozakov, 2009). It begins with a rigid-body docking search

over the entire rotational and translational 3D space for the orientation and

position of one structure (i.e., ligand, usually the smallest structure) with

respect to the other (i.e., target or receptor). The resulting conformational

predictions (i.e., docking poses or decoys) are then ranked using scoring

functions with the objective to assign the higher scores to the poses more

similar to the native structure. These poses are named closest to native or

near-native structures. The definition of near-native solution relies on the

small structural differences of a decoy with respect to the 3D structure of

the binary complex (i.e., the native conformation). Several criteria can be

used to calculate these structural differences, but the most common measure,

as it has been established in the Critical Assessment of Predicted Interactions

(CAPRI) ( Janin et al., 2003), is to calculate the root mean square deviation

(RMSD) by comparing the decoy with the native conformation. However,

the selection of residues for the comparison can vary when (1) the whole

structure of the receptor is used as reference to superimpose the poses,

the RMSD shows the deviation on the location of the ligand (ligand-

RMSD) and (2) all residues in the interface of the native structure

are selected, the RMSD shows the different disposition of the interface

(I-RMSD). In CAPRI, a near-native prediction is achieved if the

I-RMSD and the ligand-RMSD are smaller than 2 and 5 A

˚, respectively.

Currently, this implies the prediction of more than 30% of the native resi-

due–residue pairwise contacts and at least 50% of correctly identified contact

residues ( Janin et al., 2003). The best docking poses are then refined, all-

owing for conformational changes of the two unbound structures upon

binding (Dobbins, Lesk, & Sternberg, 2008; Shen, Paschalidis, Vakili, &

Vajda, 2008). Nevertheless, as it has been observed in CAPRI, there are

still some difficulties concerning the use of these methods ( Janin, 2010;

Lensink & Wodak, 2010). On the one hand, programs devoted to rigid-

docking do not simulate the conformational changes that can occur during

complex formation. On the other hand, each available docking mechanism

is highly dependent on its scoring function and none of them can produce a

single correct solution among all the predictions (Moal et al., 2013).

89Statistical Potentials for Protein Interactions

Author's personal copy

4.1. Protein–protein docking

Rigid-docking methods yield a large number of predictions (from 10,000 to

more than 50,000), including many false positives. Thus, an important

course of action is to identify those docking poses that are closer to the native

structure (i.e., near-native) before any refinement takes place. At this point,

the number of selected conformations typically spans between 10 and 2000.

There are two nonexcluding strategies to perform such selection. The first

strategy consists in reranking the docking conformations with a scoring

function (e.g., CHARMM (Brooks et al., 1983), AMBER (Cornell

et al., 1995), FOLD-X (Guerois, Nielsen, & Serrano, 2002) or ZRANK

(Pierce & Weng, 2007, 2008)). The second strategy relies on clustering sim-

ilar solutions by means of I-RMSD (Comeau, Gatchell, Vajda, & Camacho,

2004b) or ligand-RMSD (Ritchie & Kemp, 2000) in order to reduce redun-

dant solutions and detect energy favorable regions in the surface of the

receptor (Moal & Bates, 2010).

4.1.1 Benchmarking

In order to assess the ability of docking approaches to distinguish between

near-native and non-near-native structures, several benchmarks have been

created (see Table 4.2). These datasets are usually comprised of a non-

redundant set of real interactions for which the structure of the interaction

and the unbound molecules (in most cases) are available. Benchmark targets

are classified in three categories of difficulty based on the best I-RMSD

obtained with the unbound conformations of the two proteins: easy,

medium, and hard (hard cases usually involve large conformational changes

between the bound and the unbound forms of the molecules).

4.1.2 Application of split-statistical potentials to rank docking decoys

In a recent work (Feliu et al., 2011), split-statistical potentials performed bet-

ter than scoring functions encoding atomistic energy terms when applied to

rank protein–protein docking poses from targets of the hard category of dif-

ficulty of the protein-docking benchmark version 3.0 (Hwang, Pierce,

Mintseris, Janin, & Weng, 2008). Furthermore, the analysis over the whole

benchmark revealed that “E

pair

” and “E

S3DC

” provided a fair amount of

nonoverlapping results. Based on this observation, Feliu et al. (2011) defined

a new ranking strategy “MixRank”. In this strategy, they first considered the

list of decoys ranked by both statistical potentials separately, and they

selected the top-scored decoy from each list alternatively. In order to avoid

90 Oriol Fornes et al.

Author's personal copy

redundant predictions, they ignored decoys with less than 5 A

˚ligand-RMSD

from any previous selection, which removed redundant solutions and pro-

vided a better selection of near-native decoys (Feliu & Oliva, 2010).

“MixRank” outperformed, for the medium and hard targets of the bench-

mark, other ranking methods such as RPScore (Moont et al., 1999), which

is another statistical potential, or ZRANK (Mintseris et al., 2007), which is an

atomistic-detailed scoring function. The main reason behind this result was

due to the use of a rigid-body docking method (i.e., FTDock). Atomistic-

detailed scoring functions, such as ZRANK, require an accurate model of

the interaction to correctly rank the poses, which implies a flexible docking,

while coarse-grained potentials, such as “E

pair

” and “E

S3DC

”, are less affected

by the quality of the model. Recently, Moal et al. (2013) presented an eval-

uation of 115 different scoring functions for ranking docking poses. Interest-

ingly, “MixRank” and “E

S3DC

” performed among the best 40 approaches in

Table 4.2 Benchmark datasets for docking

References

Interaction

type Description Benchmark link

Hwang, Vreven,

Janin, and Weng

(2010)

Protein–

protein

176 complexes (121 easy,

30 medium, 25 hard)

http://zlab.

umassmed.edu/

benchmark/

van Dijk and

Bonvin (2008)

Protein–

DNA

47 complexes (13 easy,

22 medium, 12 hard)

http://haddock.

science.uu.nl/dna/

benchmark.html

Kim, Corona,

Hong, and Guo

(2011)

Protein–

DNA

38 complexes for rigid (21

easy, 17 hard) and flexible

docking (18 easy, 19 hard)

http://bioinfozen.

uncc.edu/tf-dna-

benchmarks/

Barik, Nithin,

Manasa, and

Bahadur (2012)

Protein–

RNA

45 complexes http://www.facweb.

iitkgp.ernet.in/

rbahadur/

benchmark.html

´rez-Cano,

Jime

´nez-Garcı

´a,

and Ferna

´ndez-

Recio (2012)

Protein–

RNA

106 complexes (35 by

homology modeling;

64 easy, 24 medium,

18 hard)

http://life.bsc.es/

pid/protein-rna-

benchmark/

Huang and Zou

(2013)

Protein–

RNA

72 complexes (49 easy,

12 medium, 7 hard)

http://zoulab.dalton.

missouri.edu/

RNAbenchmark/

Benchmarks for protein–protein, protein–DNA, and protein–RNA docking.

91Statistical Potentials for Protein Interactions

Author's personal copy

the analysis of docking decoys generated from the protein–protein docking

benchmark version 4.0 (Hwang et al., 2010) with a flexible-docking approach

(Moal & Bates, 2010). Still, the best results were obtained by the newest score

versions of ZRANK2 (Pierce & Weng, 2008), SIPPER (Pons et al., 2011),

and other atomistic potentials.

4.2. Protein–nucleic acid docking

While the field of protein–protein docking is advancing fast, the progress of

docking nucleic acids onto proteins lags behind. The flexibility of nucleic

acids, and the difficulty to recognize their interaction surface, has limited

the number of docking studies involving proteins and nucleic acids

(DNA (Knegtel, Antoon, Rullmann, Boelens, & Kaptein, 1994; Poulain,

Saladin, Hartmann, & Pre

´vost, 2008; Parisien, Freed, & Sosnick, 2012;

van Dijk & Bonvin, 2010; van Dijk, Visscher, Kastritis, & Bonvin, 2013)

and RNA (Pe

´rez-Cano et al., 2010)). Similarly, there are only few

knowledge-based potentials specifically intended to rank protein–nucleic

acid docking solutions.

Regarding the field of protein–DNA docking, Robertson and Varani

(2007) and Xu et al. (2009) designed two different all-atom statistical poten-

tials that showed similar results in identifying near-native structures from a

set of decoys generated with FTDock. Nevertheless, as shown in the previ-

ous section, atomistic-detailed potentials require more accurate conforma-

tions to correctly rank docking poses, while residue–residue potentials are

coarse-grained and less sensitive to small conformational changes, which

allows them to capture the dynamic nature of protein–DNA interactions

more accurately (Poulain et al., 2008). In this context, Takeda et al.

(2013) derived a residue-pair potential that accommodated the interaction

angles between amino acids and nucleotides. Their approach also showed

better performance than atomistic potentials in rigid-body docking between

protein and DNA.

With respect to protein–RNA docking, Zheng et al. (2007) adapted the

statistical potential for scoring protein–DNA interactions (Robertson &

Varani, 2007) to protein–RNA interactions. Their potential performed sim-

ilar to the more complex scoring function for protein–RNA interactions of

ROSETTA (Chen, Kortemme, Robertson, Baker, & Varani, 2004). In

addition, Tuszynska and Bujnicki (2011) built two statistical potentials

dependent on the interaction distance and angles of the contact site of the

nucleotide with the amino acids of the protein that penalized for spherical

clashes occurring during docking.

92 Oriol Fornes et al.

Author's personal copy

5. PREDICTION OF PROTEIN-BINDING REGIONS

One of the major challenges to understand protein interactions is the

identification of the specific binding regions (i.e., interfaces). In the previous

section, we have seen that docking methods try to find the best possible

fitting between two or more molecules by exploring the whole rotational

and translational 3D space. Therefore, these methods benefit from the

knowledge about the interacting interfaces, which saves computational time

and eliminates many potentially wrong solutions. In particular, for protein–

DNA and protein–RNA interactions, the problem is two-sided: at the side

of the protein and at the side of the nucleic acid. In this section, we will focus

on the interface at the side of the protein, either for the interaction with

other proteins or for the interaction with nucleic acids. Several approaches

have been developed for the prediction of protein-binding regions, but in

the case of protein–DNA/RNA binding, the problem has been associated to

whether the protein will interact with the nucleic acid or not. In this section,

we will split both problems: first on the prediction of binding sites, and

second, on the prediction of proteins that bind nucleic acids.

5.1. Identification of protein interfaces

The most straightforward methods to experimentally define the interacting

region of a protein are based on the determination of its 3D structure (i.e.,

X-ray crystallography and NMR spectroscopy). Other experimental

approaches such as deletion experiments, alanine-scanning mutations,

yeast-two hybrid or protein footprinting can be used to determine which

domains are involved in the interaction without the requirement of

structure (reviewed in Garcia-Garcia, Bonet, et al., 2012). Alternatively,

computational tools provide a significant advantage in terms of time- and

cost-effectiveness. We have split these computational tools according to

their input requirements into sequence-based and structure-based methods.

5.1.1 Methods based on sequence

It is known that protein interfaces share specific features that distinguish them

from the rest of the protein (e.g., there is higher conservation of residues in

interface regions due to evolutionary constraints; Valdar & Thornton,

2001). In addition, the physicochemical properties of protein–protein inter-

action interfaces have shown to bear specific properties due to different amino

acid composition propensities ( Jones & Thornton, 1997). Moreover, as the

93Statistical Potentials for Protein Interactions

Author's personal copy

conservation of residues is strongly dependent on their structural and func-

tional importance, the degree of conservation has been used not only to pre-

dict binding sites but also to infer functional annotation. This is the case of

Consurf (Ashkenazy, Erez, Martz, Pupko, & Ben-Tal, 2010), a method that

estimates the evolutionary rate of each protein residue derived from multiple

sequence alignments using an empirical Bayesian or a maximum likelihood

approach. Another method, FINDSITE (Brylinski & Skolnick, 2008), uses

a different strategy based on binding-site similarity among superimposed

groups of template structures identified by threading, which allows for the

analysis of groups with low similarity. The combination of FINDSITE with

databases such as DrugBank (Knox et al., 2011) and ChEMBL (Gaulton et al.,

2011) has been useful in high-throughput virtual ligand screening (Zhou &

Skolnick, 2013). A recent method, PIPE-Sites (Amos-Binks et al., 2011),

exploits protein–protein interaction networks to detect reoccurring polypep-

tide sequences in order to infer specific binding sites. Finally, PSIFR (Pandit

et al., 2010) combines different methodologies in a single server, including

structure-based prediction tools such as TASSER (Zhang & Skolnick,

2004) and functional inference tools such as FINDSITE, among others.

The observed amino acid conservation in protein–DNA interfaces

(Luscombe & Thornton, 2002) has also been exploited by many authors to

predict nucleic acid-binding residues of a protein with different machine

learning approaches. For example, BindN (Wang & Brown, 2006) predicts

DNA- and RNA-binding residues using a support vector machine approach

based on biochemical features of nucleic acid-binding amino acids, such as

side chain pK

value, hydrophobicity index, and molecular mass. An evolu-

tion of the previous method, BindN þ(Wang, Huang, Yang, & Yang, 2010),

incorporates evolutionary information as well. Similarly, DP-Bind (Hwang,

Gou, & Kuznetsov, 2007) relies on support vector machine, kernel logistic

regression, and penalized logistic regression based on amino acid composition

and evolutionary profiles. Another approach, NAPS (Carson, Langlois, & Lu,

2010), combines a decision tree algorithm with bootstrap aggregation and

cost-sensitive learning. Finally, metaDBSite (Si, Zhang, Lin, Schroeder, &

Huang, 2011) predicts DNA-binding residues by integrating the prediction

of six different methods (including BindN and DP-Bind).

5.1.2 Methods based on structure

Methods based on structure use features extracted from known 3D interfaces

to predict protein-binding regions. In particular, Fernandez-Recio et al.

(2005) used the Optimal Docking Area (ODA) of a protein based on atomic

94 Oriol Fornes et al.

Author's personal copy

solvation parameters. This method looks for favorable energy changes

when the residues involved in the interface become buried upon binding.

In addition, a few methods for predicting protein–DNA/RNA-binding

regions are based on structure too. For instance, DISPLAR (Tjong &

Zhou, 2007) uses neural networks trained on known structures of protein–

DNA interactions to predict the residues that contact DNA. The inputs to

the neural network include position-specific sequence profiles and solvent

accessibilities of each residue and its spatial neighbors. DNABINDPROT

(Ozbek, Soner, Erman, & Haliloglu, 2010) exploits Gaussian network models

to predict DNA-binding residues, based on the fluctuations of residues

in high-frequency modes. In DR_bind (Chen, Wright, & Lim, 2012), the

identification of DNA-binding residues is based on electrostatics, sequence

conservation, and structural geometry. Regarding the prediction of RNA-

binding sites, an evolution of ODA, Optimal Protein-RNA Area (OPRA)

(Pe

´rez-Cano & Ferna

´ndez-Recio, 2010), uses statistical potentials derived

from the differential propensities of amino acids at protein–RNA interfaces,

weighed by its accessible surface area, to predict RNA-binding regions in

proteins. Furthermore, OPRA was used in protein–RNA docking and suc-

cessfully selected near-native conformations of protein–RNA interactions by

simply using the correct prediction of the protein residues involved in the

interaction (Pe

´rez-Cano et al., 2010).

5.1.3 Application of split-statistical potentials to predict

protein-binding sites

The specific properties exhibited by protein interfaces are present in the split-

statistical potentials derived from known interacting domains (Feliu et al.,

2011). In fact, the statistical potential “E

local

” is based on the probability

of an amino acid to be in a certain environment, as defined by its hydropho-

bicity, degree of exposure, and secondary structure (see Section 2.1). In order

to show the ability of split-statistical potentials in identifying protein inter-

faces, we have tested both ODA and the potential “E

local

” on the unbound

structures retrieved from the protein docking benchmark version 3.0

(Hwang et al., 2008). The ODA predictions were obtained using the

pyDock software (Cheng, Blundell, & Fernandez-Recio, 2007). In the case

of “E

local

”, the prediction of the binding site (i.e., “BS-E

local

”) was obtained

by scoring and ranking into a list each residue in the protein surface. The

score of a residue in the surface was calculated by averaging the Z-scores

of “E

local

” of the residues within a radius of 15 A

˚, as defined by the distances

between their Cbatoms (Cafor glycines). Then, binding regions were

95Statistical Potentials for Protein Interactions

Author's personal copy

defined iteratively, starting from the top ranked residue in the list. The first

binding site was defined by the surface residues within a radius of 15 A

around the top ranked residue. Residues belonging to a binding site were

removed from the list and the iteration was repeated until the next residue

in the list had a negative score or there were no more residues left. The score

of a binding region was defined as the sum of scores of its residues.

In Fig. 4.1, we show the performance of ODA, “BS-E

local

”, and their

combination (i.e., residues predicted by both methods to be in the binding

site), in terms of percentage of proteins with a minimum positive predictive

value (PPV) of the predicted residues to be involved in the real binding site

(see details in the legend). Results were compared with a background dis-

tribution of random predictions with similar distribution of binding sites

Figure 4.1 Coverage of the prediction of binding sites versus its minimum PPV. The Y

axes show the ratio of proteins with a PPV equal or greater than a threshold (Xaxes). We

have used ODA (Fernandez-Recio et al., 2005) with a minimum pyDockODA score of

10 (A), the prediction based on “BS-E

local

”with a minimum score of 2 (B), and the bind-

ing sites predicted by both (C). The testing dataset contains 85 nonredundant proteins

extracted from the docking benchmark 3.0 for which we know the real binding region.

PPV is defined as the proportion of correctly predicted residues for each protein over the

total number of predicted residues. The binding interface for a protein is defined as the

set of residues found to be closer than 12 Å with any other interacting protein reported

in the PDB database (Berman et al., 2000), which includes the interacting partner in the

benchmark. In order to validate the quality of the prediction, we have calculated the

background distribution of obtaining the same PPV thresholds by a random selection

of the same number of residues as the actual prediction with ODA (A), “BS-E

local

”(B), or

both (C). The background distribution is shown in boxplots, and it is calculated using

sliding windows of the size of each fragment of predicted residues. This definition

allows us to compare predictions with similar topology. A horizontal dashed line

indicates the applicability of each method (i.e., proportion of proteins with at least

one residue predicted in the binding site).

96 Oriol Fornes et al.

Author's personal copy

along the sequence (i.e., preserved the topology) as the predictions produced

by each method. On the one hand, it is noteworthy that the combination of

ODA and “BS-E

local

” yielded predictions that reached PPVs higher than

75% for about 40% of proteins of the benchmark. On the other hand, each

method could be applied to more than 80% of proteins of the benchmark,

but they achieved PPVs higher than 75% for less than 40% of the proteins.

Besides, the individual performances of ODA and “BS-E

local

” were not

strikingly different from random predictions, while the combination of both

methods differed considerably from the distribution of topologically similar

predictions (i.e., random predictions), thus being more significant.

5.2. Prediction of DNA/RNA-binding proteins

DNA- and RNA-binding proteins can be discriminated from others just

from their amino acid sequences using different features, such as amino acid

composition (Ahmad, Gromiha, & Sarai, 2004; Yu, Cao, Cai, Shi, & Li,

2006) or evolutionary profiles (Kumar, Gromiha, & Raghava, 2007,

2011; Nimrod, Schushan, Szila

´gyi, Leslie, & Ben-Tal, 2010). Also, the

ability of a protein to bind nucleic acids can be predicted using statistical

potentials. For example, DBD-Hunter (Gao & Skolnick, 2008) is a Web

server for predicting DNA-binding proteins that combine structural com-

parisons and evaluation with statistical potentials. Briefly, it scans a given

protein structure against a template library composed of 179 protein–

DNA complex structures using a structural alignment program (Zhang &

Skolnick, 2005). All templates that produce a good structural alignment with

the query protein are then evaluated with statistical potentials. Specifically,

the statistical potential is applied to score all protein–DNA contacts within a

distance of 4.5 A

˚. The potential also considers whether the contact occurs

through the phosphate, sugar, pyrimidine, or imidazole groups. This

approach performed better than classical sequence homology-based

approaches (i.e., PSI-BLAST; Altschul et al., 1997). An improved version

of the previous method, DBD-Threader (Gao & Skolnick, 2009), has the

advantage that it only requires the sequence of a protein as input. The

sequence is then threaded against the previous template library and, for

the best solutions, the interaction score between the threaded sequence

and the template DNA is calculated. The exact same procedure of DBD-

Hunter, but using an all-atom statistical potential (Xu et al., 2009), has also

been applied to predict DNA- (Zhao et al., 2010) as well as RNA-binding

proteins (Zhao et al., 2011).

97Statistical Potentials for Protein Interactions

Author's personal copy

6. CHARACTERIZATION OF TRANSCRIPTION

FACTOR-BINDING SITES

In the previous section, we have focused on identifying the binding

regions of proteins. However, in protein–DNA/RNA interactions, the

nucleic acid also contains specific regions that are recognized by the protein.

In particular, transcription factors (TFs) can promote or restrain gene tran-

scription by binding to specific nucleotide sequences (i.e., binding sites)

distributed along the genome. Binding sites are often represented with a

position weight matrix (PWM) reflecting the observed degeneracy among

the recognition sites of TFs. PWMs have been exploited by many methods

to search for novel targets of TFs (reviewed in Bulyk, 2003). Therefore, the

identification of TF-binding sites is an important step towards the under-

standing of many biological processes. During the past years, several exper-

imental methods have emerged with the objective to characterize

TF-binding sites (reviewed in Xie, Hu, Qian, Blackshaw, & Zhu, 2011).

Nevertheless, their application is laborious and expensive and, as a result,

they have only been applied to a small fraction of human proteins

(Hu et al., 2009). As an alternative, computational tools can be employed

to predict TF-binding sites. A well-established procedure consists in

searching for over-represented DNA sequences in the promoter regions

of genes regulated by a TF with a motif discovery algorithm (analyzed in

Das & Dai, 2007), but the success of these approaches depends on the avail-

ability of enough sequences for pattern discovery, mainly derived from

ChIP-seq, ChIP-exo, and protein-binding microarrays (Grau, Posch,

Grosse, & Keilwagen, 2013).

Another successful strategy currently employed is the analysis of

TF–DNA complex structures with statistical potentials. Briefly, the TF is

put face to face with different DNA sequences and the binding energies

of the resulting complexes are analyzed. Those sequences with the best

energy are considered to be bound by the TF and are incorporated into a

PWM. For example, Angarica et al. (2008) created an algorithm that, given

a TF–DNA complex, mutated all nucleotide positions one by one using the

3DNA package (Lu & Olson, 2008), until all possibilities were covered (i.e.,

A, C, G, and T). The mutated sequences were then scored with a

knowledge-based potential and the 50 best oligonucleotides were used to

construct a PWM. In another work, Liu et al. (2008) developed a method

based on protein–DNA docking coupled with threading of DNA sequences.

98 Oriol Fornes et al.

Author's personal copy

They were able to predict 50% of experimentally determined sites for the

cAMP regulatory protein (CRP) in the top 1% among all 639,232 possible

solutions. They also made a de novo prediction by modeling the ferric uptake

regulator in complex with DNA, which showed similar results as CRP.

Later on, Xu et al. (2009) calculated the PWMs for different TFs by

decomposing the binding energies of the FIRE potential into individual

contributions of each base. The FIRE potential was first described by

Zhou and Zhou (2002), and it was used mostly on homology modeling.

Afterwards, FIRE was readjusted so that it could be applied to predict pro-

tein–protein and protein–DNA interactions (Zhang, Liu, Zhu, & Zhou,

2005). More recently, Alamanova et al. (2010) used an all-atom statistical

potential (Robertson & Varani, 2007) in combination with the MMTSB

tool set (Feig et al., 2004) to recover the PWMs of various members from

two widely studied families of TFs such as p53 and NF-B. In particular, they

were able to create very accurate PWMs for p53 tetramer and p50 dimer as

well as for the p50p65 and p50RelB heterodimers. They also obtained very

good results with p63 and p73 dimers built by homology modeling using the

p53 DNA-binding domain as template. Finally, Chen, Chien, et al. (2012)

established a procedure to predict PWM when no protein–DNA complex is

available. They superimposed the unbound structure of a TF over the closest

homolog TF structure in complex with DNA. Then, the PWM was esti-

mated as in the work of Xu et al. (2009).

Although knowledge-based potentials have been a good alternative to

infer TF-binding sites, their application still has some limitations. One of

them is the lack of templates due to the small number of TF–DNA complex

structures available in the PDB. To avoid any bias, statistical potentials are

usually derived from a nonredundant dataset of structures. This redundancy

is generally removed on the TF side of the complex. Yet, TFs can recognize

different binding sites, and in addition, members of the same family of TFs

can bind to distinct DNA sequences (Luscombe & Thornton, 2002). For this

reason, the removal of redundancy can generate statistical potentials suffer-

ing from low-count and at the same time low diversity of binding patterns.

Another problem arises because statistical potentials are applied under the

assumption that the contribution of the different DNA base pairs to the

binding energy of the complex is independent from each other, which

is not true (Benos, Bulyk, & Stormo, 2002). Recently, AlQuraishi and

McAdams (2013) addressed the coverage problem by combining TF–DNA

structures with experimentally determined PWMs. The inclusion of PWM

data adapted the statistical potential to the varying binding preferences of

99Statistical Potentials for Protein Interactions

Author's personal copy

TFs for different binding sites. Still, they highlighted that the use of PWMs

cannot allocate for interposition dependencies among base pairs.

6.1. Application of knowledge-based potentials on

DREAM5 targets

In order to evaluate the real capability of statistical potentials in PWM prediction,

we have tested two available online methods, 3DTF (Gabdoulline, Eckweiler,

Kel, & Stegmaier, 2012)andPiDNA(Lin & Chen, 2013), on 83 mouse TFs

from the DREAM5 TF–DNA Motif Recognition Challenge (Weirauch

et al., 2013). These two servers only require a TF–DNA complex structure

in PDB format as input and return the predicted PWM of the TF as output.

6.1.1 Modeling TF–DNA complexes

Since there were no TF–DNA complex structures available in the PDB for

the majority of the DREAM5 targets, we developed a novel modeling pro-

tocol that allowed us to obtain a TF–DNA model for a total of 71 DREAM5

targets. An overview of the procedure is shown in Fig. 4.2. In step 1, for each

TF target, we searched the best template in a database of TF–DNA com-

plexes using BLAST (Altschul et al., 1997). The database was obtained by

selecting from the PDB all TF–DNA complex structures annotated in the

TFinDit depository (Turner, Kim, & Guo, 2012) that, according to

3DNA (Lu & Olson, 2008), contained a double-stranded DNA of at least

eight base pairs. Then, we identified all dimers in the database by grouping

any two protein chains from the same PDB that (1) had at least one common

contact with the DNA and (2) had more than five residue–residue contacts

between them as to form a binary complex (Mosca, Ce

´ol, & Aloy, 2013). In

step 2, BLAST hits were filtered according to two criteria: (1) enough per-

centage of sequence identity and (2) no gaps or insertions in the region of the

interface. With respect to the percentage of identity, based on a recent work

where we observed that TFs sharing little sequence identity can still bind to

the same genes (Gitter et al., 2009), we included distantly related sequences

according to Rost’s sequence identity curve (Rost, 1999), using parameters

adjusted to ensure a 99% precision rate (i.e., n¼5). In step 3, the template

sequence that passed the filter and had the best BLAST e-value was realigned

with the TF using matcher, from the EMBOSS package (Rice, Longden, &

Bleasby, 2000). In step 4, the alignment was used to create a structural model

of the TF with MODELLER (Eswar et al., 2006). Models were created

applying 3D restraints between Caatoms. This is the pairwise distance from

the Caatom of each residue to the Caatoms of any residues within a radius

100 Oriol Fornes et al.

Author's personal copy

Figure 4.2 Pipeline for modeling transcription factors. Step 1: sequence homology search.

Step 2: filter results of step 1 by sequence identity and coverage of the protein–DNA inter-

face. Step 3: optimization of the alignment. Step 4: model-building of the three-dimensional

(Continued)

101Statistical Potentials for Protein Interactions

Author's personal copy

of 15 A

˚conserved between the template and the model. In step 5, the final

TF–DNA complex was obtained by superimposition of the protein model

on the template using PyMOL (Schro

¨dinger, 2010). Additionally, for TFs

from the bHLH and bZIP families, since they recognize DNA as homo-

or heterodimers, we modeled the dimer: if the two monomers were found

among the unfiltered hits in step 2, the dimer was obtained as before (i.e.,

steps 3–5), but using both template hits; otherwise, if only one monomer

could be modeled, it was superimposed as in step 5 on both template chains

of the dimer. Table 4.3 shows the 71 DREAM5 targets that could be

modeled following this procedure, the percentage of sequence identity,

and coverage of the pairwise alignments between the TFs and their tem-

plates, and the resulting RMSD of the superimposition.

6.1.2 Analysis of PWM predictions

A first analysis revealed that PiDNA is very sensitive to the format of input

files. It occasionally failed even when, for all models we had produced, the

DNA molecule had at least eight base pairs in the correct format (according

to 3DNA). In contrast, 3DTF could interpret all except one model, but it

produced uniform PWMs for 46 TFs (i.e., PWMs with null capacity of dis-

crimination). As a result of this analysis, the applicability of PiDNA (28/71)

was slightly better than 3DTF (24/71). Furthermore, out of the 13 different

families of TFs taken from the DREAM5 challenge that could be modeled,

3DTF and PiDNA could only make predictions for seven of them. Table 4.3

shows the quality of the predictions by means of comparing the PWMs pro-

duced by 3DTF and PiDNA with the real PWMs, using Tomtom (Gupta,

Stamatoyannopoulos, Bailey, & Noble, 2007), as distributed in the MEME

package (Bailey et al., 2009). Tomtom calculates the similarity between a

pair of PWMs by means of a P-value. Using a P-value threshold of 10

3

PiDNA predicted correctly the PWM for 10 TFs, while 3DTF for 7 (five

of which were common to both of them). Besides, we observed that

PWM predictions deteriorated together with the alignment between the

target and the template. One possible reason is that both 3DTF and PiDNA

Figure 4.2—Cont'd structure of the TF. Step 5: superimposition of the model over the

template. If the TF works as a homodimer and only one monomer can be modeled using

a heterodimer as template, the model is superimposed on each chain of the template to

construct the homodimer. Structural images were created with the UCSF Chimera pack-

age (Fraenkel & Pabo, 1998; Glover & Harrison, 1995; Pettersen et al., 2004).

102 Oriol Fornes et al.

Author's personal copy

Table 4.3 PWM predictions for targets of the DREAM5 challenge

TF Family PDB Chain %ID %Cov RMSD 3DTF PiDNA E

S3DC

Egr2 C2H2 ZF 1p47 A 94 100 0.10 1.8 10

2

Esr1 NR 1hcq A 100 98 0.02 2.1 10

4

Esrrb NR 3dzy A 36 99 1.08 - 1.5 10

3

3.2 10

3

Esrrg NR 3dzy A 36 99 0.34 1.7 10

3

1.4 10

4

Foxc2 Forkhead 1vtn C 72 96 0.01 - 8.410

3

Foxo1 Forkhead 3co6 C 100 100 1.59 1.4 10

4

1.9 10

4

Foxo3 Forkhead 2uzk A 100 98 0.04 - 7.7 10

6

6.6 10

3

Foxo4 Forkhead 2uzk A 83 88 0.04 - 4.5 10

4

4.9 10

3

Foxo6 Forkhead 3co6 C 91 100 1.55 3.910

3

3.1 10

3

Gata4 GATA 4hc7 A 86 97 1.11 - 2.9 10

3

Hmga2 AT hook 2eze A 80 80 0.35 8.2 10

4

Klf12 C2H2 ZF 2wbu A 78 97 0.06 3.4 10

4

1.1 10

3

Klf8 C2H2 ZF 2wbu A 75 97 0.09 7.7 10

5

1.1 10

4

Nr2e1 NR 3e00 A 57 23 0.01 1.8 10

4

Nr2f1 NR 3dzy A 45 99 2.33 - 3.0 10

3

Nr2f6 NR 3e00 A 39 54 4.61 - 3.3 10

7

Continued

Author's personal copy

Table 4.3 PWM predictions for targets of the DREAM5 challenge—cont'd

TF Family PDB Chain %ID %Cov RMSD 3DTF PiDNA E

S3DC

Pou3f1 Hom 2xsd C 100 100 0.07 - 1.910

3

Sox6 Sox 3f27 D 56 98 0.04 - 3.9 10

3

Sp1 C2H2 ZF 2wbu A 57 97 0.08 - 4.8 10

3

2.6 10

4

Tbx1 T-box 4a04 B 100 98 0.02 9.3 10

7

9.6 10

6

4.5 10

3

Tbx20 T-box 4a04 A 66 99 0.02 7.7 10

6

2.2 10

5

7.7 10

4

Tbx4 T-box 2x6v A 93 99 0.01 8.2 10

7

1.3 10

6

Tbx5 T-box 2x6v A 100 100 0.01 1.1 10

6

2.2 10

6

Tcf3 bHLH 2ql2 C

100

0.09

3.12

2.2 10

3

Tcfec bHLH 4ati B

100

0.18

0.10

1.5 10

3

Zfp202 C2H2 ZF 2i13 A 51 100 0.38 1.1 10

2

Transcription factors (TF) from the DREAM5 challenge and their families are shown in the first columns. Families “NR” and “Hom” stand for “nuclear receptor” and

“homeodomain”, respectively. PDB codes and chains of the templates used to model the TFs are shown in the next columns. This is followed by the quality of the model

shown by means of the percentage of sequence identity (%ID) and template coverage (%Cov) of the sequence alignment, and the RMSD of the superimposition. For

dimers, the information regarding each monomer can be found in separate lines. Asterisks indicate that the homodimerwas built by superimposing the model of one chain

to both chains of the template heterodimer. The significance of similarity between the predicted and the real PWMsis shown with the P-value for 3DTF and PiDNA, and

the statistical potential “E

S3DC

”. A hyphen indicates that the P-value is not significant and the cell is left empty when the method failed to produce a PWM.

Note: Only TFs with significant predictions are shown.

Author's personal copy

rely on all-atom statistical potentials and they are sensitive to the wrong ori-

entation of amino acid side chains that could occur upon modeling.

7. ADAPTING SPLIT-STATISTICAL POTENTIALS FOR

PROTEIN–DNA INTERACTIONS

As shown in Section 6.1, much improvement is required in the area of

TF-binding site prediction based on structure (i.e., via statistical potentials).

In this section, we propose a series of changes to the previously described

split-statistical potentials for protein folding (Aloy & Oliva, 2009) and

protein–protein interactions (Feliu et al., 2011) in order to adapt them to

protein–DNA interactions.

The application of split-statistical potentials to protein–DNA interac-

tions requires the definition of an environment for nucleotides. Moreover,

in order to address the additivity problem (Benos et al., 2002), we have

described statistical potentials for dinucleotides (i.e., two consecutive nucle-

otides along the DNA sequence). Therefore, the DNA environment of a

dinucleotide is defined by its constituting bases (i.e., any combination of

two purines and pyrimidines) and three features regarding the interaction

between the amino acid and the dinucleotide: (1) the strand (i.e., forward

or reverse) that is closer to the amino acid; (2) the DNA groove (i.e., major

or minor) where the amino acid is located (or close to); and (3) the closest

chemical group of the dinucleotide (i.e., nucleobase or deoxyribose phos-

phate) to the amino acid (see Fig. 4.4 and Section 7.1.1 for more details).

The definition of environments yields several residue–environment

combinations. For amino acids, we consider 20 residues and 6 different envi-

ronments as before (i.e., helix, coil, or strand, and being buried or exposed).

This produces a total of 120 combinations of amino acids and environments.

In contrast, we consider 16 dinucleotides (i.e., 4

different combinations of

two nucleotides) and 8 environments: 2 for the closest strand, 2 for the clos-

est DNA groove, and 2 for the closest chemical group of the dinucleotide.

These definitions produce a total of 128 dinucleotide–environment

combinations.

Given a particular interaction between an amino acid “a” and a dinucle-

otide “mn” (where “m” and “n” can be any nucleotide), we define the sta-

tistical potentials “E

pair

”, “E

local

”, “E

3DC

”, and “E

S3DC

”asin

Section 2.1 by replacing “b” with “mn” in Eqs. (4.2) and (4.3). The contri-

bution of the reference state and the “E

” potential are ignored, but also the

contributions of the “E

local

” terms. On the one hand, the “E

local

”

105Statistical Potentials for Protein Interactions

Author's personal copy

contribution of DNA is not considered because, as long as it is accessible, any

nucleotide sequence can be bound by a TF (Urnov, Rebar, Holmes,

Zhang, & Gregory, 2010) and, as a result, the environment conditions of

the base pairs are not relevant. On the other hand, given a TF, the “E

local

”

term dependent on the protein is always the same when discriminating

among different DNA-binding sites and thus, it is irrelevant too. Therefore,

we have selected the statistical potential “E

S3DC

” to evaluate the prediction

of DNA-binding sites for the targets of the DREAM5 challenge.

7.1. Application of split-statistical potentials on

DREAM5 targets

As a test pilot, we have applied these split-statistical potentials to predict the

PWM for the 71 modeled DREAM5 targets in Section 6.1.

7.1.1 Split-statistical potentials for protein–DNA interactions

We derived the potentials from a nonredundant set of templates of the

TFinDit repository (Turner et al., 2012) (see Section 6.1.1). Specifically,

templates were split into chains and redundancy was removed so that any

two chains shared less than 35% of protein–DNA contacts. A contact was

defined between an amino acid and a dinucleotide if the Cbatom of the

amino acid (Cafor glycines) was at 15 A

˚or less from the center of the dinu-

cleotide and its complementary bases (i.e., the geometrical center as defined

by the four phosphate atoms of the two nucleotides and its associated part-

ners in the complementary DNA strand; see Fig. 4.3B). In Fig. 4.3, we show

how the different details that define the environmental features used on the

description of the statistical potential are calculated. We used 3DNA (Lu &

Olson, 2008) to define which DNA residues constituted the reference strand

(i.e., forward) and which the complementary (i.e., reverse). Moreover, for

calculating the potential, we referred to “mn” as the pair of nucleotides from

the reference strand of the dinucleotide. Also, we used the distances between

the Cbatom of the amino acid and the phosphate atoms of each dinucleotide

to decide which of the two DNA strands was the closest (see Fig. 4.3C).

In order to identify the closest DNA groove (see Fig. 4.3A), we adapted a

definition of groove widths (El Hassan & Calladine, 1998): First, we selected

the closest phosphate from each strand to the Cbatom of the amino acid; let this

be at position “i” for strand S and at position “j” for strand S0.Second,ifi<j,we

calculated the distances between the phosphate atom at position “i”inSandthe

phosphate atoms at positions “iþ300 (i.e., D

iþ3

)and“iþ400 (i.e., D

iþ4

)inS

Finally, if D

iþ3

iþ4

, the amino acid was located in the major groove;

106 Oriol Fornes et al.

Author's personal copy

otherwise, it was located in the minor groove. Additionally, if in the second step

i>j, instead of calculating the distances to positions “iþ300 and “iþ400 in S0,we

used the distances to “i300 and “i400 in S0and applied the same criterion to

select the DNA groove where the amino acid was located.

The interaction between the amino acid and the DNA could be either

with the backbone of the DNA (i.e., any atom of the deoxyribose phos-

phate) or with the nucleobase (i.e., any atom of the nitrogenous base). This

was defined by the minimum distance between the atoms of the amino acid

and the atoms of the dinucleotide. If the closest atom of the nucleotide was

Figure 4.3 Definition of different DNA parameters used for deriving split-statistical

potentials. Distances between amino acids and DNA are represented in blue lines; solid

when displaying the minimal distance and dashed otherwise. Internal distances in the

DNA are shown in orange. Environment features of the DNA for a contact between an

amino acid and a dinucleotide at position “i”(see details in text for each definition):

groove in contact with the amino acid (A); distance between the amino acid and the

dinucleotide (B); strand in contact with the amino acid (C); and DNA chemical group

in contact with the amino acid (D). Structural images were created with the UCSF

Chimera package (Fraenkel & Pabo, 1998; Pettersen et al., 2004).

107Statistical Potentials for Protein Interactions

Author's personal copy

any atom of the phosphate group or the deoxyribose, the interaction was

with the backbone; otherwise, it was with the nucleobase (see Fig. 4.3D).

Finally, to make the potentials independent of the arbitrary designation

of forward and reverse strand as defined by 3DNA, we also considered, for

each protein–DNA contact, the complementary (i.e., the contact that would

have been created if the reference strand was the complementary). For

example, the complementary contact of a certain amino acid with two aden-

osines, through the forward strand, the major groove and the backbone,

would be with two thymidines, through the reverse strand, the major

groove and the backbone. This increases straight forward the knowledge-

base of interactions and improves in a natural way the number of pairs of

amino acids and dinucleotides of the structural database.

7.1.2 PWM prediction

The PWM of each TF was calculated by adapting the procedure of Xu et al.

(2009) to account for the interaction of amino acids with dinucleotides. We

used the scores of the interaction of the protein with a dinucleotide to cal-

culate the probability of a single nucleotide position in the PWM:

Pai

ðÞ

¼Xmi1

exp PMF a,mi1ai

ðÞðÞ

þXmiþ1

exp PMF a,aimiþ1

ðÞðÞ

XniXmi1

exp PMF a,mi1ni

ðÞðÞþ

Xmiþ1

exp PMF a,nimiþ1

ðÞðÞ

Where “P(a

)” is the probability of nucleotide “a” at position “i”.

The PWMs produced for each modeled TF target were analyzed as in

Section 6.1.2 (see Table 4.3). In contrast to 3DTF and PiDNA, we could

apply the “E

S3DC

” score to all modeled DREAM5 targets, which implies

that we covered 13 out of 15 families of TFs (six more families than com-

bining both 3DTF and PiDNA). Moreover, we obtained significant results

for five different TFs, three of which could not be retrieved with 3DTF nor

PiDNA (see Table 4.3). In Fig. 4.4, we compare the logos produced by

“E

S3DC

” with the logos produced by 3DTF and PiDNA for three specific

DREAM5 targets (Foxo1, Nr2e1, and Tbx20). As observed in Table 4.3,

PiDNA produced significant logos for all the targets, while “E

S3DC

” and

3DTF predicted significant logos for two TFs: “E

S3DC

” for Foxo1 and

Tbx20, and 3DTF predicted for Nr2e1 and Tbx20. Also, “E

S3DC

” predicted

a logo for Nr2e1 but it was not significant because it failed to predict two out

of four nucleotides of the central motif “GTCA” (it could only predict

“GT”). The full comparison of “E

S3DC

” with 3DTF and PiDNA can be

108 Oriol Fornes et al.

Author's personal copy

found in Table 4.3 and it shows a reasonable improvement not only in terms

of applicability but also in terms of specificity by means of significant matches

of the real PWMs.

8. CONCLUSIONS

We have reviewed the use of knowledge-based potentials as a tool for

the analysis of protein–protein, protein–DNA, and protein–RNA interac-

tions. We have explored the general definition of knowledge-based poten-

tials and described the procedure to split them into different energetic terms.

We have extensively discussed the application of statistical potentials in (1)

the evaluation of protein modeling, including homology and integrative

modeling; (2) the evaluation of protein–protein and protein–nucleic acids

docking; (3) the prediction of protein-binding regions; and (4) the charac-

terization of TF-binding sites. Finally, we have provided several resources

available online for docking, ranking scoring functions of interactions,

and benchmark databases.

We have shown that modeling of protein interactions is still limited by

the lack of 3D data. Still, this problem can be addressed using docking

approaches. We have also shown that docking methods benefit from the

prediction of binding sites. In this context, we have proposed two pilots

for predicting binding regions, one for proteins and another for DNA, based

Figure 4.4 Examples of PWM logos. PWM logos for Foxo1, Nr2e1, and Tbx, as described

in the DREAM5 challenge compared with the predictions produced by the statistical

potential “E

S3DC

”and the state-of-the-art methods 3DTF and PiDNA. Logos were created

with the R software environment (Bembom, 2007; R Core Team, 2013).

109Statistical Potentials for Protein Interactions

Author's personal copy

on split-statistical potentials. On the one hand, we have tested our approach

for predicting binding sites in proteins, “BS-E

local

”, and we have compared

it with another state-of-art methodology. This test revealed that, while none

of the methods yield significant predictions, their combination improve the

significance of the binding regions predicted for a protein. We have been

able to achieve PPVs higher than 80% for almost one quarter of the proteins

tested in our benchmark (for half of them, the methods produced different

results, and for half of the proteins for which this was applied, the result was

successful). On the other hand, we have proposed a modification of the split-

statistical potentials for protein–DNA interactions. We have applied them

to predict DNA-binding sites by modeling the structure of the TF and

constructing an artificial PWM logo. Our results were comparable to

state-of-the-art methods, such as 3DTF and PiDNA; and additionally, we

enlarged the application to several targets of the DREAM5 challenge for

which 3DTF and PiDNA could not be applied or did not produce signif-

icant results.

We have fathomed the main features involved in TF-binging sites, in the

modeling of protein–protein and protein–DNA complexes. In conclusion,

despite all the advances in the area, there is still a wide-range for improve-

ment in the exploitation of statistical potentials, especially in the field of pro-

tein–DNA interactions.

ACKNOWLEDGMENTS

O. F. and B. O. acknowledge the support of FEDER BIO2011-22568 grant from the

Spanish Ministry of Science and Innovation (MICINN). J.G.G. acknowledge support by

“Departament d’Educacio

´i Universitats de la Generalitat de Catalunya i del Fons Social

Europeu” through FI fellowships. J.B. is supported by BIO08-0206 grant from MICINN.

We are very grateful to Dr. Jun-tao Guo (UNCC) for providing us a comprehensible list

of PDB codes for all transcription factors from the TFinDit depository. We are also

thankful to Dr. Ferna

´ndez Recio (BSC) for providing us the latest version of pyDockOda.

REFERENCES

Ahmad, S., Gromiha, M. M., & Sarai, A. (2004). Analysis and prediction of DNA-binding pro-

teins and their binding residues based on composition, sequence and structural information.

Bioinformatics,20(4), 477–486. http://dx.doi.org/10.1093/bioinformatics/btg432.

Alamanova, D., Stegmaier, P., & Kel, A. (2010). Creating PWMs of transcription factors

using 3D structure-based computation of protein-DNA free binding energies. BMC

Bioinformatics,11(1), 225. http://dx.doi.org/10.1186/1471-2105-11-225.

Alber, F., Dokudovskaya, S., Veenhoff, L. M., Zhang, W., Kipper, J., & Devos, D. (2007).

Determining the architectures of macromolecular assemblies. Nature,450(7170),

683–694. http://dx.doi.org/10.1038/nature06404.

110 Oriol Fornes et al.

Author's personal copy

Alber, F., Fo

¨rster, F., Korkin, D., Topf, M., & Sali, A. (2008). Integrating diverse data for

structure determination of macromolecular assemblies. Annual Review of Biochemistry,

77(1), 443–477. http://dx.doi.org/10.1146/annurev.biochem.77.060407.135530.

Aloy, P., & Oliva, B. (2009). Splitting statistical potentials into meaningful scoring functions:

Testing the prediction of near-native structures from decoy conformations. BMC Struc-

tural Biology,9(1), 71. http://dx.doi.org/10.1186/1472-6807-9-71.

Aloy, P., & Russell, R. B. (2003). InterPreTS: Protein interaction prediction through tertiary

structure. Bioinformatics,19(1), 161–162. http://dx.doi.org/10.1093/bioinformatics/

19.1.161.

AlQuraishi, M., & McAdams, H. H. (2013). Three enhancements to the inference of statis-

tical protein-DNA potentials. Proteins: Structure, Function, and Bioinformatics,81(3),

426–442. http://dx.doi.org/10.1002/prot.24201.

Altschul, S. F., Madden, T. L., Scha

¨ffer, A. A., Zhang, J., Zhang, Z., Miller, W., et al. (1997).

Gapped BLAST and PSI-BLAST: A new generation of protein database search programs.

Nucleic Acids Research,25(17), 3389–3402. http://dx.doi.org/10.1093/nar/25.17.3389.

Amos-Binks, A., Patulea, C., Pitre, S., Schoenrock, A., Gui, Y., & Green, J. R. (2011). Bind-

ing site prediction for protein-protein interactions and novel motif discovery using

re-occurring polypeptide sequences. BMC Bioinformatics,12(1), 225. http://dx.doi.

org/10.1186/1471-2105-12-225.

Angarica, V. E., Pe

´rez, A. G., Vasconcelos, A. T., Collado-Vides, J., & Contreras-Moreira, B.

(2008). Prediction of TF target sites based on atomistic models of protein-DNA

complexes. BMC Bioinformatics,9(1), 436. http://dx.doi.org/10.1186/1471-2105-9-436.

Ashkenazy, H., Erez, E., Martz, E., Pupko, T., & Ben-Tal, N. (2010). ConSurf 2010:

Calculating evolutionary conservation in sequence and structure of proteins and nucleic

acids. Nucleic Acids Research,38(Suppl. 2), W529–W533. http://dx.doi.org/10.1093/

nar/gkq399.

Axenopoulos, A., Daras, P., Papadopoulos, G. E., & Houstis, E. N. (2013). SP-dock:

Protein-protein docking using shape and physicochemical complementarity.

IEEE/ACM Transactions on Computational Biology and Bioinformatics,10(1), 135–150.

http://dx.doi.org/10.1109/TCBB.2012.149.

Bailey, T. L., Boden, M., Buske, F. A., Frith, M., Grant, C. E., & Clementi, L. (2009).

MEME Suite: Tools for motif discovery and searching. Nucleic Acids Research,

37(Suppl. 2), W202–W208. http://dx.doi.org/10.1093/nar/gkp335.

Barik, A., Nithin, C., Manasa, P., & Bahadur, R. P. (2012). A protein-RNA docking bench-

mark (I): Nonredundant cases. Proteins: Structure, Function, and Bioinformatics,80(7),

1866–1871. http://dx.doi.org/10.1002/prot.24083.

Bau

`, D., Sanyal, A., Lajoie, B. R., Capriotti, E., Byron, M., & Lawrence, J. B. (2011). The

three-dimensional folding of the a-globin gene domain reveals formation of chromatin

globules. Nature Structural & Molecular Biology,18(1), 107–114. http://dx.doi.org/

10.1038/nsmb.1936.

Bembom, O. (2007). seqLogo: Sequence logos for DNA sequence alignments.

Benos, P. V., Bulyk, M. L., & Stormo, G. D. (2002). Additivity in protein-DNA interactions:

How good an approximation is it? Nucleic Acids Research,30(20), 4442–4451. http://dx.

doi.org/10.1093/nar/gkf578.

Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., & Weissig, H. (2000).

The Protein Data Bank. Nucleic Acids Research,28(1), 235–242. http://dx.doi.org/

10.1093/nar/28.1.235.

Brooks, B. R., Bruccoleri, R. E., Olafson, B. D., States, D. J., Swaminathan, S., &

Karplus, M. (1983). CHARMM: A program for macromolecular energy, minimization,

and dynamics calculations. Journal of Computational Chemistry,4(2), 187–217. http://dx.

doi.org/10.1002/jcc.540040211.

111Statistical Potentials for Protein Interactions

Author's personal copy

Brylinski, M., & Skolnick, J. (2008). A threading-based method (FINDSITE) for ligand-

binding site prediction and functional annotation. Proceedings of the National Academy of

Sciences of the United States of America,105(1), 129–134. http://dx.doi.org/10.1073/

pnas.0707684105.

Bulyk, M. L. (2003). Computational prediction of transcription-factor binding site locations.

Genome Biology,5(1), 201. http://dx.doi.org/10.1186/gb-2003-5-1-201.

Carson, M. B., Langlois, R., & Lu, H. (2010). NAPS: A residue-level nucleic acid-binding

prediction server. Nucleic Acids Research,38(Suppl. 2), W431–W435. http://dx.doi.org/

10.1093/nar/gkq361.

Chen, C.-Y., Chien, T.-Y., Lin, C.-K., Lin, C.-W., Weng, Y.-Z., & Chang, D. (2012).

Predicting target DNA sequences of DNA-binding proteins based on unbound struc-

tures. PLoS One,7(2), e30446. http://dx.doi.org/10.1371/journal.pone.0030446.

Chen, Y., Kortemme, T., Robertson, T., Baker, D., & Varani, G. (2004). A new hydrogen-

bonding potential for the design of protein-RNA interactions predicts specific contacts

and discriminates decoys. Nucleic Acids Research,32(17), 5147–5162. http://dx.doi.org/

10.1093/nar/gkh785.

Chen, H., & Skolnick, J. (2008). M-TASSER: An algorithm for protein quaternary structure

prediction. Biophysical Journal,94(3), 918–928. http://dx.doi.org/10.1529/

biophysj.107.114280.

Chen, Y. C., Wright, J. D., & Lim, C. (2012). DR_bind: A web server for predicting DNA-

binding residues from the protein structure based on electrostatics, evolution and geometry.

Nucleic Acids Research,40(W1), W249–W256. http://dx.doi.org/10.1093/nar/gks481.

Cheng, T. M.-K., Blundell, T. L., & Fernandez-Recio, J. (2007). pyDock: Electrostatics and

desolvation for effective scoring of rigid-body protein-protein docking. Proteins: Struc-

ture, Function, and Bioinformatics,68(2), 503–515. http://dx.doi.org/10.1002/prot.21419.

Comeau, S. R., Gatchell, D. W., Vajda, S., & Camacho, C. J. (2004a). ClusPro: A fully auto-

mated algorithm for protein-protein docking. Nucleic Acids Research,32(Suppl. 2),

W96–W99. http://dx.doi.org/10.1093/nar/gkh354.

Comeau, S. R., Gatchell, D. W., Vajda, S., & Camacho, C. J. (2004b). ClusPro: An auto-

mated docking and discrimination method for the prediction of protein complexes.

Bioinformatics,20(1), 45–50. http://dx.doi.org/10.1093/bioinformatics/btg371.

Cornell, W. D., Cieplak, P., Bayly, C. I., Gould, I. R., Merz, K. M., & Ferguson, D. M.

(1995). A second generation force field for the simulation of proteins, nucleic acids,

and organic molecules. Journal of the American Chemical Society,117(19), 5179–5197.

http://dx.doi.org/10.1021/ja00124a002.

Das, M. K., & Dai, H.-K. (2007). A survey of DNA motif finding algorithms. BMC Bioin-

formatics,8(Suppl. 7), S21. http://dx.doi.org/10.1186/1471-2105-8-S7-S21.

De Vries, S. J., van Dijk, M., & Bonvin, A. M. J. J. (2010). The HADDOCK web server for

data-driven biomolecular docking. Nature Protocols,5(5), 883–897. http://dx.doi.org/

10.1038/nprot.2010.32.

Dobbins, S. E., Lesk, V. I., & Sternberg, M. J. E. (2008). Insights into protein flexibility: The

relationship between normal modes and conformational change upon protein-protein

docking. Proceedings of the National Academy of Sciences of the United States of America,

105(30), 10390–10395. http://dx.doi.org/10.1073/pnas.0802496105.

Dominguez, C., Boelens, R., & Bonvin, A. M. J. J. (2003). HADDOCK: A protein-protein

docking approach based on biochemical or biophysical information. Journal of the Amer-

ican Chemical Society,125(7), 1731–1737. http://dx.doi.org/10.1021/ja026939x.

Dunbrack, R. L., Jr. (2006). Sequence comparison and protein structure prediction. Current

Opinion in Structural Biology,16(3), 374–384. http://dx.doi.org/10.1016/j.sbi.2006.05.006.

El Hassan, M., & Calladine, C. (1998). Two distinct modes of protein-induced bending in

DNA. Journal of Molecular Biology,282(2), 331–343. http://dx.doi.org/10.1006/

jmbi.1998.1994.

112 Oriol Fornes et al.

Author's personal copy

Eswar, N., Webb, B., Marti-Renom, M. A., Madhusudhan, M., Eramian, D., Shen, M.-Y.,

et al. (2006). Comparative Protein Structure Modeling Using Modeller. Current Protocols

in Bioinformatics,15, 5.6.1–5.6.30.

Feig, M., Karanicolas, J., & Brooks, C. L., III (2004). MMTSB Tool Set: Enhanced sampling

and multiscale modeling methods for applications in structural biology. Journal of

Molecular Graphics and Modelling,22(5), 377–395. http://dx.doi.org/10.1016/

j.jmgm.2003.12.005.

Feliu, E., Aloy, P., & Oliva, B. (2011). On the analysis of protein-protein interactions via

knowledge-based potentials for the prediction of protein-protein docking. Protein

Science,20(3), 529–541. http://dx.doi.org/10.1002/pro.585.

Feliu, E., & Oliva, B. (2010). How different from random are docking predictions when

ranked by scoring functions? Proteins: Structure, Function, and Bioinformatics,78(16),

3376–3385. http://dx.doi.org/10.1002/prot.22844.

Fernandez-Recio, J., Totrov, M., Skorodumov, C., & Abagyan, R. (2005). Optimal docking

area: A new method for predicting protein-protein interaction sites. Proteins: Structure,

Function, and Bioinformatics,58(1), 134–143. http://dx.doi.org/10.1002/prot.20285.

Ferrada, E., & Melo, F. (2009). Effective knowledge-based potentials. Protein Science,18(7),

1469–1485. http://dx.doi.org/10.1002/pro.166.

Fraenkel, E., & Pabo, C. O. (1998). Comparison of X-ray and NMR structures for the

Antennapedia homeodomain-DNA complex. Nature Structural & Molecular Biology,

5(8), 692–697. http://dx.doi.org/10.1038/1382.

Gabb, H. A., Jackson, R. M., & Sternberg, M. J. E. (1997). Modelling protein docking using

shape complementarity, electrostatics and biochemical information. Journal of Molecular

Biology,272(1), 106–120. http://dx.doi.org/10.1006/jmbi.1997.1203.

Gabdoulline, R., Eckweiler, D., Kel, A., & Stegmaier, P. (2012). 3DTF: A web server for

predicting transcription factor PWMs using 3D structure-based energy calculations.

Nucleic Acids Research,40(W1), W180–W185. http://dx.doi.org/10.1093/nar/gks551.

Gao, M., & Skolnick, J. (2008). DBD-Hunter: A knowledge-based method for the predic-

tion of DNA-protein interactions. Nucleic Acids Research,36(12), 3978–3992. http://dx.

doi.org/10.1093/nar/gkn332.

Gao, M., & Skolnick, J. (2009). A threading-based method for the prediction of DNA-

binding proteins with application to the human genome. PLoS Computational Biology,

5(11), e1000567. http://dx.doi.org/10.1371/journal.pcbi.1000567.

Gao, M., & Skolnick, J. (2010). Structural space of protein-protein interfaces is degenerate,

close to complete, and highly connected. Proceedings of the National Academy of Sciences of

the United States of America,107(52), 22517–22522. http://dx.doi.org/10.1073/

pnas.1012820107.

Garcia-Garcia, J., Bonet, J., Guney, E., Fornes, O., Planas, J., & Oliva, B. (2012). Networks

of protein-protein interactions: From uncertainty to molecular details. Molecular Informat-

ics,31(5), 342–362. http://dx.doi.org/10.1002/minf.201200005.

Garcia-Garcia, J., Schleker, S., Klein-Seetharaman, J., & Oliva, B. (2012). BIPS: BIANA

Interolog Prediction Server. A tool for protein-protein interaction inference. Nucleic

Acids Research,40(W1), W147–W151. http://dx.doi.org/10.1093/nar/gks553.

Garzon, J. I., Lope

´z-Blanco, J. R., Pons, C., Kovacs, J., Abagyan, R., Fernandez-Recio, J.,

et al. (2009). FRODOCK: A new approach for fast rotational protein-protein dock-

ing. Bioinformatics,25(19), 2544–2551. http://dx.doi.org/10.1093/bioinformatics/

btp447.

Gaulton, A., Bellis, L. J., Bento, A. P., Chambers, J., Davies, M., & Hersey, A. (2011).

ChEMBL: A large-scale bioactivity database for drug discovery. Nucleic Acids Research,

40(D1), D1100–D1107. http://dx.doi.org/10.1093/nar/gkr777.

Ginalski, K. (2006). Comparative modeling for protein structure prediction. Current Opinion

in Structural Biology,16(2), 172–177. http://dx.doi.org/10.1016/j.sbi.2006.02.003.

113Statistical Potentials for Protein Interactions

Author's personal copy

Gitter, A., Siegfried, Z., Klutstein, M., Fornes, O., Oliva, B., Simon, I., et al. (2009). Backup

in gene regulatory networks explains differences between binding and knockout results.

Molecular Systems Biology,5(1), 276. http://dx.doi.org/10.1038/msb.2009.33.

Glover, J. N. M., & Harrison, S. C. (1995). Crystal structure of the heterodimeric bZIP tran-

scription factor c-Fos-c-Jun bound to DNA. Nature,373(6511), 257–261. http://dx.doi.

org/10.1038/373257a0.

Grau, J., Posch, S., Grosse, I., & Keilwagen, J. (2013). A general approach for discriminative

de novo motif discovery from high-throughput data. Nucleic Acids Research,41(21), e197.

http://dx.doi.org/10.1093/nar/gkt831.

Gray, J. J., Moughon, S., Wang, C., Schueler-Furman, O., Kuhlman, B., Rohl, C. A., et al.

(2003). Protein-protein docking with simultaneous optimization of rigid-body displace-

ment and side-chain conformations. Journal of Molecular Biology,331(1), 281–299. http://

dx.doi.org/10.1016/S0022-2836(03)00670-3.

Gu, S., Koehl, P., Hass, J., & Amenta, N. (2012). Surface-histogram: A new shape descriptor

for protein-protein docking. Proteins: Structure, Function, and Bioinformatics,80(1),

221–238. http://dx.doi.org/10.1002/prot.23192.

Guerois, R., Nielsen, J. E., & Serrano, L. (2002). Predicting changes in the stability of

proteins and protein complexes: A study of more than 1000 mutations. Journal of Molecular

Biology,320(2), 369–387. http://dx.doi.org/10.1016/S0022-2836(02)00442-4.

Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L., & Noble, W. S. (2007). Quantifying

similarity between motifs. Genome Biology,8(2), R24. http://dx.doi.org/10.1186/

gb-2007-8-2-r24.

Hu, S., Xie, Z., Onishi, A., Yu, X., Jiang, L., & Lin, J. (2009). Profiling the human protein-

DNA interactome reveals ERK2 as a transcriptional repressor of interferon signaling.

Cell,139(3), 610–622. http://dx.doi.org/10.1016/j.cell.2009.08.037.

Huang, S.-Y., & Zou, X. (2013). A nonredundant structure dataset for benchmarking

protein-RNA computational docking. Journal of Computational Chemistry,34(4),

311–318. http://dx.doi.org/10.1002/jcc.23149.

Hwang, S., Gou, Z., & Kuznetsov, I. B. (2007). DP-Bind: A web server for sequence-based

prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics,23(5),

634–636. http://dx.doi.org/10.1093/bioinformatics/btl672.

Hwang, H., Pierce, B., Mintseris, J., Janin, J., & Weng, Z. (2008). Protein-protein docking

benchmark version 3.0. Proteins: Structure, Function, and Bioinformatics,73(3), 705–709.

http://dx.doi.org/10.1002/prot.22106.

Hwang, H., Vreven, T., Janin, J., & Weng, Z. (2010). Protein-protein docking benchmark

version 4.0. Proteins: Structure, Function, and Bioinformatics,78(15), 3111–3114. http://dx.

doi.org/10.1002/prot.22830.

Janin, J. (2010). Protein-protein docking tested in blind predictions: The CAPRI experi-

ment. Molecular BioSystems,6(12), 2351–2362. http://dx.doi.org/10.1039/C005060C.

Janin, J., Henrick, K., Moult, J., Eyck, L. T., Sternberg, M. J. E., & Vajda, S. (2003). CAPRI:

A Critical Assessment of PRedicted Interactions. Proteins: Structure, Function, and Bioin-

formatics,52(1), 2–9. http://dx.doi.org/10.1002/prot.10381.

Jime

´nez-Garcı

´a, B., Pons, C., & Ferna

´ndez-Recio, J. (2013). pyDockWEB: A web server for

rigid-body protein-protein docking using electrostatics and desolvation scoring.

Bioinformatics,29(13), 1698–1699. http://dx.doi.org/10.1093/bioinformatics/btt262.

Jones, S., & Thornton, J. M. (1997). Analysis of protein-protein interaction sites using surface

patches. Journal of Molecular Biology,272(1), 121–132. http://dx.doi.org/10.1006/

jmbi.1997.1234.

Katchalski-Katzir, E., Shariv, I., Eisenstein, M., Friesem, A. A., Aflalo, C., & Vakser, I. A.

(1992). Molecular surface recognition: Determination of geometric fit between proteins

and their ligands by correlation techniques. Proceedings of the National Academy of Sciences of

the United States of America,89(6), 2195–2199.

114 Oriol Fornes et al.

Author's personal copy

Kim, R., Corona, R. I., Hong, B., & Guo, J. (2011). Benchmarks for flexible and rigid

transcription factor-DNA docking. BMC Structural Biology,11(1), 45. http://dx.doi.

org/10.1186/1472-6807-11-45.

Kirsanov, D. D., Zanegina, O. N., Aksianov, E. A., Spirin, S. A., Karyagina, A. S., &

Alexeevski, A. V. (2012). NPIDB: Nucleic acid-protein interaction database. Nucleic

Acids Research,41(D1), D517–D523. http://dx.doi.org/10.1093/nar/gks1199.

Knegtel, R. M. A., Antoon, J., Rullmann, C., Boelens, R., & Kaptein, R. (1994). MONTY:

A Monte Carlo approach to protein-DNA recognition. Journal of Molecular Biology,

235(1), 318–324. http://dx.doi.org/10.1016/S0022-2836(05)80035-X.

Knox, C., Law, V., Jewison, T., Liu, P., Ly, S., & Frolkis, A. (2011). DrugBank 3.0:

A comprehensive resource for “Omics” research on drugs. Nucleic Acids Research,

39(Suppl. 1), D1035–D1041. http://dx.doi.org/10.1093/nar/gkq1126.

Kozakov, D., Brenke, R., Comeau, S. R., & Vajda, S. (2006). PIPER: An FFT-based protein

docking program with pairwise potentials. Proteins: Structure, Function, and Bioinformatics,

65(2), 392–406. http://dx.doi.org/10.1002/prot.21117.

Kumar, M., Gromiha, M. M., & Raghava, G. P. (2007). Identification of DNA-binding pro-

teins using support vector machines and evolutionary profiles. BMC Bioinformatics,8(1),

463. http://dx.doi.org/10.1186/1471-2105-8-463.

Kumar, M., Gromiha, M. M., & Raghava, G. P. S. (2011). SVM based prediction of RNA-

binding proteins using binding residues and evolutionary information. Journal of Molecular

Recognition,24(2), 303–313. http://dx.doi.org/10.1002/jmr.1061.

Lasker, K., Phillips, J. L., Russel, D., Vela

´zquez-Muriel, J., Schneidman-Duhovny, D., &

Tjioe, E. (2010). Integrative structure modeling of macromolecular assemblies from pro-

teomics data. Molecular & Cellular Proteomics,9(8), 1689–1702. http://dx.doi.org/

10.1074/mcp.R110.000067.

Lasker, K., Sali, A., & Wolfson, H. J. (2010). Determining macromolecular assembly

structures by molecular docking and fitting into an electron density map. Proteins:

Structure, Function, and Bioinformatics,78(15), 3205–3211. http://dx.doi.org/10.1002/

prot.22845.

Lee, H., Li, Z., Silkov, A., Fischer, M., Petrey, D., Honig, B., et al. (2010). High-throughput

computational structure-based characterization of protein families: START domains and

implications for structural genomics. Journal of Structural and Functional Genomics,11(1),

51–59. http://dx.doi.org/10.1007/s10969-010-9086-7.

Lensink, M. F., & Wodak, S. J. (2010). Docking and scoring protein interactions: CAPRI

2009. Proteins: Structure, Function, and Bioinformatics,78(15), 3073–3084. http://dx.doi.

org/10.1002/prot.22818.

Lesk, V. I., & Sternberg, M. J. E. (2008). 3D-Garden: A system for modelling protein-protein

complexes based on conformational refinement of ensembles generated with the

marching cubes algorithm. Bioinformatics,24(9), 1137–1144. http://dx.doi.org/

10.1093/bioinformatics/btn093.

Lin, C.-K., & Chen, C.-Y. (2013). PiDNA: Predicting protein-DNA interactions with

structural models. Nucleic Acids Research,41(W1), W523–W530. http://dx.doi.org/

10.1093/nar/gkt388.

Liu, Z., Guo, J.-T., Li, T., & Xu, Y. (2008). Structure-based prediction of transcription fac-

tor binding sites using a protein-DNA docking approach. Proteins: Structure, Function, and

Bioinformatics,72(4), 1114–1124. http://dx.doi.org/10.1002/prot.22002.

Lu, H., Lu, L., & Skolnick, J. (2003). Development of unified statistical potentials describing

protein-protein interactions. Biophysical Journal,84(3), 1895–1901. http://dx.doi.org/

10.1016/S0006-3495(03)74997-2.

Lu, X.-J., & Olson, W. K. (2008). 3DNA: A versatile, integrated software system for the

analysis, rebuilding and visualization of three-dimensional nucleic-acid structures. Nature

Protocols,3(7), 1213–1227. http://dx.doi.org/10.1038/nprot.2008.104.

115Statistical Potentials for Protein Interactions

Author's personal copy

Luscombe, N. M., & Thornton, J. M. (2002). Protein-DNA interactions: Amino acid con-

servation and the effects of mutations on binding specificity. Journal of Molecular Biology,

320(5), 991–1009. http://dx.doi.org/10.1016/S0022-2836(02)00571-5.

Lyskov, S., & Gray, J. J. (2008). The RosettaDock server for local protein-protein docking.

Nucleic Acids Research,36(Suppl. 2), W233–W238. http://dx.doi.org/10.1093/nar/

gkn216.

Macindoe, G., Mavridis, L., Venkatraman, V., Devignes, M.-D., & Ritchie, D. W. (2010).

HexServer: An FFT-based protein docking server powered by graphics processors. Nucleic

Acids Research,38(Suppl. 2), W445–W449. http://dx.doi.org/10.1093/nar/gkq311.

Mashiach, E., Schneidman-Duhovny, D., Andrusier, N., Nussinov, R., & Wolfson, H. J.

(2008). FireDock: A web server for fast interaction refinement in molecular docking.

Nucleic Acids Research,36(Suppl. 2), W229–W232. http://dx.doi.org/10.1093/nar/

gkn186.

Matthews, L. R., Vaglio, P., Reboul, J., Ge, H., Davis, B. P., & Garrels, J. (2001). Identi-

fication of potential interaction networks using sequence-based searches for conserved

protein-protein interactions or “interologs” Genome Research,11(12), 2120–2126.

http://dx.doi.org/10.1101/gr.205301.

Mintseris, J., Pierce, B., Wiehe, K., Anderson, Robert, Chen, R., & Weng, Z. (2007).

Integrating statistical pair potentials into protein complex prediction. Proteins: Structure,

Function, and Bioinformatics,69(3), 511–520. http://dx.doi.org/10.1002/prot.21502.

Miyazawa, S., & Jernigan, R. L. (1985). Estimation of effective interresidue contact energies

from protein crystal structures: Quasi-chemical approximation. Macromolecules,18(3),

534–552. http://dx.doi.org/10.1021/ma00145a039.

Moal, I. H., & Bates, P. A. (2010). SwarmDock and the use of normal modes in protein-

protein docking. International Journal of Molecular Sciences,11(10), 3623–3648. http://

dx.doi.org/10.3390/ijms11103623.

Moal, I. H., Torchala, M., Bates, P. A., & Ferna

´ndez-Recio, J. (2013). The scoring of poses

in protein-protein docking: Current capabilities and future directions. BMC Bioinformat-

ics,14(1), 286. http://dx.doi.org/10.1186/1471-2105-14-286.

Moont, G., Gabb, H. A., & Sternberg, M. J. E. (1999). Use of pair potentials across protein

interfaces in screening predicted docked complexes. Proteins: Structure, Function, and Bio-

informatics,35(3), 364–373. http://dx.doi.org/10.1002/(SICI)1097-0134(19990515)

35:3<364::AID-PROT11>3.0.CO;2-4.

Mosca, R., Ce

´ol, A., & Aloy, P. (2013). Interactome3D: Adding structural details to protein

networks. Nature Methods,10(1), 47–53. http://dx.doi.org/10.1038/nmeth.2289.

Mosca, R., Ce

´ol, A., Stein, A., Olivella, R., & Aloy, P. (2013). 3did: A catalog of domain-

based interactions of known three-dimensional structure. Nucleic Acids Research,42(D1),

D374–D379. http://dx.doi.org/10.1093/nar/gkt887.

Nimrod, G., Schushan, M., Szila

´gyi, A., Leslie, C., & Ben-Tal, N. (2010). iDBPs: A web

server for the identification of DNA binding proteins. Bioinformatics,26(5), 692–693.

http://dx.doi.org/10.1093/bioinformatics/btq019.

Ozbek, P., Soner, S., Erman, B., & Haliloglu, T. (2010). DNABINDPROT: Fluctuation-

based predictor of DNA-binding residues within a network of interacting residues. Nucleic

Acids Research,38(Suppl. 2), W417–W423. http://dx.doi.org/10.1093/nar/gkq396.

Pandit, S. B., Brylinski, M., Zhou, H., Gao, M., Arakaki, A. K., & Skolnick, J. (2010).

PSiFR: An integrated resource for prediction of protein structure and function.

Bioinformatics,26(5), 687–688. http://dx.doi.org/10.1093/bioinformatics/btq006.

Panjkovich, A., Melo, F., & Marti-Renom, M. A. (2008). Evolutionary potentials: Structure

specific knowledge-based potentials exploiting the evolutionary record of sequence

homologs. Genome Biology,9(4), R68. http://dx.doi.org/10.1186/gb-2008-9-4-r68.

116 Oriol Fornes et al.

Author's personal copy

Parisien, M., Freed, K. F., & Sosnick, T. R. (2012). On docking, scoring and assessing

protein-DNA complexes in a rigid-body framework. PLoS One,7(2), e32647. http://

dx.doi.org/10.1371/journal.pone.0032647.

´rez-Cano, L., & Ferna

´ndez-Recio, J. (2010). Optimal protein-RNA area, OPRA:

A propensity-based method to identify RNA-binding sites on proteins. Proteins: Struc-

ture, Function, and Bioinformatics,78(1), 25–35. http://dx.doi.org/10.1002/prot.22527.

´rez-Cano, L., Jime

´nez-Garcı

´a, B., & Ferna

´ndez-Recio, J. (2012). A protein-RNA

docking benchmark (II): Extended set from experimental and homology modeling data.

Proteins: Structure, Function, and Bioinformatics,80(7), 1872–1882. http://dx.doi.org/

10.1002/prot.24075.

´rez-Cano, L., Solernou, A., Pons, C., & Ferna

´ndez-Recio, J. (2010). Structural prediction

of protein-RNA interaction by computational docking with propensity-based statistical

potentials. Pacific Symposium on Biocomputing,15, 269–280.

Pettersen, E. F., Goddard, T. D., Huang, C. C., Couch, G. S., Greenblatt, D. M.,

Meng, E. C., et al. (2004). UCSF chimera—A visualization system for exploratory

research and analysis. Journal of Computational Chemistry,25(13), 1605–1612. http://

dx.doi.org/10.1002/jcc.20084.

Pieper, U., Webb, B. M., Barkan, D. T., Schneidman-Duhovny, D., Schlessinger, A., &

Braberg, H. (2011). ModBase, a database of annotated comparative protein structure

models, and associated resources. Nucleic Acids Research,39(Suppl. 1), D465–D474.

http://dx.doi.org/10.1093/nar/gkq1091.

Pierce, B., & Weng, Z. (2007). ZRANK: Reranking protein docking predictions with an

optimized energy function. Proteins: Structure, Function, and Bioinformatics,67(4),

1078–1086. http://dx.doi.org/10.1002/prot.21373.

Pierce, B., & Weng, Z. (2008). A combination of rescoring and refinement significantly

improves protein docking performance. Proteins: Structure, Function, and Bioinformatics,

72(1), 270–279. http://dx.doi.org/10.1002/prot.21920.

Planas-Iglesias, J., Bonet, J., Marı

´n-Lo

´pez, M. A., Feliu, E., Gursoy, A., & Oliva, B. (2012).

Structural bioinformatics of proteins: Predicting the tertiary and quaternary structure of

proteins from sequence. In W. Cai (Ed.), Protein-protein interactions—Computational and

experimental tools.http://www.intechopen.com/books/protein-protein-interactions-

computational-and-experimental-tools/structural-bioinformatics-of-proteins-predicting-

the-tertiary-and-quaternary-structure-of-proteins-f.

Pons, C., Talavera, D., de la Cruz, X., Orozco, M., & Fernandez-Recio, J. (2011). Scoring

by intermolecular pairwise propensities of exposed residues (SIPPER): A new efficient

potential for protein-protein docking. Journal of Chemical Information and Modeling,51(2),

370–377. http://dx.doi.org/10.1021/ci100353e.

Poulain, P., Saladin, A., Hartmann, B., & Pre

´vost, C. (2008). Insights on protein-DNA

recognition by coarse grain modelling. Journal of Computational Chemistry,29(15),

2582–2592. http://dx.doi.org/10.1002/jcc.21014.

R Core Team, (2013). R: A language and environment for statistical computing. Vienna: Austria.

Rice, P., Longden, I., & Bleasby, A. (2000). EMBOSS: The European Molecular Biology

Open Software Suite. Trends in Genetics,16(6), 276–277. http://dx.doi.org/10.1016/

S0168-9525(00)02024-2.

Ritchie, D. W., & Kemp, G. J. L. (2000). Protein docking using spherical polar Fourier cor-

relations. Proteins: Structure, Function, and Bioinformatics,39(2), 178–194. http://dx.doi.

org/10.1002/(SICI)1097-0134(20000501)39:2<178::AID-PROT8>3.0.CO;2-6.

Roberts, V. A., Thompson, E. E., Pique, M. E., Perez, M. S., & Ten Eyck, L. F. (2013).

DOT2: Macromolecular docking with improved biophysical models. Journal of Compu-

tational Chemistry,34(20), 1743–1758. http://dx.doi.org/10.1002/jcc.23304.

117Statistical Potentials for Protein Interactions

Author's personal copy

Robertson, T. A., & Varani, G. (2007). An all-atom, distance-dependent scoring function for

the prediction of protein–DNA interactions from structure. Proteins: Structure, Function,

and Bioinformatics,66(2), 359–374. http://dx.doi.org/10.1002/prot.21162.

Rost, B. (1999). Twilight zone of protein sequence alignments. Protein Engineering,12(2),

85–94. http://dx.doi.org/10.1093/protein/12.2.85.

Russel, D., Lasker, K., Webb, B., Vela

´zquez-Muriel, J., Tjioe, E., & Schneidman-

Duhovny, D. (2012). Putting the pieces together: Integrative modeling platform soft-

ware for structure determination of macromolecular assemblies. PLoS Biology,10(1),

e1001244. http://dx.doi.org/10.1371/journal.pbio.1001244.

Schneider, S., Saladin, A., Fiorucci, S., Pre

´vost, C., & Zacharias, M. (2012). ATTRACT

and PTOOLS: Open source programs for protein-protein docking. In R. Baron

(Ed.), Computational drug discovery and design (pp. 221–232). New York: Springer.

http://link.springer.com/protocol/10.1007/978-1-61779-465-0_15.

Schneidman-Duhovny, D., Hammel, M., & Sali, A. (2011). Macromolecular docking

restrained by a small angle X-ray scattering profile. Journal of Structural Biology,173(3),

461–471. http://dx.doi.org/10.1016/j.jsb.2010.09.023.

Schneidman-Duhovny, D., Inbar, Y., Nussinov, R., & Wolfson, H. J. (2005). PatchDock

and SymmDock: Servers for rigid and symmetric docking. Nucleic Acids Research,

33(Suppl. 2), W363–W367. http://dx.doi.org/10.1093/nar/gki481.

Schro

¨dinger, L. (2010). The PyMOL molecular graphics system (Version 1.3r1).

Sharan, R., Ulitsky, I., & Shamir, R. (2007). Network-based prediction of protein function.

Molecular Systems Biology,3(1). http://dx.doi.org/10.1038/msb4100129.

Shen, Y., Paschalidis, I. C., Vakili, P., & Vajda, S. (2008). Protein docking by the underes-

timation of free energy funnels in the space of encounter complexes. PLoS Computational

Biology,4(10), e1000191. http://dx.doi.org/10.1371/journal.pcbi.1000191.

Shen, M., & Sali, A. (2006). Statistical potential for assessment and prediction of protein

structures. Protein Science,15(11), 2507–2524. http://dx.doi.org/10.1110/ps.062416606.

Shentu, Z., Al Hasan, M., Bystroff, C., & Zaki, M. J. (2008). Context shapes: Efficient com-

plementary shape matching for protein-protein docking. Proteins: Structure, Function, and

Bioinformatics,70(3), 1056–1073. http://dx.doi.org/10.1002/prot.21600.

Si, J., Zhang, Z., Lin, B., Schroeder, M., & Huang, B. (2011). MetaDBSite: A meta approach

to improve protein DNA-binding sites prediction (Report No. Suppl. 1) (p. S7).

BioMed Central Ltd. http://www.biomedcentral.com/1752-0509/5/S1/S7/abstract.

Simon, B., Madl, T., Mackereth, C. D., Nilges, M., & Sattler, M. (2010). An efficient pro-

tocol for NMR-spectroscopy-based structure determination of protein complexes in

solution. Angewandte Chemie, International Edition,49(11), 1967–1970. http://dx.doi.

org/10.1002/anie.200906147.

Sippl, M. J. (1990). Calculation of conformational ensembles from potentials of mean force.

An approach to the knowledge-based prediction of local structures in globular proteins.

Journal of Molecular Biology,213(4), 859–883.

Stein, A., Ce

´ol, A., & Aloy, P. (2011). 3did: Identification and classification of domain-based

interactions of known three-dimensional structure. Nucleic Acids Research,39(Suppl. 1),

D718–D723. http://dx.doi.org/10.1093/nar/gkq962.

Stein, A., Rueda, M., Panjkovich, A., Orozco, M., & Aloy, P. (2011). A systematic studyof the

energeticsinvolved in structuralchanges upon association and connectivityin protein inter-

action networks. Structure,19(6), 881–889. http://dx.doi.org/10.1016/j.str.2011.03.009.

Takeda, T., Corona, R. I., & Guo, J. (2013). A knowledge-based orientation potential for

transcription factor-DNA docking. Bioinformatics,29(3), 322–330. http://dx.doi.org/

10.1093/bioinformatics/bts699.

Tjong, H., & Zhou, H.-X. (2007). DISPLAR: An accurate method for predicting DNA-

binding sites on protein surfaces. Nucleic Acids Research,35(5), 1465–1477. http://dx.

doi.org/10.1093/nar/gkm008.

118 Oriol Fornes et al.

Author's personal copy

Torchala, M., Moal, I. H., Chaleil, R. A. G., Fernandez-Recio, J., & Bates, P. A. (2013).

SwarmDock: A server for flexible protein-protein docking. Bioinformatics,29(6),

807–809. http://dx.doi.org/10.1093/bioinformatics/btt038.

Tovchigrechko, A., & Vakser, I. A. (2006). GRAMM-X public web server for protein-

protein docking. Nucleic Acids Research,34(Web Server issue), W310–W314. http://

dx.doi.org/10.1093/nar/gkl206.

Tuncbag, N., Gursoy, A., Guney, E., Nussinov, R., & Keskin, O. (2008). Architectures and

functional coverage of protein-protein interfaces. Journal of Molecular Biology,381(3),

785–802. http://dx.doi.org/10.1016/j.jmb.2008.04.071.

Tuncbag, N., Gursoy, A., Nussinov, R., & Keskin, O. (2011). Predicting protein-protein

interactions on a proteome scale by matching evolutionary and structural similarities

at interfaces using PRISM. Nature Protocols,6(9), 1341–1354. http://dx.doi.org/

10.1038/nprot.2011.367.

Turner, D., Kim, R., & Guo, J. (2012). TFinDit: Transcription factor-DNA interaction

data depository. BMC Bioinformatics,13(1), 220. http://dx.doi.org/10.1186/1471-2105-

13-220.

Tuszynska, I., & Bujnicki, J. M. (2011). DARS-RNP and QUASI-RNP: New statistical

potentials for protein-RNA docking. BMC Bioinformatics,12(1), 348. http://dx.doi.

org/10.1186/1471-2105-12-348.

Urnov, F. D., Rebar, E. J., Holmes, M. C., Zhang, H. S., & Gregory, P. D. (2010). Genome

editing with engineered zinc finger nucleases. Nature Reviews Genetics,11(9), 636–646.

http://dx.doi.org/10.1038/nrg2842.

Vajda, S., & Kozakov, D. (2009). Convergence and combination of methods in protein-

protein docking. Current Opinion in Structural Biology,19(2), 164–170. http://dx.doi.

org/10.1016/j.sbi.2009.02.008.

Valdar, W. S. J., & Thornton, J. M. (2001). Protein-protein interfaces: Analysis of amino acid

conservation in homodimers. Proteins: Structure, Function, and Bioinformatics,42(1),

108–124. http://dx.doi.org/10.1002/1097-0134(20010101)42:1<108::AID-PROT110-

>3.0.CO;2-O.

van Dijk, M., & Bonvin, A. M. J. J. (2008). A protein-DNA docking benchmark. Nucleic

Acids Research,36(14), e88. http://dx.doi.org/10.1093/nar/gkn386.

van Dijk, M., & Bonvin, A. M. J. J. (2010). Pushing the limits of what is achievable in

protein-DNA docking: Benchmarking HADDOCK’s performance. Nucleic Acids

Research,38(17), 5634–5647. http://dx.doi.org/10.1093/nar/gkq222.

van Dijk, M., Visscher, K. M., Kastritis, P. L., & Bonvin, A. M. J. J. (2013). Solvated protein-

DNA docking using HADDOCK. Journal of Biomolecular NMR,56(1), 51–63. http://dx.

doi.org/10.1007/s10858-013-9734-x.

Venkatraman, V., Yang, Y. D., Sael, L., & Kihara, D. (2009). Protein-protein docking using

region-based 3D Zernike descriptors. BMC Bioinformatics,10(1), 407. http://dx.doi.org/

10.1186/1471-2105-10-407.

Wang, L., & Brown, S. J. (2006). BindN: A web-based tool for efficient prediction of DNA

and RNA binding sites in amino acid sequences. Nucleic Acids Research,34(Suppl. 2),

W243–W248. http://dx.doi.org/10.1093/nar/gkl298.

Wang, L., Huang, C., Yang, M. Q., & Yang, J. Y. (2010). BindNþfor accurate prediction of

DNA and RNA-binding residues from protein sequence features. BMC Systems Biology,

4(Suppl. 1), S3. http://dx.doi.org/10.1186/1752-0509-4-S1-S3.

Watson, J. D., Laskowski, R. A., & Thornton, J. M. (2005). Predicting protein function from

sequence and structural data. Current Opinion in Structural Biology,15(3), 275–284. http://

dx.doi.org/10.1016/j.sbi.2005.04.003.

Weirauch, M. T., Cote, A., Norel, R., Annala, M., Zhao, Y., & Riley, T. R. (2013).

Evaluation of methods for modeling transcription factor sequence specificity. Nature

Biotechnology,31(2), 126–134. http://dx.doi.org/10.1038/nbt.2486.

119Statistical Potentials for Protein Interactions

Author's personal copy

Wiederstein, M., & Sippl, M. J. (2007). ProSA-web: Interactive web service for the recog-

nition of errors in three-dimensional structures of proteins. Nucleic Acids Research,

35(Suppl. 2), W407–W410. http://dx.doi.org/10.1093/nar/gkm290.

Wodak, S. J., & Janin, J. (1978). Computer analysis of protein-protein interaction. Journal of

Molecular Biology,124(2), 323–342. http://dx.doi.org/10.1016/0022-2836(78)90302-9.

Xie, Z., Hu, S., Qian, J., Blackshaw, S., & Zhu, H. (2011). Systematic characterization of

protein-DNA interactions. Cellular and Molecular Life Sciences,68(10), 1657–1668.

http://dx.doi.org/10.1007/s00018-010-0617-y.

Xu, B., Yang, Y., Liang, H., & Zhou, Y. (2009). An all-atom knowledge-based energy func-

tion for protein-DNA threading, docking decoy discrimination, and prediction of

transcription-factor binding profiles. Proteins: Structure, Function, and Bioinformatics,

76(3), 718–730. http://dx.doi.org/10.1002/prot.22384.

Yu, X., Cao, J., Cai, Y., Shi, T., & Li, Y. (2006). Predicting rRNA-, RNA-, and DNA-

binding proteins from primary structure with support vector machines. Journal of

Theoretical Biology,240(2), 175–184. http://dx.doi.org/10.1016/j.jtbi.2005.09.018.

Zhang, C., Liu, S., Zhu, Q., & Zhou, Y. (2005). A knowledge-based energy function for

protein-ligand, protein-protein, and protein-DNA complexes. Journal of Medicinal Chem-

istry,48(7), 2325–2335. http://dx.doi.org/10.1021/jm049314d.

Zhang, Q. C., Petrey, D., Deng, L., Qiang, L., Shi, Y., & Thu, C. A. (2012). Structure-based

prediction of protein-protein interactions on a genome-wide scale. Nature,490(7421),

556–560. http://dx.doi.org/10.1038/nature11503.

Zhang, Q. C., Petrey, D., Norel, R., & Honig, B. H. (2010). Protein interface conservation

across structure space. Proceedings of the National Academy of Sciences of the United States of

America,107(24), 10896–10901. http://dx.doi.org/10.1073/pnas.1005894107.

Zhang, Y., & Skolnick, J. (2004). Automated structure prediction of weakly homologous

proteins on a genomic scale. Proceedings of the National Academy of Sciences of the United

States of America,101(20), 7594–7599. http://dx.doi.org/10.1073/pnas.0305695101.

Zhang, Y., & Skolnick, J. (2005). TM-align: A protein structure alignment algorithm based

on the TM-score. Nucleic Acids Research,33(7), 2302–2309. http://dx.doi.org/10.1093/

nar/gki524.

Zhao, H., Yang, Y., & Zhou, Y. (2010). Structure-based prediction of DNA-binding

proteins by structural alignment and a volume-fraction corrected DFIRE-based energy

function. Bioinformatics,26(15), 1857–1863. http://dx.doi.org/10.1093/bioinformatics/

btq295.

Zhao, H., Yang, Y., & Zhou, Y. (2011). Structure-based prediction of RNA-binding

domains and RNA-binding sites and application to structural genomics targets. Nucleic

Acids Research,39(8), 3017–3025. http://dx.doi.org/10.1093/nar/gkq1266.

Zheng, S., Robertson, T. A., & Varani, G. (2007). A knowledge-based potential function

predicts the specificity and relative binding energy of RNA-binding proteins. FEBS

Journal,274(24), 6378–6391. http://dx.doi.org/10.1111/j.1742-4658.2007.06155.x.

Zhou, H., & Skolnick, J. (2013). FINDSITEcomb: A threading/structure-based, proteomic-

scale virtual ligand screening approach. Journal of Chemical Information and Modeling,53(1),

230–240. http://dx.doi.org/10.1021/ci300510n.

Zhou, H., & Zhou, Y. (2002). Distance-scaled, finite ideal-gas reference state improves

structure-derived potentials of mean force for structure selection and stability prediction.

Protein Science,11(11), 2714–2726. http://dx.doi.org/10.1110/ps.0217002.

120 Oriol Fornes et al.

Author's personal copy

ModCRE: a structure homology-modeling approach to predict TF binding in cis-regulatory elements

Preprint

Full-text available

Apr 2022

Transcription factor (TF) binding is a key component of genomic regulation. There are numerous high-throughput experimental methods to characterize TF-DNA binding specificities. Their application, however, is both laborious and expensive, which makes profiling all TFs challenging. For instance, the binding preferences of ~25% human TFs remain unknown; they neither have been determined experimentally nor inferred computationally. Here, we introduce ModCRE, a web server implementing a structure homology-modelling approach to predict TF motifs and automatically model higher-order TF regulatory complexes. Starting from a TF sequence or structure, ModCRE predicts a set of motifs for that TF. The predicted motifs are then used to scan the DNA for occurrences of each of them, and the best matches are either profiled with a binding score or collected for their subsequent modeling into a higher-order regulatory complex with DNA, as well as other TFs and co-factors. Moreover, we demonstrate that incorporating high-throughput TF binding data, such as from protein binding microarrays, addresses the protein-DNA structure scarcity problem for deriving statistical potentials. In turn, these statistical potentials are proven to be capable predictors of TF motifs. We also show the conditional advantage of using ModCRE over a nearest-neighbor approach for predicting TF binding sites as well as an improvement in prediction accuracy when using a rank-enrichment selection system. Finally, as case examples, we apply ModCRE to model the interferon beta enhanceosome and the complex of SOX2 and 11 with a nucleosome.

Deciphering the RRM-RNA recognition code: A computational analysis

Article

Full-text available

Jan 2023
PLOS COMPUT BIOL

RNA recognition motifs (RRM) are the most prevalent class of RNA binding domains in eukaryotes. Their RNA binding preferences have been investigated for almost two decades, and even though some RRM domains are now very well described, their RNA recognition code has remained elusive. An increasing number of experimental structures of RRM-RNA complexes has become available in recent years. Here, we perform an in-depth computational analysis to derive an RNA recognition code for canonical RRMs. We present and validate a computational scoring method to estimate the binding between an RRM and a single stranded RNA, based on structural data from a carefully curated multiple sequence alignment, which can predict RRM binding RNA sequence motifs based on the RRM protein sequence. Given the importance and prevalence of RRMs in humans and other species, this tool could help design RNA binding motifs with uses in medical or synthetic biology applications, leading towards the de novo design of RRMs with specific RNA recognition.

Emergence of a unique SARS‐CoV‐2 Delta sub‐cluster harboring a constellation of co‐appearing non‐Spike mutations

Article

Full-text available

Dec 2022

Accumulation of diverse mutations across the structural and non-structural genes is leading to rapid evolution of SARS-CoV-2, altering its pathogenicity. We performed whole genome sequencing of 239 SARS-CoV-2 RNA samples collected from both adult and pediatric patients across eastern India (West Bengal), during the second pandemic wave in India (April-May 2021). In addition to several common spike mutations within the Delta variant, a unique constellation of 8 co-appearing non-spike mutations was identified, which revealed a high degree of positive mutual correlation. Our results also demonstrated the dynamics of SARS-CoV-2 variants among unvaccinated pediatric patients. 41.4% of our studied Delta strains harbored this signature set of 8 co-appearing non-spike mutations and phylogenetically out-clustered other Delta sub-lineages like 21J, 21A or 21I. This is the first report from eastern India that portrayed a landscape of co-appearing mutations in the non-Spike proteins, which might have led to the evolution of a distinct Delta sub-cluster. Accumulation of such mutations in SARS-CoV-2 may lead to the emergence of “vaccine-evading variants”. Hence, monitoring of such non-Spike mutations will be significant in the formulation of any future vaccines against those SARS-CoV-2 variants that might evade the current vaccine-induced immunity, among both the pediatric and adult populations. This article is protected by copyright. All rights reserved.

Structure-based learning to predict and model protein–DNA interactions and transcription-factor co-operativity in cis -regulatory elements

Article

Jun 2024

Transcription factor (TF) binding is a key component of genomic regulation. There are numerous high-throughput experimental methods to characterize TF–DNA binding specificities. Their application, however, is both laborious and expensive, which makes profiling all TFs challenging. For instance, the binding preferences of ∼25% human TFs remain unknown; they neither have been determined experimentally nor inferred computationally. We introduce a structure-based learning approach to predict the binding preferences of TFs and the automated modelling of TF regulatory complexes. We show the advantage of using our approach over the classical nearest-neighbor prediction in the limits of remote homology. Starting from a TF sequence or structure, we predict binding preferences in the form of motifs that are then used to scan a DNA sequence for occurrences. The best matches are either profiled with a binding score or collected for their subsequent modeling into a higher-order regulatory complex with DNA. Co-operativity is modelled by: (i) the co-localization of TFs and (ii) the structural modeling of protein–protein interactions between TFs and with co-factors. We have applied our approach to automatically model the interferon-β enhanceosome and the pioneering complexes of OCT4, SOX2 (or SOX11) and KLF4 with a nucleosome, which are compared with the experimentally known structures.

MR2CPPIS: Accurate prediction of protein-protein interaction sites based on multi-scale Res2Net with coordinate attention mechanism

Article

May 2024
COMPUT BIOL MED

Protein-Protein Interaction Site Prediction Based on Attention Mechanism and Convolutional Neural Networks

Article

Oct 2023

Proteins usually perform their cellular functions by interacting with other proteins. Accurate identification of protein-protein interaction sites (PPIs) from sequence is import for designing new drugs and developing novel therapeutics. A lot of computational models for PPIs prediction have been developed because experimental methods are slow and expensive. Most models employ a sliding window approach in which local neighbors are concatenated to present a target residue. However, those neighbors are not distinguished by pairwise information between a neighbor and the target. In this study, we propose a novel PPIs prediction model AttCNNPPISP, which combines attention mechanism and convolutional neural networks (CNNs). The attention mechanism dynamically captures the pairwise correlation of each neighbor-target pair within a sliding window, and therefore makes a better understanding of the local environment of target residue. And then, CNNs take the local representation as input to make prediction. Experiments are employed on several public benchmark datasets. Compared with the state-of-the-art models, AttCNNPPISP improves the prediction performance. Also, the experimental results demonstrate that the attention mechanism is effective in terms of constructing comprehensive context information of target residue.

PCPI: Prediction of circRNA and Protein Interaction Using Machine Learning Method

Chapter

Oct 2023

Circular RNA (circRNA) is an RNA molecule different from linear RNA with covalently closed loop structure. CircRNAs can act as sponging miRNAs and can interact with RNA binding protein. Previous studies have revealed that circRNAs play important role in the development of different diseases. The biological functions of circRNAs can be investigated with the help of circRNA-protein interaction. Due to scarce circRNA data, long circRNA sequences and the sparsely distributed binding sites on circRNAs, much fewer endeavors are found in studying the circRNA-protein interaction compared to interaction between linear RNA and protein. With the increase in experimental data on circRNA, machine learning methods are widely used in recent times for predicting the circRNA-protein interaction. The existing methods either use RNA sequence or protein sequence for predicting the binding sites. In this paper, we present a new method PCPI (Predicting CircRNA and Protein Interaction) to predict the interaction between circRNA and protein using support vector machine (SVM) classifier. We have used both the RNA and protein sequences to predict their interaction. The circRNA sequences were converted in pseudo peptide sequences based on codon translation. The pseudo peptide and the protein sequences were classified based on dipole moments and the volume of the side chains. The 3-mers of the classified sequences were used as features for training the model. Several machine learning model were used for classification. Comparing the performances, we selected SVM classifier for predicting circRNA-protein interaction. Our method achieved 93% prediction accuracy.

A CNN-LSTM Ensemble Model for Predicting Protein-Protein Interaction Binding Sites

Article

Aug 2023
IEEE ACM T COMPUT BI

Proteins commonly perform biological functions through protein-protein interactions (PPIs). The knowledge of PPI sites is imperative for the understanding of protein functions, disease mechanisms, and drug design. Traditional biological experimental methods for studying PPI sites still incur considerable drawbacks, including long experimental time and high labor costs. Therefore, many computational methods have been proposed for predicting PPI sites. However, achieving high prediction performance and overcoming severe data imbalance remain challenging issues. In this paper, we propose a new sequence-based deep learning model called CLPPIS (standing for C NN- L STM ensemble based PPI S ites prediction). CLPPIS consists of CNN and LSTM components, which can capture spatial features and sequential features simultaneously. Further, it utilizes a novel feature group as input, which has 7 physicochemical, biophysical, and statistical properties. Besides, it adopts a batch-weighted loss function to reduce the interference of imbalance data. Our work suggests that the integration of protein spatial features and sequential features provides important information for PPI sites prediction. Evaluation on three public benchmark datasets shows that our CLPPIS model significantly outperforms existing state-of-the-art methods.

Protein-Protein Interaction Sites Prediction Using Batch Normalization Based CNNs and Oversampling Method Borderline-SMOTE

Article

Jan 2023

The recognition of protein-protein interaction sites (PPIs) is beneficial for the interpretation of protein functions and the development of new drugs. Traditional biological experiments to identify PPI sites are expensive and inefficient, leading to the generation of various computational methods to predict PPIs. However, the accurate prediction of PPI sites remains a big challenge due to the existence of the sample imbalance issue. In this work, we design a novel model that combines convolutional neural networks (CNNs) with Batch Normalization to predict PPI sites, and employ an oversampling technique Borderline-SMOTE to address the sample imbalance issue. In particular, to better characterize the amino acid residues on the protein chains, we employ a sliding window approach for feature extraction of target residues and their contextual residues. We verify the effectiveness of our method by comparing our method with the existing state-of-the-art schemes. The performance validations of our method on three public datasets achieve accuracies of 88.6%, 89.9%, and 86.7%, respectively, all showing improved accuracies compared with the existing schemes. Moreover, the ablation experiment results suggest that Batch Normalization can greatly improve the generalization and the prediction stability of our model.

MSE-CapsPPISP: Spatial Hierarchical Protein-Protein Interaction Sites Prediction Using Squeeze-and-Excitation Capsule Networks

Conference Paper

Dec 2022

The scoring of poses in protein-protein docking: Current capabilities and future directions

Article

Full-text available

Oct 2013
BMC BIOINFORMATICS

Protein-protein docking, which aims to predict the structure of a protein-protein complex from its unbound components, remains an unresolved challenge in structural bioinformatics. An important step is the ranking of docked poses using a scoring function, for which many methods have been developed. There is a need to explore the differences and commonalities of these methods with each other, as well as with functions developed in the fields of molecular dynamics and homology modelling. We present an evaluation of 115 scoring functions on an unbound docking decoy benchmark covering 118 complexes for which a near-native solution can be found, yielding top 10 success rates of up to 58%. Hierarchical clustering is performed, so as to group together functions which identify near-natives in similar subsets of complexes. Three set theoretic approaches are used to identify pairs of scoring functions capable of correctly scoring different complexes. This shows that functions in different clusters capture different aspects of binding and are likely to work together synergistically. All functions designed specifically for docking perform well, indicating that functions are transferable between sampling methods. We also identify promising methods from the field of homology modelling. Further, differential success rates by docking difficulty and solution quality suggest a need for flexibility-dependent scoring. Investigating pairs of scoring functions, the set theoretic measures identify known scoring strategies as well as a number of novel approaches, indicating promising augmentations of traditional scoring methods. Such augmentation and parameter combination strategies are discussed in the context of the learning-to-rank paradigm.

Protein-protein interfaces: Analysis of amino acid conservation in homodimers

Article

Jan 2001

Evolutionary information derived from the large number of available protein sequences and structures could powerfully guide both analysis and prediction of protein–protein interfaces. To test the relevance of this information, we assess the conservation of residues at protein–protein interfaces compared with other residues on the protein surface. Six homodimer families are analyzed: alkaline phosphatase, enolase, glutathione S-transferase, copper-zinc superoxide dismutase, Streptomyces subtilisin inhibitor, and triose phosphate isomerase. For each family, random simulation is used to calculate the probability (P value) that the level of conservation observed at the interface occurred by chance. The results show that interface conservation is higher than expected by chance and usually statistically significant at the 5% level or better. The effect on the P values of using different definitions of the interface and of excluding active site residues is discussed. Proteins 2001;42:108–124. © 2000 Wiley-Liss, Inc.

ATTRACT and PTOOLS: Open Source Programs for Protein–Protein Docking

Chapter

Nov 2012

Use of pair potentials across protein interfaces in screening predicted docked complexes

Article

May 1999
PROTEINS

Empirical residue–residue pair potentials are used to screen possible complexes for protein–protein dockings. A correct docking is defined as a complex with not more than 2.5 Å root-mean-square distance from the known experimental structure. The complexes were generated by “ftdock” (Gabb et al. J Mol Biol 1997;272:106–120) that ranks using shape complementarity. The complexes studied were 5 enzyme-inhibitors and 2 antibody-antigens, starting from the unbound crystallographic coordinates, with a further 2 antibody-antigens where the antibody was from the bound crystallographic complex. The pair potential functions tested were derived both from observed intramolecular pairings in a database of nonhomologous protein domains, and from observed intermolecular pairings across the interfaces in sets of nonhomologous heterodimers and homodimers. Out of various alternate strategies, we found the optimal method used a mole-fraction calculated random model from the intramolecular pairings. For all the systems, a correct docking was placed within the top 12% of the pair potential score ranked complexes. A combined strategy was developed that incorporated “multidock,” a side-chain refinement algorithm (Jackson et al. J Mol Biol 1998;276:265–285). This placed a correct docking within the top 5 complexes for enzyme-inhibitor systems, and within the top 40 complexes for antibody–antigen systems. Proteins 1999;35:364–373. © 1999 Wiley-Liss, Inc.

Protein docking using spherical polar Fourier correlations

Article

May 2000
PROTEINS

We present a new computational method of docking pairs of proteins by using spherical polar Fourier correlations to accelerate the search for candidate low-energy conformations. Interaction energies are estimated using a hydrophobic excluded volume model derived from the notion of “overlapping surface skins,” augmented by a rigorous but “soft” model of electrostatic complementarity. This approach has several advantages over former three-dimensional grid-based fast Fourier transform (FFT) docking correlation methods even though there is no analogue to the FFT in a spherical polar representation. For example, a complete search over all six rigid-body degrees of freedom can be performed by rotating and translating only the initial expansion coefficients, many infeasible orientations may be eliminated rapidly using only low-resolution terms, and the correlations are easily localized around known binding epitopes when this knowledge is available. Typical execution times on a single processor workstation range from 2 hours for a global search (5 × 10⁸ trial orientations) to a few minutes for a local search (over 6 × 10⁷ orientations). The method is illustrated with several domain dimer and enzyme–inhibitor complexes and 20 large antibody–antigen complexes, using both the bound and (when available) unbound subunits. The correct conformation of the complex is frequently identified when docking bound subunits, and a good docking orientation is ranked within the top 20 in 11 out of 18 cases when starting from unbound subunits. Proteins 2000;39:178–194. © 2000 Wiley-Liss, Inc.

Protein–protein interfaces: Analysis of amino acid conservation in homodimers

Article

Jan 2001
PROTEINS

UNIT 2.9 comparative protein structure modeling using MODELLER

Article

Jan 2007

UCSF Chimera—A visualization system for exploratory research and analysis

Article

Jan 2004

DrugBank 3.0: a comprehensive resource for 'omics' research on drugs: Nucleic Acids Res

Article

Jan 2011
NUCLEIC ACIDS RES

The Protein Data Bank

Article

Jan 2000

Helen Berman

The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.

On the Use of Knowledge-Based Potentials for the Evaluation of Models of Protein-Protein, Protein-DNA, and Protein-RNA Interactions

Abstract and Figures

Recommended publications

Probing ligand binding modes of human cytochrome P450 2J2 by homology modeling, molecular dynamics s...

Homology Modeling of Cannabinoid Receptors: Discovery of Cannabinoid Analogues for Therapeutic Use

Microscopic Binding of M5 Muscarinic Acetylcholine Receptor with Antagonists by Homology Modeling, M...

Homology modeling and molecular dynamics study of GSK3/SHAGGY-like kinase