ArticlePDF Available

Constructing Markov State Models to elucidate the functional conformational changes of complex biomolecules

October 2017
Wiley interdisciplinary reviews: Computational Molecular Science. 6(1):e1343

October 2017
6(1):e1343

DOI:10.1002/wcms.1343

Authors:

Wei Wang

The Hong Kong University of Science and Technology

Lizhe Zhu

The Chinese University of Hong Kong - Shenzhen

The function of complex biomolecular machines relies heavily on their conformational changes. Investigating these functional conformational changes is therefore essential for understanding the corresponding biological processes and promoting bioengineering applications and rational drug design. Constructing Markov State Models (MSMs) based on large‐scale molecular dynamics simulations has emerged as a powerful approach to model functional conformational changes of the biomolecular system with sufficient resolution in both time and space. However, the rapid development of theory and algorithms for constructing MSMs has made it difficult for nonexperts to understand and apply the MSM framework, necessitating a comprehensive guidance toward its theory and practical usage. In this study, we introduce the MSM theory of conformational dynamics based on the projection operator scheme. We further propose a general protocol of constructing MSM to investigate functional conformational changes, which integrates the state‐of‐the‐art techniques for building and optimizing initial pathways, performing adaptive sampling and constructing MSMs. We anticipate this protocol to be widely applied and useful in guiding nonexperts to study the functional conformational changes of large biomolecular systems via the MSM framework. We also discuss the current limitations of MSMs and some alternative methods to alleviate them. WIREs Comput Mol Sci 2018, 8:e1343. doi: 10.1002/wcms.1343 This article is categorized under: Structure and Mechanism > Computational Biochemistry and Biophysics Theoretical and Physical Chemistry > Statistical Mechanics

…

| The quality of putative path is important for the adaptive sampling scheme. (a) The translocation process of RNA Polymerase II: Isomap representation of the initial paths generated by the Climber algorithm and the samples from the final Markov State Model (MSM) (colored dots). The MSM samples clearly deviate from the initial paths, indicating the necessity of path optimization before the adaptive sampling and MSM construction (Figure adapted with permission from Ref 37. Copyright 2014 National Academy of Sciences, USA). (b) The initial path can be optimized via the string method, as exemplified by the study of activation pathway of c-Src kinase: the initial targeted molecular dynamics (MD) path can be optimized using limited amount of sampling (Figure adapted with permission from Ref 79. Copyright 2009 Elsevier).

…

| An example of choosing the input structural features for the time-lagged independent component analysis (tICA) analysis. In all subgraphs (a)-(f ), the left panel demonstrates the set of atoms (blue) among which the pair-wise distances are selected as input features for the tICA analysis; the right panel plots the implied timescales (ITS) of the 1000 state MSM built by k-centers clustering on the slowest four tICs shown in the left panel. The correlation lag time for tICA is 40 ns. The Markov State Model (MSM) lag time is 8 ns. The error bars of ITS of the MSMs are calculated by 100 times of bootstrapping experiments on all molecular dynamics (MD) trajectories. The distance set (f ) is chosen as the optimal one, because the top MSM ITS is the highest among all sets yet with sufficiently less number of input distances (Figure adapted with permission from Ref 2. Copyright 2016 Nature Publishing Group).

…

| Markov state model identifies key intermediate states along activation pathways of c-Src kinase. (a) Crystal structures of inactive (left) and active (right) states of c-Src. The differences lie in the activation loop (A-loop; red), C-helix (orange), and switching of electrostatic network among Lys295, Glu310, Arg409, and Tyr416. (b) Two intermediate states are identified on the potential of mean force calculated based on the stationary population of a 2000-state microstate Markov State Model (MSM) over two reaction coordinates: root-mean-square distance (RMSD) of A-loop residues and difference of distance between residue pairs E310-R409 and K295-E310. (c) The variation of four structural metrics along a long trajectory, synthesized from the MSM via the kinetic Monte Carlo scheme, provides a rough estimate of the timescale of the activation and deactivation processes. Here inactive state, active state, intermediated state I 1 and I 2 are shown in magenta, blue, green, and black, respectively (Figure adapted with permission from Ref 4. Copyright 2014 Nature Publishing Group).

…

| Comparison of different kinetic lumping methods for 1-residue alanine dipeptide, 35-residue villin headpiece, and 263-residue β-lactamase systems. (a) Crystal structures of alanine dipeptide (all-atom), villin (ribbon), and β-lactamase (ribbon). (b) Bayes factor of five lumping methods (less negative means better model). (c) Metastability values of the five lumping methods (larger value means better model) (Figure reprinted with permission from Ref 114. Copyright 2013 AIP Publishing LLC).

…

Figures - uploaded by Wei Wang

Content may be subject to copyright.

Content uploaded by Wei Wang

Content may be subject to copyright.

Advanced Review

Constructing Markov State Models

to elucidate the functional

conformational changes

of complex biomolecules

Wei Wang ,

1,2

Siqin Cao,

1†

Lizhe Zhu

1,2†

and Xuhui Huang

1,2,3,4

The function of complex biomolecular machines relies heavily on their confor-

mational changes. Investigating these functional conformational changes is

therefore essential for understanding the corresponding biological processes and

promoting bioengineering applications and rational drug design. Constructing

Markov State Models (MSMs) based on large-scale molecular dynamics simula-

tions has emerged as a powerful approach to model functional conformational

changes of the biomolecular system with sufﬁcient resolution in both time and

space. However, the rapid development of theory and algorithms for construct-

ing MSMs has made it difﬁcult for nonexperts to understand and apply the

MSM framework, necessitating a comprehensive guidance toward its theory and

practical usage. In this study, we introduce the MSM theory of conformational

dynamics based on the projection operator scheme. We further propose a gen-

eral protocol of constructing MSM to investigate functional conformational

changes, which integrates the state-of-the-art techniques for building and opti-

mizing initial pathways, performing adaptive sampling and constructing MSMs.

We anticipate this protocol to be widely applied and useful in guiding nonex-

perts to study the functional conformational changes of large biomolecular sys-

tems via the MSM framework. We also discuss the current limitations of MSMs

How to cite this article:

WIREs Comput Mol Sci 2017, e1343. doi: 10.1002/wcms.1343

INTRODUCTION

Conformational changes of complex biomolecules

are indispensable features of their function.

Investigating the functional conformational changes of

biomacromolecules is thus essential not only for reveal-

ing the mechanisms of the corresponding biological

processes,

1–3

but also for rational drug design

4–6

and

various biotechnological applications. However, direct

investigations of functional dynamics by experiments

remains challenging, because it remains difﬁcult for cur-

rent experimental techniques

7–12

to reach atomic reso-

lution in both space and time. Molecular dynamics

(MD) simulation has therefore emerged as a powerful

approach to complement experiments, since it can sim-

ulate the motions of all atoms in the biomolecular sys-

tem on timescales as short as femtoseconds.

Nonetheless, it is still difﬁcult to study complex

biomolecular systems directly through brute-force

MD simulations. This is because large-scale

†

These authors contributed equally to this work.

*Correspondence to: xuhuihuang@ust.hk

Department of Chemistry, The Hong Kong University of Science

and Technology, Kowloon, Hong Kong

Center of Systems Biology and Human Health, The Hong Kong

University of Science and Technology, Kowloon, Hong Kong

Hong Kong Branch of Chinese National Engineering Research

Center for Tissue Restoration & Reconstruction, The Hong Kong

University of Science and Technology, Kowloon, Hong Kong

HKUST-Shenzhen Research Institute, Shenzhen, China

Conﬂict of interest: The authors have declared no conﬂicts of inter-

est for this article.

conformational changes of large biomolecules,

e.g., RNA Polymerase II (more than 300,000 atoms in

explicit water),

typically occur on submillisecond

timescales, beyond the affordable length of the MD

simulations. The solution to this time-scale gap can be

achieved by either accelerating MD via advanced

hardware

13–16

or adopting advanced sampling

17–20

and analysis techniques such as replica exchange MD

(REMD),

metadynamics,

transition path

sampling,

milestoning,

accelerated MD,

and

Markov State Models (MSMs).

26–28

MSM approaches represent a powerful theoret-

ical framework that has been widely applied in the

past decade to study protein folding

29–33

and func-

tional conformational dynamics

1–3,34–44

of many bio-

molecular systems. In MSM, statistical models are

built to approach the timescale involved in the func-

tional conformational changes between known struc-

tures, based on an ensemble of short trajectories

initiated from different regions of the free energy

landscape of the system. This feature of MSMs

allows highly parallelized sampling and a systematic

and statistical description of the system under study.

As MSMs have become increasingly popular,

new algorithms for constructing and validating

MSMs have also been continuously developed in

recent years, necessitating a comprehensive review of

these new advances.

45–52

Since most existing reviews

about MSMs are theory and algorithm-oriented, here

we aim to provide systematic guide toward its practi-

cal usage, particularly in the context of studying

functional conformational changes of biomolecules.

After a brief introduction to the basic theories of

MSMs, we will suggest a detailed protocol for non-

experts who plan to apply MSMs to investigate func-

tional conformational changes in biomolecular

systems. All commonly used methods in our protocol

and their representative cutting-edge applications will

be reviewed. Finally, we discuss the limitations of

MSMs, and alternative methods as well as future

development that may alleviate these limitations.

THE BASIC THEORY OF MSMs

From a microscopic point of view, the kinetics of any

system can be precisely predicted by the Liouville’s

equation. However, for biomolecules whose dynamics

span multiple timescales, such equation is too complex

for the practical usage. A natural solution is then to

adopt a macroscopic perspective—focusing on the

slow degrees of freedom (DOFs) that dominate the

dynamics while ignoring the less relevant fast DOFs.

Constructing MSMs is one popular approach to

achieve this. In this section, we adopt a projection

operator scheme

proposed by Zwanzig

and

Mori

to derive the basic equations of the MSM.

The Liouville’s equation of the phase space distri-

bution reads ∂ρ(Γ;t)/∂t=ℒρ(Γ;t), where the Liou-

ville operator ℒcontains all the information of the

dynamic system and Γ=(x;p)=(x

,…,x

,…p

). In a discrete time sequence, t=nτ,theevo-

lution of the distribution function follows

ρΓ;t+τ

ðÞ

=eℒτρΓ;t

ðÞ ð1Þ

Here the propagator e

ℒτ

has to obey the detailed bal-

ance condition under equilibrium conditions, hf

(Γ)|

ℒτ

(Γ)i

ρ(Γ; eq)

=hf

(Γ)|e

ℒτ

(Γ)i

ρ(Γ; eq)

, meaning that

the transition from jto iequals the transition from i

to junder the ensemble average taken with the equi-

librium distribution function ρ(Γ; eq).

Although e

ℒτ

is deﬁned in a high dimensional

space to describe the complete dynamics of the sys-

tem, there exist separations of timescales in dynamics

underline functional conformational changes, and

elucidating slowest dynamic modes in e

ℒτ

are often

sufﬁcient to understand overall mechanisms of these

conformational changes. One can, therefore, project

the full dynamics onto a reduced space of slow

dynamics jχivia the Mori–Zwanzig projection oper-

ator, forming a kinetic network model of the original

full dynamics.

The Mori–Zwanzig projection operator reads

ℙ≔P

j¼1

ρΓ;eqÞχjxðÞ

!i$π−1

jhχjxðÞ

""", where jχiconsists

of the indicator functions that deﬁnes the state of

each region of the conﬁguration space (i.e., χ

(x)=1

(or 0) implies the conﬁguration space region x

belongs (or not) to state i). π

is the stationary popu-

lation of state j. The kinetics can then be projected to

the reduced space jχiand satisﬁes the Nakajima–

Zwanzig equation,

∂

∂tℙρΓ;tðÞ=ℙℒℙρΓ;tðÞ

+ðt

dt0ℙℒeℚℒ t−t0

ðÞ

ℚℒℙρΓ;t0

ðÞ+ℙℒeℚℒtℚρΓ;0ðÞ

ð2Þ

where the non-Markovian term (second term

on the right) is the result of the fast kinetics of the

system that is related with ℚ=1−ℙ.

When time is discretized as t=nτ, the kinetics

in the reduced space becomes

Advanced Review wires.wiley.com/compmolsci

ℙρΓ;nτðÞ=ℙeℒτ

nℙρΓ;0ðÞ

+ℙeℒτX

n−1

m=1

Pℙeℒτ

n−m−1,ℚeℒτ

ℙρΓ;0ðÞ

+ℙenℒτℚρΓ;0ðÞ ð3Þ

where

Psums up all the permutation of the n−m−

1 terms of ℙe

ℒτ

and mterms of ℚe

ℒτ

. By assuming

the separation between the fast and slow kinetics that

ℙe

ℒτ

ℚ≈0 at a lag time τ

, a master equation of the

states could be derived, also known as the MSM:

pTt+τT

ðÞ≈pTtðÞTτT

ðÞ ð4Þ

Here Tmk τðÞ=π−1

mhχkxðÞeℒτ

""""χmðxÞiρΓ;eqÞð is the prob-

ability of transition from state mto state kover the

lag time τ. The resulting matrix T(τ

) is the transition

probability matrix (TPM). p

(t)≔hχ

(x)| ρ(Γ;t)iis

the probability of the system in state k. Due to the

property of the equilibrium state, Tshould satisfy the

detailed balance condition π

(τ)=π

(τ). We note

that MSM can also be derived from the framework

of the variational principle.

As the kinetics of a system can be modeled by

many different MSMs, one often needs to assess the

quality of an MSM and select the one that best repre-

sents the original kinetics. One commonly used

approach for such quality assessment is to apply the

variational principle, which states that, for any given

trial function f(Γ),

λi≥

λi=hfΓðÞeℒτ

""""fΓð ÞiρðΓ;eqÞ

hfΓðÞjfðΓÞiρðΓ;eqÞ

ð5Þ

where λ

is the eigen value of e

ℒτ

, equality only holds

when f(Γ) is exactly the eigenvectors of e

ℒτ

. In other

words, a good MSM should preserve the largest top

eigenvalues of e

ℒτ

. In practice, this can be attained

by introducing the summation of top eigenvalues

[called Generalized matrix Rayleigh quotient

(GMRQ)

58,59

]. In fact, apart from MSMs, the varia-

tional principle can also be used to understand many

methods, such as the time-lagged (or time-structure

based) independent component analysis (tICA,

TICA)

60–62

and the core-set MSM.

PROTOCOL OF BUILDING MSMs TO

ELUCIDATE FUNCTIONAL

CONFORMATIONAL CHANGES

The ﬁeld of MSMs has experienced a rapid develop-

ment in the past decade, including advancement in

both post analysis and sampling strategies. In this

section, we propose a general protocol, particularly

in the context of studying the functional dynamics of

biomolecules, to build MSMs through these state-of-

the-art techniques.

Overview of Our Protocol

The complete protocol consists of three stages:

(1) preparation (Figure 1(a)–(c)), (2) adaptive sam-

pling and the construction of microstate MSM

(Figure 1(d)–(g)), and (3) constructing macrostate

MSM and elucidating of the kinetics of the system

(Figure 1(h)). Stage 1 is divided into three substeps:

preparing initial structures (Figure 1(a)), generating

initial pathway(s) (Figure 1(b)), and path(s) optimiza-

tion (Figure 1(c)). Stage 2 is a recursive stage that per-

forms adaptive sampling along the optimized path

until a good kinetic model is obtained: MD simula-

tions (Figure 1(d)), the feature selection to ﬁnd the

reduced space that can capture the slowest transitions

(Figure 1(e)), the splitting of the phase space into a

state space (Figure 1(f )), and the building, validation,

and error estimation of the microstate MSM (Figure 1

(g)). Stage 3 builds a macrostate MSM (Figure 1(h))

by kinetic lumping of the microstates obtained in

Stage 2 and predict the slowest kinetics of the system

based on the validated microstate MSM. The various

dimensionality reduction, clustering, and MSM con-

struction tools mentioned in Figure 1 can be found in

the open source packages, such as MSMBuilder

64–66

(http://msmbuilder.org), PyEMMA

(http://emma-

project.org/), htmd

(https://www.htmd.org), and

HK_DataMiner

(https://github.com/liusong299/

HK_DataMiner).

Methods

Finding the Minimum Free Energy Path(s)

between Functional States

Our protocol starts with preparing an initial path

(Figure 1(b)) connecting the known structures

(obtained from, e.g., X-ray crystallography,

Cryo-

electron microscopy,

9,70

and Nuclear magnetic reso-

nance spectroscopy

). This is specially tailored for

studying the functional conformational changes of

biomolecular systems. One of the major advantages

of the MSM framework is the parallelized sampling,

i.e., the model is built on an ensemble of unbiased

MD trajectories initiated from different regions of

the conformational space of the system. Accordingly,

this requires an initial sampling scheme to provide

the seeding structures for the unbiased sampling.

WIREs Computational Molecular Science Constructing MSMs

When investigating protein folding, we can obtain

the initial sampling using techniques such as

REMD

21,71,72

which enhances sampling in a global

manner. For studying protein functional conforma-

tional changes, however, it is more suitable to

enhance the sampling in a local manner, because

globally enhanced sampling is likely to introduce

unwanted results like unfolding of protein secondary

structures.

Various methods are available to generate the

initial path(s). For example, the Climber algo-

rithm

37,73

can drive the system toward the target

structure on the potential energy surface progres-

sively, via a self-adjusting restraint potential propor-

tional to the deviation of inter-residue distance

between the target structure and the structure

obtained in the last step. The initial path can then be

obtained by solvating the conformations along the

Climber path. Climber has been successfully applied

to investigate the translocation

and backtracking

process of RNA Polymerase II (Pol II). Alternatively,

one may ﬁrst solvate the system and then perform

steered MD (SMD)

or targeted MD (TMD)

drive the system from one crystal to the other.

Steered MD has been applied to study the pyrophos-

phate ion release in the yeast Pol II

or bacterial

RNA Polymerase.

34,35

Other methods like Caver

has been used for the study of NTP entry routes in

RNA Polymerase II elongation complex.

Metady-

namics

may also be performed to generate the ini-

tial path if a low dimensional collective variable

(CV) space can be deﬁned a priori. The recently

developed FAST algorithm

is also a valuable tool

for this task. We anticipate that coarse-grained MD

(CGMD) simulations may also serve as a good

approach to obtain the initial path after proper atom-

istic reconstruction.

Due to the presence of the bias potential, the

prepared initial path is often unable to correctly

cover the transition state. Further optimization of the

path is necessary to ensure the statistical signiﬁcance

of the putative path and the subsequent unbiased

(a) (b) (c) (d)

(h) (g)

Initial and final

structures

Lumping and

macrostate MSM Microstate MSM

and validation

Lag time (ns)

20 40 20 40

0.1

Splitting Feature selection

X-ray, Cryo-EM, NMR

...

Spec. Clus., PCCA, PCCA+

MPP, BACE, HNEG

k-centers, k-medoids

k-means, APLoD, APM msmbuilder

pyEMMA, htmd

tICA

Climber

SMD, TMD MD sampling

string method

Seeding, MD

Initial path

C.K. test

ITS (μs)

Path optimization MD simulations

(f) (e)

FIGURE 1 |Suggested protocol for constructing Markov State Models (MSMs) to investigate the functional conformational changes. The

workﬂow consists of three stages: (a)–(c) generating the minimum free energy path(s) among the known functional states; (d)–(g) adaptive

sampling and microstate MSM construction/validation; (h) elucidating the slowest kinetics of the system via the validated microstate MSM and

interpreting the mechanism by lumping the microstate MSM into a macrostate MSM. (a) Find the known functional states from experimental

structures or molecular modeling; (b) build a preliminary transition path between the known states via morphing (e.g., the Climber algorithm) or

biased molecular dynamics (MD) simulation (e.g., steered MD, targeted MD); (c) optimize the preliminary path to locate the closest minimum free

energy path via string method or extensive MD sampling; (d) initiate an ensemble of short unbiased MD simulations from the representative

conformations along the optimized path; (e) select kinetically slow reaction coordinates using time-lagged independent component analysis (tICA);

(f ) partition the collected samples into microstates based on their geometric proximity in the reduced tIC space; (g) build and validate the

microstate MSM and perform further unbiased sampling seeded by the representative structures of each microstate if the local equilibrium is not

reached in the microstate MSM; and (h) predict kinetic properties of the system via the microstate MSM and build the macrostate MSM via kinetic

lumping for mechanism visualization and interpretation.

Advanced Review wires.wiley.com/compmolsci

sampling (Figure 1(c)). Such optimization can be

achieved either by extensive short unbiased sam-

plings (Figure 2(a)) or more systematically via stan-

dard path-searching methods, such as the string

method (Figure 2(b)).

We recommend adopting path-searching

methods such as the string method

80–83

for this opti-

mization step, because they contain a standard proto-

col for convergence check and ensure the presence of

the transition state in the optimized path. All path-

searching methods aim to locate minimum free-energy

path (MFEP) closest to a given initial path. Typically,

the path is deﬁned on the preselected space composed

by a number of CVs. For example, in the most estab-

lished method—the string method, local sampling is

performed in a small CV volume around the path

nodes to allow a gradual downhill update of the path.

Other methods such as path-metadynamics

and the

fast tomographic

methods may also be applied.

Nevertheless, the automation level and overall

efﬁciency of existing path searching methods are still

limited. For example, the amount of sampling

required by the string method may become

comparable to the subsequent unbiased sampling,

because the local sampling adopted can make the

downhill path update too gradual. The choice of the

CV space for existing methods is also challenging,

especially when no prior knowledge of the system is

available. Therefore, we expect new methods to be

developed to alleviate these issues in the future.

Selecting Kinetically Slow Variables for State

Decomposition

After sufﬁcient samples are collected, an MSM can

be constructed to model the slowest DOFs in the sys-

tem. This requires a proper decomposition of the

conformational space for deﬁning the states in the

MSM. Traditionally, such state decomposition is per-

formed by applying clustering methods on the high

dimensional structures according to a chosen distance

metric, e.g., the root-mean-square distance (RMSD)

among the conformations.

Typically, when using

the RMSD metric the conformations are ﬁrst aligned

to a reference structure based on a subset of atoms.

The RMSD is then computed on another subset of

atoms, e.g., heavy atoms relevant to the process

(a)

(b)

RNA Polymerase II

c-Src kinase

Posttranslocation

Inactive

Pretranslocation

Active

A-loop opening

Helix rotation

Hck TMD

String

Pre Post (Climber)

Post Pre (Climber)

X (Isomap)

Y(Isomap)

Z (BH & TL RMSD)

0.0

1.0

0.8

0.0 1.0

–40 –20 020

–20

FIGURE 2 |The quality of putative path is important for the adaptive sampling scheme. (a) The translocation process of RNA Polymerase II:

Isomap representation of the initial paths generated by the Climber algorithm and the samples from the ﬁnal Markov State Model (MSM) (colored

dots). The MSM samples clearly deviate from the initial paths, indicating the necessity of path optimization before the adaptive sampling and

optimized via the string method, as exempliﬁed by the study of activation pathway of c-Src kinase: the initial targeted molecular dynamics

WIREs Computational Molecular Science Constructing MSMs

under study, chosen based on the root-mean-square

ﬂuctuations (RMSF) of the atoms. However, the

resulting state deﬁnition is highly sensitive to the

atom-sets and to noise in the samples.

To address these issues during state decomposi-

tion, it has become increasingly popular to apply

methods that can automatically extract the major fea-

tures (or reaction coordinates) and preserve the slowest

kinetics of the system before the MSM construction.

For example, in several early studies of allostery and

protein folding,

principal component analysis (PCA)

has been applied to extract the dimensions that maxi-

mize variances in the samples. However, these principal

dimensions are not necessarily kinetically slow. More

recent recipes for this task are the variational

approach

and tICA.

60–62

Different from PCA, tICA

focuses on the time correlation between features and

would thus give statistically independent components

that can reproduce the slowest dynamics.

As shown in Figure 1, we recommend using

tICA for feature selection. tICA can be understood as

an application of a variational principle on a basis

set of a few input features (e.g., distances between

atoms). It aims to ﬁnd linear combinations of the

input features, known as tICs, that generate the best

estimation of the eigenvalues of the propagator e

ℒτ

First, the input features will be transformed into

mean-free features (d

(x(nτ))). Then, the time-lagged

correlation matrix with a lag time τ=N

dt (dt is the

time interval between snapshots of the trajectories)

Cij τðÞ=X

trajs

NT−NτX

NT−Nτ

l=1

dixlðÞðÞdjxl+Nτ

ðÞðÞð6Þ

and covariance matrix

Sij =X

trajs

NTX

n=1

dixlðÞðÞdjxlðÞðÞ ð7Þ

are calculated based on the MD trajectories. Finally,

one solves the generalized eigenvalue problem C(τ)

V=SVΛto get the coefﬁcients Vfor the linear com-

binations that deﬁne the tICs. Due to ﬁnite sampling,

(τ) may not be time-reversible and may produce

physically invalid results. Therefore, one typically

symmetrizes C

(τ) by adding its transpose ((C

(τ)+

(τ))/2) to account for the nonreversibility. More

recent developments of tICA include kernel-tICA,

hTICA,

and variationally optimized diffusion

maps.

Interested readers can refer to reviews

91,92

for more discussion on reaction coordinates. We

anticipate new techniques to be developed for

automatic and smart choice of the characteristic fea-

tures of the conformational dynamics.

Although the tICs generated by tICA are linear

approximations of the slowest reaction coordinates

and thus may not correspond exactly to the slowest

motions identiﬁed by the ﬁnal constructed MSM,

they are particularly useful to achieve optimal state

decomposition with minimal statistical error.

performing clustering on the reduced space spanned

by top tICs, we can then construct microstate MSM

and choose the best MSM according to GMRQ or

slowest implied timescales (ITSs; see Eq. (11)). For

simplicity, we refer to this step as tICA–MSM.

In practice, the performance of tICA–MSM

depends on the choice of various parameters, such as

the input features, tICA correlation time, number of

selected tICs, and so on. It is recommended to apply

cross validation or bootstrapping techniques to

account for the statistical errors and avoid over-ﬁtting

(see Constructing and Validating Microstate MSMs

section for a detailed discussion) when selecting these

optimal parameters. Here we illustrate how to choose

these parameters via an example from our study of

backtracking of RNA Polymerase II.

As shown in

Figure 3, we scanned the pairwise distances between

different sets of atom pairs as input features for tICA.

We selected the sets for which the largest ITS of the

corresponding MSM reached its maximum (Figure 3

(d)–(f )). Among the chosen sets of atom pairs, we

picked the one with the smallest size (i.e., the one with

least number of pair-wise distances, Figure 3(f )).

Partitioning the Conformational Space for

Deﬁning Microstates

With the subspace of tICs identiﬁed, one performs

the ‘splitting-and-lumping’scheme

on the subspace

to construct MSMs. The splitting step clusters the

sampled conformations into hundreds or thousands

of nonoverlapping microstates based on a distance

metric. For visualization and interpretation of mecha-

nisms, the subsequent lumping step groups the micro-

states into several macrostates based on the kinetic

proximity among them.

For the distance metric used in the splitting step,

one can simply apply Euclidean distance, other L

dis-

tances, or the kinetic distance

that is weighted by

tICA eigenvalues. To perform the clustering, center-

based methods (k-means,

k-centers,

95,96

and k-

medoids

) are widely used to partition the subspace

into Voronoi cells. These algorithms, however, often

need to be provided the number of clusters, which is

difﬁcult to choose apriori.Such algorithms may

become inferior when the metastable regions in the free

energy landscape are not convex. Adaptive splitting

Advanced Review wires.wiley.com/compmolsci

methods recently developed in our group, including

APM

and APLoD,

could be helpful in solving these

two issues. Incorporating both the geometric informa-

tion and the correlation between microstates in the iter-

ation, APM can effectively tackle multibody systems

with heterogeneous timescales, such as protein-ligand

binding system. To achieve a similar goal, APLoD

makes use of the local density of each conformation to

identify the local density peaks as cluster centers. Other

alternative methods include Ward’smethod

98,99

that

utilizes the hierarchical structure of distance matrix,

and robust density-based clustering

100

that partitions

the conformations with different local free energy based

on geometric proximity. k-Means, k-medoids, k-cen-

ters, Ward, and APM can be found in http://www.

msmbuilder.org. APLoD can be found in https://github.

com/liusong299/HK_DataMiner. Robust density-based

clustering can be found in http://www.moldyn.uni-frei-

burg.de/software/software.html.

Constructing and Validating Microstate MSMs

To build a microstate MSM on the discretized time

sequences, we choose a lag time τ, count the number

of transitions among the microstates and obtain a

transition count matrix (TCM). In the limit of inﬁnite

sampling, the TCM C

(τ) is counted by the follow-

ing formula,

Cmk τðÞ=X

trajs

NT−NτX

NT−Nτ

l=1

χmxlðÞðÞχkxl+Nτ

ðÞðÞð8Þ

Here C

(τ) represents the total number of transi-

tions from state mto state kover the lag time τ=

dt, where dt is the time interval between snapshots

of the trajectories. The TPM which represents the

transition probability between two states is then

obtained by taking the row normalization of TCM.

TτðÞ=CτðÞD−1

nð9Þ

Here D

is a diagonal matrix with the value of diago-

nal entries being the total number of transition from

each state.

We can then ﬁnd the eigenvectors and eigen-

values of the TPM T(τ) to model the original kinetics

of the system. The eigenvectors (ψ

) are related to the

collective transition modes between states, and eigen-

values (λ

) are related to the relaxation timescale of

the corresponding transition process as shown in

Eq. (11). More interpretation of the eigenvectors can

be found in Prinz et al.

As discussed above, it is necessary for the tran-

sition matrices to fulﬁll the detailed balance condi-

tion. Yet due to the statistical error or insufﬁciency

of the sampling, the transitions from one state to

another are usually not equal to the reverse ones.

The simplest way to impose detailed balance is to

directly symmetrize the TCM by adding its transpose

Csym τðÞ=CτðÞ+CTτðÞ

=2ð10Þ

100

0.01

0.0001

100

0.01

0.0001

100

0.01

0.0001

49 distances 495 distances

895 distances2115 distances 695 distances

1200 distances

(a) (b) (c)

(d) (e) (f)

100

0.01

0.0001

100

0.01

0.0001

100

0.01

0.0001

0 20 40 60 80 0 20 40 60 80 0 20 40 60 80

0 20 40 60 80

FIGURE 3 |An example of choosing the input structural features for the time-lagged independent component analysis (tICA) analysis. In all

subgraphs (a)–(f ), the left panel demonstrates the set of atoms (blue) among which the pair-wise distances are selected as input features for the

tICA analysis; the right panel plots the implied timescales (ITS) of the 1000 state MSM built by

-centers clustering on the slowest four tICs shown

in the left panel. The correlation lag time for tICA is 40 ns. The Markov State Model (MSM) lag time is 8 ns. The error bars of ITS of the MSMs are

calculated by 100 times of bootstrapping experiments on all molecular dynamics (MD) trajectories. The distance set (f ) is chosen as the optimal

one, because the top MSM ITS is the highest among all sets yet with sufﬁciently less number of input distances (Figure adapted with permission

WIREs Computational Molecular Science Constructing MSMs

before normalization.

As the direct symmetrization

is not a good approximation when the samplings are

statistically biased, a more accurate method is the

maximum likelihood estimator (MLE) with the

detailed balance constraint imposed.

28,101

Particu-

larly, it is suggested to perform MLE on the maximal

ergodic subgraph of TCM.

65,102

Generally, the microstate model obtained at lag

time τis not guaranteed to be Markovian, unless the

lag time is long enough, i.e., τ≥τ

(τ

is called the

Markovian lag time). A straightforward way to exam-

ine the Markovianity is to compute the ITS deﬁned as

timτðÞ=−mτ

logλimτðÞ ð11Þ

where λ

(mτ) is the ith eigenvalue of T(mτ). When the

model is Markovian, t

(mτ) will be a constant value

of −τ/ log λ

(τ). Accordingly, we choose the minimum

time τ

for the ITS to be invariant to the lag time

(Figure 1(g)) as the Markovian lag time to build

the MSM.

Subsequently, we use the Chapman–Kolmogorov

equation (C.K.) to validate the Markovianity of the

model in a stricter way (Figure 1(g)). In this test, the

probability distribution predicted by the MSM (T

(τ

))

should be consistent with the distribution counted by

the trajectories (T(mτ

)) after several time steps mτ

the model is Markovian:

TmτT

ðÞ

=TτT

ðÞ

mð12Þ

The inequality in this CK equation, if present, has

two sources—the ‘discretization error’and the ‘statis-

tical error.’

28,48

The discretization error is the system-

atic deviation of the MSM-predicted kinetics from

that of the propagator, due to the neglect of the

terms contributed by the fast, irrelevant kinetics

related to ℚ(see Eq. (3)). The discretization error can

be, in theory, quantiﬁed by the deviation of eigen-

values or eigenvectors of the TPM from that of the

propagator. However, because the full kinetics is

unknown a priori, it is more practical to choose the

MSM that produces largest top eigenvalues of T(τ

)

(i.e., GMRQ

). The statistical error, caused by the

limited sampling, can be estimated via a number of

techniques, e.g., the formula proposed in Prinz

et al.

or the Bayesian estimation method

103

that

(a)

(b)

(c)

RMSD of A-loop (Å)

E310-R409

-d

K295-E310

(Å)

0 5 10 0

–20

–10

Inactive (I)

Intermediate (I1 and I2)

Active (A)

Time (µs)

0102030405060708090100

DFG RMSD

from active (Å)

E310-R409

distance (Å)

K295-E310

distance (Å)

A-loop RMSD

from inactive (Å)

Active

Inactive

C-terminal domain

N-terminal domain

K295 K295

ATP ATP

D404 D404

E310 E310

R409 R409

Y416 Y416

A-loop

unfolds

Mg2+ Mg2+

C-helix

moves

inwards

C-helix

FIGURE 4 |Markov state model identiﬁes key intermediate states along activation pathways of c-Src kinase. (a) Crystal structures of inactive

(left) and active (right) states of c-Src. The differences lie in the activation loop (A-loop; red), C-helix (orange), and switching of electrostatic

network among Lys295, Glu310, Arg409, and Tyr416. (b) Two intermediate states are identiﬁed on the potential of mean force calculated based

on the stationary population of a 2000-state microstate Markov State Model (MSM) over two reaction coordinates: root-mean-square distance

(RMSD) of A-loop residues and difference of distance between residue pairs E310-R409 and K295-E310. (c) The variation of four structural metrics

along a long trajectory, synthesized from the MSM via the kinetic Monte Carlo scheme, provides a rough estimate of the timescale of the

activation and deactivation processes. Here inactive state, active state, intermediated state I

and I

are shown in magenta, blue, green, and black,

Advanced Review wires.wiley.com/compmolsci

samples the transition probability matrices via Mar-

kov Chain Monte Carlo (MCMC) according to a

Dirichlet form posterior distribution.

In practice, it is difﬁcult to reduce these two

types of errors simultaneously. Increasing the number

of states or lag time can reduce the discretization

error, but would lead to fewer transition counts

among the states and thus larger statistical error, and

vice versa. Therefore, it is recommended to perform

the splitting step via different clustering methods and

then select the best model based on the GMRQ.

order to achieve a balance between the discretization

error and statistical error, cross validation

and

bootstrapping techniques

104

can also be performed.

Performing Adaptive Sampling to Enhance

Data Connectivity

As high-quality MSMs require a sufﬁcient number of

the transition counts in the available MD trajectories,

one round of MD sampling is often insufﬁcient.

Therefore, it is necessary to perform adaptive

sampling

39,78,105–108

to encourage the occurrence of

transitions. This is achieved by selecting representa-

tive structures of the less populated microstates as

seeds for the next round of simulations. A rough way

to decide when this iterative sampling procedure

should end is to check if all data are connected on

the potential of mean force (PMF) on some reaction

coordinates. More strictly, the adaptive sampling

should be terminated when the kinetic properties

computed from the microstate MSM become invari-

ant for a few rounds.

Calculating the Kinetic Properties from the

Validated Microstate MSM

The obtained microstate TPM can be used to calcu-

late the essential kinetic properties of the system,

such as the mean ﬁrst passage time (MFPT) for the

system to make a transition from one microstate to

another and the ensemble of transition pathways

among different states. Elucidation of major transi-

tion pathways and their associated ﬂuxes from

between two states can be achieved by the transition

path theory (TPT).

29,109–111

In TPT, once the initial

and ﬁnal states are deﬁned, net ﬂuxes between all

pairs of states are computed based on transition

probabilities, committor probabilities as well as equi-

librium populations of states, and transition path-

ways can then be identiﬁed from the net ﬂux matrix.

TPT has been implemented to investigate various

binding processes,

43,112

polymerase systems,

self-

assembly processes,

113

and so forth. MFPT, which

quantiﬁes the averaged time for a state to ﬁrst make

a transition to another state, can also be used to

characterize the kinetics of the system.

In fact, the average values of any dynamic

observables can also be estimated from the TPM. A

simple way to achieve this is to generate a long Mar-

kov chain based on the TPM and randomly select

one conﬁguration from each microstate to represent

the state. Here we discuss one example of application

to illustrate the construction of microstate MSM.

Application Example: Elucidating the

Activation Pathway of c-Src Kinase

Src kinases are key signaling proteins. Deregulated

activation of these kinases can lead to aberrant sig-

naling and therefore cause excessive cell growth and

differentiation. By constructing MSMs along the acti-

vation pathway of the c-Src kinase, Roux and Pande

groups successfully discovered two intermediate

states along the activation pathway that can be

employed for allosteric drug design.

They generated

an initial path by targeted MD (Figure 1(b)) between

the inactive and active form of the kinase (Figure 1

(a)) and then applied the string method with swarms

of trajectories to optimize the path (Figure 1(c),

Figure 2(b)). Adaptive sampling was then performed

for two rounds along the optimized activation path-

way, amounting to a total sampling of 500 μs. These

samples were clustered based on the RMSD metric

deﬁned on a subset of heavy atoms, resulting in a

2000-microstate MSM with the lag time of 5 ns. The

MSM revealed two intermediate states along the path

(Figure 5(b)). The activation process was visualized

through a long trajectory which was synthesized

from the microstate MSM using kinetic Monte Carlo

scheme (Figure 5(c)). Finally, the 2000-state MSM

predicted the MFPT of the activation and deactiva-

tion process to be 106 and 21 μs, respectively.

Lumping Microstates to Macrostates to Aid

Mechanism Interpretation

A microstate MSM typically contains hundreds or

thousands of states to ensure that the model can be

Markovian at a sufﬁciently small lag time. This helps

shorten the necessary length of MD simulations used

to build the MSM. However, the large number of

states also hinders the visualization of the conforma-

tional changes and subsequent human appreciation

of the underlying mechanism. Therefore, we perform

‘kinetic lumping’that merges the microstates into

macrostates to reduce the number of states. This

lumping step is based on the kinetic proximity

between microstates, i.e., the interstate transitions

between metastable states should be much slower

than the intrastate transitions.

WIREs Computational Molecular Science Constructing MSMs

Among the recently developed lumping

methods, the spectral based methods like Perron clus-

ter cluster analysis (PCCA),

115

Robust Perron cluster

analysis (PCCA+),

116,117

and spectral clustering

118,119

make use of the leading eigenvectors of the microstate

TPM to do clustering. Among these three methods,

PCCA performs the bi-partitioning based on the sign

structure of top eigenvectors, spectral clustering per-

forms k-means clustering on the subspace spanned by

top eigenvectors, and PCCA+ performs a fuzzy clus-

tering as a robust improvement around the transition

region compared to PCCA. In addition, Chodera

et al.

120

invented a Monte Carlo-simulated annealing

(MCSA) algorithm to optimize the PCCA results

based on metastability (PiMii, where Mis the TPM

of the macrostate model). In spite of their wide appli-

cation, these spectral based methods are sensitive to

the poorly sampled regions in the dataset.

114

Several alternative methods have been proposed

to address the sensitivity over poor sampling, includ-

ing Bayesian agglomerative clustering engine

(BACE),

121

the most probable paths algorithm

(MPP),

122

and super-level-set hierarchical clustering

algorithm (SHC).

123

In particular, the Hierarchical

Nyström Extension Graph (HNEG) algorithm devel-

oped by Yao et al.

124

reduces the noise dependence by

putting more emphasis on the more populated states.

Another recently developed method called the renor-

malization group clustering (RGC)

125

provides the

optimal microstate representation of the kinetics of

the system via minimizing the error induced by

projection.

It is worth noting that the choice of the number

of macrostates is still an open question. In spectral

based methods, one typically decides the number of

metastable states based on the eigenvalue gap of

microstate TPM, while in BACE, such number is

decided by the gap of BACE Bayes factors. In

HNEG, this number is automatically decided by the

extensive graph, which could be an advantage over

other methods.

Given a number of macrostate MSMs lumped

through different methods, it is also necessary to

choose the model that best represents the system.

Popular criteria for such choices include the metasta-

bility (PiMii) and the Bayes factor.

126

The Bayes

(a) (b) (c)

Method Method

BACE

HNEG

MPP

PCCA+

PCCA

BACE

HNEG

MPP

PCCA+

PCCA

-Lactamase

Villin

Villin headpiece Villin

Alanine dipeptide

Alanine dipeptide Alanine dipeptide

log (Evidence)

–3.0 105

–2.6 105

–2.2 105

–1.8 105

–2.0 105

–1.8 105

–1.6 105

–1.4 105

–1.2 105

log (Evidence)

–1.6 104

–1.5 104

–1.4 104

log (Evidence)

2.5

2.0

1.5

1.0

0.5

0.0

FIGURE 5 |Comparison of different kinetic lumping methods for 1-residue alanine dipeptide, 35-residue villin headpiece, and 263-residue

β-lactamase systems. (a) Crystal structures of alanine dipeptide (all-atom), villin (ribbon), and β-lactamase (ribbon). (b) Bayes factor of ﬁve lumping

methods (less negative means better model). (c) Metastability values of the ﬁve lumping methods (larger value means better model)

Advanced Review wires.wiley.com/compmolsci

factor quantiﬁes the likelihood to produce the micro-

state chains given a certain lumping. Based on these

two criteria, Bowman et al.

114

compared the perfor-

mance of several lumping methods (Figure 5).

Source codes of PCCA, PCCA+, BACE, and

MCSA are available in http://www.msmbuilder.org;

MPP can be found in http://www.moldyn.uni-frei-

burg.de/software/software.html; and spectral cluster-

ing is available in http://scikit-learn.org.

Calculating the Kinetic Properties

at Macrostate Level

Despite its usefulness in visualization, the lag time

necessary for a macrostate model to become Mar-

kovian is often beyond the length of the MD trajecto-

ries. Therefore, the kinetic properties are typically

computed from the microstate MSM. For example,

to calculate the MFPT from one metastable state to

another, one may count the ﬁrst passage events in the

trajectories that are mapped from the microstate

Markov chains (either directly obtained from MD

simulations or synthesized using MCMC based on

the microstate MSM) based on the lumping results.

Application Example: Study of the

Backtracking Mechanism of RNA Polymerase II

RNA Polymerases are the core enzymes for DNA

transcription in the central dogma of biology. To

understand the proofreading mechanism in DNA

transcription, Da et al.

studied, via MSMs, the back-

tracking process of RNA Polymerase II. To construct

MSMs, the frayed and backtracked state with rG:dG

mismatched base pair were ﬁrst prepared from corre-

sponding crystal structures. The pretranslocation

structure was built from the backtracked structure

(Figure 1(a)). The Climber algorithm was then used

to generate two initial paths: pretranslocation !

frayed !backtracked and the reverse (Figure 1(b)).

Next, four rounds of MD simulations were per-

formed via adaptive sampling, resulting in 480 trajec-

tories (each 100 ns) for analysis (Figure 1(d)–(g)).

Subsequently, the MD samples were further divided

into 800 states by k-centers clustering using RMSD

of heavy atoms as the distance metric. A lag time of

8 ns was then chosen to build the microstate MSM

as it passed the C.K. test. Four macrostates were

obtained from the microstate TPM by PCCA+ lump-

ing for visualization. The 800-state microstate MSM

was used to calculate all kinetic properties.

Apart from the three known states (pretranslo-

cation, frayed, and backtracked), another kinetically

important intermediate state (S3) was identiﬁed

(Figure 6(a)). Backtracking was found to occur in a

stepwise fashion: ﬁrst, S1 quickly evolves to S2 at

submicrosecond timescale, promoted by the bending

of bridge helix (Figure 6(b)); then, S2 equilibrates

with S3 at microsecond timescale; ﬁnally, S3 back-

tracks to S4 at the timescale of 10

microseconds,

being the rate limiting step in the whole process.

DISCUSSION AND FUTURE

PERSPECTIVE

Notwithstanding the popularity and advancement of

the MSM framework, several limitations still remain,

including the issue of non-Markovianity for macro-

state MSMs, recrossing when counting transitions,

and rare events when sampling the protein–ligand

dissociation. In this section, we review the alternative

methods recently developed to overcome these issues,

including the Hidden Markov Model (HMM)

127–129

and the Hummer & Szabo scheme

130

that address

the non-Markovian problem of macrostate MSMs,

the core-set MSM

that tackles the recrossing issue,

and the transition-based reweighting analysis method

(TRAM)

131–133

that handles slow transitions for the

protein–ligand dissociation.

The Markovianity is central to MSMs. Apart

from our recommendation using the microstate

MSM to compute kinetic properties and the macro-

state model for interpretation, several new break-

throughs have been made to construct macrostate

MSM at the non-Markovian regime based on avail-

able microstate MSM. Very recently, Hummer and

Szabo

130

applied the Projection operator scheme to

construct a macrostate MSM that reasonably

approximates the kinetics of the aggregated states at

both short and long time limits. We anticipate that

further improvement can be made to calculate quan-

tities like MFPT based on the reconstructed macro-

state transition matrix.

In addition, Noé et al.

127

and McGibbon

et al.

128

have proposed the ‘Hidden Markov Models’

(HMM) to avoid constructing a Markovian model

explicitly. In HMM, the sequences of observables are

assumed to be generated by hidden Markov chains

(or ‘path’) according to certain emission probabili-

ties. The emission probability is the probability for

each hidden state to produce certain observable, con-

ceptually similar to the ‘membership function’that is

used in PCCA+. The complete likelihood function is

composed of two parts: the probability to produce

certain hidden path according to the TPM between

the hidden states and the probability to generate the

observables according to the emission probabilities.

Because the likelihood function is highly complex,

one adopts the forward–backward algorithm to

WIREs Computational Molecular Science Constructing MSMs

estimate the optimal hidden variables (TPM of hid-

den states, emission probabilities, and stationary

population of hidden states). The kinetic properties

of the system can then be computed from the TPM

of the HMM. The membership probabilities of each

observed state are accordingly computed from emis-

sion probability. HMM has been successfully applied

to study the dynamics of Ubiquitin,

128

and the recog-

nition process between ribonuclease barnase and its

inhibitor barstar.

For these two cases, where a clear

separation of timescale exists, the HMM demon-

strated better performance over MSMs. However,

the interpretation of the HMM is sometimes not

straightforward as there exists overlap between dif-

ferent states.

Very often, the timescales of transition events in

an MSM are underestimated, because of the existence

of frequent recrossing events near the boundaries of

the states. In one trajectory, the system can leave a

metastable state, cross the state boundary and then

return to the previous state before visiting the core

region of the other state. This will result in the overes-

timation of transitions between states. One intuitive

but naïve solution is to deﬁne more states around the

transition region,

which will, however, increase the

statistical error in the transition counts. Alternatively,

one can focus on the core region of the metastable

states and use the milestoning processes to estimate

the transition probability among the cores, as imple-

mented in the approach of ‘core-set MSM.’

The

Pretranslocation

(a)

(b)

Y769

10.5%

T831 TL

Y836

5.1μs

0.1 μs

95.9 μs

0.8 μs

191.8 μs

22.4%

4.0%

1.0 μs

i–6 i+1

i–5

BH BH

Mg Mg

i+1 i–6

i+1

i–6 i+1

Mg Mg

H3C

OH OH

H3C

OH HO

HH3C

HOH

H3C

HOH

Syn(rG) : Anti(dG)

Y836

0.1 μs

5.1μs

0.8 μs

95.9 μs

191.8 μs

1.0 μs

BH BH

T831

Y769

63.1%

Frayed Backtracked

FIGURE 6 |The backtracking process of RNA Polymerase II (Pol II) revealed by MSMs. (a) The stepwise process occurs among four metastable

states. The equilibrium population of the states and MFPT among them are labeled. These values are calculated based on ultra-long macrostate

chains that are simulated by an 800-state microstate MSM after bootstrapping the original 480 molecular dynamics (MD) trajectories. (b) A

cartoon model of the backtracking mechanism. In S1 !S2, the motion of the RNA 30-end nucleotide is triggered by the bending of Bridge Helix

(BH). In S2 !S3, the BH residue Y836 stacks with DNA transition nucleotide and Rpb2 residue Y769 stacks with RNA 30-end nucleotide through

their aromatic rings. In S3 !S4, the movement of the RNA:DNA hybrid ﬁnally delivers the Pol II to the backtracked state (Figure adapted with

Advanced Review wires.wiley.com/compmolsci

challenging and open question here is how to locate

the core sets with strong metastability. For this,

Lemke and Keller

134

has proposed several density

based clustering methods to ﬁnd the core sets.

Another limitation of MSMs is that the transi-

tion events should be reversely sampled by unbiased

MD trajectories under one ensemble (thermodynamic

state). This is, however, rather difﬁcult if the transition

of interest is a rare event. Although the adaptive sam-

pling (see our protocol in Figure 1) can alleviate this

issue, direct sampling of extremely slow transitions in

one or both forward and backward directions are still

beyond the reach for many systems. This is particu-

larly relevant in topics of protein-ligand recognition,

where the association is easy to sample, while the dis-

sociation could occur at a very long timescale at sec-

onds or even longer timescales. To tackle this rare

event issue, the Noé group has proposed the

transition-based reweighting analysis method

(TRAM),

131–133

where the kinetic network is con-

structed with trajectories sampled at multiple thermo-

dynamic states. Using the same conﬁguration state

partitioning for all thermodynamic states, a complete

likelihood function (TRAM likelihood) is formulated

to consider all transition events and bias potential of

each ensemble. The unbiased TPM can then be

obtained by solving the maximum likelihood problem

for the multiensemble Markov Models (MEMMs).

TRAM has been successfully applied to study

the recognition processes between serine protease

trypsin and its inhibitor benzamidine (Figure 7).

131

Here the unbinding process occurs at approximately

1 ms. To sample such rare events, 49.1 μs unbiased

trajectories and 459 umbrella sampling simulations

were performed. tICA–MSM was performed on the

unbiased trajectories, where the nearest neighbor

heavy-atom contacts between benzamidine and all

trypsin residues were selected as input features for

tICA. k-Means clustering on the joint space of

umbrella coordinate and tICs yielded 500-state

decomposition for all thermodynamic states. Multi-

ple MSMs were constructed based on the partition-

ing and all the trajectories. Based on the ﬁnal

unbiased TPM, the transition rate for the dissociation

process was reported to be 1170 s

−1

CONCLUSION

Constructing MSMs for studying functional confor-

mational changes of complex biomolecules has

(a)

(b)

20,000 0.4 10 60 1000

10000

1.0

0.8

0.6

0.4

0.2

Probab. of finding koff

0.00 20 40 60 80 100

TRAM

MSM

4000

60 4000 50

2000 30

Unbound

500

9000

20,000

(iii)(i)

(ii) (iv)

FIGURE 7 |The multiensemble Markov Model (MEMM) for the protein-ligand binding of trypsin-benzamidine. (a) The coarse-grained kinetic

network of the MEMM. All transition rates between macrostates are labeled in ms

−1

. (b) The efﬁciency of transition-based reweighting analysis

method (TRAM) and Markov State Model (MSM) in computing unbinding kinetics

off

(Figure adapted with permission from Ref 131. Copyright

2016 National Academy of Sciences, USA).

WIREs Computational Molecular Science Constructing MSMs

become increasingly popular in the past decade,

thanks to the rapid development of the underlying

theory, validation tools and sampling strategy. As a

systematic review of the new advancement in MSM

construction, we proposed a protocol that integrates

the state-of-the-art techniques to guide beginners

who want to study the functional conformational

changes via the MSM framework.

ACKNOWLEDGMENTS

The authors would like to thank Dr. Fu Kit Sheong for fruitful discussions. This work was supported by the

Hong Kong Research Grant Council (grant numbers HKUST C6009-15G, 16305817, 16302214, 16304215,

16318816, AoE/P-705/16, M-HKUST601/13, F-HKUST605/15, and T13–607/12R), King Abdullah University

of Science and Technology (KAUST) Ofﬁce of Sponsored Research (OSR) (OSR-2016-CRG5-3007), and Inno-

vation and Technology Commission (ITCPD/17-9 and ITC-CNERC14SC01). W.W. acknowledges support

from the Hong Kong Ph.D. Fellowship Scheme 2014/15 (PF13-14699). X.H. is the Padma Harilela Associate

Professor of Science.

REFERENCES

1. Kohlhoff KJ, Shukla D, Lawrenz M, Bowman GR,

Konerding DE, Belov D, Altman RB, Pande

VS. Cloud-based simulations on Google Exacycle

reveal ligand modulation of GPCR activation path-

ways. Nat Chem 2014, 6:15–21.

2. Da L-T, Pardo-Avila F, Xu L, Silva D-A, Zhang L,

Gao X, Wang D, Huang X. Bridge helix bending pro-

motes RNA polymerase II backtracking through a

critical and conserved threonine residue. Nat Com-

mun 2016, 7:ncomms11244.

3. Jiang H, Sheong FK, Zhu L, Gao X, Bernauer J,

Huang X. Markov state models reveal a two-step mech-

anism of miRNA loading into the human Argonaute

protein: selective binding followed by structural re-

arrangement. PLoS Comput Biol 2015, 11:e1004404.

4. Shukla D, Meng Y, Roux B, Pande VS. Activation

pathway of Src kinase reveals intermediate states as

targets for drug design. Nat Commun 2014, 5:

ncomms4397.

5. Bowman GR, Bolin ER, Hart KM, Maguire BC,

Marqusee S. Discovery of multiple hidden allosteric sites

by combining Markov state models and experiments.

Proc Natl Acad Sci USA 2015, 112:2734–2739.

6. Wagner JR, Lee CT, Durrant JD, Malmstrom RD,

Feher VA, Amaro RE. Emerging computational

methods for the rational discovery of allosteric drugs.

Chem Rev 2016, 116:6370–6390.

7. Kendrew JC, Bodo G, Dintzis HM, Parrish RG,

Wyckoff H, Phillips DC. A three-dimensional model

of the myoglobin molecule obtained by x-ray analysis.

Nature 1958, 181:662–666.

8. Wüthrich K. Protein structure determination in solu-

tion by NMR spectroscopy. J Biol Chem 1990,

265:22059–22062.

9. Nogales E, Scheres SHW. Cryo-EM: a unique tool for

the visualization of macromolecular complexity. Mol

Cell 2015, 58:677–689.

10. Callender R, Dyer RB. Probing protein dynamics

using temperature jump relaxation spectroscopy. Curr

Opin Struct Biol 2002, 12:628–633.

11. Clore GM, Tang C, Iwahara J. Elucidating transient

macromolecular interactions using paramagnetic

relaxation enhancement. Curr Opin Struct Biol 2007,

17:603–616.

12. Ha T, Ting AY, Liang J, Caldwell WB, Deniz AA,

Chemla DS, Schultz PG, Weiss S. Single-molecule

ﬂuorescence spectroscopy of enzyme conformational

dynamics and cleavage mechanism. Proc Natl Acad

Sci USA 1999, 96:893–898.

13. Shaw DE, Deneroff MM, Dror RO, Kuskin JS,

Larson RH, Salmon JK, Young C, Batson B, Bowers

KJ, Chao JC, et al. Anton, a special-purpose machine

for molecular dynamics simulation. Commun ACM

2008, 51:91–97.

14. Eastman P, Friedrichs MS, Chodera JD, Radmer RJ,

Bruns CM, Ku JP, Beauchamp KA, Lane TJ, Wang L-

P, Shukla D, et al. OpenMM 4: a reusable, extensible,

hardware independent library for high performance

molecular simulation. J Chem Theory Comput 2013,

9:461–469.

15. Salomon-Ferrer R, Götz AW, Poole D, Le Grand S,

Walker RC. Routine microsecond molecular dynam-

ics simulations with AMBER on GPUs. 2. Explicit

solvent particle mesh Ewald. J Chem Theory Comput.

2013, 9:3878–3888.

16. Fitch BG, Germain RS, Mendell M, Pitera J, Pitman

M, Rayshubskiy A, Sham Y, Suits F, Swope W, Ward

TJC, et al. Blue matter, an application framework for

molecular simulation on blue gene. J Parallel Distrib

Comput 2003, 63:759–773.

Advanced Review wires.wiley.com/compmolsci

17. Maximova T, Moffatt R, Ma B, Nussinov R,

Shehu A. Principles and overview of sampling

methods for modeling macromolecular structure and

dynamics. PLoS Comput Biol 2016, 12:e1004619.

18. Mitsutake A, Sugita Y, Okamoto Y. Generalized-

ensemble algorithms for molecular simulations of bio-

polymers. Biopolym-Pept Sci Sect 2001, 60:96–123.

19. Zheng L, Chen M, Yang W. Random walk in orthog-

onal space to achieve efﬁcient free-energy simulation

of complex systems. Proc Natl Acad Sci USA 2008,

105:20227–20232.

20. Gao YQ. An integrate-over-temperature approach for

enhanced sampling. J Chem Phys 2008, 128:64105.

21. Zhang BW, Dai W, Gallicchio E, He P, Xia J, Tan Z,

Levy RM. Simulating replica exchange: Markov state

models, proposal schemes, and the inﬁnite swapping

limit. J Phys Chem B 2016, 120:8289–8301.

22. Barducci A, Bonomi M, Parrinello M. Metadynamics.

Wiley Interdiscip Rev Comput Mol Sci 2011, 1:826–843.

23. Bolhuis PG, Chandler D, Dellago C, Geissler PL.

Transition path sampling: throwing ropes over Rough

Mountain passes, in the dark. Annu Rev Phys Chem

2002, 53:291–318.

24. Bello-Rivas JM, Elber R. Exact milestoning. J Chem

Phys 2015, 142:94102.

25. Hamelberg D, Mongan J, McCammon JA. Acceler-

ated molecular dynamics: a promising and efﬁcient

simulation method for biomolecules. J Chem Phys

2004, 120:11919–11929.

26. Bowman GR, Pande VS, Noé F. An Introduction to

Markov State Models and their Application to Long

Timescale Molecular Simulation, vol. 797. Nether-

lands: Springer Science & Business Media; 2014.

27. Da L-T, Sheong FK, Silva D-A, Huang X. Application

of Markov state models to simulate long timescale

dynamics of biological macromolecules. In: Han K,

Zhang X, Yang M, eds. Protein Conformational

Dynamics [Internet]. Advances in Experimental Med-

icine and Biology. Switzerland: Springer International

Publishing; 2014, 29–66 Available at: http://link.

springer.com/chapter/10.1007/978-3-319-02970-2_2.

(Accessed June 25, 2017).

28. Prinz J-H, Wu H, Sarich M, Keller B, Senne M, Held M,

Chodera JD, Schütte C, Noé F. Markov models of molec-

ular kinetics: generation and validation. JChemPhys

2011, 134:174105.

29. Noé F, Schütte C, Vanden-Eijnden E, Reich L,

Weikl TR. Constructing the equilibrium ensemble of

folding pathways from short off-equilibrium simula-

tions. Proc Natl Acad Sci USA 2009,

106:19011–19016.

30. Beauchamp KA, McGibbon R, Lin Y-S, Pande VS.

Simple few-state models reveal hidden complexity in

protein folding. Proc Natl Acad Sci USA 2012,

109:17807–17813.

31. Lane TJ, Shukla D, Beauchamp KA, Pande VS. To

milliseconds and beyond: challenges in the simulation

of protein folding. Curr Opin Struct Biol 2013,

23:58–65.

32. Voelz VA, Jäger M, Yao S, Chen Y, Zhu L,

Waldauer SA, Bowman GR, Friedrichs M, Bakajin O,

Lapidus LJ, et al. Slow unfolded-state structuring in

acyl-CoA binding protein folding revealed by simula-

tion and experiment. J Am Chem Soc 2012, 134:

12565–12577.

33. Buchete N-V, Hummer G. Coarse master equations

for peptide folding dynamics. J Phys Chem B 2008,

112:6057–6069.

34. Da L-T, Avila FP, Wang D, Huang X. A two-state

model for the dynamics of the pyrophosphate ion

release in bacterial RNA polymerase. PLoS Comput

Biol 2013, 9:e1003020.

35. Da L-T, E C, Duan B, Zhang C, Zhou X, Yu J. A

jump-from-cavity pyrophosphate ion release assisted

by a key lysine residue in T7 RNA polymerase tran-

scription elongation. PLoS Comput Biol 2015, 11:

e1004624.

36. Da L-T, Wang D, Huang X. Dynamics of pyrophos-

phate ion release and its coupled trigger loop motion

from closed to open state in RNA polymerase II. J

Am Chem Soc 2012, 134:2399–2406.

37. Silva D-A, Weiss DR, Avila FP, Da L-T, Levitt M,

Wang D, Huang X. Millisecond dynamics of RNA

polymerase II translocation at atomic resolution. Proc

Natl Acad Sci USA 2014, 111:7665–7670.

38. Weber M, Bujotzek A, Haag R. Quantifying the

rebinding effect in multivalent chemical ligand-

receptor systems. J Chem Phys 2012 Aug,

137:54111.

39. Plattner N, Doerr S, De Fabritiis G, Noé F. Complete

protein–protein association kinetics in atomic detail

revealed by molecular dynamics simulations and

Markov modelling. Nat Chem 2017, 9:1005–1011.

40. Vanatta DK, Shukla D, Lawrenz M, Pande VS. A net-

work of molecular switches controls the activation of

the two-component response regulator NtrC. Nat

Commun 2015, 6:ncomms8283.

41. Silva D-A, Bowman GR, Sosa-Peinado A, Huang X.

A role for both conformational selection and induced

ﬁt in ligand binding by the LAO protein. PLoS Com-

put Biol 2011, 7:e1002054.

42. Malmstrom RD, Kornev AP, Taylor SS, Amaro RE.

Allostery through the computational microscope:

cAMP activation of a canonical signalling domain.

Nat Commun 2015 Jul, 6:ncomms8588.

43. Lawrenz M, Shukla D, Pande VS. Cloud computing

approaches for prediction of ligand binding poses and

pathways. Sci Rep 2015, 5:srep07918.

44. Buch I, Giorgino T, Fabritiis GD. Complete recon-

struction of an enzyme-inhibitor binding process by

WIREs Computational Molecular Science Constructing MSMs

molecular dynamics simulations. Proc Natl Acad Sci

USA 2011, 108:10184–10189.

45. Zhang L, Jiang H, Sheong FK, Pardo-Avila F,

Cheung PP-H, Huang X. Constructing kinetic net-

work models to elucidate mechanisms of functional

conformational changes of enzymes and their recogni-

tion with ligands. Methods Enzymol 2016,

578:343–371.

46. Zhang L, Pardo-Avila F, Unarta IC, Cheung PP-H,

Wang G, Wang D, Huang X. Elucidation of the

dynamics of transcription elongation by RNA poly-

merase II using kinetic network models. Acc Chem

Res 2016, 49:687–694.

47. Zhu L, Sheong FK, Zeng X, Huang X. Elucidating

conformational dynamics of multi-body systems by

constructing Markov state models. Phys Chem Chem

Phys 2016, 18:30228–30235.

48. Schütte C, Sarich M. A critical appraisal of Markov

state models. Eur Phys J Spec Top 2015,

224:2445–2462.

49. Shukla D, Hernández CX, Weber JK, Pande VS. Mar-

kov state models provide insights into dynamic mod-

ulation of protein function. Acc Chem Res 2015,

48:414–422.

50. Schwantes CR, McGibbon RT, Pande VS. Perspec-

tive: Markov models for long-timescale biomolecular

dynamics. J Chem Phys 2014 Sep, 141:90901.

51. Chodera JD, Noé F. Markov state models of biomo-

lecular conformational dynamics. Curr Opin Struct

Biol 2014, 25:135–144.

52. Pande VS, Beauchamp K, Bowman GR. Everything

you wanted to know about Markov state models but

were afraid to ask. Methods 2010, 52:99–105.

53. Zwanzig R. Nonequilibrium Statistical Mechanics.

Oxford, UK: Oxford University Press; 2001.

54. Zwanzig R. From classical dynamics to continuous

time random walks. J Stat Phys 1983, 30:255–262.

55. Mori H. Transport, collective motion, and Brownian

motion. Prog Theor Phys 1965, 33:423–455.

56. Zwanzig R. Ensemble method in the theory of irre-

versibility. J Chem Phys 1960, 33:1338–1341.

57. Nüske F, Keller BG, Pérez-Hernández G, Mey ASJS,

Noé F. Variational approach to molecular kinetics. J

Chem Theory Comput. 2014, 10:1739–1752.

58. Husic BE, McGibbon RT, Sultan MM, Pande VS.

Optimized parameter selection reveals trends in Mar-

kov state models for protein folding. J Chem Phys

2016, 145:194103.

59. McGibbon RT, Pande VS. Variational cross-

validation of slow dynamical modes in molecular

kinetics. J Chem Phys 2015, 142:124105.

60. Schwantes CR, Pande VS. Improvements in Markov

state model construction reveal many non-native

interactions in the folding of NTL9. J Chem Theory

Comput. 2013, 9:2000–2009.

61. Pérez-Hernández G, Paul F, Giorgino T, De

Fabritiis G, Noé F. Identiﬁcation of slow molecular

order parameters for Markov model construction. J

Chem Phys 2013, 139:15102.

62. Naritomi Y, Fuchigami S. Slow dynamics of a protein

backbone in molecular dynamics simulation revealed

by time-structure based independent component anal-

ysis. J Chem Phys 2013, 139:215102.

63. Schütte C, Noé F, Lu J, Sarich M, Vanden-Eijnden E.

Markov state models based on milestoning. J Chem

Phys 2011 May 24, 134:204105.

64. Harrigan MP, Sultan MM, Hernandez CX, Husic BE,

Eastman P, Schwantes CR, Beauchamp KA, McGibbon

RT, Pande VS. MSMBuilder: statistical models for bio-

molecular dynamics. bioRxiv 2016;84020.

65. Beauchamp KA, Bowman GR, Lane TJ, Maibaum L,

Haque IS, Pande VS. MSMBuilder2: modeling conforma-

tional dynamics on the picosecond to millisecond scale. J

Chem Theory Comput. 2011, 7:3412–3419.

66. Bowman GR, Huang X, Pande VS. Using generalized

ensemble simulations and Markov state models to iden-

tify conformational states. Methods 2009, 49:197–201.

67. Scherer MK, Trendelkamp-Schroer B, Paul F, Pérez-

Hernández G, Hoffmann M, Plattner N, Wehmeyer C,

Prinz J-H, Noé F. PyEMMA 2: a software package for

estimation, validation, and analysis of Markov models.

J Chem Theory Comput 2015, 11:5525–5542.

68. Harvey MJ, De Fabritiis G. High-throughput molecu-

lar dynamics: the powerful new tool for drug discov-

ery. Drug Discov Today 2012, 17:1059–1062.

69. Liu S, Zhu L, Sheong FK, Wang W, Huang X. Adap-

tive partitioning by local density-peaks: an efﬁcient

density-based clustering algorithm for analyzing

molecular dynamics trajectories. J Comput Chem

2017, 38:152–160.

70. Frank J. Three-Dimensional Electron Microscopy of

Macromolecular Assemblies: Visualization of Biologi-

cal Molecules in Their Native State. Oxford, UK:

Oxford University Press; 2006.

71. Rhee YM, Pande VS. Multiplexed-replica exchange

molecular dynamics method for protein folding simu-

lation. Biophys J 2003, 84:775–786.

72. Sugita Y, Okamoto Y. Replica-exchange molecular

dynamics method for protein folding. Chem Phys Lett

1999, 314:141–151.

73. Weiss DR, Levitt M. Can morphing methods predict

intermediate structures? J Mol Biol 2009,

385:665–674.

74. Isralewitz B, Gao M, Schulten K. Steered molecular

dynamics and mechanical functions of proteins. Curr

Opin Struct Biol 2001, 11:224–230.

75. Schlitter J, Engels M, Krüger P. Targeted molecular

dynamics: a new approach for searching pathways of

conformational transitions. JMolGraph1994,

12:84–89.

Advanced Review wires.wiley.com/compmolsci

76. Chovancova E, Pavelka A, Benes P, Strnad O,

Brezovsky J, Kozlikova B, Gora A, Sustr V, Klvana

M, Medek P, et al. CAVER 3.0: a tool for the analy-

sis of transport pathways in dynamic protein struc-

tures. PLOS Comput Biol 2012, 8:e1002708.

77. Zhang L, Silva D-A, Pardo-Avila F, Wang D,

Huang X. Structural model of RNA polymerase II

elongation complex with complete transcription bub-

ble reveals NTP entry routes. PLoS Comput Biol

2015, 11:e1004354.

78. Zimmerman MI, Bowman GR. FAST conformational

searches by balancing exploration/exploitation trade-

offs. J Chem Theory Comput 2015, 11:5747–5757.

79. Gan W, Yang S, Roux B. Atomistic view of the confor-

mational activation of Src kinase using the string method

with swarms-of-trajectories. Biophys J 2009, 97:L8–L10.

80. Maragliano L, Fischer A, Vanden-Eijnden E,

Ciccotti G. String method in collective variables: min-

imum free energy paths and isocommittor surfaces. J

Chem Phys 2006, 125:24106.

81. Pan AC, Roux B. Building Markov state models

along pathways to determine free energies and rates

of transitions. J Chem Phys 2008, 129:64107.

82. Pan AC, Sezer D, Roux B. Finding transition path-

ways using the string method with swarms of trajec-

tories. J Phys Chem B 2008, 112:3432–3440.

83. Maragliano L, Vanden-Eijnden E. On-the-ﬂy string

method for minimum free energy paths calculation.

Chem Phys Lett. 2007, 446:182–190.

84. Díaz Leines G, Ensing B. Path ﬁnding on high-

dimensional free energy landscapes. Phys Rev Lett

2012, 109:20601.

85. Pelt DM, Batenburg KJ. Fast tomographic reconstruc-

tion from limited data using artiﬁcial neural net-

works. IEEE Trans Image Process 2013,

22:5238–5251.

86. Theobald DL. Rapid calculation of RMSDs using a

quaternion-based characteristic polynomial. Acta

Crystallogr A 2005, 61:478–480.

87. Ernst M, Sittel F, Stock G. Contact- and distance-

based principal component analysis of protein

dynamics. J Chem Phys 2015, 143:244114.

88. Schwantes CR, Pande VS. Modeling molecular kinet-

ics with tICA and the kernel trick. J Chem Theory

Comput 2015, 11:600–608.

89. Pérez-Hernández G, Noé F. Hierarchical time-lagged

independent component analysis: computing slow modes

and reaction coordinates for large molecular systems. J

Chem Theory Comput 2016, 12:6118–6129.

90. Boninsegna L, Gobbo G, Noé F, Clementi C. Investi-

gating molecular kinetics by variationally optimized

diffusion maps. J Chem Theory Comput 2015,

11:5947–5960.

91. Rohrdanz MA, Zheng W, Clementi C. Discovering

mountain passes via torchlight: methods for the

deﬁnition of reaction coordinates and pathways in

complex macromolecular reactions. Annu Rev Phys

Chem 2013, 64:295–316.

92. Noé F, Clementi C. Collective variables for the study of

long-time kinetics from molecular trajectories: theory and

methods. Curr Opin Struct Biol 2017, 43:141–147.

93. Noé F, Clementi C. Kinetic distance and kinetic maps

from molecular dynamics simulation. J Chem Theory

Comput 2015, 11:5002–5011.

94. Sculley D. Web-scale K-means clustering. In: Proceed-

ings of the 19th International Conference on World

Wide Web [Internet] (WWW ‘10). New York: ACM;

2010, 1177–1178. Available at: http://doi.acm.org/

10.1145/1772690.1772862

95. Gonzalez TF. Clustering to minimize the maximum inter-

cluster distance. Theor Comput Sci 1985, 38:293–306.

96. Zhao Y, Sheong FK, Sun J, Sander P, Huang X. A

fast parallel clustering algorithm for molecular simu-

lation trajectories. J Comput Chem 2013, 34:95–104.

97. Sheong FK, Silva D-A, Meng L, Zhao Y, Huang X.

Automatic state partitioning for multibody systems

(APM): an efﬁcient algorithm for constructing Mar-

kov state models to elucidate conformational dynam-

ics of multibody systems. J Chem Theory Comput

2015, 11:17–27.

98. Ward JH. Hierarchical grouping to optimize an

objective function. J Am Stat Assoc 1963,

58:236–244.

99. Husic BE, Pande VS. Ward clustering improves cross-

validated Markov state models of protein folding. J

Chem Theory Comput 2017, 13:963–967.

100. Sittel F, Stock G. Robust density-based clustering to

identify metastable conformational states of proteins.

J Chem Theory Comput 2016, 12:2426–2435.

101. Bowman GR, Beauchamp KA, Boxer G, Pande VS.

Progress and challenges in the automated construc-

tion of Markov state models for full protein systems.

J Chem Phys 2009, 131:124101.

102. Scalco R, Caﬂisch A. Equilibrium distribution from

distributed computing (simulations of protein fold-

ing). J Phys Chem B 2011 May, 115:6358–6365.

103. Metzner P, Noé F, Schütte C. Estimating the sampling

error: distribution of transition matrices and func-

tions of transition matrices for given trajectory data.

Phys Rev E 2009, 80:21106.

104. Efron B, Tibshirani RJ. An Introduction to the Boot-

strap. New York: Chapman & Hall; 1994.

105. Bowman GR, Ensign DL, Pande VS. Enhanced

modeling via network theory: adaptive sampling of

Markov state models. J Chem Theory Comput 2010,

6:787–794.

106. Doerr S, Harvey MJ, Noé F, De Fabritiis G. HTMD:

high-throughput molecular dynamics for molecular

discovery. J Chem Theory Comput 2016,

12:1845–1852.

WIREs Computational Molecular Science Constructing MSMs

107. Hinrichs NS, Pande VS. Calculation of the distribu-

tion of eigenvalues and eigenvectors in Markovian

state models for molecular dynamics. J Chem Phys

2007, 126:244101.

108. Voelz VA, Elman B, Razavi AM, Zhou G. Surprisal

metrics for quantifying perturbed conformational

dynamics in Markov state models. J Chem Theory

Comput 2014, 10:5716–5728.

109. Weinan E, Vanden-Eijnden E. Transition-path theory

and path-ﬁnding algorithms for the study of rare

events. Annu Rev Phys Chem 2010, 61:391–420.

110. Metzner P, Schütte C, Vanden-Eijnden E. Transition

path theory for Markov jump processes. Multiscale

Model Simul 2009, 7:1192–1219.

111. Meng L, Sheong FK, Zeng X, Zhu L, Huang X. Path

lumping: an efﬁcient algorithm to identify metastable

path channels for conformational dynamics of multi-

body systems. J Chem Phys 2017, 147:44112.

112. Zhou G, Pantelopulos GA, Mukherjee S, Voelz VA.

Bridging microscopic and macroscopic mechanisms

of p53-MDM2 binding using molecular simulations

and kinetic network models. bioRxiv 2016;86272.

113. Zheng X, Zhu L, Zeng X, Meng L, Zhang L, Wang

D, Huang X. Kinetics-controlled amphiphile self-

assembly processes. J Phys Chem Lett 2017,

8:1798–1803.

114. Bowman GR, Meng L, Huang X. Quantitative com-

parison of alternative methods for coarse-graining

biological networks. J Chem Phys 2013, 139:121905.

115. Deuﬂhard P, Huisinga W, Fischer A, Schütte C. Iden-

tiﬁcation of almost invariant aggregates in reversible

nearly uncoupled Markov chains. Linear Algebra

Appl 2000, 315:39–59.

116. Deuﬂhard P, Weber M. Robust Perron cluster analy-

sis in conformation dynamics. Linear Algebra Appl

2005, 398:161–184.

117. Röblitz S, Weber M. Fuzzy spectral clustering by

PCCA+: application to Markov state models and data

classiﬁcation. Adv Data Anal Classif 2013,

7:147–179.

118. Shi J, Malik J. Normalized cuts and image segmenta-

tion. IEEE Trans Pattern Anal Mach Intell 2000,

22:888–905.

119. Ng AY, Jordan MI, Weiss Y. On spectral clustering:

analysis and an algorithm. In: Proceedings of the

14th International Conference on Neural Information

Processing Systems: Natural and Synthetic [Internet]

(NIPS’01). Cambridge, MA: MIT Press; 2001,

849–856. Available at: http://dl.acm.org/citation.cfm?

id=2980539.2980649

120. Chodera JD, Singhal N, Pande VS, Dill KA,

Swope WC. Automatic discovery of metastable states

for the construction of Markov models of macromo-

lecular conformational dynamics. J Chem Phys 2007,

126:155101.

121. Bowman GR. Improved coarse-graining of Markov

state models via explicit consideration of statistical

uncertainty. J Chem Phys 2012, 137:134111.

122. Jain A, Stock G. Identifying metastable states of folding

proteins. J Chem Theory Comput 2012, 8:3810–3819.

123. Huang X, Yao Y, Bowman GR, Sun J, Guibas LJ,

Carlsson G, Pande VS. Constructing multi-resolution

Markov state models (MSMS) to elucidate RNA hair-

pin folding mechanisms. In: Biocomputing 2010

[Internet]. World Scientiﬁc; 2009, 228–239. Available

at: http://www.worldscientiﬁc.com/doi/abs/10.1142/

9789814295291_0025

124. Yao Y, Cui RZ, Bowman GR, Silva D-A, Sun J,

Huang X. Hierarchical Nyström methods for con-

structing Markov state models for conformational

dynamics. J Chem Phys 2013, 138:174106.

125. Orioli S, Faccioli P. Dimensional reduction of Mar-

kov state models from renormalization group theory.

J Chem Phys 2016, 145:124120.

126. Bacallado S, Chodera JD, Pande V. Bayesian compar-

ison of Markov models of molecular dynamics with

detailed balance constraint. J Chem Phys 2009,

131:45106.

127. Noé F, Wu H, Prinz J-H, Plattner N. Projected and

hidden Markov models for calculating kinetics and

metastable states of complex molecules. J Chem Phys

2013, 139:184114.

128. McGibbon R, Ramsundar B, Sultan M, Kiss G,

Pande V. Understanding protein dynamics with L1-

regularized reversible hidden Markov models. In:

International Conference on Machine Learning;

2014, 1197–1205.

129. Shukla S, Shamsi Z, Moffett A, Selvam B, Shukla D.

Application of hidden Markov models in biomolecu-

lar simulations. In: Westhead DR, Vijayabaskar MS,

eds. Hidden Markov Models [Internet]. Methods in

Molecular Biology. New York: Springer; 2017,

29–41. https://doi.org/10.1007/978-1-4939-6753-7_3.

130. Hummer G, Szabo A. Optimal dimensionality reduc-

tion of multistate kinetic and Markov-state models. J

Phys Chem B 2015, 119:9029–9037.

131. Wu H, Paul F, Wehmeyer C, Noé F. Multiensemble

Markov models of molecular thermodynamics and kinet-

ics. Proc Natl Acad Sci USA 2016, 113:E3221–E3230.

132. Mey ASJS, Wu H, Noé F. xTRAM: estimating equi-

librium expectations from time-correlated simulation

data at multiple thermodynamic states. Phys Rev X

2014, 4:41018.

133. Wu H, Mey ASJS, Rosta E, Noé F. Statistically opti-

mal analysis of state-discretized trajectory data from

multiple thermodynamic states. J Chem Phys 2014,

141:214106.

134. Lemke O, Keller BG. Density-based cluster algo-

rithms for the identiﬁcation of core sets. J Chem Phys

2016, 145:164104.

Advanced Review wires.wiley.com/compmolsci

Log-periodic oscillations as real-time signatures of hierarchical dynamics in proteins

Article

Feb 2024

The time-dependent relaxation of a dynamical system may exhibit a power-law behavior that is superimposed by log-periodic oscillations. D. Sornette [Phys. Rep. 297, 239 (1998)] showed that this behavior can be explained by a discrete scale invariance of the system, which is associated with discrete and equidistant timescales on a logarithmic scale. Examples include such diverse fields as financial crashes, random diffusion, and quantum topological materials. Recent time-resolved experiments and molecular dynamics simulations suggest that discrete scale invariance may also apply to hierarchical dynamics in proteins, where several fast local conformational changes are a prerequisite for a slow global transition to occur. Employing entropy-based timescale analysis and Markov state modeling to a simple one-dimensional hierarchical model and biomolecular simulation data, it is found that hierarchical systems quite generally give rise to logarithmically spaced discrete timescales. By introducing a one-dimensional reaction coordinate that collectively accounts for the hierarchically coupled degrees of freedom, the free energy landscape exhibits a characteristic staircase shape with two metastable end states, which causes the log-periodic time evolution of the system. The period of the log-oscillations reflects the effective roughness of the energy landscape and can, in simple cases, be interpreted in terms of the barriers of the staircase landscape.

A kinetic model reveals the critical gating motifs for donor-substrate loading into Actinobacillus pleuropneumoniae N-glycosyltransferase

Article

Apr 2024
PHYS CHEM CHEM PHYS

Through constructing a kinetic model based on extensive all-atom molecular dynamics simulations, the key structural motifs in ApNGT Q469A responsible for mediating the donor-substrate loading are pinpointed.

The Arabidopsis AtSWEET13 transporter discriminates sugars by selective facial and positional substrate recognition

Article

Full-text available

Jun 2024

Transporters are targeted by endogenous metabolites and exogenous molecules to reach cellular destinations, but it is generally not understood how different substrate classes exploit the same transporter’s mechanism. Any disclosure of plasticity in transporter mechanism when treated with different substrates becomes critical for developing general selectivity principles in membrane transport catalysis. Using extensive molecular dynamics simulations with an enhanced sampling approach, we select the Arabidopsis sugar transporter AtSWEET13 as a model system to identify the basis for glucose versus sucrose molecular recognition and transport. Here we find that AtSWEET13 chemical selectivity originates from a conserved substrate facial selectivity demonstrated when committing alternate access, despite mono-/di-saccharides experiencing differing degrees of conformational and positional freedom throughout other stages of transport. However, substrate interactions with structural hallmarks associated with known functional annotations can help reinforce selective preferences in molecular transport.

Diffusive dynamics of a model protein chain in solution

Article

Feb 2024

A Markov state model is a powerful tool that can be used to track the evolution of populations of configurations in an atomistic representation of a protein. For a coarse-grained linear chain model with discontinuous interactions, the transition rates among states that appear in the Markov model when the monomer dynamics is diffusive can be determined by computing the relative entropy of states and their mean first passage times, quantities that are unchanged by the specification of the energies of the relevant states. In this paper, we verify the folding dynamics described by a diffusive linear chain model of the crambin protein in three distinct solvent systems, each differing in complexity: a hard-sphere solvent, a solvent undergoing multi-particle collision dynamics, and an implicit solvent model. The predicted transition rates among configurations agree quantitatively with those observed in explicit molecular dynamics simulations for all three solvent models. These results suggest that the local monomer–monomer interactions provide sufficient friction for the monomer dynamics to be diffusive on timescales relevant to changes in conformation. Factors such as structural ordering and dynamic hydrodynamic effects appear to have minimal influence on transition rates within the studied solvent densities.

Information Bottleneck Approach for Markov Model Construction

Article

Jun 2024
J CHEM THEORY COMPUT

Advanced computational approaches to understand protein aggregation

Article

Apr 2024

Protein aggregation is a widespread phenomenon implicated in debilitating diseases like Alzheimer's, Parkinson's, and cataracts, presenting complex hurdles for the field of molecular biology. In this review, we explore the evolving realm of computational methods and bioinformatics tools that have revolutionized our comprehension of protein aggregation. Beginning with a discussion of the multifaceted challenges associated with understanding this process and emphasizing the critical need for precise predictive tools, we highlight how computational techniques have become indispensable for understanding protein aggregation. We focus on molecular simulations, notably molecular dynamics (MD) simulations, spanning from atomistic to coarse-grained levels, which have emerged as pivotal tools in unraveling the complex dynamics governing protein aggregation in diseases such as cataracts, Alzheimer's, and Parkinson's. MD simulations provide microscopic insights into protein interactions and the subtleties of aggregation pathways, with advanced techniques like replica exchange molecular dynamics, Metadynamics (MetaD), and umbrella sampling enhancing our understanding by probing intricate energy landscapes and transition states. We delve into specific applications of MD simulations, elucidating the chaperone mechanism underlying cataract formation using Markov state modeling and the intricate pathways and interactions driving the toxic aggregate formation in Alzheimer's and Parkinson's disease. Transitioning we highlight how computational techniques, including bioinformatics, sequence analysis, structural data, machine learning algorithms, and artificial intelligence have become indispensable for predicting protein aggregation propensity and locating aggregation-prone regions within protein sequences. Throughout our exploration, we underscore the symbiotic relationship between computational approaches and empirical data, which has paved the way for potential therapeutic strategies against protein aggregation-related diseases. In conclusion, this review offers a comprehensive overview of advanced computational methodologies and bioinformatics tools that have catalyzed breakthroughs in unraveling the molecular basis of protein aggregation, with significant implications for clinical interventions, standing at the intersection of computational biology and experimental research.

Tutorial on how to build non-Markovian dynamic models from molecular dynamics simulations for studying protein conformational changes

Article

Mar 2024

Protein conformational changes play crucial roles in their biological functions. In recent years, the Markov State Model (MSM) constructed from extensive Molecular Dynamics (MD) simulations has emerged as a powerful tool for modeling complex protein conformational changes. In MSMs, dynamics are modeled as a sequence of Markovian transitions among metastable conformational states at discrete time intervals (called lag time). A major challenge for MSMs is that the lag time must be long enough to allow transitions among states to become memoryless (or Markovian). However, this lag time is constrained by the length of individual MD simulations available to track these transitions. To address this challenge, we have recently developed Generalized Master Equation (GME)-based approaches, encoding non-Markovian dynamics using a time-dependent memory kernel. In this Tutorial, we introduce the theory behind two recently developed GME-based non-Markovian dynamic models: the quasi-Markov State Model (qMSM) and the Integrative Generalized Master Equation (IGME). We subsequently outline the procedures for constructing these models and provide a step-by-step tutorial on applying qMSM and IGME to study two peptide systems: alanine dipeptide and villin headpiece. This Tutorial is available at https://github.com/xuhuihuang/GME_tutorials. The protocols detailed in this Tutorial aim to be accessible for non-experts interested in studying the biomolecular dynamics using these non-Markovian dynamic models.

Revealing the conformational dynamics of UDP-GlcNAc recognition by O-GlcNAc transferase via Markov state model

Article

Nov 2023

Unlocking the potential of RNAi as a therapeutic strategy against infectious viruses: an in-silico study

Article

Nov 2023

RNA interference is an upcoming methodology being designed to specifically target viral infections. The current study suggests a strategy to design probable small interfering RNAs (siRNA) for targeting the viral genome of SARS-CoV-2, as a case study. siRNAs were designed against the targets from a highly conserved region of the spike gene of SARS-CoV-2 having no significant matches within the human genome. Four targets/viral RNAs (vRNA) with high predicted inhibition values were selected for further evaluation. The predicted siRNAs were examined for their properties and stability using molecular dynamics (MD) simulations. Further, to understand the RNA-Induced Silencing Complex (RISC) mechanism of the predicted siRNA targets of SARS-CoV-2, the human argonaute (Ago2) protein in complex with the four siRNA-vRNA duplexes was built. MD simulations of apo-Ago2, four selected siRNA-vRNA duplexes and four Ago2 bound to these siRNA-vRNA duplexes were carried out for 1 μs each. Amongst the four duplex-bound Ago2 simulation systems, the siRNA-vRNA3 duplex showed stable base pairing in the seed region, favourable and strong interactions with functionally important residues of Ago2 protein through the simulation length. Therefore, the designed siRNA3 molecule may act as an effective therapeutic agent against the SARS-CoV-2. The reported in-silico strategy may be beneficial for the identification and designing of probable siRNAs against any viral genome in RNAi therapeutics. However, the experimental validation of these molecules would be required for proving their use as therapeutics.

Computational Studies of Enzyme Motions

Chapter

Oct 2020

Qiang Cui

Protein structure determination in solution by NMR spectroscopy.

Article

Full-text available

Dec 1990

Kurt Wüthrich

The introduction of nuclear magnetic resonance (NMR) spectroscopy as a second method for protein structure determination at atomic resolution, in addition to x-ray diffraction in single crystals, has already led to a significant increase in the number of known protein structures. The NMR method provides data that are in many ways complementary to those obtained from x-ray crystallography and thus promises to widen our view of protein molecules, giving a clearer insight into the relation between structure and function.

Path lumping: An efficient algorithm to identify metastable path channels for conformational dynamics of multi-body systems

Article

Full-text available

Jul 2017

Constructing Markov state models from large-scale molecular dynamics simulation trajectories is a promising approach to dissect the kinetic mechanisms of complex chemical and biological processes. Combined with transition path theory, Markov state models can be applied to identify all pathways connecting any conformational states of interest. However, the identified pathways can be too complex to comprehend, especially for multi-body processes where numerous parallel pathways with comparable flux probability often coexist. Here, we have developed a path lumping method to group these parallel pathways into metastable path channels for analysis. We define the similarity between two pathways as the intercrossing flux between them and then apply the spectral clustering algorithm to lump these pathways into groups. We demonstrate the power of our method by applying it to two systems: a 2D-potential consisting of four metastable energy channels and the hydrophobic collapse process of two hydrophobic molecules. In both cases, our algorithm successfully reveals the metastable path channels. We expect this path lumping algorithm to be a promising tool for revealing unprecedented insights into the kinetic mechanisms of complex multi-body processes.

Kinetics-Controlled Amphiphile Self-Assembly Processes

Article

Full-text available

Apr 2017

Amphiphiles self-assembly is an essential bottom-up approach of fabricating advanced functional materials. Self-assembled materials with desired structures are often obtained through thermodynamic control. Here, we demonstrate that the selection of kinetic pathways can lead to drastically different self-assembled structures, underlining the significance of kinetic control in self-assembly. By constructing kinetic network models from large-scale molecular dynamics simulations, we show that two largely similar amphiphiles PYR and PYN prefer distinct kinetic assembly pathways. While PYR prefers an incremental growth mechanism and forms a nanotube, PYN favors a hopping growth pathway leading to a vesicle. Such preference was found to originate from the subtle difference in the distributions of hydrophobic and hydrophilic groups in their chemical structures, which leads to different rates of the adhesion process among the aggregating micelles. Our results are in good agreement with experimental results, and accentuates the role of kinetics in the rational design of amphiphiles self-assembly.

Bridging microscopic and macroscopic mechanisms of p53-MDM2 binding using molecular simulations and kinetic network models

Preprint

Dec 2016

Under normal cellular conditions, the tumor suppressor protein p53 is kept at a low levels in part due to ubiquitination by MDM2, a process initiated by binding of MDM2 to the intrinsically disordered transactivation domain (TAD) of p53. Although many experimental and simulation studies suggest that disordered domains such as p53 TAD bind their targets nonspecifically before folding to a tightly-associated conformation, the molecular details are unclear. Toward a detailed prediction of binding mechanism, pathways and rates, we have performed large-scale unbiased all-atom simulations of p53-MDM2 binding. Markov State Models (MSMs) constructed from the trajectory data predict p53 TAD peptide binding pathways and on-rates in good agreement with experiment. The MSM reveals that two key bound intermediates, each with a non-native arrangement of hydrophobic residues in the MDM2 binding cleft, control the overall on-rate. Using microscopic rate information from the MSM, we parameterize a simple four-state kinetic model to (1) determine that induced-fit pathways dominate the binding flux over a large range of concentrations, and (2) predict how modulation of residual p53 helicity affects binding, in good agreement with experiment. These results suggest new ways in which microscopic models of bound-state ensembles can be used to understand biological function on a macroscopic scale. AUTHOR SUMMARY Many cell signaling pathways involve protein-protein interactions in which an intrinsically disordered peptide folds upon binding its target. Determining the molecular mechanisms that control these binding rates is important for understanding how such systems are regulated. In this paper, we show how extensive all-atom simulations combined with kinetic network models provide a detailed mechanistic understanding of how tumor suppressor protein p53 binds to MDM2, an important target of new cancer therapeutics. A simple four-state model parameterized from the simulations shows a binding-then-folding mechanism, and recapitulates experiments in which residual helicity boosts binding. This work goes beyond previous simulations of small-molecule binding, to achieve pathways and binding rates for a large peptide, in good agreement with experiment.

An Introduction to the Bootstrap

Book

May 1994

Bridging Microscopic and Macroscopic Mechanisms of p53-MDM2 Binding with Kinetic Network Models

Article

Aug 2017

Under normal cellular conditions, the tumor suppressor protein p53 is kept at low levels in part due to ubiquitination by MDM2, a process initiated by binding of MDM2 to the intrinsically disordered transactivation domain (TAD) of p53. Many experimental and simulation studies suggest that disordered domains such as p53 TAD bind their targets nonspecifically before folding to a tightly associated conformation, but the microscopic details are unclear. Toward a detailed prediction of binding mechanisms, pathways, and rates, we have performed large-scale unbiased all-atom simulations of p53-MDM2 binding. Markov state models (MSMs) constructed from the trajectory data predict p53 TAD binding pathways and on-rates in good agreement with experiment. The MSM reveals that two key bound intermediates, each with a nonnative arrangement of hydrophobic residues in the MDM2 binding cleft, control the overall on-rate. Using microscopic rate information from the MSM, we parameterize a simple four-state kinetic model to 1) determine that induced-fit pathways dominate the binding flux over a large range of concentrations, and 2) predict how modulation of residual p53 helicity affects binding, in good agreement with experiment. These results suggest new ways in which microscopic models of peptide binding, coupled with simple few-state binding flux models, can be used to understand biological function in physiological contexts.

Complete protein–protein association kinetics in atomic detail revealed by molecular dynamics simulations and Markov modelling

Article

Jun 2017

Protein–protein association is fundamental to many life processes. However, a microscopic model describing the structures and kinetics during association and dissociation is lacking on account of the long lifetimes of associated states, which have prevented efficient sampling by direct molecular dynamics (MD) simulations. Here we demonstrate protein–protein association and dissociation in atomistic resolution for the ribonuclease barnase and its inhibitor barstar by combining adaptive high-throughput MD simulations and hidden Markov modelling. The model reveals experimentally consistent intermediate structures, energetics and kinetics on timescales from microseconds to hours. A variety of flexibly attached intermediates and misbound states funnel down to a transition state and a native basin consisting of the loosely bound near-native state and the tightly bound crystallographic state. These results offer a deeper level of insight into macromolecular recognition and our approach opens the door for understanding and manipulating a wide range of macromolecular association processes.

Application of Hidden Markov Models in Biomolecular Simulations

Chapter

Feb 2017
Meth Mol Biol

Hidden Markov models (HMMs) provide a framework to analyze large trajectories of biomolecular simulation datasets. HMMs decompose the conformational space of a biological molecule into finite number of states that interconvert among each other with certain rates. HMMs simplify long timescale trajectories for human comprehension, and allow comparison of simulations with experimental data. In this chapter, we provide an overview of building HMMs for analyzing bimolecular simulation datasets. We demonstrate the procedure for building a Hidden Markov model for Met-enkephalin peptide simulation dataset and compare the timescales of the process.

Collective variables for the study of long-time kinetics from molecular trajectories: theory and methods

Article

Apr 2017

Collective variables are an important concept to study high-dimensional dynamical systems, such as molecular dynamics of macromolecules, liquids, or polymers, in particular to define relevant metastable states and state-transition or phase-transition. Over the past decade, a rigorous mathematical theory has been formulated to define optimal collective variables to characterize slow dynamical processes. Here we review recent developments, including a variational principle to find optimal approximations to slow collective variables from simulation data, and algorithms such as the time-lagged independent component analysis. Using these concepts, a distance metric can be defined that quantifies how slowly molecular conformations interconvert. Extensions and open questions are discussed.

Ward Clustering Improves Cross-Validated Markov State Models of Protein Folding

Article

Feb 2017

Markov state models (MSMs) are a powerful framework for analyzing protein dynamics. MSMs require the decomposition of conformation space into states via clustering, which can be cross-validated when a prediction method is available for the clustering method. We present an algorithm for predicting cluster assignments of new data points with Ward's minimum variance method. We then show that clustering with Ward's method produces better or equivalent cross-validated MSMs for protein folding than other clustering algorithms.

Constructing Markov State Models to elucidate the functional conformational changes of complex biomolecules

Abstract and Figures

Recommended publications

Analysis of Greenhouse Air Temperature Distribution Using Geostatistical Methods

Engineering and Analysis of a Self-Sufficient Biosynthetic Cytochrome P450 PikC Fused to the RhFRED...

Distribution and characteristic of nitrite-dependent anaerobic methane oxidation bacteria in wastewa...

Biotechnological Approaches toward the Synthesis of Eukaryotic N-Linked Glycoprotein