ArticlePDF Available

Constructing Markov State Models to elucidate the functional conformational changes of complex biomolecules

Authors:
  • The Chinese University of Hong Kong - Shenzhen

Abstract and Figures

The function of complex biomolecular machines relies heavily on their conformational changes. Investigating these functional conformational changes is therefore essential for understanding the corresponding biological processes and promoting bioengineering applications and rational drug design. Constructing Markov State Models (MSMs) based on large‐scale molecular dynamics simulations has emerged as a powerful approach to model functional conformational changes of the biomolecular system with sufficient resolution in both time and space. However, the rapid development of theory and algorithms for constructing MSMs has made it difficult for nonexperts to understand and apply the MSM framework, necessitating a comprehensive guidance toward its theory and practical usage. In this study, we introduce the MSM theory of conformational dynamics based on the projection operator scheme. We further propose a general protocol of constructing MSM to investigate functional conformational changes, which integrates the state‐of‐the‐art techniques for building and optimizing initial pathways, performing adaptive sampling and constructing MSMs. We anticipate this protocol to be widely applied and useful in guiding nonexperts to study the functional conformational changes of large biomolecular systems via the MSM framework. We also discuss the current limitations of MSMs and some alternative methods to alleviate them. WIREs Comput Mol Sci 2018, 8:e1343. doi: 10.1002/wcms.1343 This article is categorized under: Structure and Mechanism > Computational Biochemistry and Biophysics Theoretical and Physical Chemistry > Statistical Mechanics
| Suggested protocol for constructing Markov State Models (MSMs) to investigate the functional conformational changes. The workflow consists of three stages: (a)-(c) generating the minimum free energy path(s) among the known functional states; (d)-(g) adaptive sampling and microstate MSM construction/validation; (h) elucidating the slowest kinetics of the system via the validated microstate MSM and interpreting the mechanism by lumping the microstate MSM into a macrostate MSM. (a) Find the known functional states from experimental structures or molecular modeling; (b) build a preliminary transition path between the known states via morphing (e.g., the Climber algorithm) or biased molecular dynamics (MD) simulation (e.g., steered MD, targeted MD); (c) optimize the preliminary path to locate the closest minimum free energy path via string method or extensive MD sampling; (d) initiate an ensemble of short unbiased MD simulations from the representative conformations along the optimized path; (e) select kinetically slow reaction coordinates using time-lagged independent component analysis (tICA); (f ) partition the collected samples into microstates based on their geometric proximity in the reduced tIC space; (g) build and validate the microstate MSM and perform further unbiased sampling seeded by the representative structures of each microstate if the local equilibrium is not reached in the microstate MSM; and (h) predict kinetic properties of the system via the microstate MSM and build the macrostate MSM via kinetic lumping for mechanism visualization and interpretation.
… 
Content may be subject to copyright.
Advanced Review
Constructing Markov State Models
to elucidate the functional
conformational changes
of complex biomolecules
Wei Wang ,
1,2
Siqin Cao,
1
Lizhe Zhu
1,2
and Xuhui Huang
1,2,3,4
*
The function of complex biomolecular machines relies heavily on their confor-
mational changes. Investigating these functional conformational changes is
therefore essential for understanding the corresponding biological processes and
promoting bioengineering applications and rational drug design. Constructing
Markov State Models (MSMs) based on large-scale molecular dynamics simula-
tions has emerged as a powerful approach to model functional conformational
changes of the biomolecular system with sufcient resolution in both time and
space. However, the rapid development of theory and algorithms for construct-
ing MSMs has made it difcult for nonexperts to understand and apply the
MSM framework, necessitating a comprehensive guidance toward its theory and
practical usage. In this study, we introduce the MSM theory of conformational
dynamics based on the projection operator scheme. We further propose a gen-
eral protocol of constructing MSM to investigate functional conformational
changes, which integrates the state-of-the-art techniques for building and opti-
mizing initial pathways, performing adaptive sampling and constructing MSMs.
We anticipate this protocol to be widely applied and useful in guiding nonex-
perts to study the functional conformational changes of large biomolecular sys-
tems via the MSM framework. We also discuss the current limitations of MSMs
and some alternative methods to alleviate them. © 2017 Wiley Periodicals, Inc.
How to cite this article:
WIREs Comput Mol Sci 2017, e1343. doi: 10.1002/wcms.1343
INTRODUCTION
Conformational changes of complex biomolecules
are indispensable features of their function.
Investigating the functional conformational changes of
biomacromolecules is thus essential not only for reveal-
ing the mechanisms of the corresponding biological
processes,
13
but also for rational drug design
46
and
various biotechnological applications. However, direct
investigations of functional dynamics by experiments
remains challenging, because it remains difcult for cur-
rent experimental techniques
712
to reach atomic reso-
lution in both space and time. Molecular dynamics
(MD) simulation has therefore emerged as a powerful
approach to complement experiments, since it can sim-
ulate the motions of all atoms in the biomolecular sys-
tem on timescales as short as femtoseconds.
Nonetheless, it is still difcult to study complex
biomolecular systems directly through brute-force
MD simulations. This is because large-scale
These authors contributed equally to this work.
*Correspondence to: xuhuihuang@ust.hk
1
Department of Chemistry, The Hong Kong University of Science
and Technology, Kowloon, Hong Kong
2
Center of Systems Biology and Human Health, The Hong Kong
University of Science and Technology, Kowloon, Hong Kong
3
Hong Kong Branch of Chinese National Engineering Research
Center for Tissue Restoration & Reconstruction, The Hong Kong
University of Science and Technology, Kowloon, Hong Kong
4
HKUST-Shenzhen Research Institute, Shenzhen, China
Conict of interest: The authors have declared no conicts of inter-
est for this article.
© 2017 Wil e y Pe r i o d i cals, Inc. 1 of 18
conformational changes of large biomolecules,
e.g., RNA Polymerase II (more than 300,000 atoms in
explicit water),
2
typically occur on submillisecond
timescales, beyond the affordable length of the MD
simulations. The solution to this time-scale gap can be
achieved by either accelerating MD via advanced
hardware
1316
or adopting advanced sampling
1720
and analysis techniques such as replica exchange MD
(REMD),
21
metadynamics,
22
transition path
sampling,
23
milestoning,
24
accelerated MD,
25
and
Markov State Models (MSMs).
2628
MSM approaches represent a powerful theoret-
ical framework that has been widely applied in the
past decade to study protein folding
2933
and func-
tional conformational dynamics
13,3444
of many bio-
molecular systems. In MSM, statistical models are
built to approach the timescale involved in the func-
tional conformational changes between known struc-
tures, based on an ensemble of short trajectories
initiated from different regions of the free energy
landscape of the system. This feature of MSMs
allows highly parallelized sampling and a systematic
and statistical description of the system under study.
As MSMs have become increasingly popular,
new algorithms for constructing and validating
MSMs have also been continuously developed in
recent years, necessitating a comprehensive review of
these new advances.
4552
Since most existing reviews
about MSMs are theory and algorithm-oriented, here
we aim to provide systematic guide toward its practi-
cal usage, particularly in the context of studying
functional conformational changes of biomolecules.
After a brief introduction to the basic theories of
MSMs, we will suggest a detailed protocol for non-
experts who plan to apply MSMs to investigate func-
tional conformational changes in biomolecular
systems. All commonly used methods in our protocol
and their representative cutting-edge applications will
be reviewed. Finally, we discuss the limitations of
MSMs, and alternative methods as well as future
development that may alleviate these limitations.
THE BASIC THEORY OF MSMs
From a microscopic point of view, the kinetics of any
system can be precisely predicted by the Liouvilles
equation. However, for biomolecules whose dynamics
span multiple timescales, such equation is too complex
for the practical usage. A natural solution is then to
adopt a macroscopic perspectivefocusing on the
slow degrees of freedom (DOFs) that dominate the
dynamics while ignoring the less relevant fast DOFs.
Constructing MSMs is one popular approach to
achieve this. In this section, we adopt a projection
operator scheme
53
proposed by Zwanzig
54
and
Mori
55
to derive the basic equations of the MSM.
The Liouvilles equation of the phase space distri-
bution reads ρ(Γ;t)/t=ρ(Γ;t), where the Liou-
ville operator contains all the information of the
dynamic system and Γ=(x;p)=(x
1
,x
2
,,x
3N
;p
1
,
p
2
,p
3N
). In a discrete time sequence, t=nτ,theevo-
lution of the distribution function follows
ρΓ;t+τ
ðÞ
=eτρΓ;t
ðÞ ð1Þ
Here the propagator e
τ
has to obey the detailed bal-
ance condition under equilibrium conditions, hf
i
(Γ)|
e
τ
|f
j
(Γ)i
ρ(Γ; eq)
=hf
j
(Γ)|e
τ
|f
i
(Γ)i
ρ(Γ; eq)
, meaning that
the transition from jto iequals the transition from i
to junder the ensemble average taken with the equi-
librium distribution function ρ(Γ; eq).
Although e
τ
is dened in a high dimensional
space to describe the complete dynamics of the sys-
tem, there exist separations of timescales in dynamics
underline functional conformational changes, and
elucidating slowest dynamic modes in e
τ
are often
sufcient to understand overall mechanisms of these
conformational changes. One can, therefore, project
the full dynamics onto a reduced space of slow
dynamics jχivia the MoriZwanzig projection oper-
ator, forming a kinetic network model of the original
full dynamics.
The MoriZwanzig projection operator reads
P
n
j¼1
ρΓ;eqÞχjxðÞ
!i$π1
jhχjxðÞ
""
""", where jχiconsists
of the indicator functions that denes the state of
each region of the conguration space (i.e., χ
i
(x)=1
(or 0) implies the conguration space region x
belongs (or not) to state i). π
j
is the stationary popu-
lation of state j. The kinetics can then be projected to
the reduced space jχiand satises the Nakajima
Zwanzig equation,
56
tρΓ;tðÞ=ℙℒℙρΓ;tðÞ
+ðt
0
dt0ℙℒeℚℒ tt0
ðÞ
ℚℒℙρΓ;t0
ðÞ+ℙℒeℚℒtρΓ;0ðÞ
ð2Þ
where the non-Markovian term (second term
on the right) is the result of the fast kinetics of the
system that is related with =1.
When time is discretized as t=nτ, the kinetics
in the reduced space becomes
Advanced Review wires.wiley.com/compmolsci
2 of 18 © 2017 Wiley Per i o d i c als, Inc.
ρΓ;nτðÞ=eτ
!$
nρΓ;0ðÞ
+eτX
n1
m=1
^
Peτ
!$
nm1,eτ
!$
m
hi
!
ρΓ;0ðÞ
+enτρΓ;0ðÞ ð3Þ
where
^
Psums up all the permutation of the nm
1 terms of e
τ
and mterms of e
τ
. By assuming
the separation between the fast and slow kinetics that
e
τ
0 at a lag time τ
T
, a master equation of the
states could be derived, also known as the MSM:
pTt+τT
ðÞpTtðÞTτT
ðÞ ð4Þ
Here Tmk τðÞ=π1
mhχkxðÞeτ
""""χmðxÞiρΓ;eqÞð is the prob-
ability of transition from state mto state kover the
lag time τ. The resulting matrix T(τ
T
) is the transition
probability matrix (TPM). p
k
(t)hχ
k
(x)| ρ(Γ;t)iis
the probability of the system in state k. Due to the
property of the equilibrium state, Tshould satisfy the
detailed balance condition π
i
T
ij
(τ)=π
j
T
ji
(τ). We note
that MSM can also be derived from the framework
of the variational principle.
57
As the kinetics of a system can be modeled by
many different MSMs, one often needs to assess the
quality of an MSM and select the one that best repre-
sents the original kinetics. One commonly used
approach for such quality assessment is to apply the
variational principle, which states that, for any given
trial function f(Γ),
λi
^
λi=hfΓðÞeτ
""""fΓð ÞiρðΓ;eqÞ
hfΓðÞjfðΓÞiρðΓ;eqÞ
ð5Þ
where λ
i
is the eigen value of e
τ
, equality only holds
when f(Γ) is exactly the eigenvectors of e
τ
. In other
words, a good MSM should preserve the largest top
eigenvalues of e
τ
. In practice, this can be attained
by introducing the summation of top eigenvalues
[called Generalized matrix Rayleigh quotient
(GMRQ)
58,59
]. In fact, apart from MSMs, the varia-
tional principle can also be used to understand many
methods, such as the time-lagged (or time-structure
based) independent component analysis (tICA,
TICA)
6062
and the core-set MSM.
63
PROTOCOL OF BUILDING MSMs TO
ELUCIDATE FUNCTIONAL
CONFORMATIONAL CHANGES
The eld of MSMs has experienced a rapid develop-
ment in the past decade, including advancement in
both post analysis and sampling strategies. In this
section, we propose a general protocol, particularly
in the context of studying the functional dynamics of
biomolecules, to build MSMs through these state-of-
the-art techniques.
Overview of Our Protocol
The complete protocol consists of three stages:
(1) preparation (Figure 1(a)(c)), (2) adaptive sam-
pling and the construction of microstate MSM
(Figure 1(d)(g)), and (3) constructing macrostate
MSM and elucidating of the kinetics of the system
(Figure 1(h)). Stage 1 is divided into three substeps:
preparing initial structures (Figure 1(a)), generating
initial pathway(s) (Figure 1(b)), and path(s) optimiza-
tion (Figure 1(c)). Stage 2 is a recursive stage that per-
forms adaptive sampling along the optimized path
until a good kinetic model is obtained: MD simula-
tions (Figure 1(d)), the feature selection to nd the
reduced space that can capture the slowest transitions
(Figure 1(e)), the splitting of the phase space into a
state space (Figure 1(f )), and the building, validation,
and error estimation of the microstate MSM (Figure 1
(g)). Stage 3 builds a macrostate MSM (Figure 1(h))
by kinetic lumping of the microstates obtained in
Stage 2 and predict the slowest kinetics of the system
based on the validated microstate MSM. The various
dimensionality reduction, clustering, and MSM con-
struction tools mentioned in Figure 1 can be found in
the open source packages, such as MSMBuilder
6466
(http://msmbuilder.org), PyEMMA
67
(http://emma-
project.org/), htmd
68
(https://www.htmd.org), and
HK_DataMiner
69
(https://github.com/liusong299/
HK_DataMiner).
Methods
Finding the Minimum Free Energy Path(s)
between Functional States
Our protocol starts with preparing an initial path
(Figure 1(b)) connecting the known structures
(obtained from, e.g., X-ray crystallography,
7
Cryo-
electron microscopy,
9,70
and Nuclear magnetic reso-
nance spectroscopy
8
). This is specially tailored for
studying the functional conformational changes of
biomolecular systems. One of the major advantages
of the MSM framework is the parallelized sampling,
i.e., the model is built on an ensemble of unbiased
MD trajectories initiated from different regions of
the conformational space of the system. Accordingly,
this requires an initial sampling scheme to provide
the seeding structures for the unbiased sampling.
WIREs Computational Molecular Science Constructing MSMs
© 2017 Wil e y Pe r i o d i cals, Inc. 3 of 18
When investigating protein folding, we can obtain
the initial sampling using techniques such as
REMD
21,71,72
which enhances sampling in a global
manner. For studying protein functional conforma-
tional changes, however, it is more suitable to
enhance the sampling in a local manner, because
globally enhanced sampling is likely to introduce
unwanted results like unfolding of protein secondary
structures.
Various methods are available to generate the
initial path(s). For example, the Climber algo-
rithm
37,73
can drive the system toward the target
structure on the potential energy surface progres-
sively, via a self-adjusting restraint potential propor-
tional to the deviation of inter-residue distance
between the target structure and the structure
obtained in the last step. The initial path can then be
obtained by solvating the conformations along the
Climber path. Climber has been successfully applied
to investigate the translocation
37
and backtracking
2
process of RNA Polymerase II (Pol II). Alternatively,
one may rst solvate the system and then perform
steered MD (SMD)
74
or targeted MD (TMD)
75
to
drive the system from one crystal to the other.
Steered MD has been applied to study the pyrophos-
phate ion release in the yeast Pol II
36
or bacterial
RNA Polymerase.
34,35
Other methods like Caver
76
has been used for the study of NTP entry routes in
RNA Polymerase II elongation complex.
77
Metady-
namics
22
may also be performed to generate the ini-
tial path if a low dimensional collective variable
(CV) space can be dened a priori. The recently
developed FAST algorithm
78
is also a valuable tool
for this task. We anticipate that coarse-grained MD
(CGMD) simulations may also serve as a good
approach to obtain the initial path after proper atom-
istic reconstruction.
Due to the presence of the bias potential, the
prepared initial path is often unable to correctly
cover the transition state. Further optimization of the
path is necessary to ensure the statistical signicance
of the putative path and the subsequent unbiased
(a) (b) (c) (d)
(h) (g)
Initial and final
structures
Lumping and
macrostate MSM Microstate MSM
and validation
Lag time (ns)
Lag time (ns)
20 40 20 40
0
0.1
1
10
0
1
48
Splitting Feature selection
X-ray, Cryo-EM, NMR
...
Spec. Clus., PCCA, PCCA+
MPP, BACE, HNEG
k-centers, k-medoids
k-means, APLoD, APM msmbuilder
pyEMMA, htmd
tICA
Climber
SMD, TMD MD sampling
string method
Seeding, MD
Initial path
C.K. test
ITS (μs)
Path optimization MD simulations
(f) (e)
FIGURE 1 |Suggested protocol for constructing Markov State Models (MSMs) to investigate the functional conformational changes. The
workow consists of three stages: (a)(c) generating the minimum free energy path(s) among the known functional states; (d)(g) adaptive
sampling and microstate MSM construction/validation; (h) elucidating the slowest kinetics of the system via the validated microstate MSM and
interpreting the mechanism by lumping the microstate MSM into a macrostate MSM. (a) Find the known functional states from experimental
structures or molecular modeling; (b) build a preliminary transition path between the known states via morphing (e.g., the Climber algorithm) or
biased molecular dynamics (MD) simulation (e.g., steered MD, targeted MD); (c) optimize the preliminary path to locate the closest minimum free
energy path via string method or extensive MD sampling; (d) initiate an ensemble of short unbiased MD simulations from the representative
conformations along the optimized path; (e) select kinetically slow reaction coordinates using time-lagged independent component analysis (tICA);
(f ) partition the collected samples into microstates based on their geometric proximity in the reduced tIC space; (g) build and validate the
microstate MSM and perform further unbiased sampling seeded by the representative structures of each microstate if the local equilibrium is not
reached in the microstate MSM; and (h) predict kinetic properties of the system via the microstate MSM and build the macrostate MSM via kinetic
lumping for mechanism visualization and interpretation.
Advanced Review wires.wiley.com/compmolsci
4 of 18 © 2017 Wiley Per i o d i c als, I nc.
sampling (Figure 1(c)). Such optimization can be
achieved either by extensive short unbiased sam-
plings (Figure 2(a)) or more systematically via stan-
dard path-searching methods, such as the string
method (Figure 2(b)).
We recommend adopting path-searching
methods such as the string method
8083
for this opti-
mization step, because they contain a standard proto-
col for convergence check and ensure the presence of
the transition state in the optimized path. All path-
searching methods aim to locate minimum free-energy
path (MFEP) closest to a given initial path. Typically,
the path is dened on the preselected space composed
by a number of CVs. For example, in the most estab-
lished methodthe string method, local sampling is
performed in a small CV volume around the path
nodes to allow a gradual downhill update of the path.
Other methods such as path-metadynamics
84
and the
fast tomographic
85
methods may also be applied.
Nevertheless, the automation level and overall
efciency of existing path searching methods are still
limited. For example, the amount of sampling
required by the string method may become
comparable to the subsequent unbiased sampling,
because the local sampling adopted can make the
downhill path update too gradual. The choice of the
CV space for existing methods is also challenging,
especially when no prior knowledge of the system is
available. Therefore, we expect new methods to be
developed to alleviate these issues in the future.
Selecting Kinetically Slow Variables for State
Decomposition
After sufcient samples are collected, an MSM can
be constructed to model the slowest DOFs in the sys-
tem. This requires a proper decomposition of the
conformational space for dening the states in the
MSM. Traditionally, such state decomposition is per-
formed by applying clustering methods on the high
dimensional structures according to a chosen distance
metric, e.g., the root-mean-square distance (RMSD)
among the conformations.
86
Typically, when using
the RMSD metric the conformations are rst aligned
to a reference structure based on a subset of atoms.
The RMSD is then computed on another subset of
atoms, e.g., heavy atoms relevant to the process
(a)
(b)
RNA Polymerase II
c-Src kinase
Posttranslocation
Inactive
Pretranslocation
Active
A-loop opening
Helix rotation
Hck TMD
String
Pre Post (Climber)
Post Pre (Climber)
X (Isomap)
Y(Isomap)
Z (BH & TL RMSD)
0.0
0.0
1.0
0.8
0.0 1.0
–40 –20 020
–20
0
20
FIGURE 2 |The quality of putative path is important for the adaptive sampling scheme. (a) The translocation process of RNA Polymerase II:
Isomap representation of the initial paths generated by the Climber algorithm and the samples from the nal Markov State Model (MSM) (colored
dots). The MSM samples clearly deviate from the initial paths, indicating the necessity of path optimization before the adaptive sampling and
MSM construction (Figure adapted with permission from Ref 37. Copyright 2014 National Academy of Sciences, USA). (b) The initial path can be
optimized via the string method, as exemplied by the study of activation pathway of c-Src kinase: the initial targeted molecular dynamics
(MD) path can be optimized using limited amount of sampling (Figure adapted with permission from Ref 79. Copyright 2009 Elsevier).
WIREs Computational Molecular Science Constructing MSMs
© 2017 Wil e y Pe r i o d i cals, Inc. 5 of 18
under study, chosen based on the root-mean-square
uctuations (RMSF) of the atoms. However, the
resulting state denition is highly sensitive to the
atom-sets and to noise in the samples.
To address these issues during state decomposi-
tion, it has become increasingly popular to apply
methods that can automatically extract the major fea-
tures (or reaction coordinates) and preserve the slowest
kinetics of the system before the MSM construction.
For example, in several early studies of allostery and
protein folding,
87
principal component analysis (PCA)
has been applied to extract the dimensions that maxi-
mize variances in the samples. However, these principal
dimensions are not necessarily kinetically slow. More
recent recipes for this task are the variational
approach
57
and tICA.
6062
Different from PCA, tICA
focuses on the time correlation between features and
would thus give statistically independent components
that can reproduce the slowest dynamics.
As shown in Figure 1, we recommend using
tICA for feature selection. tICA can be understood as
an application of a variational principle on a basis
set of a few input features (e.g., distances between
atoms). It aims to nd linear combinations of the
input features, known as tICs, that generate the best
estimation of the eigenvalues of the propagator e
τ
.
First, the input features will be transformed into
mean-free features (d
i
(x(nτ))). Then, the time-lagged
correlation matrix with a lag time τ=N
τ
dt (dt is the
time interval between snapshots of the trajectories)
Cij τðÞ=X
trajs
1
NTNτX
NTNτ
l=1
dixlðÞðÞdjxl+Nτ
ðÞðÞð6Þ
and covariance matrix
Sij =X
trajs
1
NTX
NT
n=1
dixlðÞðÞdjxlðÞðÞ ð7Þ
are calculated based on the MD trajectories. Finally,
one solves the generalized eigenvalue problem C(τ)
V=SVΛto get the coefcients Vfor the linear com-
binations that dene the tICs. Due to nite sampling,
C
ij
(τ) may not be time-reversible and may produce
physically invalid results. Therefore, one typically
symmetrizes C
ij
(τ) by adding its transpose ((C
ij
(τ)+
C
ji
(τ))/2) to account for the nonreversibility. More
recent developments of tICA include kernel-tICA,
88
hTICA,
89
and variationally optimized diffusion
maps.
90
Interested readers can refer to reviews
91,92
for more discussion on reaction coordinates. We
anticipate new techniques to be developed for
automatic and smart choice of the characteristic fea-
tures of the conformational dynamics.
Although the tICs generated by tICA are linear
approximations of the slowest reaction coordinates
and thus may not correspond exactly to the slowest
motions identied by the nal constructed MSM,
93
they are particularly useful to achieve optimal state
decomposition with minimal statistical error.
58
By
performing clustering on the reduced space spanned
by top tICs, we can then construct microstate MSM
and choose the best MSM according to GMRQ or
slowest implied timescales (ITSs; see Eq. (11)). For
simplicity, we refer to this step as tICAMSM.
In practice, the performance of tICAMSM
depends on the choice of various parameters, such as
the input features, tICA correlation time, number of
selected tICs, and so on. It is recommended to apply
cross validation or bootstrapping techniques to
account for the statistical errors and avoid over-tting
(see Constructing and Validating Microstate MSMs
section for a detailed discussion) when selecting these
optimal parameters. Here we illustrate how to choose
these parameters via an example from our study of
backtracking of RNA Polymerase II.
2
As shown in
Figure 3, we scanned the pairwise distances between
different sets of atom pairs as input features for tICA.
We selected the sets for which the largest ITS of the
corresponding MSM reached its maximum (Figure 3
(d)(f )). Among the chosen sets of atom pairs, we
picked the one with the smallest size (i.e., the one with
least number of pair-wise distances, Figure 3(f )).
Partitioning the Conformational Space for
Dening Microstates
With the subspace of tICs identied, one performs
the splitting-and-lumpingscheme
66
on the subspace
to construct MSMs. The splitting step clusters the
sampled conformations into hundreds or thousands
of nonoverlapping microstates based on a distance
metric. For visualization and interpretation of mecha-
nisms, the subsequent lumping step groups the micro-
states into several macrostates based on the kinetic
proximity among them.
For the distance metric used in the splitting step,
one can simply apply Euclidean distance, other L
p
dis-
tances, or the kinetic distance
93
that is weighted by
tICA eigenvalues. To perform the clustering, center-
based methods (k-means,
94
k-centers,
95,96
and k-
medoids
65
) are widely used to partition the subspace
into Voronoi cells. These algorithms, however, often
need to be provided the number of clusters, which is
difcult to choose apriori.Such algorithms may
become inferior when the metastable regions in the free
energy landscape are not convex. Adaptive splitting
Advanced Review wires.wiley.com/compmolsci
6 of 18 © 2017 Wile y Pe r i o d i cals, Inc.
methods recently developed in our group, including
APM
97
and APLoD,
69
could be helpful in solving these
two issues. Incorporating both the geometric informa-
tion and the correlation between microstates in the iter-
ation, APM can effectively tackle multibody systems
with heterogeneous timescales, such as protein-ligand
binding system. To achieve a similar goal, APLoD
makes use of the local density of each conformation to
identify the local density peaks as cluster centers. Other
alternative methods include Wardsmethod
98,99
that
utilizes the hierarchical structure of distance matrix,
and robust density-based clustering
100
that partitions
the conformations with different local free energy based
on geometric proximity. k-Means, k-medoids, k-cen-
ters, Ward, and APM can be found in http://www.
msmbuilder.org. APLoD can be found in https://github.
com/liusong299/HK_DataMiner. Robust density-based
clustering can be found in http://www.moldyn.uni-frei-
burg.de/software/software.html.
Constructing and Validating Microstate MSMs
To build a microstate MSM on the discretized time
sequences, we choose a lag time τ, count the number
of transitions among the microstates and obtain a
transition count matrix (TCM). In the limit of innite
sampling, the TCM C
mk
(τ) is counted by the follow-
ing formula,
Cmk τðÞ=X
trajs
1
NTNτX
NTNτ
l=1
χmxlðÞðÞχkxl+Nτ
ðÞðÞð8Þ
Here C
mk
(τ) represents the total number of transi-
tions from state mto state kover the lag time τ=
N
τ
dt, where dt is the time interval between snapshots
of the trajectories. The TPM which represents the
transition probability between two states is then
obtained by taking the row normalization of TCM.
TτðÞ=CτðÞD1
nð9Þ
Here D
n
is a diagonal matrix with the value of diago-
nal entries being the total number of transition from
each state.
We can then nd the eigenvectors and eigen-
values of the TPM T(τ) to model the original kinetics
of the system. The eigenvectors (ψ
i
) are related to the
collective transition modes between states, and eigen-
values (λ
i
) are related to the relaxation timescale of
the corresponding transition process as shown in
Eq. (11). More interpretation of the eigenvectors can
be found in Prinz et al.
28
As discussed above, it is necessary for the tran-
sition matrices to fulll the detailed balance condi-
tion. Yet due to the statistical error or insufciency
of the sampling, the transitions from one state to
another are usually not equal to the reverse ones.
The simplest way to impose detailed balance is to
directly symmetrize the TCM by adding its transpose
Csym τðÞ=CτðÞ+CTτðÞ
!$
=2ð10Þ
100
1
0.01
0.0001
100
1
0.01
0.0001
100
1
0.01
0.0001
49 distances 495 distances
895 distances2115 distances 695 distances
1200 distances
(a) (b) (c)
(d) (e) (f)
100
1
0.01
0.0001
100
1
0.01
0.0001
100
1
0.01
0.0001
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
0 20 40 60 80
0 20 40 60 80
0 20 40 60 80
FIGURE 3 |An example of choosing the input structural features for the time-lagged independent component analysis (tICA) analysis. In all
subgraphs (a)(f ), the left panel demonstrates the set of atoms (blue) among which the pair-wise distances are selected as input features for the
tICA analysis; the right panel plots the implied timescales (ITS) of the 1000 state MSM built by
k
-centers clustering on the slowest four tICs shown
in the left panel. The correlation lag time for tICA is 40 ns. The Markov State Model (MSM) lag time is 8 ns. The error bars of ITS of the MSMs are
calculated by 100 times of bootstrapping experiments on all molecular dynamics (MD) trajectories. The distance set (f ) is chosen as the optimal
one, because the top MSM ITS is the highest among all sets yet with sufciently less number of input distances (Figure adapted with permission
from Ref 2. Copyright 2016 Nature Publishing Group).
WIREs Computational Molecular Science Constructing MSMs
© 2017 Wil e y Pe r i o d i cals, Inc. 7 of 18
before normalization.
65
As the direct symmetrization
is not a good approximation when the samplings are
statistically biased, a more accurate method is the
maximum likelihood estimator (MLE) with the
detailed balance constraint imposed.
28,101
Particu-
larly, it is suggested to perform MLE on the maximal
ergodic subgraph of TCM.
65,102
Generally, the microstate model obtained at lag
time τis not guaranteed to be Markovian, unless the
lag time is long enough, i.e., ττ
T
(τ
T
is called the
Markovian lag time). A straightforward way to exam-
ine the Markovianity is to compute the ITS dened as
timτðÞ=mτ
logλimτðÞ ð11Þ
where λ
i
(mτ) is the ith eigenvalue of T(mτ). When the
model is Markovian, t
i
(mτ) will be a constant value
of τ/ log λ
i
(τ). Accordingly, we choose the minimum
time τ
T
for the ITS to be invariant to the lag time
(Figure 1(g)) as the Markovian lag time to build
the MSM.
Subsequently, we use the ChapmanKolmogorov
equation (C.K.) to validate the Markovianity of the
model in a stricter way (Figure 1(g)). In this test, the
probability distribution predicted by the MSM (T
m
(τ
T
))
should be consistent with the distribution counted by
the trajectories (T(mτ
T
)) after several time steps mτ
T
if
the model is Markovian:
TmτT
ðÞ
=TτT
ðÞ
mð12Þ
The inequality in this CK equation, if present, has
two sourcesthe discretization errorand the statis-
tical error.
28,48
The discretization error is the system-
atic deviation of the MSM-predicted kinetics from
that of the propagator, due to the neglect of the
terms contributed by the fast, irrelevant kinetics
related to (see Eq. (3)). The discretization error can
be, in theory, quantied by the deviation of eigen-
values or eigenvectors of the TPM from that of the
propagator. However, because the full kinetics is
unknown a priori, it is more practical to choose the
MSM that produces largest top eigenvalues of T(τ
T
)
(i.e., GMRQ
58
). The statistical error, caused by the
limited sampling, can be estimated via a number of
techniques, e.g., the formula proposed in Prinz
et al.
28
or the Bayesian estimation method
103
that
(a)
(b)
(c)
RMSD of A-loop (Å)
d
E310-R409
-d
K295-E310
(Å)
0 5 10 0
2
4
6
8
10
–20
–10
0
10
20
Inactive (I)
Intermediate (I1 and I2)
Active (A)
Time (µs)
0102030405060708090100
DFG RMSD
from active (Å)
E310-R409
distance (Å)
K295-E310
distance (Å)
A-loop RMSD
from inactive (Å)
Active
Inactive
C-terminal domain
N-terminal domain
K295 K295
ATP ATP
D404 D404
E310 E310
R409 R409
Y416 Y416
A-loop
A-loop
unfolds
Mg2+ Mg2+
C-helix
moves
inwards
C-helix
0
1
2
3
0
10
5
10
15
0
5
10
20
20
FIGURE 4 |Markov state model identies key intermediate states along activation pathways of c-Src kinase. (a) Crystal structures of inactive
(left) and active (right) states of c-Src. The differences lie in the activation loop (A-loop; red), C-helix (orange), and switching of electrostatic
network among Lys295, Glu310, Arg409, and Tyr416. (b) Two intermediate states are identied on the potential of mean force calculated based
on the stationary population of a 2000-state microstate Markov State Model (MSM) over two reaction coordinates: root-mean-square distance
(RMSD) of A-loop residues and difference of distance between residue pairs E310-R409 and K295-E310. (c) The variation of four structural metrics
along a long trajectory, synthesized from the MSM via the kinetic Monte Carlo scheme, provides a rough estimate of the timescale of the
activation and deactivation processes. Here inactive state, active state, intermediated state I
1
and I
2
are shown in magenta, blue, green, and black,
respectively (Figure adapted with permission from Ref 4. Copyright 2014 Nature Publishing Group).
Advanced Review wires.wiley.com/compmolsci
8 of 18 © 2017 Wile y Pe r i o d i cals, Inc.
samples the transition probability matrices via Mar-
kov Chain Monte Carlo (MCMC) according to a
Dirichlet form posterior distribution.
In practice, it is difcult to reduce these two
types of errors simultaneously. Increasing the number
of states or lag time can reduce the discretization
error, but would lead to fewer transition counts
among the states and thus larger statistical error, and
vice versa. Therefore, it is recommended to perform
the splitting step via different clustering methods and
then select the best model based on the GMRQ.
58
In
order to achieve a balance between the discretization
error and statistical error, cross validation
59
and
bootstrapping techniques
104
can also be performed.
Performing Adaptive Sampling to Enhance
Data Connectivity
As high-quality MSMs require a sufcient number of
the transition counts in the available MD trajectories,
one round of MD sampling is often insufcient.
Therefore, it is necessary to perform adaptive
sampling
39,78,105108
to encourage the occurrence of
transitions. This is achieved by selecting representa-
tive structures of the less populated microstates as
seeds for the next round of simulations. A rough way
to decide when this iterative sampling procedure
should end is to check if all data are connected on
the potential of mean force (PMF) on some reaction
coordinates. More strictly, the adaptive sampling
should be terminated when the kinetic properties
computed from the microstate MSM become invari-
ant for a few rounds.
Calculating the Kinetic Properties from the
Validated Microstate MSM
The obtained microstate TPM can be used to calcu-
late the essential kinetic properties of the system,
such as the mean rst passage time (MFPT) for the
system to make a transition from one microstate to
another and the ensemble of transition pathways
among different states. Elucidation of major transi-
tion pathways and their associated uxes from
between two states can be achieved by the transition
path theory (TPT).
29,109111
In TPT, once the initial
and nal states are dened, net uxes between all
pairs of states are computed based on transition
probabilities, committor probabilities as well as equi-
librium populations of states, and transition path-
ways can then be identied from the net ux matrix.
TPT has been implemented to investigate various
binding processes,
43,112
polymerase systems,
2
self-
assembly processes,
113
and so forth. MFPT, which
quanties the averaged time for a state to rst make
a transition to another state, can also be used to
characterize the kinetics of the system.
In fact, the average values of any dynamic
observables can also be estimated from the TPM. A
simple way to achieve this is to generate a long Mar-
kov chain based on the TPM and randomly select
one conguration from each microstate to represent
the state. Here we discuss one example of application
to illustrate the construction of microstate MSM.
Application Example: Elucidating the
Activation Pathway of c-Src Kinase
Src kinases are key signaling proteins. Deregulated
activation of these kinases can lead to aberrant sig-
naling and therefore cause excessive cell growth and
differentiation. By constructing MSMs along the acti-
vation pathway of the c-Src kinase, Roux and Pande
groups successfully discovered two intermediate
states along the activation pathway that can be
employed for allosteric drug design.
4
They generated
an initial path by targeted MD (Figure 1(b)) between
the inactive and active form of the kinase (Figure 1
(a)) and then applied the string method with swarms
of trajectories to optimize the path (Figure 1(c),
Figure 2(b)). Adaptive sampling was then performed
for two rounds along the optimized activation path-
way, amounting to a total sampling of 500 μs. These
samples were clustered based on the RMSD metric
dened on a subset of heavy atoms, resulting in a
2000-microstate MSM with the lag time of 5 ns. The
MSM revealed two intermediate states along the path
(Figure 5(b)). The activation process was visualized
through a long trajectory which was synthesized
from the microstate MSM using kinetic Monte Carlo
scheme (Figure 5(c)). Finally, the 2000-state MSM
predicted the MFPT of the activation and deactiva-
tion process to be 106 and 21 μs, respectively.
Lumping Microstates to Macrostates to Aid
Mechanism Interpretation
A microstate MSM typically contains hundreds or
thousands of states to ensure that the model can be
Markovian at a sufciently small lag time. This helps
shorten the necessary length of MD simulations used
to build the MSM. However, the large number of
states also hinders the visualization of the conforma-
tional changes and subsequent human appreciation
of the underlying mechanism. Therefore, we perform
kinetic lumpingthat merges the microstates into
macrostates to reduce the number of states. This
lumping step is based on the kinetic proximity
between microstates, i.e., the interstate transitions
between metastable states should be much slower
than the intrastate transitions.
WIREs Computational Molecular Science Constructing MSMs
© 2017 Wil e y Pe r i o d i cals, Inc. 9 of 18
Among the recently developed lumping
methods, the spectral based methods like Perron clus-
ter cluster analysis (PCCA),
115
Robust Perron cluster
analysis (PCCA+),
116,117
and spectral clustering
118,119
make use of the leading eigenvectors of the microstate
TPM to do clustering. Among these three methods,
PCCA performs the bi-partitioning based on the sign
structure of top eigenvectors, spectral clustering per-
forms k-means clustering on the subspace spanned by
top eigenvectors, and PCCA+ performs a fuzzy clus-
tering as a robust improvement around the transition
region compared to PCCA. In addition, Chodera
et al.
120
invented a Monte Carlo-simulated annealing
(MCSA) algorithm to optimize the PCCA results
based on metastability (PiMii, where Mis the TPM
of the macrostate model). In spite of their wide appli-
cation, these spectral based methods are sensitive to
the poorly sampled regions in the dataset.
114
Several alternative methods have been proposed
to address the sensitivity over poor sampling, includ-
ing Bayesian agglomerative clustering engine
(BACE),
121
the most probable paths algorithm
(MPP),
122
and super-level-set hierarchical clustering
algorithm (SHC).
123
In particular, the Hierarchical
Nyström Extension Graph (HNEG) algorithm devel-
oped by Yao et al.
124
reduces the noise dependence by
putting more emphasis on the more populated states.
Another recently developed method called the renor-
malization group clustering (RGC)
125
provides the
optimal microstate representation of the kinetics of
the system via minimizing the error induced by
projection.
It is worth noting that the choice of the number
of macrostates is still an open question. In spectral
based methods, one typically decides the number of
metastable states based on the eigenvalue gap of
microstate TPM, while in BACE, such number is
decided by the gap of BACE Bayes factors. In
HNEG, this number is automatically decided by the
extensive graph, which could be an advantage over
other methods.
Given a number of macrostate MSMs lumped
through different methods, it is also necessary to
choose the model that best represents the system.
Popular criteria for such choices include the metasta-
bility (PiMii) and the Bayes factor.
126
The Bayes
(a) (b) (c)
Method Method
BACE
HNEG
MPP
PCCA+
PCCA
BACE
HNEG
MPP
PCCA+
PCCA
-Lactamase
-Lactamase
-Lactamase
Villin
Villin headpiece Villin
Alanine dipeptide
Alanine dipeptide Alanine dipeptide
log (Evidence)
–3.0 105
–2.6 105
–2.2 105
–1.8 105
–2.0 105
–1.8 105
–1.6 105
–1.4 105
–1.2 105
log (Evidence)
–1.6 104
–1.5 104
–1.4 104
log (Evidence)
Q
0
5
10
15
20
70
60
50
40
30
20
10
0
2.5
2.0
1.5
1.0
0.5
0.0
QQ
FIGURE 5 |Comparison of different kinetic lumping methods for 1-residue alanine dipeptide, 35-residue villin headpiece, and 263-residue
β-lactamase systems. (a) Crystal structures of alanine dipeptide (all-atom), villin (ribbon), and β-lactamase (ribbon). (b) Bayes factor of ve lumping
methods (less negative means better model). (c) Metastability values of the ve lumping methods (larger value means better model)
(Figure reprinted with permission from Ref 114. Copyright 2013 AIP Publishing LLC).
Advanced Review wires.wiley.com/compmolsci
10 of 18 © 2017 Wi l e y Pe r i o d icals, Inc.
factor quanties the likelihood to produce the micro-
state chains given a certain lumping. Based on these
two criteria, Bowman et al.
114
compared the perfor-
mance of several lumping methods (Figure 5).
Source codes of PCCA, PCCA+, BACE, and
MCSA are available in http://www.msmbuilder.org;
MPP can be found in http://www.moldyn.uni-frei-
burg.de/software/software.html; and spectral cluster-
ing is available in http://scikit-learn.org.
Calculating the Kinetic Properties
at Macrostate Level
Despite its usefulness in visualization, the lag time
necessary for a macrostate model to become Mar-
kovian is often beyond the length of the MD trajecto-
ries. Therefore, the kinetic properties are typically
computed from the microstate MSM. For example,
to calculate the MFPT from one metastable state to
another, one may count the rst passage events in the
trajectories that are mapped from the microstate
Markov chains (either directly obtained from MD
simulations or synthesized using MCMC based on
the microstate MSM) based on the lumping results.
Application Example: Study of the
Backtracking Mechanism of RNA Polymerase II
RNA Polymerases are the core enzymes for DNA
transcription in the central dogma of biology. To
understand the proofreading mechanism in DNA
transcription, Da et al.
2
studied, via MSMs, the back-
tracking process of RNA Polymerase II. To construct
MSMs, the frayed and backtracked state with rG:dG
mismatched base pair were rst prepared from corre-
sponding crystal structures. The pretranslocation
structure was built from the backtracked structure
(Figure 1(a)). The Climber algorithm was then used
to generate two initial paths: pretranslocation !
frayed !backtracked and the reverse (Figure 1(b)).
Next, four rounds of MD simulations were per-
formed via adaptive sampling, resulting in 480 trajec-
tories (each 100 ns) for analysis (Figure 1(d)(g)).
Subsequently, the MD samples were further divided
into 800 states by k-centers clustering using RMSD
of heavy atoms as the distance metric. A lag time of
8 ns was then chosen to build the microstate MSM
as it passed the C.K. test. Four macrostates were
obtained from the microstate TPM by PCCA+ lump-
ing for visualization. The 800-state microstate MSM
was used to calculate all kinetic properties.
Apart from the three known states (pretranslo-
cation, frayed, and backtracked), another kinetically
important intermediate state (S3) was identied
(Figure 6(a)). Backtracking was found to occur in a
stepwise fashion: rst, S1 quickly evolves to S2 at
submicrosecond timescale, promoted by the bending
of bridge helix (Figure 6(b)); then, S2 equilibrates
with S3 at microsecond timescale; nally, S3 back-
tracks to S4 at the timescale of 10
2
microseconds,
being the rate limiting step in the whole process.
DISCUSSION AND FUTURE
PERSPECTIVE
Notwithstanding the popularity and advancement of
the MSM framework, several limitations still remain,
including the issue of non-Markovianity for macro-
state MSMs, recrossing when counting transitions,
and rare events when sampling the proteinligand
dissociation. In this section, we review the alternative
methods recently developed to overcome these issues,
including the Hidden Markov Model (HMM)
127129
and the Hummer & Szabo scheme
130
that address
the non-Markovian problem of macrostate MSMs,
the core-set MSM
63
that tackles the recrossing issue,
and the transition-based reweighting analysis method
(TRAM)
131133
that handles slow transitions for the
proteinligand dissociation.
The Markovianity is central to MSMs. Apart
from our recommendation using the microstate
MSM to compute kinetic properties and the macro-
state model for interpretation, several new break-
throughs have been made to construct macrostate
MSM at the non-Markovian regime based on avail-
able microstate MSM. Very recently, Hummer and
Szabo
130
applied the Projection operator scheme to
construct a macrostate MSM that reasonably
approximates the kinetics of the aggregated states at
both short and long time limits. We anticipate that
further improvement can be made to calculate quan-
tities like MFPT based on the reconstructed macro-
state transition matrix.
In addition, Noé et al.
127
and McGibbon
et al.
128
have proposed the Hidden Markov Models
(HMM) to avoid constructing a Markovian model
explicitly. In HMM, the sequences of observables are
assumed to be generated by hidden Markov chains
(or path) according to certain emission probabili-
ties. The emission probability is the probability for
each hidden state to produce certain observable, con-
ceptually similar to the membership functionthat is
used in PCCA+. The complete likelihood function is
composed of two parts: the probability to produce
certain hidden path according to the TPM between
the hidden states and the probability to generate the
observables according to the emission probabilities.
Because the likelihood function is highly complex,
one adopts the forwardbackward algorithm to
WIREs Computational Molecular Science Constructing MSMs
© 2017 Wil e y Pe r i o d i cals, Inc. 11 of 18
estimate the optimal hidden variables (TPM of hid-
den states, emission probabilities, and stationary
population of hidden states). The kinetic properties
of the system can then be computed from the TPM
of the HMM. The membership probabilities of each
observed state are accordingly computed from emis-
sion probability. HMM has been successfully applied
to study the dynamics of Ubiquitin,
128
and the recog-
nition process between ribonuclease barnase and its
inhibitor barstar.
39
For these two cases, where a clear
separation of timescale exists, the HMM demon-
strated better performance over MSMs. However,
the interpretation of the HMM is sometimes not
straightforward as there exists overlap between dif-
ferent states.
50
Very often, the timescales of transition events in
an MSM are underestimated, because of the existence
of frequent recrossing events near the boundaries of
the states. In one trajectory, the system can leave a
metastable state, cross the state boundary and then
return to the previous state before visiting the core
region of the other state. This will result in the overes-
timation of transitions between states. One intuitive
but naïve solution is to dene more states around the
transition region,
28
which will, however, increase the
statistical error in the transition counts. Alternatively,
one can focus on the core region of the metastable
states and use the milestoning processes to estimate
the transition probability among the cores, as imple-
mented in the approach of core-set MSM.
63
The
Pretranslocation
(a)
(b)
S1
Y769
10.5%
T831 TL
BH
Y836
5.1μs
0.1 μs
95.9 μs
0.8 μs
191.8 μs
22.4%
4.0%
1.0 μs
S1
i–6 i+1
i–5
S4
BH BH
Mg Mg
i+1 i–6
S3
i+1
i–6 i+1
Mg Mg
H3C
HO
OH
OH OH
OH
H
H3C
OH HO
OH
HO
HH3C
OH
HOH
H3C
HOH
Syn(rG) : Anti(dG)
Y836
0.1 μs
5.1μs
0.8 μs
95.9 μs
191.8 μs
1.0 μs
BH BH
T831
Y769
S2
S2
S3
S4
63.1%
Frayed Backtracked
FIGURE 6 |The backtracking process of RNA Polymerase II (Pol II) revealed by MSMs. (a) The stepwise process occurs among four metastable
states. The equilibrium population of the states and MFPT among them are labeled. These values are calculated based on ultra-long macrostate
chains that are simulated by an 800-state microstate MSM after bootstrapping the original 480 molecular dynamics (MD) trajectories. (b) A
cartoon model of the backtracking mechanism. In S1 !S2, the motion of the RNA 30-end nucleotide is triggered by the bending of Bridge Helix
(BH). In S2 !S3, the BH residue Y836 stacks with DNA transition nucleotide and Rpb2 residue Y769 stacks with RNA 30-end nucleotide through
their aromatic rings. In S3 !S4, the movement of the RNA:DNA hybrid nally delivers the Pol II to the backtracked state (Figure adapted with
permission from Ref 2. Copyright 2016 Nature Publishing Group).
Advanced Review wires.wiley.com/compmolsci
12 of 18 © 2017 Wi l e y Pe r i o d icals, Inc.
challenging and open question here is how to locate
the core sets with strong metastability. For this,
Lemke and Keller
134
has proposed several density
based clustering methods to nd the core sets.
Another limitation of MSMs is that the transi-
tion events should be reversely sampled by unbiased
MD trajectories under one ensemble (thermodynamic
state). This is, however, rather difcult if the transition
of interest is a rare event. Although the adaptive sam-
pling (see our protocol in Figure 1) can alleviate this
issue, direct sampling of extremely slow transitions in
one or both forward and backward directions are still
beyond the reach for many systems. This is particu-
larly relevant in topics of protein-ligand recognition,
where the association is easy to sample, while the dis-
sociation could occur at a very long timescale at sec-
onds or even longer timescales. To tackle this rare
event issue, the Noé group has proposed the
transition-based reweighting analysis method
(TRAM),
131133
where the kinetic network is con-
structed with trajectories sampled at multiple thermo-
dynamic states. Using the same conguration state
partitioning for all thermodynamic states, a complete
likelihood function (TRAM likelihood) is formulated
to consider all transition events and bias potential of
each ensemble. The unbiased TPM can then be
obtained by solving the maximum likelihood problem
for the multiensemble Markov Models (MEMMs).
TRAM has been successfully applied to study
the recognition processes between serine protease
trypsin and its inhibitor benzamidine (Figure 7).
131
Here the unbinding process occurs at approximately
1 ms. To sample such rare events, 49.1 μs unbiased
trajectories and 459 umbrella sampling simulations
were performed. tICAMSM was performed on the
unbiased trajectories, where the nearest neighbor
heavy-atom contacts between benzamidine and all
trypsin residues were selected as input features for
tICA. k-Means clustering on the joint space of
umbrella coordinate and tICs yielded 500-state
decomposition for all thermodynamic states. Multi-
ple MSMs were constructed based on the partition-
ing and all the trajectories. Based on the nal
unbiased TPM, the transition rate for the dissociation
process was reported to be 1170 s
1
.
CONCLUSION
Constructing MSMs for studying functional confor-
mational changes of complex biomolecules has
(a)
(b)
20,000 0.4 10 60 1000
10000
1.0
0.8
0.6
0.4
0.2
Probab. of finding koff
0.00 20 40 60 80 100
TRAM
MSM
4000
60 4000 50
2000 30
Unbound
500
9000
1
20,000
(iii)(i)
(ii) (iv)
FIGURE 7 |The multiensemble Markov Model (MEMM) for the protein-ligand binding of trypsin-benzamidine. (a) The coarse-grained kinetic
network of the MEMM. All transition rates between macrostates are labeled in ms
1
. (b) The efciency of transition-based reweighting analysis
method (TRAM) and Markov State Model (MSM) in computing unbinding kinetics
k
off
(Figure adapted with permission from Ref 131. Copyright
2016 National Academy of Sciences, USA).
WIREs Computational Molecular Science Constructing MSMs
© 2017 Wil e y Pe r i o d i cals, Inc. 13 of 18
become increasingly popular in the past decade,
thanks to the rapid development of the underlying
theory, validation tools and sampling strategy. As a
systematic review of the new advancement in MSM
construction, we proposed a protocol that integrates
the state-of-the-art techniques to guide beginners
who want to study the functional conformational
changes via the MSM framework.
ACKNOWLEDGMENTS
The authors would like to thank Dr. Fu Kit Sheong for fruitful discussions. This work was supported by the
Hong Kong Research Grant Council (grant numbers HKUST C6009-15G, 16305817, 16302214, 16304215,
16318816, AoE/P-705/16, M-HKUST601/13, F-HKUST605/15, and T13607/12R), King Abdullah University
of Science and Technology (KAUST) Ofce of Sponsored Research (OSR) (OSR-2016-CRG5-3007), and Inno-
vation and Technology Commission (ITCPD/17-9 and ITC-CNERC14SC01). W.W. acknowledges support
from the Hong Kong Ph.D. Fellowship Scheme 2014/15 (PF13-14699). X.H. is the Padma Harilela Associate
Professor of Science.
REFERENCES
1. Kohlhoff KJ, Shukla D, Lawrenz M, Bowman GR,
Konerding DE, Belov D, Altman RB, Pande
VS. Cloud-based simulations on Google Exacycle
reveal ligand modulation of GPCR activation path-
ways. Nat Chem 2014, 6:1521.
2. Da L-T, Pardo-Avila F, Xu L, Silva D-A, Zhang L,
Gao X, Wang D, Huang X. Bridge helix bending pro-
motes RNA polymerase II backtracking through a
critical and conserved threonine residue. Nat Com-
mun 2016, 7:ncomms11244.
3. Jiang H, Sheong FK, Zhu L, Gao X, Bernauer J,
Huang X. Markov state models reveal a two-step mech-
anism of miRNA loading into the human Argonaute
protein: selective binding followed by structural re-
arrangement. PLoS Comput Biol 2015, 11:e1004404.
4. Shukla D, Meng Y, Roux B, Pande VS. Activation
pathway of Src kinase reveals intermediate states as
targets for drug design. Nat Commun 2014, 5:
ncomms4397.
5. Bowman GR, Bolin ER, Hart KM, Maguire BC,
Marqusee S. Discovery of multiple hidden allosteric sites
by combining Markov state models and experiments.
Proc Natl Acad Sci USA 2015, 112:27342739.
6. Wagner JR, Lee CT, Durrant JD, Malmstrom RD,
Feher VA, Amaro RE. Emerging computational
methods for the rational discovery of allosteric drugs.
Chem Rev 2016, 116:63706390.
7. Kendrew JC, Bodo G, Dintzis HM, Parrish RG,
Wyckoff H, Phillips DC. A three-dimensional model
of the myoglobin molecule obtained by x-ray analysis.
Nature 1958, 181:662666.
8. Wüthrich K. Protein structure determination in solu-
tion by NMR spectroscopy. J Biol Chem 1990,
265:2205922062.
9. Nogales E, Scheres SHW. Cryo-EM: a unique tool for
the visualization of macromolecular complexity. Mol
Cell 2015, 58:677689.
10. Callender R, Dyer RB. Probing protein dynamics
using temperature jump relaxation spectroscopy. Curr
Opin Struct Biol 2002, 12:628633.
11. Clore GM, Tang C, Iwahara J. Elucidating transient
macromolecular interactions using paramagnetic
relaxation enhancement. Curr Opin Struct Biol 2007,
17:603616.
12. Ha T, Ting AY, Liang J, Caldwell WB, Deniz AA,
Chemla DS, Schultz PG, Weiss S. Single-molecule
uorescence spectroscopy of enzyme conformational
dynamics and cleavage mechanism. Proc Natl Acad
Sci USA 1999, 96:893898.
13. Shaw DE, Deneroff MM, Dror RO, Kuskin JS,
Larson RH, Salmon JK, Young C, Batson B, Bowers
KJ, Chao JC, et al. Anton, a special-purpose machine
for molecular dynamics simulation. Commun ACM
2008, 51:9197.
14. Eastman P, Friedrichs MS, Chodera JD, Radmer RJ,
Bruns CM, Ku JP, Beauchamp KA, Lane TJ, Wang L-
P, Shukla D, et al. OpenMM 4: a reusable, extensible,
hardware independent library for high performance
molecular simulation. J Chem Theory Comput 2013,
9:461469.
15. Salomon-Ferrer R, Götz AW, Poole D, Le Grand S,
Walker RC. Routine microsecond molecular dynam-
ics simulations with AMBER on GPUs. 2. Explicit
solvent particle mesh Ewald. J Chem Theory Comput.
2013, 9:38783888.
16. Fitch BG, Germain RS, Mendell M, Pitera J, Pitman
M, Rayshubskiy A, Sham Y, Suits F, Swope W, Ward
TJC, et al. Blue matter, an application framework for
molecular simulation on blue gene. J Parallel Distrib
Comput 2003, 63:759773.
Advanced Review wires.wiley.com/compmolsci
14 of 18 © 2017 Wi l e y Pe r i o d icals, Inc.
17. Maximova T, Moffatt R, Ma B, Nussinov R,
Shehu A. Principles and overview of sampling
methods for modeling macromolecular structure and
dynamics. PLoS Comput Biol 2016, 12:e1004619.
18. Mitsutake A, Sugita Y, Okamoto Y. Generalized-
ensemble algorithms for molecular simulations of bio-
polymers. Biopolym-Pept Sci Sect 2001, 60:96123.
19. Zheng L, Chen M, Yang W. Random walk in orthog-
onal space to achieve efcient free-energy simulation
of complex systems. Proc Natl Acad Sci USA 2008,
105:2022720232.
20. Gao YQ. An integrate-over-temperature approach for
enhanced sampling. J Chem Phys 2008, 128:64105.
21. Zhang BW, Dai W, Gallicchio E, He P, Xia J, Tan Z,
Levy RM. Simulating replica exchange: Markov state
models, proposal schemes, and the innite swapping
limit. J Phys Chem B 2016, 120:82898301.
22. Barducci A, Bonomi M, Parrinello M. Metadynamics.
Wiley Interdiscip Rev Comput Mol Sci 2011, 1:826843.
23. Bolhuis PG, Chandler D, Dellago C, Geissler PL.
Transition path sampling: throwing ropes over Rough
Mountain passes, in the dark. Annu Rev Phys Chem
2002, 53:291318.
24. Bello-Rivas JM, Elber R. Exact milestoning. J Chem
Phys 2015, 142:94102.
25. Hamelberg D, Mongan J, McCammon JA. Acceler-
ated molecular dynamics: a promising and efcient
simulation method for biomolecules. J Chem Phys
2004, 120:1191911929.
26. Bowman GR, Pande VS, Noé F. An Introduction to
Markov State Models and their Application to Long
Timescale Molecular Simulation, vol. 797. Nether-
lands: Springer Science & Business Media; 2014.
27. Da L-T, Sheong FK, Silva D-A, Huang X. Application
of Markov state models to simulate long timescale
dynamics of biological macromolecules. In: Han K,
Zhang X, Yang M, eds. Protein Conformational
Dynamics [Internet]. Advances in Experimental Med-
icine and Biology. Switzerland: Springer International
Publishing; 2014, 2966 Available at: http://link.
springer.com/chapter/10.1007/978-3-319-02970-2_2.
(Accessed June 25, 2017).
28. Prinz J-H, Wu H, Sarich M, Keller B, Senne M, Held M,
Chodera JD, Schütte C, Noé F. Markov models of molec-
ular kinetics: generation and validation. JChemPhys
2011, 134:174105.
29. Noé F, Schütte C, Vanden-Eijnden E, Reich L,
Weikl TR. Constructing the equilibrium ensemble of
folding pathways from short off-equilibrium simula-
tions. Proc Natl Acad Sci USA 2009,
106:1901119016.
30. Beauchamp KA, McGibbon R, Lin Y-S, Pande VS.
Simple few-state models reveal hidden complexity in
protein folding. Proc Natl Acad Sci USA 2012,
109:1780717813.
31. Lane TJ, Shukla D, Beauchamp KA, Pande VS. To
milliseconds and beyond: challenges in the simulation
of protein folding. Curr Opin Struct Biol 2013,
23:5865.
32. Voelz VA, Jäger M, Yao S, Chen Y, Zhu L,
Waldauer SA, Bowman GR, Friedrichs M, Bakajin O,
Lapidus LJ, et al. Slow unfolded-state structuring in
acyl-CoA binding protein folding revealed by simula-
tion and experiment. J Am Chem Soc 2012, 134:
1256512577.
33. Buchete N-V, Hummer G. Coarse master equations
for peptide folding dynamics. J Phys Chem B 2008,
112:60576069.
34. Da L-T, Avila FP, Wang D, Huang X. A two-state
model for the dynamics of the pyrophosphate ion
release in bacterial RNA polymerase. PLoS Comput
Biol 2013, 9:e1003020.
35. Da L-T, E C, Duan B, Zhang C, Zhou X, Yu J. A
jump-from-cavity pyrophosphate ion release assisted
by a key lysine residue in T7 RNA polymerase tran-
scription elongation. PLoS Comput Biol 2015, 11:
e1004624.
36. Da L-T, Wang D, Huang X. Dynamics of pyrophos-
phate ion release and its coupled trigger loop motion
from closed to open state in RNA polymerase II. J
Am Chem Soc 2012, 134:23992406.
37. Silva D-A, Weiss DR, Avila FP, Da L-T, Levitt M,
Wang D, Huang X. Millisecond dynamics of RNA
polymerase II translocation at atomic resolution. Proc
Natl Acad Sci USA 2014, 111:76657670.
38. Weber M, Bujotzek A, Haag R. Quantifying the
rebinding effect in multivalent chemical ligand-
receptor systems. J Chem Phys 2012 Aug,
137:54111.
39. Plattner N, Doerr S, De Fabritiis G, Noé F. Complete
proteinprotein association kinetics in atomic detail
revealed by molecular dynamics simulations and
Markov modelling. Nat Chem 2017, 9:10051011.
40. Vanatta DK, Shukla D, Lawrenz M, Pande VS. A net-
work of molecular switches controls the activation of
the two-component response regulator NtrC. Nat
Commun 2015, 6:ncomms8283.
41. Silva D-A, Bowman GR, Sosa-Peinado A, Huang X.
A role for both conformational selection and induced
t in ligand binding by the LAO protein. PLoS Com-
put Biol 2011, 7:e1002054.
42. Malmstrom RD, Kornev AP, Taylor SS, Amaro RE.
Allostery through the computational microscope:
cAMP activation of a canonical signalling domain.
Nat Commun 2015 Jul, 6:ncomms8588.
43. Lawrenz M, Shukla D, Pande VS. Cloud computing
approaches for prediction of ligand binding poses and
pathways. Sci Rep 2015, 5:srep07918.
44. Buch I, Giorgino T, Fabritiis GD. Complete recon-
struction of an enzyme-inhibitor binding process by
WIREs Computational Molecular Science Constructing MSMs
© 2017 Wil e y Pe r i o d i cals, Inc. 15 of 18
molecular dynamics simulations. Proc Natl Acad Sci
USA 2011, 108:1018410189.
45. Zhang L, Jiang H, Sheong FK, Pardo-Avila F,
Cheung PP-H, Huang X. Constructing kinetic net-
work models to elucidate mechanisms of functional
conformational changes of enzymes and their recogni-
tion with ligands. Methods Enzymol 2016,
578:343371.
46. Zhang L, Pardo-Avila F, Unarta IC, Cheung PP-H,
Wang G, Wang D, Huang X. Elucidation of the
dynamics of transcription elongation by RNA poly-
merase II using kinetic network models. Acc Chem
Res 2016, 49:687694.
47. Zhu L, Sheong FK, Zeng X, Huang X. Elucidating
conformational dynamics of multi-body systems by
constructing Markov state models. Phys Chem Chem
Phys 2016, 18:3022830235.
48. Schütte C, Sarich M. A critical appraisal of Markov
state models. Eur Phys J Spec Top 2015,
224:24452462.
49. Shukla D, Hernández CX, Weber JK, Pande VS. Mar-
kov state models provide insights into dynamic mod-
ulation of protein function. Acc Chem Res 2015,
48:414422.
50. Schwantes CR, McGibbon RT, Pande VS. Perspec-
tive: Markov models for long-timescale biomolecular
dynamics. J Chem Phys 2014 Sep, 141:90901.
51. Chodera JD, Noé F. Markov state models of biomo-
lecular conformational dynamics. Curr Opin Struct
Biol 2014, 25:135144.
52. Pande VS, Beauchamp K, Bowman GR. Everything
you wanted to know about Markov state models but
were afraid to ask. Methods 2010, 52:99105.
53. Zwanzig R. Nonequilibrium Statistical Mechanics.
Oxford, UK: Oxford University Press; 2001.
54. Zwanzig R. From classical dynamics to continuous
time random walks. J Stat Phys 1983, 30:255262.
55. Mori H. Transport, collective motion, and Brownian
motion. Prog Theor Phys 1965, 33:423455.
56. Zwanzig R. Ensemble method in the theory of irre-
versibility. J Chem Phys 1960, 33:13381341.
57. Nüske F, Keller BG, Pérez-Hernández G, Mey ASJS,
Noé F. Variational approach to molecular kinetics. J
Chem Theory Comput. 2014, 10:17391752.
58. Husic BE, McGibbon RT, Sultan MM, Pande VS.
Optimized parameter selection reveals trends in Mar-
kov state models for protein folding. J Chem Phys
2016, 145:194103.
59. McGibbon RT, Pande VS. Variational cross-
validation of slow dynamical modes in molecular
kinetics. J Chem Phys 2015, 142:124105.
60. Schwantes CR, Pande VS. Improvements in Markov
state model construction reveal many non-native
interactions in the folding of NTL9. J Chem Theory
Comput. 2013, 9:20002009.
61. Pérez-Hernández G, Paul F, Giorgino T, De
Fabritiis G, Noé F. Identication of slow molecular
order parameters for Markov model construction. J
Chem Phys 2013, 139:15102.
62. Naritomi Y, Fuchigami S. Slow dynamics of a protein
backbone in molecular dynamics simulation revealed
by time-structure based independent component anal-
ysis. J Chem Phys 2013, 139:215102.
63. Schütte C, Noé F, Lu J, Sarich M, Vanden-Eijnden E.
Markov state models based on milestoning. J Chem
Phys 2011 May 24, 134:204105.
64. Harrigan MP, Sultan MM, Hernandez CX, Husic BE,
Eastman P, Schwantes CR, Beauchamp KA, McGibbon
RT, Pande VS. MSMBuilder: statistical models for bio-
molecular dynamics. bioRxiv 2016;84020.
65. Beauchamp KA, Bowman GR, Lane TJ, Maibaum L,
Haque IS, Pande VS. MSMBuilder2: modeling conforma-
tional dynamics on the picosecond to millisecond scale. J
Chem Theory Comput. 2011, 7:34123419.
66. Bowman GR, Huang X, Pande VS. Using generalized
ensemble simulations and Markov state models to iden-
tify conformational states. Methods 2009, 49:197201.
67. Scherer MK, Trendelkamp-Schroer B, Paul F, Pérez-
Hernández G, Hoffmann M, Plattner N, Wehmeyer C,
Prinz J-H, Noé F. PyEMMA 2: a software package for
estimation, validation, and analysis of Markov models.
J Chem Theory Comput 2015, 11:55255542.
68. Harvey MJ, De Fabritiis G. High-throughput molecu-
lar dynamics: the powerful new tool for drug discov-
ery. Drug Discov Today 2012, 17:10591062.
69. Liu S, Zhu L, Sheong FK, Wang W, Huang X. Adap-
tive partitioning by local density-peaks: an efcient
density-based clustering algorithm for analyzing
molecular dynamics trajectories. J Comput Chem
2017, 38:152160.
70. Frank J. Three-Dimensional Electron Microscopy of
Macromolecular Assemblies: Visualization of Biologi-
cal Molecules in Their Native State. Oxford, UK:
Oxford University Press; 2006.
71. Rhee YM, Pande VS. Multiplexed-replica exchange
molecular dynamics method for protein folding simu-
lation. Biophys J 2003, 84:775786.
72. Sugita Y, Okamoto Y. Replica-exchange molecular
dynamics method for protein folding. Chem Phys Lett
1999, 314:141151.
73. Weiss DR, Levitt M. Can morphing methods predict
intermediate structures? J Mol Biol 2009,
385:665674.
74. Isralewitz B, Gao M, Schulten K. Steered molecular
dynamics and mechanical functions of proteins. Curr
Opin Struct Biol 2001, 11:224230.
75. Schlitter J, Engels M, Krüger P. Targeted molecular
dynamics: a new approach for searching pathways of
conformational transitions. JMolGraph1994,
12:8489.
Advanced Review wires.wiley.com/compmolsci
16 of 18 © 2017 Wi l e y Pe r i o d icals, Inc.
76. Chovancova E, Pavelka A, Benes P, Strnad O,
Brezovsky J, Kozlikova B, Gora A, Sustr V, Klvana
M, Medek P, et al. CAVER 3.0: a tool for the analy-
sis of transport pathways in dynamic protein struc-
tures. PLOS Comput Biol 2012, 8:e1002708.
77. Zhang L, Silva D-A, Pardo-Avila F, Wang D,
Huang X. Structural model of RNA polymerase II
elongation complex with complete transcription bub-
ble reveals NTP entry routes. PLoS Comput Biol
2015, 11:e1004354.
78. Zimmerman MI, Bowman GR. FAST conformational
searches by balancing exploration/exploitation trade-
offs. J Chem Theory Comput 2015, 11:57475757.
79. Gan W, Yang S, Roux B. Atomistic view of the confor-
mational activation of Src kinase using the string method
with swarms-of-trajectories. Biophys J 2009, 97:L8L10.
80. Maragliano L, Fischer A, Vanden-Eijnden E,
Ciccotti G. String method in collective variables: min-
imum free energy paths and isocommittor surfaces. J
Chem Phys 2006, 125:24106.
81. Pan AC, Roux B. Building Markov state models
along pathways to determine free energies and rates
of transitions. J Chem Phys 2008, 129:64107.
82. Pan AC, Sezer D, Roux B. Finding transition path-
ways using the string method with swarms of trajec-
tories. J Phys Chem B 2008, 112:34323440.
83. Maragliano L, Vanden-Eijnden E. On-the-y string
method for minimum free energy paths calculation.
Chem Phys Lett. 2007, 446:182190.
84. Díaz Leines G, Ensing B. Path nding on high-
dimensional free energy landscapes. Phys Rev Lett
2012, 109:20601.
85. Pelt DM, Batenburg KJ. Fast tomographic reconstruc-
tion from limited data using articial neural net-
works. IEEE Trans Image Process 2013,
22:52385251.
86. Theobald DL. Rapid calculation of RMSDs using a
quaternion-based characteristic polynomial. Acta
Crystallogr A 2005, 61:478480.
87. Ernst M, Sittel F, Stock G. Contact- and distance-
based principal component analysis of protein
dynamics. J Chem Phys 2015, 143:244114.
88. Schwantes CR, Pande VS. Modeling molecular kinet-
ics with tICA and the kernel trick. J Chem Theory
Comput 2015, 11:600608.
89. rez-Hernández G, Noé F. Hierarchical time-lagged
independent component analysis: computing slow modes
and reaction coordinates for large molecular systems. J
Chem Theory Comput 2016, 12:61186129.
90. Boninsegna L, Gobbo G, Noé F, Clementi C. Investi-
gating molecular kinetics by variationally optimized
diffusion maps. J Chem Theory Comput 2015,
11:59475960.
91. Rohrdanz MA, Zheng W, Clementi C. Discovering
mountain passes via torchlight: methods for the
denition of reaction coordinates and pathways in
complex macromolecular reactions. Annu Rev Phys
Chem 2013, 64:295316.
92. Noé F, Clementi C. Collective variables for the study of
long-time kinetics from molecular trajectories: theory and
methods. Curr Opin Struct Biol 2017, 43:141147.
93. Noé F, Clementi C. Kinetic distance and kinetic maps
from molecular dynamics simulation. J Chem Theory
Comput 2015, 11:50025011.
94. Sculley D. Web-scale K-means clustering. In: Proceed-
ings of the 19th International Conference on World
Wide Web [Internet] (WWW 10). New York: ACM;
2010, 11771178. Available at: http://doi.acm.org/
10.1145/1772690.1772862
95. Gonzalez TF. Clustering to minimize the maximum inter-
cluster distance. Theor Comput Sci 1985, 38:293306.
96. Zhao Y, Sheong FK, Sun J, Sander P, Huang X. A
fast parallel clustering algorithm for molecular simu-
lation trajectories. J Comput Chem 2013, 34:95104.
97. Sheong FK, Silva D-A, Meng L, Zhao Y, Huang X.
Automatic state partitioning for multibody systems
(APM): an efcient algorithm for constructing Mar-
kov state models to elucidate conformational dynam-
ics of multibody systems. J Chem Theory Comput
2015, 11:1727.
98. Ward JH. Hierarchical grouping to optimize an
objective function. J Am Stat Assoc 1963,
58:236244.
99. Husic BE, Pande VS. Ward clustering improves cross-
validated Markov state models of protein folding. J
Chem Theory Comput 2017, 13:963967.
100. Sittel F, Stock G. Robust density-based clustering to
identify metastable conformational states of proteins.
J Chem Theory Comput 2016, 12:24262435.
101. Bowman GR, Beauchamp KA, Boxer G, Pande VS.
Progress and challenges in the automated construc-
tion of Markov state models for full protein systems.
J Chem Phys 2009, 131:124101.
102. Scalco R, Caisch A. Equilibrium distribution from
distributed computing (simulations of protein fold-
ing). J Phys Chem B 2011 May, 115:63586365.
103. Metzner P, Noé F, Schütte C. Estimating the sampling
error: distribution of transition matrices and func-
tions of transition matrices for given trajectory data.
Phys Rev E 2009, 80:21106.
104. Efron B, Tibshirani RJ. An Introduction to the Boot-
strap. New York: Chapman & Hall; 1994.
105. Bowman GR, Ensign DL, Pande VS. Enhanced
modeling via network theory: adaptive sampling of
Markov state models. J Chem Theory Comput 2010,
6:787794.
106. Doerr S, Harvey MJ, Noé F, De Fabritiis G. HTMD:
high-throughput molecular dynamics for molecular
discovery. J Chem Theory Comput 2016,
12:18451852.
WIREs Computational Molecular Science Constructing MSMs
© 2017 Wil e y Pe r i o d i cals, Inc. 17 of 18
107. Hinrichs NS, Pande VS. Calculation of the distribu-
tion of eigenvalues and eigenvectors in Markovian
state models for molecular dynamics. J Chem Phys
2007, 126:244101.
108. Voelz VA, Elman B, Razavi AM, Zhou G. Surprisal
metrics for quantifying perturbed conformational
dynamics in Markov state models. J Chem Theory
Comput 2014, 10:57165728.
109. Weinan E, Vanden-Eijnden E. Transition-path theory
and path-nding algorithms for the study of rare
events. Annu Rev Phys Chem 2010, 61:391420.
110. Metzner P, Schütte C, Vanden-Eijnden E. Transition
path theory for Markov jump processes. Multiscale
Model Simul 2009, 7:11921219.
111. Meng L, Sheong FK, Zeng X, Zhu L, Huang X. Path
lumping: an efcient algorithm to identify metastable
path channels for conformational dynamics of multi-
body systems. J Chem Phys 2017, 147:44112.
112. Zhou G, Pantelopulos GA, Mukherjee S, Voelz VA.
Bridging microscopic and macroscopic mechanisms
of p53-MDM2 binding using molecular simulations
and kinetic network models. bioRxiv 2016;86272.
113. Zheng X, Zhu L, Zeng X, Meng L, Zhang L, Wang
D, Huang X. Kinetics-controlled amphiphile self-
assembly processes. J Phys Chem Lett 2017,
8:17981803.
114. Bowman GR, Meng L, Huang X. Quantitative com-
parison of alternative methods for coarse-graining
biological networks. J Chem Phys 2013, 139:121905.
115. Deuhard P, Huisinga W, Fischer A, Schütte C. Iden-
tication of almost invariant aggregates in reversible
nearly uncoupled Markov chains. Linear Algebra
Appl 2000, 315:3959.
116. Deuhard P, Weber M. Robust Perron cluster analy-
sis in conformation dynamics. Linear Algebra Appl
2005, 398:161184.
117. Röblitz S, Weber M. Fuzzy spectral clustering by
PCCA+: application to Markov state models and data
classication. Adv Data Anal Classif 2013,
7:147179.
118. Shi J, Malik J. Normalized cuts and image segmenta-
tion. IEEE Trans Pattern Anal Mach Intell 2000,
22:888905.
119. Ng AY, Jordan MI, Weiss Y. On spectral clustering:
analysis and an algorithm. In: Proceedings of the
14th International Conference on Neural Information
Processing Systems: Natural and Synthetic [Internet]
(NIPS01). Cambridge, MA: MIT Press; 2001,
849856. Available at: http://dl.acm.org/citation.cfm?
id=2980539.2980649
120. Chodera JD, Singhal N, Pande VS, Dill KA,
Swope WC. Automatic discovery of metastable states
for the construction of Markov models of macromo-
lecular conformational dynamics. J Chem Phys 2007,
126:155101.
121. Bowman GR. Improved coarse-graining of Markov
state models via explicit consideration of statistical
uncertainty. J Chem Phys 2012, 137:134111.
122. Jain A, Stock G. Identifying metastable states of folding
proteins. J Chem Theory Comput 2012, 8:38103819.
123. Huang X, Yao Y, Bowman GR, Sun J, Guibas LJ,
Carlsson G, Pande VS. Constructing multi-resolution
Markov state models (MSMS) to elucidate RNA hair-
pin folding mechanisms. In: Biocomputing 2010
[Internet]. World Scientic; 2009, 228239. Available
at: http://www.worldscientic.com/doi/abs/10.1142/
9789814295291_0025
124. Yao Y, Cui RZ, Bowman GR, Silva D-A, Sun J,
Huang X. Hierarchical Nyström methods for con-
structing Markov state models for conformational
dynamics. J Chem Phys 2013, 138:174106.
125. Orioli S, Faccioli P. Dimensional reduction of Mar-
kov state models from renormalization group theory.
J Chem Phys 2016, 145:124120.
126. Bacallado S, Chodera JD, Pande V. Bayesian compar-
ison of Markov models of molecular dynamics with
detailed balance constraint. J Chem Phys 2009,
131:45106.
127. Noé F, Wu H, Prinz J-H, Plattner N. Projected and
hidden Markov models for calculating kinetics and
metastable states of complex molecules. J Chem Phys
2013, 139:184114.
128. McGibbon R, Ramsundar B, Sultan M, Kiss G,
Pande V. Understanding protein dynamics with L1-
regularized reversible hidden Markov models. In:
International Conference on Machine Learning;
2014, 11971205.
129. Shukla S, Shamsi Z, Moffett A, Selvam B, Shukla D.
Application of hidden Markov models in biomolecu-
lar simulations. In: Westhead DR, Vijayabaskar MS,
eds. Hidden Markov Models [Internet]. Methods in
Molecular Biology. New York: Springer; 2017,
2941. https://doi.org/10.1007/978-1-4939-6753-7_3.
130. Hummer G, Szabo A. Optimal dimensionality reduc-
tion of multistate kinetic and Markov-state models. J
Phys Chem B 2015, 119:90299037.
131. Wu H, Paul F, Wehmeyer C, Noé F. Multiensemble
Markov models of molecular thermodynamics and kinet-
ics. Proc Natl Acad Sci USA 2016, 113:E3221E3230.
132. Mey ASJS, Wu H, Noé F. xTRAM: estimating equi-
librium expectations from time-correlated simulation
data at multiple thermodynamic states. Phys Rev X
2014, 4:41018.
133. Wu H, Mey ASJS, Rosta E, Noé F. Statistically opti-
mal analysis of state-discretized trajectory data from
multiple thermodynamic states. J Chem Phys 2014,
141:214106.
134. Lemke O, Keller BG. Density-based cluster algo-
rithms for the identication of core sets. J Chem Phys
2016, 145:164104.
Advanced Review wires.wiley.com/compmolsci
18 of 18 © 2017 Wi l e y Pe r i o d icals, Inc.
... 31,32 Employing furthermore single-molecule (i.e., single-trajectory) information, 33 we can recover the free-energy landscape of the model and construct a Langevin equation [34][35][36] or a Markov state model. [37][38][39][40] The analyses are shown to result in a multiexponential response function with discrete timescales, giving rise to log-periodic oscillations. ...
... As explained above, we wish to analyze a time series given from a nonequilibrium experiment or an MD simulation, using three theoretical formulations: maximum entropy timescale analysis, 32 Markov state modeling, [37][38][39][40] and discrete scale invariance. 20 Moreover, we discuss if the same effects can also be observed under equilibrium conditions. ...
... If the free-energy landscape ΔG(x) of the system is known (e.g., from single-trajectories), we may construct a Markov state model (MSM), [37][38][39][40] which describes the dynamics in terms of memory-less jumps between N metastable conformational states of the system. Assuming a timescale separation between fast intrastate fluctuations and rarely occurring interstate transitions (i.e., the Markov approximation), the dynamics of the system is completely determined by the transition matrix T(τ lag ) containing the probabilities Tij that the system jumps from state j to i within a lag time τ lag . ...
Article
The time-dependent relaxation of a dynamical system may exhibit a power-law behavior that is superimposed by log-periodic oscillations. D. Sornette [Phys. Rep. 297, 239 (1998)] showed that this behavior can be explained by a discrete scale invariance of the system, which is associated with discrete and equidistant timescales on a logarithmic scale. Examples include such diverse fields as financial crashes, random diffusion, and quantum topological materials. Recent time-resolved experiments and molecular dynamics simulations suggest that discrete scale invariance may also apply to hierarchical dynamics in proteins, where several fast local conformational changes are a prerequisite for a slow global transition to occur. Employing entropy-based timescale analysis and Markov state modeling to a simple one-dimensional hierarchical model and biomolecular simulation data, it is found that hierarchical systems quite generally give rise to logarithmically spaced discrete timescales. By introducing a one-dimensional reaction coordinate that collectively accounts for the hierarchically coupled degrees of freedom, the free energy landscape exhibits a characteristic staircase shape with two metastable end states, which causes the log-periodic time evolution of the system. The period of the log-oscillations reflects the effective roughness of the energy landscape and can, in simple cases, be interpreted in terms of the barriers of the staircase landscape.
... 22,23 The choice of the number of macrostates is still an open question. 24,25 In spectralbased methods, like PCCA++ used in this study, the number of coarse-grained macrostates has often been chosen based on the existence of a gap in the eigenvalue spectrum of the transition probability matrix (Fig. S3b). 24,25 However, the choice of the number of macrostates is generally very subjective due to the continuum of eigenvalues. ...
... 24,25 In spectralbased methods, like PCCA++ used in this study, the number of coarse-grained macrostates has often been chosen based on the existence of a gap in the eigenvalue spectrum of the transition probability matrix (Fig. S3b). 24,25 However, the choice of the number of macrostates is generally very subjective due to the continuum of eigenvalues. 25 We firstly plotted the free energy landscape by mapping all MD conformations onto the top two tICs, which clearly indicates six distinct low-energy basins (Fig. S3c). ...
Article
Through constructing a kinetic model based on extensive all-atom molecular dynamics simulations, the key structural motifs in ApNGT Q469A responsible for mediating the donor-substrate loading are pinpointed.
... Herein, we resolve the conformational landscapes depicting AtSWEET13 apo, holo glucose (GLC), and holo sucrose (SUC) transport cycles. After statistically validating our~450 µs of aggregate simulation with Markov State Models (MSMs) [24][25][26][27][28] , we identified regions along the AtSWEET13 transmembrane channel critical for differentiating GLC from SUC molecular recognition. ...
Article
Full-text available
Transporters are targeted by endogenous metabolites and exogenous molecules to reach cellular destinations, but it is generally not understood how different substrate classes exploit the same transporter’s mechanism. Any disclosure of plasticity in transporter mechanism when treated with different substrates becomes critical for developing general selectivity principles in membrane transport catalysis. Using extensive molecular dynamics simulations with an enhanced sampling approach, we select the Arabidopsis sugar transporter AtSWEET13 as a model system to identify the basis for glucose versus sucrose molecular recognition and transport. Here we find that AtSWEET13 chemical selectivity originates from a conserved substrate facial selectivity demonstrated when committing alternate access, despite mono-/di-saccharides experiencing differing degrees of conformational and positional freedom throughout other stages of transport. However, substrate interactions with structural hallmarks associated with known functional annotations can help reinforce selective preferences in molecular transport.
... 3,4 Markov state models (MSMs) have emerged as a popular approach to bridge this timescale gap by predicting long timescale dynamics based on numerous short MD trajectories. [5][6][7][8][9] In an MSM, the conformational space is partitioned into metastable states, such that intrastate transitions are fast but interstate changes are slow. The dynamics of populations in the targeted states over long times are predicted by a Markovian master equation governed by a matrix of transition rates among them. ...
Article
A Markov state model is a powerful tool that can be used to track the evolution of populations of configurations in an atomistic representation of a protein. For a coarse-grained linear chain model with discontinuous interactions, the transition rates among states that appear in the Markov model when the monomer dynamics is diffusive can be determined by computing the relative entropy of states and their mean first passage times, quantities that are unchanged by the specification of the energies of the relevant states. In this paper, we verify the folding dynamics described by a diffusive linear chain model of the crambin protein in three distinct solvent systems, each differing in complexity: a hard-sphere solvent, a solvent undergoing multi-particle collision dynamics, and an implicit solvent model. The predicted transition rates among configurations agree quantitatively with those observed in explicit molecular dynamics simulations for all three solvent models. These results suggest that the local monomer–monomer interactions provide sufficient friction for the monomer dynamics to be diffusive on timescales relevant to changes in conformation. Factors such as structural ordering and dynamic hydrodynamic effects appear to have minimal influence on transition rates within the studied solvent densities.
Article
Protein aggregation is a widespread phenomenon implicated in debilitating diseases like Alzheimer's, Parkinson's, and cataracts, presenting complex hurdles for the field of molecular biology. In this review, we explore the evolving realm of computational methods and bioinformatics tools that have revolutionized our comprehension of protein aggregation. Beginning with a discussion of the multifaceted challenges associated with understanding this process and emphasizing the critical need for precise predictive tools, we highlight how computational techniques have become indispensable for understanding protein aggregation. We focus on molecular simulations, notably molecular dynamics (MD) simulations, spanning from atomistic to coarse-grained levels, which have emerged as pivotal tools in unraveling the complex dynamics governing protein aggregation in diseases such as cataracts, Alzheimer's, and Parkinson's. MD simulations provide microscopic insights into protein interactions and the subtleties of aggregation pathways, with advanced techniques like replica exchange molecular dynamics, Metadynamics (MetaD), and umbrella sampling enhancing our understanding by probing intricate energy landscapes and transition states. We delve into specific applications of MD simulations, elucidating the chaperone mechanism underlying cataract formation using Markov state modeling and the intricate pathways and interactions driving the toxic aggregate formation in Alzheimer's and Parkinson's disease. Transitioning we highlight how computational techniques, including bioinformatics, sequence analysis, structural data, machine learning algorithms, and artificial intelligence have become indispensable for predicting protein aggregation propensity and locating aggregation-prone regions within protein sequences. Throughout our exploration, we underscore the symbiotic relationship between computational approaches and empirical data, which has paved the way for potential therapeutic strategies against protein aggregation-related diseases. In conclusion, this review offers a comprehensive overview of advanced computational methodologies and bioinformatics tools that have catalyzed breakthroughs in unraveling the molecular basis of protein aggregation, with significant implications for clinical interventions, standing at the intersection of computational biology and experimental research.
Article
Protein conformational changes play crucial roles in their biological functions. In recent years, the Markov State Model (MSM) constructed from extensive Molecular Dynamics (MD) simulations has emerged as a powerful tool for modeling complex protein conformational changes. In MSMs, dynamics are modeled as a sequence of Markovian transitions among metastable conformational states at discrete time intervals (called lag time). A major challenge for MSMs is that the lag time must be long enough to allow transitions among states to become memoryless (or Markovian). However, this lag time is constrained by the length of individual MD simulations available to track these transitions. To address this challenge, we have recently developed Generalized Master Equation (GME)-based approaches, encoding non-Markovian dynamics using a time-dependent memory kernel. In this Tutorial, we introduce the theory behind two recently developed GME-based non-Markovian dynamic models: the quasi-Markov State Model (qMSM) and the Integrative Generalized Master Equation (IGME). We subsequently outline the procedures for constructing these models and provide a step-by-step tutorial on applying qMSM and IGME to study two peptide systems: alanine dipeptide and villin headpiece. This Tutorial is available at https://github.com/xuhuihuang/GME_tutorials. The protocols detailed in this Tutorial aim to be accessible for non-experts interested in studying the biomolecular dynamics using these non-Markovian dynamic models.
Article
RNA interference is an upcoming methodology being designed to specifically target viral infections. The current study suggests a strategy to design probable small interfering RNAs (siRNA) for targeting the viral genome of SARS-CoV-2, as a case study. siRNAs were designed against the targets from a highly conserved region of the spike gene of SARS-CoV-2 having no significant matches within the human genome. Four targets/viral RNAs (vRNA) with high predicted inhibition values were selected for further evaluation. The predicted siRNAs were examined for their properties and stability using molecular dynamics (MD) simulations. Further, to understand the RNA-Induced Silencing Complex (RISC) mechanism of the predicted siRNA targets of SARS-CoV-2, the human argonaute (Ago2) protein in complex with the four siRNA-vRNA duplexes was built. MD simulations of apo-Ago2, four selected siRNA-vRNA duplexes and four Ago2 bound to these siRNA-vRNA duplexes were carried out for 1 μs each. Amongst the four duplex-bound Ago2 simulation systems, the siRNA-vRNA3 duplex showed stable base pairing in the seed region, favourable and strong interactions with functionally important residues of Ago2 protein through the simulation length. Therefore, the designed siRNA3 molecule may act as an effective therapeutic agent against the SARS-CoV-2. The reported in-silico strategy may be beneficial for the identification and designing of probable siRNAs against any viral genome in RNAi therapeutics. However, the experimental validation of these molecules would be required for proving their use as therapeutics.
Article
Full-text available
The introduction of nuclear magnetic resonance (NMR) spectroscopy as a second method for protein structure determination at atomic resolution, in addition to x-ray diffraction in single crystals, has already led to a significant increase in the number of known protein structures. The NMR method provides data that are in many ways complementary to those obtained from x-ray crystallography and thus promises to widen our view of protein molecules, giving a clearer insight into the relation between structure and function.
Article
Full-text available
Constructing Markov state models from large-scale molecular dynamics simulation trajectories is a promising approach to dissect the kinetic mechanisms of complex chemical and biological processes. Combined with transition path theory, Markov state models can be applied to identify all pathways connecting any conformational states of interest. However, the identified pathways can be too complex to comprehend, especially for multi-body processes where numerous parallel pathways with comparable flux probability often coexist. Here, we have developed a path lumping method to group these parallel pathways into metastable path channels for analysis. We define the similarity between two pathways as the intercrossing flux between them and then apply the spectral clustering algorithm to lump these pathways into groups. We demonstrate the power of our method by applying it to two systems: a 2D-potential consisting of four metastable energy channels and the hydrophobic collapse process of two hydrophobic molecules. In both cases, our algorithm successfully reveals the metastable path channels. We expect this path lumping algorithm to be a promising tool for revealing unprecedented insights into the kinetic mechanisms of complex multi-body processes.
Article
Full-text available
Amphiphiles self-assembly is an essential bottom-up approach of fabricating advanced functional materials. Self-assembled materials with desired structures are often obtained through thermodynamic control. Here, we demonstrate that the selection of kinetic pathways can lead to drastically different self-assembled structures, underlining the significance of kinetic control in self-assembly. By constructing kinetic network models from large-scale molecular dynamics simulations, we show that two largely similar amphiphiles PYR and PYN prefer distinct kinetic assembly pathways. While PYR prefers an incremental growth mechanism and forms a nanotube, PYN favors a hopping growth pathway leading to a vesicle. Such preference was found to originate from the subtle difference in the distributions of hydrophobic and hydrophilic groups in their chemical structures, which leads to different rates of the adhesion process among the aggregating micelles. Our results are in good agreement with experimental results, and accentuates the role of kinetics in the rational design of amphiphiles self-assembly.
Preprint
Under normal cellular conditions, the tumor suppressor protein p53 is kept at a low levels in part due to ubiquitination by MDM2, a process initiated by binding of MDM2 to the intrinsically disordered transactivation domain (TAD) of p53. Although many experimental and simulation studies suggest that disordered domains such as p53 TAD bind their targets nonspecifically before folding to a tightly-associated conformation, the molecular details are unclear. Toward a detailed prediction of binding mechanism, pathways and rates, we have performed large-scale unbiased all-atom simulations of p53-MDM2 binding. Markov State Models (MSMs) constructed from the trajectory data predict p53 TAD peptide binding pathways and on-rates in good agreement with experiment. The MSM reveals that two key bound intermediates, each with a non-native arrangement of hydrophobic residues in the MDM2 binding cleft, control the overall on-rate. Using microscopic rate information from the MSM, we parameterize a simple four-state kinetic model to (1) determine that induced-fit pathways dominate the binding flux over a large range of concentrations, and (2) predict how modulation of residual p53 helicity affects binding, in good agreement with experiment. These results suggest new ways in which microscopic models of bound-state ensembles can be used to understand biological function on a macroscopic scale. AUTHOR SUMMARY Many cell signaling pathways involve protein-protein interactions in which an intrinsically disordered peptide folds upon binding its target. Determining the molecular mechanisms that control these binding rates is important for understanding how such systems are regulated. In this paper, we show how extensive all-atom simulations combined with kinetic network models provide a detailed mechanistic understanding of how tumor suppressor protein p53 binds to MDM2, an important target of new cancer therapeutics. A simple four-state model parameterized from the simulations shows a binding-then-folding mechanism, and recapitulates experiments in which residual helicity boosts binding. This work goes beyond previous simulations of small-molecule binding, to achieve pathways and binding rates for a large peptide, in good agreement with experiment.
Article
Under normal cellular conditions, the tumor suppressor protein p53 is kept at low levels in part due to ubiquitination by MDM2, a process initiated by binding of MDM2 to the intrinsically disordered transactivation domain (TAD) of p53. Many experimental and simulation studies suggest that disordered domains such as p53 TAD bind their targets nonspecifically before folding to a tightly associated conformation, but the microscopic details are unclear. Toward a detailed prediction of binding mechanisms, pathways, and rates, we have performed large-scale unbiased all-atom simulations of p53-MDM2 binding. Markov state models (MSMs) constructed from the trajectory data predict p53 TAD binding pathways and on-rates in good agreement with experiment. The MSM reveals that two key bound intermediates, each with a nonnative arrangement of hydrophobic residues in the MDM2 binding cleft, control the overall on-rate. Using microscopic rate information from the MSM, we parameterize a simple four-state kinetic model to 1) determine that induced-fit pathways dominate the binding flux over a large range of concentrations, and 2) predict how modulation of residual p53 helicity affects binding, in good agreement with experiment. These results suggest new ways in which microscopic models of peptide binding, coupled with simple few-state binding flux models, can be used to understand biological function in physiological contexts.
Article
Protein–protein association is fundamental to many life processes. However, a microscopic model describing the structures and kinetics during association and dissociation is lacking on account of the long lifetimes of associated states, which have prevented efficient sampling by direct molecular dynamics (MD) simulations. Here we demonstrate protein–protein association and dissociation in atomistic resolution for the ribonuclease barnase and its inhibitor barstar by combining adaptive high-throughput MD simulations and hidden Markov modelling. The model reveals experimentally consistent intermediate structures, energetics and kinetics on timescales from microseconds to hours. A variety of flexibly attached intermediates and misbound states funnel down to a transition state and a native basin consisting of the loosely bound near-native state and the tightly bound crystallographic state. These results offer a deeper level of insight into macromolecular recognition and our approach opens the door for understanding and manipulating a wide range of macromolecular association processes.
Chapter
Hidden Markov models (HMMs) provide a framework to analyze large trajectories of biomolecular simulation datasets. HMMs decompose the conformational space of a biological molecule into finite number of states that interconvert among each other with certain rates. HMMs simplify long timescale trajectories for human comprehension, and allow comparison of simulations with experimental data. In this chapter, we provide an overview of building HMMs for analyzing bimolecular simulation datasets. We demonstrate the procedure for building a Hidden Markov model for Met-enkephalin peptide simulation dataset and compare the timescales of the process.
Article
Collective variables are an important concept to study high-dimensional dynamical systems, such as molecular dynamics of macromolecules, liquids, or polymers, in particular to define relevant metastable states and state-transition or phase-transition. Over the past decade, a rigorous mathematical theory has been formulated to define optimal collective variables to characterize slow dynamical processes. Here we review recent developments, including a variational principle to find optimal approximations to slow collective variables from simulation data, and algorithms such as the time-lagged independent component analysis. Using these concepts, a distance metric can be defined that quantifies how slowly molecular conformations interconvert. Extensions and open questions are discussed.
Article
Markov state models (MSMs) are a powerful framework for analyzing protein dynamics. MSMs require the decomposition of conformation space into states via clustering, which can be cross-validated when a prediction method is available for the clustering method. We present an algorithm for predicting cluster assignments of new data points with Ward's minimum variance method. We then show that clustering with Ward's method produces better or equivalent cross-validated MSMs for protein folding than other clustering algorithms.