Conference PaperPDF Available

NAMD: Biomolecular Simulation on Thousands of Processors

December 2002

December 2002

DOI:10.1109/SC.2002.10019

Source
IEEE Xplore

Conference: Supercomputing, ACM/IEEE 2002 Conference

Authors:

James C. Phillips

University of Illinois, Urbana-Champaign

Gengbin Zheng

Intel

Sameer Kumar

IBM India Research Lab

NAMD is a fully featured, production molecular dynamics program for high performance simulation of large biomolecular systems. We have previously, at SC2000, presented scaling results for simulations with cutoff electrostatics on up to 2048 processors of the ASCI Red machine, achieved with an object-based hybrid force and spatial decomposition scheme and an aggressive measurement-based predictive load balancing framework. We extend this work by demonstrating similar scaling on the much faster processors of the PSC Lemieux Alpha cluster, and for simulations employing efficient (order N log N) particle mesh Ewald full electrostatics. This unprecedented scalability in a biomolecular simulation code has been attained through latency tolerance, adaptation to multiprocessor nodes, and the direct use of the Quadrics Elan library in place of MPI by the Charm++/Converse parallel runtime system.

…

NAMD 2 hybrid force/spatial decomposition. Atoms are spatially decomposed into patches, which are represented on other nodes by proxies. Interactions between atoms are calculated by several classes of compute objects.

…

Total resources consumed per step for 92K atom benchmark with PME every four steps by NAMD versions 2.1-2.4 on varying numbers of processors of the PSC T3E. Perfect linear scaling is a horizontal line. Diagonal scale shows runtime per ns.

…

Total resources consumed per step for 92K atom ApoA1 PME MTS benchmark by NAMD 2.4 on varying numbers of processors for recent parallel platforms. Perfect linear scaling is a horizontal line. Diagonal scale shows runtime per ns.

…

Grainsize of pairwise nonbonded computations on a 1536 processor ATPase cutoff run over 32 timesteps. Uniform grainsize aids load balancing and the interleaving of higher priority tasks. Figure generated by the projections utility of Charm++.

…

Figures - uploaded by Gengbin Zheng

Content may be subject to copyright.

Content uploaded by Gengbin Zheng

Content may be subject to copyright.

NAMD: Biomolecular Simulation on Thousands of Processors

James C. Phillips

∗

Gengbin Zheng

†

Sameer Kumar

†

Laxmikant V. Kal´e

†

Abstract

NAMD is a fully featured, production molecular dynamics program for high performance

simulation of large biomolecular systems. We have previously, at SC2000, presented scaling

results for simulations with cutoﬀ electrostatics on up to 2048 processors of the ASCI Red

machine, achieved with an object-based hybrid force and spatial decomposition scheme and an

aggressive measurement-based predictive load balancing framework. We extend this work by

demonstrating similar scaling on the much faster processors of the PSC Lemieux Alpha cluster,

and for simulations employing eﬃcient (order N log N) particle mesh Ewald full electrostatics.

This unprecedented scalability in a biomolecular simulation code has been attained through

latency tolerance, adaptation to multiprocessor nodes, and the direct use of the Quadrics Elan

library in place of MPI by the Charm++/Converse parallel runtime system.

1 Introduction

NAMD is a parallel, object-oriented molecular dynamics program designed for high performance

simulation of large biomolecular systems [6]. NAMD employs the prioritized message-driven exe-

cution capabilities of the Charm++/Converse parallel runtime system,

allowing excellent parallel

scaling on both massively parallel supercomputers and commodity workstation clusters. NAMD is

distributed free of charge via the web

to over 4000 registered users as both source code and conve-

nient precompiled binaries. NAMD development and support is a service of the National Institutes

of Health Resource for Macromolecular Modeling and Bioinformatics, located at the University of

Illinois at Urbana-Champaign.

In a molecular dynamics (MD) simulation, full atomic coordinates of the proteins, nucleic acids,

and/or lipids of interest, as well as explicit water and ions, are obtained from known crystallographic

or other structures. An empirical energy function, which consists of approximations of covalent in-

teractions in addition to long-range Lennard-Jones and electrostatic terms, is applied. The resulting

Newtonian equations of motion are typically integrated by symplectic and reversible methods using

a timestep of 1 fs. Modiﬁcations are made to the equations of motion to control temperature and

pressure during the simulation.

With continuing increases in high performance computing technology, the domain of biomolecu-

lar simulation has rapidly expanded from isolated proteins in solvent to include complex aggregates,

often in a lipid environment. Such simulations can easily exceed 100,000 atoms (see Fig. 1). Sim-

ilarly, studying the function of even the simplest of biomolecular machines requires simulations

∗

Beckman Institute, University of Illinois at Urbana-Champaign.

†

Department of Computer Science and Beckman Institute, University of Illinois at Urbana-Champaign.

http://charm.cs.uiuc.edu/

http://www.ks.uiuc.edu/Research/namd/

http://www.ks.uiuc.edu/

0-7695-1524-X/02 $17.00

 2002 IEEE

Figure 1: Simulations have increased exponentially in size, from BPTI (upper left, about 3K

atoms), through the estrogen receptor (lower left, 36K atoms, 1996), to F

-ATPase (right, 327K

atoms, 2001). (Atom counts include solvent.)

of 10 ns or longer, even when techniques for accelerating processes of interest are employed. The

goal of interactive simulation of smaller molecules places even greater demands on the performance

of MD software, as the sensitivity of the haptic interface increases quadratically with simulation

speed [11].

Despite the seemingly unending progress in microprocessor performance indicated by Moore’s

law, the urgent nature and computational needs of biomedical research demand that we pursue

the additional factors of tens, hundreds, or thousands in total performance which may be had by

harnessing a multitude of processors for a single calculation. While the MD algorithm is blessed

with a large ratio of calculation to data, its parallelization to large numbers of processors is not

straightforward [4].

2 NAMD Parallelization Strategy

We have approached the scalability challenge by adopting message-driven approaches and reducing

the complexity associated with these methods by combining multi-threading and an object-oriented

implementation in C++.

The dynamic components of NAMD are implemented in the Charm++[8] parallel language.

Charm++ implements an object-based message-driven execution model. In Charm++ applications,

there are collections of C++ objects, which communicate by remotely invoking methods on other

objects by messages.

Compared with conventional programming models such as message passing, shared memory or

data parallel programming, Charm++ has several advantages in improving locality, parallelism and

load balance [3, 7]. The ﬂexibility provided by Charm++ is a key to the high performance achieved

by NAMD on thousands of processors.

In Charm++ applications, users decompose the problem into objects, and since they decide

the granularity of the objects, it is easier for them to control the degree of parallelism. As de-

scribed below, NAMD uses a novel way of decomposition that easily generates the large amount of

parallelism needed to occupy thousands of processors.

Charm++’s object-based decomposition also help users to improve data locality. Objects en-

capsulate states, Charm++ objects are only allowed to directly access their own local memory.

Access to other data is only possible via asynchronous method invocation to other objects.

Charm++’s parallel objects and data-driven execution adaptively overlaps communication and

computation and hide communication latency: when an object is waiting for some incoming data,

entry functions of other objects with all data ready are free to execute.

In Charm++, objects may even migrate from processor to processor at runtime. Object mi-

gration is controlled by Charm++ load balancer. Charm++ implements a measurement based

load balancing framework which automatically instruments all Charm++ objects, collects compu-

tation load and communication pattern during execution and stores them into a “load balancing

database”. Charm++ then provides a collection of load balancing strategies whose job is to decide

on a new mapping of objects to processors based on information from the database. Load balanc-

ing strategies are implemented in Charm++ as libraries. Programmers can easily experiment with

diﬀerent existing strategies by linking diﬀerent strategy modules and specify which strategy to use

at runtime via command line options. This involves very little eﬀorts from programmers while

achieving signiﬁcant improvements in performance in adaptive applications. Application speciﬁc

load balancing strategies can also be developed by users and plugged in easily. In the following

paragraphs, we will describe the load balancing strategies optimized for NAMD in detail.

NAMD 1 is parallelized via a form of spatial decomposition using cubes whose dimensions are

slightly larger than the cutoﬀ radius. Thus, atoms in one cube need to interact only with their 26

neighboring cubes. However, one problem with this spatial decomposition is that the number of

cubes is limited by the simulation space. Even on a relatively large molecular system, such as the

92K atom ApoA1 benchmark, we only have 144 (6 ×6 ×4) cubes. Further, as density of the system

varies across space, one may encounter strong load imbalances.

NAMD 2 addresses this problem with a novel combination of force [10] and spatial decomposi-

tion. For each pair of neighboring cubes, we assign a non-bonded force computation object, which

can be independently mapped to any processor. The number of such objects is therefore 14 times

(26/2 + 1 self-interaction) the number of cubes. To further increase the number and reduce the

granularity of these compute objects, they are split into subsets of interactions, each of roughly

equal work.

The cubes described above are represented in NAMD 2 by objects called home patches. Each

home patch is responsible for distributing coordinate data, retrieving forces, and integrating the

equations of motion for all of the atoms in the cube of space owned by the patch. The forces used

by the patches are computed by a variety of compute objects. There are several varieties of compute

objects, responsible for computing the diﬀerent types of forces (bond, electrostatic, constraint, etc.).

Some compute objects require data from one patch, and only calculate interactions between atoms

within that single patch. Other compute objects are responsible for interactions between atoms

distributed among neighboring patches. Relationships among objects are illustrated in Fig. 2.

When running in parallel, some compute objects require data from patches not on the compute

Objects

Proxy

Patch

Bonded Force Objects

Proxy

PROCESSOR 1

Non-bonded

Non-bonded Non-bonded Non-bonded

Pair Compute

Objects

Pair Compute

Objects

Self Compute

Objects

Figure 2: NAMD 2 hybrid force/spatial decomposition. Atoms are spatially decomposed into

patches, which are represented on other nodes by proxies. Interactions between atoms are calculated

by several classes of compute objects.

object’s processor. In this case, a proxy patch takes the place of the home patch on the compute

object’s processor. During each time step, the home patch requests new forces from local compute

objects, and sends its atom positions to all its proxy patches. Each proxy patch informs the

compute objects on the proxy patch’s processor that new forces must be calculated. When the

compute objects provide the forces to the proxy, the proxy returns the data to the home patch,

which combines all incoming forces before integrating. Thus, all computation and communication

is scheduled based on priority and the availability of required data.

Some compute objects are permanently placed on processors at the start of the simulation, but

others are moved during periodic load balancing phases. Ideally, all compute objects would be able

to be moved around at any time. However, where calculations must be performed for atoms in

several patches, it is more eﬃcient to assume that some compute objects will not move during the

course of the simulation. In general, the bulk of the computational load is represented by the non-

bonded (electrostatic and van der Waals) interactions, and certain types of bonds. These objects

are designed to be able to migrate during the simulation to optimize parallel eﬃciency. The non-

migratable objects, including computations for bonds spanning multiple patches, represent only a

small fraction of the work, so good load balance can be achieved without making them migratable.

NAMD uses a measurement-based load balancer, employing the Charm++ load balancing

framework. When a simulation begins, patches are distributed according to a recursive coordi-

nate bisection scheme [1], so that each processor receives a number of neighboring patches. All

compute objects are then distributed to a processor owning at least one home patch, insuring that

each patch has at most seven proxies. The dynamic load balancer uses the load measurement

capabilities of Converse to reﬁne the initial distribution. The framework measures the execution

time of each compute object (the object loads), and records other (non-migratable) patch work as

“background load.” After the simulation runs for several time-steps (typically several seconds to

several minutes), the program suspends the simulation to trigger the initial load balancing. NAMD

retrieves the object times and background load from the framework, computes an improved load

distribution, and redistributes the migratable compute objects.

The initial load balancer is aggressive, starting from the set of required proxies and assigning

compute objects in order from larger to smaller, avoiding creating new proxies unless necessary.

To assist this algorithm when the number of processors is larger than the number of patches,

unoccupied processors are seeded with a proxy from a home patch on a close processor. In addition,

when placing compute objects which only require a single patch (calculating interactions within

that patch) the load balancer will prefer a processor with a proxy over the one with the actual

home patch; reserving the home patch nodes for objects which need them more. After this initial

balancing, only small reﬁnements are made, attempting to transfer single compute objects oﬀ of

overloaded processors without increasing communication. Two additional cycles of load balancing

follow immediately in order to account for increased communication load, after which load balancing

occurs periodically to maintain load balance.

3 Particle Mesh Ewald in NAMD

Particle mesh Ewald (PME) [5] full electrostatics calculations in NAMD have been parallelized in

several stages, and this one feature has greatly aﬀected the performance and scalability observed

by users. As seen in Fig. 3, signiﬁcant progress has been made. We initially incorporated the

external DPME

package into NAMD 2.0, providing a stable base functionality. The PME recip-

rocal sum was serialized in this version because the target workstation clusters for which DPME

was developed obtained suﬃcient scaling by only distributing the direct interactions. In NAMD,

the direct interactions were incorporated into the existing parallel cutoﬀ non-bonded calculation.

The reciprocal sum was reimplemented more eﬃciently in NAMD 2.1, providing a base for later

parallelization.

An eﬀective parallel implementation of the PME reciprocal sum was ﬁnally achieved in NAMD 2.2.

The ﬁnal design, while elegant, was far from obvious, as numerous false starts were attempted along

the way. In particular, the parallel implementation of FFTW

was found to be inappropriate for

our purposes, due to an ineﬃcient transpose operation designed to conserve memory. Instead,

the serial components of the FFTW 3D FFT (2D and 1D FFTs) were used in combination with

Charm++ messages to implement the data transpose. The reciprocal sum is currently parallelized

only to the size of the PME grid, which is typically between 50 and 200. However, this is suﬃcient

to allow NAMD simulations to scale eﬃciently, as the bottleneck has been signiﬁcantly reduced.

The PME calculation involves ﬁve phases of calculation and four of communication, which the

message-driven design of NAMD interleaves with other work, giving good performance even on

high latency networks. The basic parallel structure is illustrated in Fig. 4. This overlap allows the

FFT components of the PME operations and real-space direct force computations to execute on

the same processor, in an interleaved fashion. When the number of processors is comparable to the

number of PME processors (e.g. on 512 processors) this is a substantial advantage.

In the ﬁrst and most expensive calculation, atomic charges are interpolated smoothly onto

(typically) 4 ×4 ×4 subsections of a three dimensional mesh; this is done on each processor for the

atoms in the patches it possesses, as the process is strictly additive. Next, the grid is composited and

decomposed along its ﬁrst dimension to as many processors as possible. The spatial decomposition

Distributed Particle–Mesh Ewald, http://www.ee.duke.edu/Research/SciComp/SciComp.html

Fastest Fourier Transform in the West, http://www.fftw.org/

128

8 16 32 64 128 256 512

Processors x Time per Step (seconds)

Runtime for 1 ns Simulation

Number of Processors

NAMD ApoA1 (PME) Benchmark on PSC Cray T3E

1 month 2 weeks 1 week 4 days 2 days

NAMD 2.1

NAMD 2.2

NAMD 2.3

NAMD 2.4

Figure 3: Total resources consumed per step for 92K atom benchmark with PME every four steps

by NAMD versions 2.1–2.4 on varying numbers of processors of the PSC T3E. Perfect linear scaling

is a horizontal line. Diagonal scale shows runtime per ns.

of the patches is used to avoid unnecessary messages or empty parts of the grid being transmitted.

A local 2D FFT is performed by each processor on the second and third dimension of the grid,

and the grid is then redistributed along its second dimension for a ﬁnal 1D FFT on the ﬁrst

dimension. The transformed grid is then multiplied by the appropriate Ewald electrostatic kernel,

and a backward FFT performed on the ﬁrst dimension. The grid is redistributed back along its

ﬁrst dimension, and a backward 2D FFT performed, producing real-space potentials. The initial

communication pattern is then reversed, sending the exact data required to extract atomic forces

back to the processors with patches.

Improvements to the PME direct sum in NAMD 2.3 were obtained by eliminating expensive calls

to erfc().

This was accomplished by incorporating the entire short-range non-bonded potential

into an interpolation table. By interpolating based on the square of the interaction distance, the

calculation of 1/

√

was eliminated as well. The interpolation table is carefully constructed to avoid

cancellation errors and is iteratively reﬁned during program startup to eliminate discontinuities in

the calculated forces. Simulations performed with the new code are up to 50% faster than before

and are of equivalent accuracy.

Much of the improvement observed in NAMD 2.4 is due to the elimination of unnecessary

messages and data, which could be excluded based on the overlap of patches and the PME grid.

Distributing PME calculations to at most one processor per node for large machines in NAMD 2.4

has reduced contention for available interconnect bandwidth and other switch or network interface

resources. Patches were similarly distributed, avoiding processors participating in PME when

possible. This has increased the scalability of NAMD on the existing machines at NCSA, PSC,

The erfc(x) function computes the complement of the error function of x.

Reductions

Asynchronous

Compute Objects

Angle

Transposes

PME

Compute Objects

Pairwise

Patches : Integration

Point to Point

Multicast

Point to Point

Figure 4: Parallel structure of PME in NAMD. When PME is required, typically every four steps,

PME interleaves with other calculation.

and SDSC with 2, 4, and 8 processors per node. A more aggressive optimization would employ

diﬀerent sets of processors for each phase of the PME calculation on a suﬃciently large machine.

Figure 5 illustrates the portable scalability of NAMD 2.4 on a variety of recent platforms employed

for production simulations by researchers at the Resource.

Continuing development for NAMD 2.5 as reported here has resulted in a more eﬃcient inter-

polation table implementation and the elimination of unnecessary calculations from non-multiple-

timestepping simulations, resulting in substantial improvements to serial performance. To improve

locality of communication, patches and the PME grid planes with which they communicate have

been aligned on the processors. In order to improve performance on Lemieux, various manipulations

of the load balancer have been attempted, including removing all work from processors involved in

the PME reciprocal sum, with limited success.

4 Optimization for PSC Lemieux

The Pittsburgh Supercomputing Center’s Lemieux is a high performance Alpha cluster with 3000

processors and a peak performance of about 6 TFLOPS. Lemieux uses the Quadrics intercon-

nection network, which surpasses other communication networks in speed and programmability

[9]. Lemieux provides a default MPI library to access its interconnect. Our NAMD/Charm++

implementation on top of MPI uses MPI Iprobe to get the size of the received message and then

allocates a buﬀer for it. MPI Iprobe turned out to be a very expensive call and aﬀected the speedup

of NAMD.

The network interface in Quadrics can also be accessed through the Elan communication library

[12]. The Elan message passing library has a low latency of 5µs for small messages and a peak

bandwidth of over 300 MB/s. The ﬂexibility and programmability of Elan library motivated us to

create a NAMD/Charm++ implementation directly on top of the Elan library. Our implementation

has two types of messages, small and large, distinguished by diﬀerent tags. Small messages are sent

asynchronously along with the message envelope and are received in preallocated buﬀers. The

maximum size of small messages and the number of preallocated buﬀers are a parameter to the

1 2 4 8 16 32 64 128 256 512

Processors x Time per Step (seconds)

Runtime for 1 ns Simulation

Number of Processors

NAMD 2.4 ApoA1 (PME) Benchmark

1 month 2 weeks 1 week

4 days 2 days 1 day 12 hours

NCSA Platinum 2xP3/1000 Myri

SDSC IBM SP 8xPWR3/375

Resource Linux K7/1333 100bT

NCSA Titan 2xIA64/800 Myri

PSC LeMieux 4xev6/1000

Figure 5: Total resources consumed per step for 92K atom ApoA1 PME MTS benchmark by

NAMD 2.4 on varying numbers of processors for recent parallel platforms. Perfect linear scaling is

a horizontal line. Diagonal scale shows runtime per ns.

program (we set the size of the buﬀers to about 5KB and number of buﬀers to the number of

processors in the system). For large messages the sender ﬁrst sends the message envelope and on

receiving the message envelope the receiver allocates a buﬀer for the message and a DMA copy is

performed from the sender’s memory into the allocated buﬀer. The above framework is very easy to

implement in Elan. A comparison of NAMD’s performance on the MPI and Elan implementations

is shown in Table 1.

The Elan network interface also allows user threads to be run on its processor. These threads

could process the messages as soon as they are received and not disturb the main processors which

are busy with computation. Thus multicast and reduction messages can be immediately sent to

their destinations and other messages can be copied into local memory to be handled by the main

processors when they become idle. This would greatly improve the eﬃciency of high level message

passing systems like Charm++. This approach is currently being investigated.

We also implemented a message combining library that reduces the number of messages sent

during the PME transposes from 192 or 144 for ATPase to 28. The library imposes a virtual

mesh topology on the processors and each processor ﬁrst combines and sends data to its column

neighbors and then to its row neighbors. Thus each processor sends 2

√

p messages instead of p

messages without the library. This helps in reducing the per message cost and is very useful for

small messages. However notice that each byte is sent two times on the communication network.

For ApoA1 the message size is around 600B and the library eﬀectively reduces the step time. For

ATPase the message size is around 900B which limits the gains of the library. For ATPase the

library reduces the step time by about around 1ms for runs of over 2000 processors.

Processors Time/step Speedup GFLOPS

Total Per Node MPI Elan MPI Elan MPI Elan

1 1 28.08 s 28.08 s 1 1 0.480 0.480

128 4 248.3 ms 234.6 ms 113 119 54 57

256 4 135.2 ms 121.9 ms 207 230 99 110

512 4 65.8 ms 63.8 ms 426 440 204 211

510 3 65.7 ms 63.0 ms 427 445 205 213

1024 4 41.9 ms 36.1 ms 670 778 322 373

1023 3 35.1 ms 33.9 ms 799 829 383 397

1536 4 35.4 ms 32.9 ms 792 854 380 410

1536 3 26.7 ms 24.7 ms 1050 1137 504 545

2048 4 31.8 ms 25.9 ms 883 1083 423 520

1800 3 25.8 ms 22.3 ms 1087 1261 521 605

2250 3 19.7 ms 18.4 ms 1425 1527 684 733

2400 4 32.4 ms 27.2 ms 866 1032 416 495

2800 4 32.3 ms 32.1 ms 869 873 417 419

3000 4 32.5 ms 28.8 ms 862 973 414 467

Table 1: NAMD performance on 327K atom ATPase benchmark system with and multiple timestep-

ping with PME every four steps for Charm++ based on MPI and Elan.

5 Performance Analysis and Breakdown

One of the lessons from our previous parallel scaling eﬀort was that we must keep the computation

time of each individual object substantially lower than the per-step time we expect to achieve on the

largest conﬁguration. Since the heaviest computation happens in the pairwise force computation

objects, we analyzed their grainsize, as shown in Fig. 6. This data was taken from 32 timesteps

of a cutoﬀ-only run on 2100 processors (although it will roughly be the same independent of the

number of processors: the same set of objects are mapped to the available processors by Charm++

load balancing schemes). It shows that most objects’ execution time is less than 800 microseconds,

and the maximum is limited by 2.0 msecs or so. (The ﬁgure shows 905,312 execution instances of

pairwise computation objects in all, over 32 steps. So, there are 28,291 pairwise compute objects

per step. These objects account for about 845.53 seconds of execution time, or about 26.42 seconds

per timestep. The average grainsize — the amount of computation per object — is about 934

microseconds).

Fig. 7 shows where the processors spent their time. As expected, the amount of time spent

on communication, and related activities increases with number of processors. But the percentage

increase is moderate, once the communication time has risen to a signiﬁcant percentage at 128

processors. The idle time is due almost entirely due to load imbalances (since there are no blocking

receives in Charm++), which are handled relatively eﬃciently by the measurement-based load

balancing of Charm++.

Figure 6: Grainsize of pairwise nonbonded computations on a 1536 processor ATPase cutoﬀ run

over 32 timesteps. Uniform grainsize aids load balancing and the interleaving of higher priority

tasks. Figure generated by the projections utility of Charm++.

Figure 7: Time spent in various activities, during 8 timesteps of an ATPase PME MTS run as a

function of processor count.

System Atoms GOP per step GFLOP per step

Cut PME MTS Cut PME MTS

ApoA1 92K 9.68 10.64 10.75 3.52 3.65 3.84

ATPase 327K 32.07 35.66 36.05 12.29 12.80 13.48

Table 2: Measured operation counts for NAMD on benchmark systems and simulation modes cutoﬀ

(Cut), PME every step (PME), and multiple timestepping with PME every four steps (MTS).

6 Benchmark Methodology

NAMD is a production simulation engine, and has been benchmarked against the standard com-

munity codes AMBER [13] and CHARMM [2]. The serial performance of NAMD is comparable to

these established packages, while its scalability is much better. NAMD is an optimized and eﬃcient

implementation of the MD algorithm as applied to biomolecular systems.

In order to demonstrate the scalability of NAMD for the real problems of biomedical researchers,

we have drawn benchmarks directly from simulations being conducted by our NIH-funded collab-

orators. The smaller ApoA1 benchmark comprises 92K atoms of lipid, protein, and water, and

models a high density lipoprotein particle found in the bloodstream. The larger ATPase bench-

mark is taken from simulations still in progress, consists of 327K atoms of protein and water, and

models the F

subunit of ATP synthase, a component of the energy cycle in all life. Both systems

comprise a solvated biomolecular aggregate and explicit water in a periodic cell. While we have fa-

vored larger simulations to demonstrate scalability, the simulations themselves are not gratuitously

large, but were created by users of NAMD to be as small as possible but still scientiﬁcally valid. We

have not scaled the benchmark simulations in any way, or created an artiﬁcially large simulation

in order to demonstrate scalability.

The number of short-range interactions to be evaluated in a simulation, and hence its serial

runtime, is proportional to the cube of the cutoﬀ distance. The realistic selection of this value is

vital to the validity of the benchmark, as excessive values inﬂate the ratio of computation to data.

For all benchmarks here, short-range nonbonded interactions were cut oﬀ at 12

A as speciﬁed by

the CHARMM force ﬁeld. The spatial decomposition employed in NAMD bases domain size on

the cutoﬀ distance and, therefore, NAMD will sometimes scale better with smaller cutoﬀs. Our

A cutoﬀ results in 6 × 6 × 4 = 144 patches (domains) for ApoA1 and 11 × 8 × 8 = 704 patches

for ATPase; scaling is harder when more processors than patches are used.

The serial performance implications of the selection of the PME full electrostatics parameters

are minor. Provided the dimensions of the charge grid have suﬃcient factors to allow O(NlogN )

scaling of the 3D FFT, the bulk of the work is proportional to the number of atoms and the number

of points each charge is interpolated to. The default 4 × 4 × 4 interpolation as used and the grid

was set at a spacing of approximately 1

A, 108 × 108 × 80 for ApoA1 and 192 × 144 × 144 for

ATPase. In typical usage, full electrostatics interactions are calculated every four timesteps using

a multiple timestepping integrator; such a simulation is of equivalent accuracy to one in which

PME is evaluated every step. To observe the eﬀect of PME on performance and scalability, we

benchmark three variations: no PME or cutoﬀ (Cut), PME every step (PME), and PME every

four steps (MTS).

Operation counts for both benchmarks were measured using the perfex utility for monitoring

the hardware performance counters on the SGI Origin 2000 at NCSA. Total and ﬂoating point

operations were measured for all combinations of benchmark systems (ApoA1, ATPase) and sim-

ulation modes (Cut, PME, MTS) for runs of 20 and 40 steps running on a single processor (to

eliminate any parallel overhead). The results of the 40 and 20 step simulations were subtracted

to eliminate startup calculations and yield an estimate of the marginal operations per step for a

continuing simulation. The results of this calculation are presented in Table 2 and used as the basis

for operation counts in parallel runs.

Based on serial runtimes, NAMD executes around 0.5 ﬂoating point operations and 1.3 total

operations per clock cycle on the 1 GHz processors of Lemieux. The ratio of integer to ﬂoating point

operations observed reﬂects the eﬃciency of NAMD in using eﬃcient algorithms to avoid unneces-

sary ﬂoating point operations whenever possible. For example, a recent optimization manipulates

an IEEE ﬂoat as an integer in order to index into the short-range electrostatics interpolation table.

Measured by operation count, PME is only slightly more expensive than cutoﬀ calculations.

This is due to the O(N log N) complexity of the long-range calculation and the eﬃcient incorpo-

ration of the short-range correction into the electrostatic interaction. Operation counts are nearly

proportional to the number of atoms in the simulation, demonstrating the near-linear scaling of

the PME algorithm. The additional operations in MTS (despite fewer PME calculations!) are

due to the need to separate the short-range correction result rather than including it in the short-

range results. Note, however, that the actual runtime of MTS is less than that of PME, since the

short-range correction is calculated very eﬃciently.

For scaling benchmark runs, execution time is measured over the ﬁnal 2000 steps of a 2500

step simulation. Initial load balancing occurs during the ﬁrst 500 steps, broken down as 100 steps

ignored (to eliminate any lingering startup eﬀects), 100 steps of measurement and a complete

reassignment, 100 steps of measurement and a ﬁrst reﬁnement, another 100 steps of measurement

and a second reﬁnement, and ﬁnally another 100 ignored steps. In a continuing simulation, load

balancing would occur every 4000 steps (i.e., at steps 4400, 8400, etc.), preceded immediately by

100 steps of measurement.

In order to demonstrate that the reported scaling and performance is sustainable in a production

simulation, we have run the ApoA1 MTS benchmark for 100,000 steps on 512 processors of Lemieux,

recording the average time per step for each 100 step interval. This number of processors was chosen

to ensure that performance was heavily inﬂuenced by the load balancer rather than some other

bottleneck in the calculation. As seen in Figure 8, time spent load balancing is hardly perceptible,

while the beneﬁts of periodic reﬁnement over the course of the run are substantial. The average

performance over all 100,000 steps including initial load balancing is 23.3 ms/step, while the average

for steps 500-2500 as employed in speedup calculations is 26.3 ms/step, some 10% slower than the

performance experienced by an actual user. The value reported for the matching 2500 step run in

the speedup table is 23.9 ms/step, which is again higher than the overall average seen here. We

can therefore conﬁdently state that the results reported below are conservative and indicative of

the true performance observed during a production run.

7 Scalability Results

Table 3 shows the performance of NAMD on the smaller ApoA1 benchmark system. While reason-

able eﬃciency is only obtained to 512 processors, by using larger numbers of processors the time

per step can be reduced to 11.2 ms cutoﬀ and 12.6 ms PME (MTS). The ApoA1 data also reveals

superior scaling when only three of the four processors per node available on Lemieux are used.

Table 4 shows the greatest scalability attained by NAMD on the 3000 processor Lemieux cluster

at PSC. NAMD scales particularly well for larger simulations, which are those for which improved

Processors Time/step Speedup GFLOPS

Total Per Node Cut PME MTS Cut PME MTS Cut PME MTS

1 1 7.08 s 8.24 s 7.86 s 1 1 1 0.497 0.443 0.489

128 4 62.5 ms 80.0 ms 71.5 ms 113 103 109 56 45 53

256 4 37.4 ms 43.7 ms 40.3 ms 189 188 194 94 83 95

512 4 21.0 ms 26.6 ms 23.9 ms 336 309 329 167 137 161

510 3 20.5 ms 24.9 ms 23.3 ms 344 331 337 171 146 164

1024 4 18.2 ms 24.4 ms 17.6 ms 389 338 446 193 149 218

1023 3 13.8 ms 15.2 ms 14.0 ms 512 540 559 254 239 273

1536 4 16.6 ms 22.6 ms 17.1 ms 427 364 459 212 161 224

1536 3 11.2 ms 15.0 ms 12.6 ms 629 549 624 312 243 305

2250 3 11.2 ms 15.0 ms 12.8 ms 629 549 613 313 243 299

Table 3: NAMD performance on 92K atom ApoA1 benchmark system with simulation modes cutoﬀ

(Cut), PME every step (PME), and multiple timestepping with PME every four steps (MTS) for

Charm++ based on Elan.

Processors Time/step Speedup GFLOPS

Total Per Node Cut PME MTS Cut PME MTS Cut PME MTS

1 1 24.89 s 29.49 s 28.08 s 1 1 1 0.494 0.434 0.480

128 4 207.4 ms 249.3 ms 234.6 ms 119 118 119 59 51 57

256 4 105.5 ms 135.5 ms 121.9 ms 236 217 230 116 94 110

512 4 55.4 ms 72.9 ms 63.8 ms 448 404 440 221 175 211

510 3 54.8 ms 69.5 ms 63.0 ms 454 424 445 224 184 213

1024 4 33.4 ms 45.1 ms 36.1 ms 745 653 778 368 283 373

1023 3 29.8 ms 38.7 ms 33.9 ms 835 762 829 412 331 397

1536 4 25.7 ms 44.7 ms 32.9 ms 968 660 854 477 286 410

1536 3 21.2 ms 28.2 ms 24.7 ms 1175 1047 1137 580 454 545

2048 4 25.8 ms 46.7 ms 25.9 ms 963 631 1083 475 274 520

1800 3 18.6 ms 25.8 ms 22.3 ms 1340 1141 1261 661 495 605

2250 3 15.6 ms 23.5 ms 18.4 ms 1599 1256 1527 789 545 733

2400 4 22.6 ms 44.6 ms 27.2 ms 1099 661 1032 542 286 495

2800 4 22.1 ms 43.6 ms 32.1 ms 1127 676 873 556 293 419

3000 4 22.6 ms 39.6 ms 28.8 ms 1102 743 973 544 322 467

Table 4: NAMD performance on 327K atom ATPase benchmark system with simulation modes

cutoﬀ (Cut), PME every step (PME), and multiple timestepping with PME every four steps (MTS)

for Charm++ based on Elan. Peak performance is attained on 2250 processors, three per node.

100

0 20000 40000 60000 80000 100000

100-Step Average Time per Step (ms)

Step

NAMD 2.5 ApoA1 PME MTS Benchmark LeMieux 512 Processors

Figure 8: Performance variation over a longer simulation of ApoA1 MTS on 512 processors of

Lemieux. Periodic load balancing maintains and improves performance over time. Overall av-

erage performance is 23.3 ms/step, while performance measured for steps 500-2500 of this run is

26.3 ms/step and performance reported in speedup tables for a 2500 step run is 23.9 ms/step.

performance is most greatly desired by researchers. Maximum performance is achieved employing

2250 processors, three processors per node for all of the 750 nodes of Lemieux. For the cutoﬀ

simulation, a speedup of 1599 at 789 GFLOPS was attained. For the more challenging PME

(MTS) simulation, a speedup of 1527 at 733 GFLOPS was attained. This 68% eﬃciency for a

full electrostatics biomolecular simulation on over 2000 processors indicates that the scalability

problems imposed by PME have been overcome.

8 Remaining Challenges

As can be seen above, we have achieved quite satisfactory performance levels when running on

up to 2250 processors. The major problem remaining appears to be simply the inability to use

the 4th processor on each node for useful computation eﬀectively. As the speedup data shows, we

consistently get smaller speedups (and slowdowns) when we try to use all 4 processors on a node.

Even more fundamentally, our analysis shows that the problems are due to “stretching” of

various operations (typically inside send or receive operations inside the communication layers). The

message driven nature of NAMD and Charm++ allows us to eﬀectively deal with small variations

of this nature. For example, Fig. 9 shows the “timeline” view using the projections performance

tool, for a run on 1536 processors (using 512 nodes). As can be seen, processors 900 and 933

had “stretched” events, even on this run that used 3 out of 4 processors on each node. However,

other processes reordered the computations they were doing (automatically, as a result of data-

driven execution) so that they were not held up for data, and the delayed processors “caught

Figure 9: Projections timeline view illustrating harmful stretching and message-driven adaptation

on 1536 processor cutoﬀ run using three processors per node.

Figure 10: Projections timeline view illustrating catastrophic stretching on 3000 processor PME

MTS run using four processors per node.

up” without causing a serious dent in performance. On the other hand, With 4 processors per

node, such stretching events happen more often, making hiding them via data-driven adaptation

impossible. Further, with PME, which imposes a global synchrony across all patches, processors

have less leeway to adjust for the misbehavior. The compound result of these two factors is the

substantial degradation in performance seen in Fig. 10. This is the reason for the better speedups

for cutoﬀ-based runs on 2800 and 3000 processors in Table 4.

We are working with PSC and HP staﬀ to address these problems. We are especially hopeful that

a diﬀerent implementation of our low level communication primitives will overcome this problem,

and allow our application to utilize the entire set of processors of the machine eﬀectively.

Parallelizing the MD algorithm to thousands of processors reduces individual timesteps to the

range of 5–25 ms. The diﬃculty of this task is compounded by the speed of modern processors,

e.g., ASCI Red ran the ApoA1 benchmark without PME at 57 s per step on one processor while a

1 GHz Alpha of Lemieux runs with PME at 7.86 s per step. The scalability results reported here

demonstrate the strengths and limitations of both the software, NAMD and Charm++, and the

hardware, Lemieux.

Acknowledgements

NAMD was developed as a part of biophysics research at the Theoretical Biophysics Group (Beck-

man Institute, University of Illinois), which operates as an NIH Resource for Macromolecular Mod-

eling and Bioinformatics. This resource is led by principle investigators Professors Klaus Schulten

(Director), Robert Skeel, Laxmikant Kal´e and Todd Martinez. We are thankful for their support,

encouragement, and cooperation. Prof. Skeel and coworkers have designed the speciﬁc numerical

algorithms used in NAMD. We are grateful for the funding to the resource provided by the National

Institutes of Health (NIH PHS 5 P41 RR05969-04).

NAMD itself is a collaborative eﬀort. NAMD 1 was implemented by a team including: Robert

Brunner, Andrew Dalke, Attila Gursoy, Bill Humphrey and Mark Nelson. NAMD 2, the current

version, was implemented and is being enhanced by by a team consisting of: Milind Bhandarkar,

Robert Brunner, Paul Grayson, Justin Gullingsrud, Attila Gursoy, David Hardy, Neal Krawetz,

Jim Phillips, Ari Shinozaki, Krishnan Varadarajan, Gengbin Zheng, and Fangqiang Zhu.

This research has beneﬁted directly from the Charm++ framework at the Parallel Programming

Laboratory (http://charm.cs.uiuc.edu), and especially its load balancing framework and strategies,

the work on the performance tracing and visualization tool projections), and its recent extensions

for using chip-level performance counters. For help with the results in this papers, as well as for

relevant work on Charm++, we thank Orion Lawlor, Ramkumar Vadali, Joshua Unger, Chee-Wai

Lee, and Sindhura Bandhakavi.

Charm++ framework is also being used and supported by the Center for Simulation of Advanced

Rockets (CSAR or simply the Rocket Center) at the University of Illinois at Urbana-Champaign,

funded by the Department of Energy (via subcontract B341494 from Univ. of California, ), and

the NSF NGS grant (NSF EIA 0103645) for developing this programming system for even larger

parallel machines extending into PetaFLOPS level performance.

The parallel runs were carried out primarily at the Pittsburgh Supercomputing Center (PSC)

and also at the National Center for Supercomputing Applications (NCSA). We are thankful to these

organizations and their staﬀ for their continued assistance and for the early access and computer

time we were provided for this work. In particular we would like to thank David O’Neal, Sergiu

Sanielevici, John Kochmar and Chad Vizino from PSC and Richard Foster (Hewlett Packard)

for helping us make the runs at PSC Lemieux and providing us with technical support. Com-

puter time at these centers was provided by the National Resource Allocations Committee (NRAC

MCA93S028).

References

[1] M. Berger and S. Bokhari. A partitioning strategy for nonuniform problems on multiprocessors.

IEEE Trans. Computers, C-36:570–580, 1987.

[2] B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. vid J. States, S. Swaminathan, and

M. Karplus. CHARMM: A program for macromolecular energy, minimization, and dyn amics

calculations. J. Comp. Chem., 4:187–217, 1983.

[3] R. K. Brunner and L. V. Kal´e. Adapting to load on workstation clusters. In The Seventh Sym-

posium on the Frontiers of Massively Parallel Computation, pages 106–112. IEEE Computer

Society Press, February 1999.

[4] R. K. Brunner, J. C. Phillips, and L. V. Kal´e. Scalable molecular dynamics for large biomolec-

ular systems. In Proceedings of the 2000 ACM/IEEE SC2000 Conference. ACM, 2000.

[5] T. Darden, D. York, and L. Pedersen. Particle mesh Ewald. An N·log(N) method for Ewald

sumsin large systems. J. Chem. Phys., 98:10089–10092, 1993.

[6] L. Kal´e, R. Skeel, M. Bhandarkar, R. Brunner, A. Gursoy, N. Krawetz, J. Phillips, A. Shinozaki,

K. Varadarajan, and K. Schulten. NAMD2: Greater scalability for parallel molecular dynamics.

J. Comp. Phys., 151:283–312, 1999.

[7] L. V. Kale, M. Bhandarkar, and R. Brunner. Run-time Support for Adaptive Load Balancing.

In J. Rolim, editor, Lecture Notes in Computer Science, Proceedings of 4th Workshop on

Runtime Systems for Parallel Programming (RTSPP) Cancun - Mexico, volume 1800, pages

1152–1159, March 2000.

[8] L. V. Kal´e and S. Krishnan. Charm++: Parallel programming with message-driven objects.

In G. V. Wilson and P. Lu, editors, Parallel Programming using C++, pages 175–213. MIT

Press, 1996.

[9] F. Petrini, W. chun Feng, A. Hoisie, S. Coll, and E. Frachtenberg. The quadrics network:

High-performance clustering technology. IEEE Micro, 22(1):46– 57, 2002.

[10] S. J. Plimpton and B. A. Hendrickson. A new parallel method for molecular dynamics simu-

lation of macromolecular systems. J. Comp. Chem., 17(3):326–337, 1996.

[11] J. Stone, J. Gullingsrud, P. Grayson, and K. Schulten. A system for interactive molecular

dynamics simulation. In J. F. Huges and C. H. S´equin, editors, 2001 ACM Symposium on

Interactive 3D Graphics, pages 191–194, New York, 2001. ACM SIGGRAPH.

[12] Elan programming manual. http://www.lanl.gov/radiant/website/pubs/quadrics/qsnet.pdf.

[13] P. K. Weiner and P. A. Kollman. AMBER: Assisted model building with energy reﬁnement. A

general p rogram for modeling molecules and their interactions. J. Comp. Chem., 2(3):287–303,

1981.

In Silico Comparative Analysis of Ivermectin and Nirmatrelvir Inhibitors Interacting with the SARS-CoV-2 Main Protease

Article

Full-text available

Jun 2024

Exploring therapeutic options is crucial in the ongoing COVID-19 pandemic caused by SARS-CoV-2. Nirmatrelvir, which is a potent inhibitor that targets the SARS-CoV-2 Mpro, shows promise as an antiviral treatment. Additionally, Ivermectin, which is a broad-spectrum antiparasitic drug, has demonstrated effectiveness against the virus in laboratory settings. However, its clinical implications are still debated. Using computational methods, such as molecular docking and 100 ns molecular dynamics simulations, we investigated how Nirmatrelvir and Ivermectin interacted with SARS-CoV-2 Mpro(A). Calculations using density functional theory were instrumental in elucidating the behavior of isolated molecules, primarily by analyzing the frontier molecular orbitals. Our analysis revealed distinct binding patterns: Nirmatrelvir formed strong interactions with amino acids, like MET49, MET165, HIS41, HIS163, HIS164, PHE140, CYS145, GLU166, and ASN142, showing stable binding, with a root-mean-square deviation (RMSD) of around 2.0 Å. On the other hand, Ivermectin interacted with THR237, THR239, LEU271, LEU272, and LEU287, displaying an RMSD of 1.87 Å, indicating enduring interactions. Both ligands stabilized Mpro(A), with Ivermectin showing stability and persistent interactions despite forming fewer hydrogen bonds. These findings offer detailed insights into how Nirmatrelvir and Ivermectin bind to the SARS-CoV-2 main protease, providing valuable information for potential therapeutic strategies against COVID-19.

In Silico Comparative Analysis of Ivermectin and Nirmatrelvir Inhibitors Interacting with the SARS-CoV-2 Main Protease

Preprint

Full-text available

Apr 2024

Exploring therapeutic options is crucial in the ongoing COVID-19 pandemic caused by SARS-CoV-2. Nirmatrelvir, a potent inhibitor targeting the SARS-CoV-2 Mpro, shows promise as an antiviral treatment. Additionally, Ivermectin, a broad-spectrum antiparasitic drug, has demonstrated effectiveness against the virus in laboratory settings. However, its clinical implications are still debated. Using computational methods such as molecular docking and 100 ns molecular dynamics simulations, we investigated how Nirmatrelvir and Ivermectin interact with SARS-CoV-2 Mpro(A). Calculations using density functional theory have been instrumental in elucidating the behavior of isolated molecules, primarily by analyzing the frontier molecular orbitals. Our analysis revealed distinct binding patterns: Nirmatrelvir formed strong interactions with amino acids like MET49, MET165, HIS41, HIS163, HIS164, PHE140, CYS145, GLU166, and ASN142, showing stable binding with a root mean square deviation (RMSD) of around 2.0 Å. On the other hand, Ivermectin interacted with THR237, THR239, LEU271, LEU272, and LEU287, displaying an RMSD of 1.87 Å, indicating enduring interactions. Both ligands stabilized Mpro(A), with Ivermectin showing stability and persistent interactions despite forming fewer hydrogen bonds. These findings offer detailed insights into how Nirmatrelvir and Ivermectin bind to the SARS-CoV-2 main protease, providing valuable information for potential therapeutic strategies against COVID-19.

Spherical PEG/SiO2 promising agents for Lamivudine antiviral drug delivery, a molecular dynamics simulation study

Article

Full-text available

Feb 2023

Spherical nanocarriers can lead to a bright future to lessen problems of virus infected people. Spherical polyethylene glycol (PEG) and spherical silica (SiO2) are novel attractive nanocarriers as drug delivery agents, especially they are recently noticed to be reliable for antiviral drugs like anti-HIV, anti-covid-19, etc. Lamivudine (3TC) is used as a first line drug for antiviral therapy and the atomic view of 3TC-PEG/SiO2 complexes enable scientist to help improve treatment of patients with viral diseases. This study investigates the interactions of 3TC with Spherical PEG/SiO2, using molecular dynamics simulations. The mechanism of adsorption, the stability of systems and the drug concentration effect are evaluated by analyzing the root mean square deviation, the solvent accessible surface area, the radius of gyration, the number of hydrogen bonds, the radial distribution function, and Van der Waals energy. Analyzed data show that the compression of 3TC is less on PEG and so the stability is higher than SiO2; the position and intensity of the RDF peaks approve this stronger binding of 3TC to PEG as well. Our studies show that PEG and also SiO2 are suitable for loading high drug concentrations and maintaining their stability; therefore, spherical PEG/SiO2 can reduce drug dosage efficiently.

Study of Workload Interference with Intelligent Routing on Dragonfly

Conference Paper

Full-text available

Nov 2022

Yeast-derived high N-mannosylation modulates acyl chain length selectivity and biochemical properties of recombinant Cordyceps militaris lipase

Article

Apr 2024

Lightweight Noise Detection

Chapter

Jun 2023

Performance variance of parallel and distributed systems is becoming increasingly severe. The runtimes of different executions can vary greatly even with a fixed number of computing nodes. Many HPC applications on supercomputers exhibit such variance. Efficient online performance variance detection is an open problem in HPC research. To solve it, we propose an approach, called vSensor, to detect the performance variance of systems. The key finding of this study is that the source code of programs can better represent performance at runtime than an external detector. Specifically, many HPC applications contain code snippets that are fixed-workload patterns of execution, e.g., the workload of an invariant quantity and a linearly growing workload. This observation allows us to automatically identify these snippets of workload-related code and use them to detect performance variance. We evaluate vSensor on the Tianhe-2A system with a large number of parallel applications, and the results indicate that it can efficiently identify variations in system performance. The average overhead of 4,096 processes is less than 6% for fixed-workload v-sensors. We identify a problematic node with slow memory and network issues on Tianhe-2A system with vSensor that degrade programs’ performance by 21% and 3.37$\times $, respectively. (Ⓒ 2022 IEEE. Reproduced, with permission, from Jidong Zhai et al., Leveraging code snippets to detect variations in the performance of HPC systems, IEEE Transactions on Parallel and Distributed Systems, 2022.)

Computational Modeling Approaches in Search of Anti-Alzheimer's Disease Agents: Case Studies of Phosphodiesterase Inhibitors

Chapter

Jul 2023

Alzheimer’s disease (AD) is one of the major public health concerns. Phosphodiesterases (PDEs) are a major class of enzymes which hydrolyze two second messengers: cyclic adenosine monophosphate (cAMP) and cyclic guanosine monophosphate (cGMP). Due to the high expression of various PDE subfamilies in the human brain, PDE inhibition has a substantial impact on neurodegenerative diseases by controlling the level of cAMP or cGMP. In this regard, several synthetic or natural compounds that inhibit specific PDE subtypes, for instance, rolipram and roflumilast (PDE4 inhibitors), vinpocetine (PDE1 inhibitor), cilostazol and milrinone (PDE3 inhibitors), sildenafil and tadalafil (PDE5 inhibitors), etc., have been stated as exhibiting excellent results for the treatment of AD. PDEs are currently believed to be a potential target for the treatment of AD since several PDE inhibitors have demonstrated significant cognitive improvement effects in preclinical investigations and more than 33 of them have been subjected to clinical trials. In the search for novel drugs, computational drug design methods are now essential. Computational approaches, whether structure-based (protein structure prediction, molecular docking, MD simulation, pharmacophore modeling, fragment-based de novo design, etc.) or ligand-based (QSAR, chemical read-across, pharmacophore modeling, similarity search), are used in almost every drug discovery project. To investigate new drugs, many drug targets have been researched employing computational techniques. Many researchers across the world have recently focused on the development of more advanced and selective phosphodiesterases as treatments for inflammatory illnesses, CNS disorders (including Alzheimer’s disease), and numerous other diseases. The majority of these groups have used computational tools for drug discovery and design at various stages of their research. The objective of the current chapter is to provide a concise summary of the most relevant and recent research on PDE inhibitors as anti-AD therapeutics with promising results utilizing various computational modeling techniques, which can assist in the further development and identification of new anti-AD agents. In this chapter, we will present relevant and recently published computational studies for the identification or design of potential PDE inhibitors using various computational approaches. Moreover, the chapter will give the audience a broad overview of effective computational drug discovery research in this particular field of applications.Key wordsAlzheimer’s diseasePhosphodiesteraseQSARPharmacophore modelingMolecular dockingMD simulation

An automatic QoS-aware resource partitioning framework for cloud environment

Conference Paper

Jun 2023

Workload Interference Prevention with Intelligent Routing and Flexible Job Placement on Dragonfly

Conference Paper

Jun 2023

Revealing the Impacts of Chemical Complexity on Submicrometer Sea Spray Aerosol Morphology

Article

May 2023

Sea spray aerosol (SSA) ejected through bursting bubbles at the ocean surface is a complex mixture of salts and organic species. Submicrometer SSA particles have long atmospheric lifetimes and play a critical role in the climate system. Composition impacts their ability to form marine clouds, yet their cloud-forming potential is difficult to study due to their small size. Here, we use large-scale molecular dynamics (MD) simulations as a "computational microscope" to provide never-before-seen views of 40 nm model aerosol particles and their molecular morphologies. We investigate how increasing chemical complexity impacts the distribution of organic material throughout individual particles for a range of organic constituents with varying chemical properties. Our simulations show that common organic marine surfactants readily partition between both the surface and interior of the aerosol, indicating that nascent SSA may be more heterogeneous than traditional morphological models suggest. We support our computational observations of SSA surface heterogeneity with Brewster angle microscopy on model interfaces. These observations indicate that increased chemical complexity in submicrometer SSA leads to a reduced surface coverage by marine organics, which may facilitate water uptake in the atmosphere. Our work thus establishes large-scale MD simulations as a novel technique for interrogating aerosols at the single-particle level.

A System for Interactive Molecular Dynamics Simulation

Conference Paper

Full-text available

Mar 2001

A Partitioning Strategy for Nonuniform Problems on Multiprocessors

Article

Full-text available

Jun 1987

The partitioning of a problem on a domain with unequal work estimates in different subdomains is considered in a way that balances the work load across multiple processors. Such a problem arises for example in solving partial differential equations using an adaptive method that places extra grid points in certain subregions of the domain. A binary decomposition of the domain is used to partition it into rectangles requiring equal computational effort. The communication costs of mapping this partitioning onto different microprocessors: a mesh-connected array, a tree machine and a hypercube is then studied. The communication cost expressions can be used to determine the optimal depth of the above partitioning.

Particle Mesh Ewald - An N.log(N) Method for Ewald Sums in Large Systems

Article

Jan 1993

Particle Mesh Ewald: An Nlog (N) Method for Ewald Sums in Large Systems

Article

Jun 1993

An N·log(N) method for evaluating electrostatic energies and forces of large periodic systems is presented. The method is based on interpolation of the reciprocal space Ewald sums and evaluation of the resulting convolutions using fast Fourier transforms. Timings and accuracies are presented for three large crystalline ionic systems. The Journal of Chemical Physics is copyrighted by The American Institute of Physics.

Charm++: Parallel Programming with Message-Driven Objects

Article

Jan 1996

AMBER: Assisted model building with energy refinement. A general program for modeling molecules and their interactions

Article

Sep 1981
J COMPUT CHEM

We describe a computer program we have been developing to build models of molecules and calculate their interactions using empirical energy approaches. The program is sufficiently flexible and general to allow modeling of small molecules, as well as polymers. As an illustration, we present applications of the program to study the conformation of actinomycin D. In particular, we study the rotational isomerism about the D-Val-, L-Pro, and L-Pro-Sar amide bonds as well as comparing the energy and structure of the Sobell model and the x-ray structure of actinomycin D.

CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations

Article

Sep 2004
J COMPUT CHEM

CHARMM (Chemistry at HARvard Macromolecular Mechanics) is a highly flexible computer program which uses empirical energy functions to model macromolecular systems. The program can read or model build structures, energy minimize them by first- or second-derivative techniques, perform a normal mode or molecular dynamics simulation, and analyze the structural, equilibrium, and dynamic properties determined in these calculations. The operations that CHARMM can perform are described, and some implementation details are given. A set of parameters for the empirical energy function and a sample run are included.

GROMACS 3.0: A package for molecular simulation and trajectory analysis

Article

Aug 2001

GROMACS 3.0 is the latest release of a versatile and very well optimized package for molecular simulation. Much effort has been devoted to achieving extremely high performance on both workstations and parallel computers. The design includes an extraction of virial and periodic boundary conditions from the loops over pairwise interactions, and special software routines to enable rapid calculation of x–1/2. Inner loops are generated automatically in C or Fortran at compile time, with optimizations adapted to each architecture. Assembly loops using SSE and 3DNow! Multimedia instructions are provided for x86 processors, resulting in exceptional performance on inexpensive PC workstations. The interface is simple and easy to use (no scripting language), based on standard command line arguments with selfexplanatory functionality and integrated documentation. All binary files are independent of hardware endian and can be read by versions of GROMACS compiled using different floating-point precision. A large collection of flexible tools for trajectory analysis is included, with output in the form of finished Xmgr/Grace graphs. A basic trajectory viewer is included, and several external visualization tools can read the GROMACS trajectory format. Starting with version 3.0, GROMACS is available under the GNU General Public License from http://www.gromacs.org.

NAMD2: Greater Scalability for Parallel Molecular Dynamics

Article

May 1999

Molecular dynamics programs simulate the behavior of biomolecular systems, leading to understanding of their functions. However, the computational complexity of such simulations is enormous. Parallel machines provide the potential to meet this computational challenge. To harness this potential, it is necessary to develop a scalable program. It is also necessary that the program be easily modified by application–domain programmers. The NAMD2 program presented in this paper seeks to provide these desirable features. It uses spatial decomposition combined with force decomposition to enhance scalability. It uses intelligent periodic load balancing, so as to maximally utilize the available compute power. It is modularly organized, and implemented using Charm++, a parallel C++ dialect, so as to enhance its modifiability. It uses a combination of numerical techniques and algorithms to ensure that energy drifts are minimized, ensuring accuracy in long running calculations. NAMD2 uses a portable run-time framework called Converse that also supports interoperability among multiple parallel paradigms. As a result, different components of applications can be written in the most appropriate parallel paradigms. NAMD2 runs on most parallel machines including workstation clusters and has yielded speedups in excess of 180 on 220 processors. This paper also describes the performance obtained on some benchmark applications.

Run-Time Support for Adaptive Load Balancing.

Conference Paper

May 2000

Many parallel scientific applications have dynamic and irregular computational structure. However, most such applications exhibit persistence of computational load and communication structure. This allows us to embed measurement-based automatic load balancing frame-work in run-time systems of parallel languages that are used to build such applications. In this paper, we describe such a framework built for the Converse [4] in teroperable runtime system. This framework is composed of mechanisms for recording application performance data, a mechanism for object migration, and interfaces for plug-in load balancing strategy objects. Interfaces for strategy objects allow easy implementation of novel load balancing strategies that could use application characteristics on the entire machine, or only a local neighborhood. We present the performance of a few strategies on a synthetic benchmark and also the impact of automatic load balancing on an actual application.

NAMD: Biomolecular Simulation on Thousands of Processors

Abstract and Figures

Recommended publications

A cell multipole based domain decomposition algorithm for molecular dynamics simulation of systems o...

NAMD: Biomolecular Simulation on Thousands of Processors

Scalable Molecular Dynamics for Large Biomolecular Systems

Application performance of a linux cluster using converse

NAMD: Scalable Molecular Dynamics Based on the Charm++ Parallel Runtime System