Illustration of the geometry of the Bump3D grid with 5 blocks and its corresponding block sizes.

Source publication

Multi-criteria partitioning of multi-block structured grids

Conference Paper

Full-text available

Jun 2019

Partitioning of multi-block structured grids impacts the performance and scalability of numerical simulations. An optimal partitioner should achieve both load balance and minimize communication time. The state-of-art domain decomposition algorithms do a good job at balancing the load across processors. However, even if the work is well balanced, th...

Context 1

... explain the different partitioning strategies using an example synthetic grid called Bump3D which consists of 5 blocks as shown in Figure 1. Bump3D has one block that is significantly larger than the others which challenges the algorithms' ability to cut large blocks. ...

View in full-text

Context 2

View in full-text

Evaluating Abstract Asynchronous Schwarz solvers

Preprint

Full-text available

Mar 2020

With the commencement of the exascale computing era, we realize that the majority of the leadership supercomputers are heterogeneous and massively parallel even on a single node with multiple co-processors such as GPUs and multiple cores on each node. For example, ORNLs Summit accumulates six NVIDIA Tesla V100s and 42 core IBM Power9s on each node....

MPIX Stream: An Explicit Solution to Hybrid MPI+X Programming

Preprint

Full-text available

Aug 2022

The hybrid MPI+X programming paradigm, where X refers to threads or GPUs, has gained prominence in the high-performance computing arena. This corresponds to a trend of system architectures growing more heterogeneous. The current MPI standard only specifies the compatibility levels between MPI and threading runtimes. No MPI concept or interface exists for applications to pass thread context or GPU stream context to MPI implementations explicitly. This lack has made performance optimization complicated in some cases and impossible in other cases. We propose a new concept in MPI, called MPIX stream, to represent the general serial execution context that exists in X runtimes. MPIX streams can be directly mapped to threads or GPU execution streams. Passing thread context into MPI allows implementations to precisely map the execution contexts to network endpoints. Passing GPU execution context into MPI allows implementations to directly operate on GPU streams, lowering the CPU/GPU synchronization cost.

Logically Parallel Communication for Fast MPI+Threads Applications

Article

Full-text available

Apr 2021
IEEE T PARALL DISTR

Supercomputing applications are increasingly adopting the MPI+threads programming model over the traditional MPI everywhere approach to better handle the disproportionate increase in the number of cores compared with other on-node resources. In practice, however, most applications observe a slower performance with MPI+threads primarily because of poor communication performance. Recent research efforts on MPI libraries address this bottleneck by mapping logically parallel communication, that is, operations that are not subject to MPI's ordering constraints to the underlying network parallelism. Domain scientists, however, typically do not expose such communication independence information because the existing MPI-3.1 standard's semantics can be limiting. Researchers had initially proposed user-visible endpoints to combat this issue, but such a solution requires intrusive changes to the standard (new APIs). The upcoming MPI-4.0 standard, on the other hand, allows applications to relax unneeded semantics and provides them with many opportunities to express logical communication parallelism. In this paper, we show how MPI+threads applications can achieve high performance with logically parallel communication. Through application case studies, we compare the capabilities of the new MPI-4.0 standard with those of the existing one and user-visible endpoints (upper bound). Logical communication parallelism can boost the overall performance of an application by over 2x.

Pencil: A Pipelined Algorithm for Distributed Stencils

Conference Paper

Nov 2020

MPIX Stream: An Explicit Solution to Hybrid MPI+X Programming

Conference Paper

Sep 2022

Illustration of the geometry of the Bump3D grid with 5 blocks and its corresponding block sizes.

Contexts in source publication

Similar publications

Citations