Figure 1 - uploaded by Hengjie Wang
Content may be subject to copyright.
Illustration of the geometry of the Bump3D grid with 5 blocks and its corresponding block sizes.

Illustration of the geometry of the Bump3D grid with 5 blocks and its corresponding block sizes.

Source publication
Conference Paper
Full-text available
Partitioning of multi-block structured grids impacts the performance and scalability of numerical simulations. An optimal partitioner should achieve both load balance and minimize communication time. The state-of-art domain decomposition algorithms do a good job at balancing the load across processors. However, even if the work is well balanced, th...

Contexts in source publication

Context 1
... explain the different partitioning strategies using an example synthetic grid called Bump3D which consists of 5 blocks as shown in Figure 1. Bump3D has one block that is significantly larger than the others which challenges the algorithms' ability to cut large blocks. ...
Context 2
... explain the different partitioning strategies using an example synthetic grid called Bump3D which consists of 5 blocks as shown in Figure 1. Bump3D has one block that is significantly larger than the others which challenges the algorithms' ability to cut large blocks. ...

Similar publications

Preprint
Full-text available
With the commencement of the exascale computing era, we realize that the majority of the leadership supercomputers are heterogeneous and massively parallel even on a single node with multiple co-processors such as GPUs and multiple cores on each node. For example, ORNLs Summit accumulates six NVIDIA Tesla V100s and 42 core IBM Power9s on each node....

Citations

... This performance is due to the extra critical sections introduced by MPI communications. Much research has been done on both the application side [14] and implementation side [1,11,16] to address the performance of MPI+Threads. To reach good performance, applications need to make sure that the communications can happen concurrently, and the implementations need to map the communications to multiple communication channels to allow the communication to proceed in parallel. ...
Preprint
Full-text available
The hybrid MPI+X programming paradigm, where X refers to threads or GPUs, has gained prominence in the high-performance computing arena. This corresponds to a trend of system architectures growing more heterogeneous. The current MPI standard only specifies the compatibility levels between MPI and threading runtimes. No MPI concept or interface exists for applications to pass thread context or GPU stream context to MPI implementations explicitly. This lack has made performance optimization complicated in some cases and impossible in other cases. We propose a new concept in MPI, called MPIX stream, to represent the general serial execution context that exists in X runtimes. MPIX streams can be directly mapped to threads or GPU execution streams. Passing thread context into MPI allows implementations to precisely map the execution contexts to network endpoints. Passing GPU execution context into MPI allows implementations to directly operate on GPU streams, lowering the CPU/GPU synchronization cost.
... But the most important challenge is the dismal communication performance of an MPI+threads application. The slow multithreaded MPI communication performance is a critical challenge to tackle since most scientific simulation campaigns run close to the strong-scaling limit where communication occupies a significant part of the application's runtime [17], [18], [19]. The MPI+threads version of the HYPRE solver [20], for instance, spends 2.81× more time in MPI than does its corresponding MPI everywhere version. ...
Article
Full-text available
Supercomputing applications are increasingly adopting the MPI+threads programming model over the traditional MPI everywhere approach to better handle the disproportionate increase in the number of cores compared with other on-node resources. In practice, however, most applications observe a slower performance with MPI+threads primarily because of poor communication performance. Recent research efforts on MPI libraries address this bottleneck by mapping logically parallel communication, that is, operations that are not subject to MPI's ordering constraints to the underlying network parallelism. Domain scientists, however, typically do not expose such communication independence information because the existing MPI-3.1 standard's semantics can be limiting. Researchers had initially proposed user-visible endpoints to combat this issue, but such a solution requires intrusive changes to the standard (new APIs). The upcoming MPI-4.0 standard, on the other hand, allows applications to relax unneeded semantics and provides them with many opportunities to express logical communication parallelism. In this paper, we show how MPI+threads applications can achieve high performance with logically parallel communication. Through application case studies, we compare the capabilities of the new MPI-4.0 standard with those of the existing one and user-visible endpoints (upper bound). Logical communication parallelism can boost the overall performance of an application by over 2x.
... rectangular shapes in spatial dimensions). Typically, structured grids for complex geometries such as an aircraft or turbo-machinery contain on the order of 100's of blocks [20]. In such multiblock grids, each block with the iterations in time forms an iteration space. ...