ArticlePDF Available

A Portable Method for Finding User Errors in the Usage of MPI Collective Operations

Authors:

Abstract

An MPI profiling library is a standard mechanism for intercepting MPI calls by applications. Profiling libraries are so named because they are commonly used to gather runtime information about performance characteristics. Here we present a profiling library whose purpose is to detect user errors in the use of MPI's collective operations. While some errors can be detected locally (by a single process), other errors involving the consistency of arguments passed to MPI collective functions must be tested for in a collective fashion. While the idea of using such a profiling library does not originate here, we take the idea further than it has been taken before (we detect more errors, including those involving datatype inconsistencies) and present an open-source library that can be used with any MPI implementation. We describe the tests carried out, provide some details of the implementation, illustrate the usage of the library, and present performance tests.
A preview of the PDF is not available
... Both dynamic and static tools have been developed to check collective communication in MPI programs. The former operate at run time and may miss errors that depend on specific program inputs [2], [3], [4], while the latter work at compile time and can check all potential program 01: MPI_Comm app_com, server_com; 02: void sub() { 03: ...
... Dynamic approaches have been proposed to check MPI collective communication [2], [3], [4]. [2] and [3] can check more necessary conditions for correct collective communication than our analyzer. ...
... Dynamic approaches have been proposed to check MPI collective communication [2], [3], [4]. [2] and [3] can check more necessary conditions for correct collective communication than our analyzer. For example, all members of a process group should provide consistent arguments for each common collective operation. ...
Article
Collective communication is widely used in MPI programs. However, its misuse may cause synchronization errors. This paper first proposes an extention to an existing static barrier analysis approach, so that it can check one necessary condition for correct collective communication. Since previous analyzers do not distinguish different com-municators, they may report false alarms. This paper further presents a communicator sensitive collective communication analyzer. Moreover, this paper reports the results of compar-ative experiments on several real MPI programs. Compared with existing static analyzers, the proposed tool generates less false alarms, can check more communication behaviour, and is applicable to more programs.
... MPI libraries Validation can also be done inside MPI libraries or as an extension of a library (as for MPICH for instance or NEC-MPI), allowing collective verification for the full MPI-2 standard [67][68][69][70]. The detection of runtime deadlock causes is however limited to the information available to the MPI routines. ...
Thesis
Full-text available
Supercomputing plays an important role in several innovative fields, speeding up prototyping or validating scientific theories. However, supercomputers are evolving rapidly with now millions of processing units, posing the questions of their programmability. Despite the emergence of more widespread and functional parallel programming models, developing correct and effective parallel applications still remains a complex task. Although debugging solutions have emerged to address this issue, they often come with restrictions. However programming model evolutions stress the requirement for a convenient validation tool able to handle hybrid applications. Indeed as current scientific applications mainly rely on the Message Passing Interface (MPI) parallel programming model, new hardwares designed for Exascale with higher node-level parallelism clearly advocate for an MPI+X solutions with X a thread-based model such as OpenMP. But integrating two different programming models inside the same application can be error-prone leading to complex bugs - mostly detected unfortunately at runtime. In an MPI+X program not only the correctness of MPI should be ensured but also its interactions with the multi-threaded model, for example identical MPI collective operations cannot be performed by multiple nonsynchronized threads. This thesis aims at developing a combination of static and dynamic analysis to enable an early verification of hybrid HPC applications. The first pass statically verifies the thread level required by an MPI+OpenMP application and outlines execution paths leading to potential deadlocks. Thanks to this analysis, the code is selectively instrumented, displaying an error and synchronously interrupting all processes if the actual scheduling leads to a deadlock situation.
... As most HPC applications are parallelized with MPI, a lot of work has been done to help programmers to debug MPI applications (TASS [15], DAMPI [21], MPI-CHECK [10], Intel Message Checker [2], Marmot [9], Umpire [20], MUST [6], MPICH [3]). Existing tools, static or dynamic, are able to detect the line in the source code where an error occured but rarely the line responsible for this situation. ...
Conference Paper
Full-text available
MPI is the most widely used parallel programming model. But the reducing amount of memory per compute core tends to push MPI to be mixed with shared-memory approaches like OpenMP. In such cases, the interoperability of those two models is challenging. The MPI 2.0 standard defines the so-called thread level to indicate how MPI will interact with threads. But even if hybrid programs are more common, there is still a lack in debugging tools and more precisely in thread level compliance. To fill this gap, we propose a static analysis to verify the thread-level required by an application. This work extends PARCOACH, a GCC plugin focused on the detection of MPI collective errors in MPI and MPI+OpenMP programs. We validated our analysis on computational benchmarks and applications and measured a low overhead.
... Realizing the importance of the reliability of parallel and distributed programs, researchers have proposed many dynamic techniques for interactive parallel debugging [40], [41], [42], [43], [44], [45], [46] and automatic bug detection [12], [19], [24], [47], [48], [49], [50]. Interactive parallel debuggers help programmers identify the bugs by exploiting automated information collection, aggregation, and visualization techniques. ...
Conference Paper
While improving the performance, nonblocking communication is prone to synchronization errors between MPI applications and the underlying MPI libraries. Such synchronization error occurs in the following way. After initiating nonblocking communication and performing overlapped computation, the MPI application reuses the message buffer before the MPI library completes the use of the same buffer, which may lead to sending out corrupted message data or reading undefined message data. This paper presents a new method called Sync Checker to detect synchronization errors in MPI nonblocking communication. To examine whether the use of message buffers is well synchronized between the MPI application and the MPI library, Sync Checker first tracks relevant memory accesses in the MPI application and corresponding message send/receive operations in the MPI library. Then it checks whether the correct execution order between the MPI application and the MPI library is enforced by the MPI completion check routines. If not, Sync Checker reports the error with diagnostic information. To reduce runtime overhead, we propose three dynamic optimizations. We have implemented a prototype of Sync Checker on Linux and evaluated it with seven bug cases, i.e., five introduced by the original developers and two injected, in four different MPI applications. Our experiments show that Sync Checker detects all the evaluated synchronization errors and provides helpful diagnostic information. Moreover, our experiments with seven NAS Parallel Benchmarks demonstrate that Sync Checker incurs moderate runtime overhead, 1.3-9.5 times with an average of 5.2 times, making it suitable for software testing.
Article
This article presents Jdebug, a fast, non-intrusive and scalable fault locating tool for extreme-scale parallel applications. Large-scale debugging has drawn more attention with the increasing scale of supercomputers and applications. To eliminate program intrusion caused by traditional instrumentation or interception during debugging information acquisition, we introduce the out-of-band management into large-scale debugging. We propose a rapid information gathering scheme that separates user and debugging traffic to solve scalability problem and to eliminate program interference during merging data. Observations of Program Counters (PC) and performance characteristics in suspended applications find abnormalities and help locate abnormal threads caused by software errors or hardware failures effectively. Evaluation shows that Jdebug collects PCs of over 20 million cores on the new Sunway supercomputer within 1.97 seconds, and can locate the abnormal threads in 1.4 seconds with an accuracy of 92.5%. In the running test of three fundamental benchmarks (HPL, HPCG, Graph500) and seventeen real-world applications, Jdebug quickly and accurately locates abnormal threads to help find scalability errors and hardware failures including memory access failures, communication failures, and execution component failures, which validates its effectiveness.
Conference Paper
The development of correct high performance computing applications is challenged by software defects that result from parallel programming. We present an automatic tool that provides novel correctness capabilities for application developers of OpenSHMEM applications. These applications follow a Single Program Multiple Data (SPMD) model of parallel programming. A strict form of SPMD programming requires that certain types of operations are textually aligned, i.e., they need to be called from the same source code line in every process. This paper proposes and demonstrates run-time checks that assert such behavior for OpenSHMEM collective communication calls. The resulting tool helps to check program consistency in an automatic and scalable fashion. We introduce the types of checks that we cover and include strict checks that help application developers to detect deviations from expected program behavior. Further, we discuss how we can utilize a parallel tool infrastructure to achieve a scalable and maintainable implementation for these checks. Finally, we discuss an extension of our checks towards further types of OpenSHMEM operations.
Article
In recent years, GPUs have emerged as an extremely cost-effective means for achieving high performance. Many application developers, including those with no prior parallel programming experience, are now trying to scale their applications using GPUs. While languages like CUDA and OpenCL have eased GPU programming for non-graphical applications, they are still explicitly parallel languages. All parallel programmers, particularly the novices, need tools that can help ensuring the correctness of their programs. Like any multithreaded environment, data races on GPUs can severely affect the program reliability. Thus, tool support for detecting race conditions can significantly benefit GPU application developers. Existing approaches for detecting data races on CPUs or GPUs have one or more of the following limitations: 1) being illsuited for handling non-lock synchronization primitives on GPUs; 2) lacking of scalability due to the state explosion problem; 3) reporting many false positives because of simplified modeling; and/or 4) incurring prohibitive runtime and space overhead. In this paper, we propose GRace, a new mechanism for detecting races in GPU programs that combines static analysis with a carefully designed dynamic checker for logging and analyzing information at runtime. Our design utilizes GPUs memory hierarchy to log runtime data accesses efficiently. To improve the performance, GRace leverages static analysis to reduce the number of statements that need to be instrumented. Additionally, by exploiting the knowledge of thread scheduling and the execution model in the underlying GPUs, GRace can accurately detect data races with no false positives reported. Based on the above idea, we have built a prototype of GRace with two schemes, i.e., GRace-stmt and GRace-addr, for NVIDIA GPUs. Both schemes are integrated with the same static analysis. We have evaluated GRace-stmt and GRace-addr with three data race bugs in three GPU kernel functions and also have compared them with the existing approach, referred to as B-tool. Our experimental results show that both schemes of GRace are effective in detecting all evaluated cases with no false positives, whereas Btool reports many false positives for one evaluated case. On the one hand, GRace-addr incurs low runtime overhead, i.e., 22-116%, and low space overhead, i.e., 9-18MB, for the evaluated kernels. On the other hand, GRace-stmt offers more help in diagnosing data races with larger overhead.
Conference Paper
Full-text available
Collective MPI communications have to be executed in the same order by all processes in their communicator and the same number of times, otherwise a deadlock occurs. As soon as the control-flow involving these collective operations becomes more complex, in particular including conditionals on process ranks, ensuring the correction of such code is error-prone. We propose in this paper a static analysis to detect when such situation occurs, combined with a code transformation that prevents from deadlocking. We show on several benchmarks the small impact on performance and the ease of integration of our techniques in the development process.
Article
In recent years, GPUs have emerged as an extremely cost-effective means for achieving high performance. While languages like CUDA and OpenCL have eased GPU programming for nongraphical applications, they are still explicitly parallel languages. All parallel programmers, particularly the novices, need tools that can help ensuring the correctness of their programs. Like any multithreaded environment, data races on GPUs can severely affect the program reliability. In this paper, we propose GMRace, a new mechanism for detecting races in GPU programs. GMRace combines static analysis with a carefully designed dynamic checker for logging and analyzing information at runtime. Our design utilizes GPUs memory hierarchy to log runtime data accesses efficiently. To improve the performance, GMRace leverages static analysis to reduce the number of statements that need to be instrumented. Additionally, by exploiting the knowledge of thread scheduling and the execution model in the underlying GPUs, GMRace can accurately detect data races with no false positives reported. Our experimental results show that comparing to previous approaches, GMRace is more effective in detecting races in the evaluated cases, and incurs much less runtime and space overhead.
Conference Paper
Full-text available
An MPI profiling library is a standard mechanism for intercepting MPI calls by applications. Profiling libraries are so named because they are commonly used to gather performance data on MPI programs. Here we present a profiling library whose purpose is to detect user errors in the use of MPI’s collective operations. While some errors can be detected locally (by a single process), other errors involving the consistency of arguments passed to MPI collective functions must be tested for in a collective fashion. While the idea of using such a profiling library does not originate here, we take the idea further than it has been taken before (we detect more errors) and offer an open-source library that can be used with any MPI implementation. We describe the tests carried out, provide some details of the implementation, illustrate the usage of the library, and present performance tests. KeywordsMPI-collective-errors-datatype-hashing
Conference Paper
Full-text available
Detecting misuse of datatypes in an application code is a desirable feature for an MPI library. To support this goal we investigate the class of hash functions based on checksums to encode the type signatures of MPI datatype. The quality of these hash functions is assessed in terms of hashing, timing and comparing to other functions published for this particular problem (Gropp, 7th European PVM/MPI Users’ Group Meeting, 2000) or for other applications (CRCs). In particular hash functions based on Galois Field enables good hashing, computation of the signature of unidatatype in O\mathcal{O}(1) and computation of the concatenation of two datatypes in O\mathcal{O}(1) additionally.
Conference Paper
Full-text available
The MPI standard provides a way to send and receive complex combinations of datatypes (e.g., integers and doubles) with a single communication operation. The MPI standard specifies that the type signature, that is, the basic datatypes (language-defined types such as int or DOUBLE PRECISION), must match in communication operations such as send/receive or broadcast. Because datatypes may be defined by the user in MPI, there is a limitless collection of possible type signatures. Detecting the programmer error of mismatched datatypes is difficult in this case; detecting all errors essentially requires sending a complete description of the type signature with a message. This paper discusses an alternative: send the value of a function of the type signature so that (a) identical type signatures always give the same function value, (b) different type signatures often give different values, and (c) common cases (e.g., predefined datatypes) are handled exactly. Thus, erroneous programs are often (but not always) detected; correct programs never are flagged as erroneous. The method described is relatively inexpensive to compute and uses a small (and fixed, independent of the complexity of the datatype) amount of space in the message envelope.
Conference Paper
Full-text available
A large number of MPI implementations are currently avail- able, each of which emphasize dierent aspects of high-performance com- puting or are intended to solve a specific research problem. The result is a myriad of incompatible MPI implementations, all of which require separate installation, and the combination of which present significant logistical challenges for end users. Building upon prior research, and in- fluenced by experience gained from the code bases of the LAM/MPI, LA-MPI, and FT-MPI projects, Open MPI is an all-new, production- quality MPI-2 implementation that is fundamentally centered around component concepts. Open MPI provides a unique combination of novel features previously unavailable in an open-source, production-quality im- plementation of MPI. Its component architecture provides both a stable platform for third-party research as well as enabling the run-time compo- sition of independent software add-ons. This paper presents a high-level overview the goals, design, and implementation of Open MPI.
Conference Paper
Full-text available
The BlueGene/L supercomputer uses system-on-a-chip integration and a highly scalable 65,536-node cellular architecture to deliver 360 Teraflops of peak computing power. Efficient operation of the machine requires a fast, scalable and standards compliant MPI library. Researchers at IBM and Argonne National Laboratory are porting the MPICH2library to BlueGene/L. We present the cur- rent state of the project and discuss the features critical to achieving performance and scalability.
Conference Paper
The collective communication operations of MPI, and in general MPI operations with non-local semantics, require the processes participating in the calls to provide consistent parameters, eg. a unique root process, matching type signatures and amounts for data to be exchanged, or same operator. Under normal use of MPI such exhaustive consistency checks are typically too expensive to perform and would compromise optimizations for high performance in the collective routines. However, confusing and hard-to-find errors (deadlocks, wrong results, or program crash) can happen by inconsistent calls to collective operations. We suggest to use the MPI profiling interface to provide for more extensive semantic checking of calls to MPI routines with collective (non-local) semantics. With this, exhaustive, semantic checks cap be enabled during application development, and disabled for production runs. We discuss what can reasonably be checked by such an interface, and mention some inherent limitations of MPI to making a fully portable interface for semantic checking. The proposed collective semantics verification interface for the full MPI-2 standard has been implemented for the NEC proprietary MPI/SX as well as other NEC MPI implementations.