Article

Event Synchronization Analysis for Debugging Parallel Programs

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

One of the major difficulties of explicit parallel programming for a shared memory machine model is detecting the potential for nondeterminacy and identifying its causes. There will often be shared variables in a parallel program, and the tasks comprising the program may need to be synchronized when accessing these variables. This paper discusses this problem and presents a method for automatically detecting non-determinacy in parallel programs that utilize event style synchronization instructions, using the Post, Wait, and Clear primitives. With event style synchronization, especially when there are many references to the same event, the difficulty lies in computing the execution order that is guaranteed given the synchronization instructions and the sequential components of the program. The main result in this paper is an algorithm that computes such an execution order and yields a Task Graph upon which a nondeterminacy detection algorithm can be applied. We have focused on events ...

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Incompatibility of the facilities for different parts of concurrent debugging forces the programmer to either use different facilities for different parts of the cycle or debug without them. Use of multiple representations for P, M and E typically compels the debuggers to either constrain the range of behaviors that can be checked [ReSc94]; or to tolerate the ambiguities in the observed behavior [EGP89], [HMW90], [NM91a]; or to demand extra programming effort [SBN89], [LMF90], [Bat89]. ...
... Race detection facilities detect simultaneous accesses to shared data (R-W, W-W) [NM90a], [EGP89], [HMW90], [Sch89]. Their focus is race behavior in the P → E part of the debugging cycle. ...
... Shared memory debuggers typically do not record the inter-process orderings. They only approximate the orderings from the sequential traces obtained for each process [HMW90], [NM91a], [Sch89], [EGP89]. This restricts their model checking ability to only one type of expected behavior; the race behavior resulting from the non-deterministic access of shared data. ...
Thesis
Full-text available
Debugging is a process that involves establishing relationships between several entities: The behavior specified in the program, P, the model/predicate of the expected behavior, M, and the observed execution behavior, E. The thesis of the unified approach is that a consistent representation for P, M and E greatly simplifies the problem of concurrent debugging, both from the viewpoint of the programmer attempting to debug a program and from the viewpoint of the implementor of debugging facilities. Provision of such a consistent representation becomes possible when sequential behavior is separated from concurrent or parallel structuring. Given this separation, the program becomes a set of sequential actions and relationships among these actions. The debugging process, then, becomes a matter of specifying and determining relations on the set of program actions. The relations are specified in P, modeled in M and observed in E. This simplifies debugging because it allows the programmer to think in terms of the program which he understands. It also simplifies the development of a unified debugging system because all of the different approaches to concurrent debugging become instances of the establishment of relationships between the actions. The unified approach defines a formal model for concurrent debugging in which the entire debugging process is specified in terms of program actions. The unified model places all of the approaches to debugging of parallel programs such as execution replay, race detection, model/predicate checking, execution history displays and animation, which are commonly formulated as disjoint facilities, in a single, uniform framework. We have also developed a feasibility demonstration prototype implementation of this unified model of concurrent debugging in the context of the CODE 2.0 parallel programming system. This implementation demonstrates and validates the claims of integration of debugging facilities in a single framework. It is further the case that the unified model of debugging greatly simplifies the construction of a concurrent debugger. All of the capabilities previously regarded as separate for debugging of parallel programs, both in shared memory models of execution and distributed memory models of execution, are supported by this prototype.
... Static analysis is used to augment dynamic analysis [1,6,7,8,9,14,22]. Dynamic analysis detects races for a particular execution of the program; these methods generally differ in the way the information that is used in the analysis is collected and analyzed. ...
... Dynamic analysis detects races for a particular execution of the program; these methods generally differ in the way the information that is used in the analysis is collected and analyzed. The two approaches used are: post-mortem [1,6,7,8,9,12,13,18] and on-the-fly [14,18,22] analysis. Allen and Padua [1] describe a method for the post-mortem analysis of Fortran programs in a shared-memory parallel machine. ...
... Algorithms That Compute Guaranteed Ordering: Emrath, Ghosh and Padua [7,8,9] present methods for detecting general races in programs using event synchronization. They present two algorithms for post-mortem analysis, namely, the Exhaustive Pairing Algorithm (EPA) and the Common Ancestor Algorithm (CAA) to compute guaranteed orderings, which are described next. ...
Article
. The increase in the number and complexity of parallel programs has led to a need for better approaches for synchronization error detection and debugging of parallel programs. This paper presents an efficient and precise algorithm for the detection of nondeterminacy (race conditions) in parallel programs. Nondeterminacy exists in a program when the program yields different outputs for different runs with the same input. We limit our attention to nondeterminacy due to errors in synchronization and to race conditions due to these unsynchronized accesses to shared variables. A directed acyclic graph called a task graph is used to represent the accesses to shared variables in a parallel program with edges representing guaranteed ordering. The algorithm proposed here constructs an augmented task graph, and then uses a modification of depth-first search to classify the edges in the augmented task graph. The edges are analyzed and the nodes that are guaranteed to execute before a...
... Memory effects (mutation) are crucial for theoretical and practical efficiency of nested parallel programs, both in the underlying runtime system (e.g., to support communication for the purposes of scheduling), and at the application level (e.g., to implement collection data structures using mutable arrays). The challenge is that the same memory effects that improve efficiency can lead to race conditions, which typically harm correctness in complex and unpredictable ways [Adve 2010;Allen and Padua 1987;Bocchino et al. 2011Bocchino et al. , 2009Boehm 2011;Emrath et al. 1991;Mellor-Crummey 1991;Netzer and Miller 1992;Steele Jr. 1990]. ...
... All of these systems support memory effects or destructive updates, which make it challenging to write correct parallel programs, because they can lead to determinacy or data races [Allen and Padua 1987;Emrath et al. 1991;Mellor-Crummey 1991;Netzer and Miller 1992;Steele Jr. 1990], which can be very difficult to avoid and usually lead to incorrect behavior [Adve 2010; Bocchino et al. 2011Bocchino et al. , 2009Boehm 2011]. There has therefore been much work on ensuring race freedom by detecting or eliminating races via dynamic techniques (e.g., [Cheng et al. 1998;Feng and Leiserson 1999;Kuper and Newton 2013;Kuper et al. 2014b;Mellor-Crummey 1991;Raman et al. 2012;Steele Jr. 1990;Utterback et al. 2016], as well as static techniques including type systems (e.g., [Bocchino et al. 2011;Flanagan and Freund 2009;Flanagan et al. 2008]). ...
Article
Full-text available
Nested parallelism has proved to be a popular approach for programming the rapidly expanding range of multicore computers. It allows programmers to express parallelism at a high level and relies on a run-time system and a scheduler to deliver efficiency and scalability. As a result, many programming languages and extensions that support nested parallelism have been developed, including in C/C++, Java, Haskell, and ML. Yet, writing efficient and scalable nested parallel programs remains challenging, primarily due to difficult concurrency bugs arising from destructive updates or effects. For decades, researchers have argued that functional programming can simplify writing parallel programs by allowing more control over effects but functional programs continue to underperform in comparison to parallel programs written in lower-level languages. The fundamental difficulty with functional languages is that they have high demand for memory, and this demand only grows with parallelism. In this paper, we identify a memory property, called disentanglement, of nested-parallel programs, and propose memory management techniques for improved efficiency and scalability. Disentanglement allows for (destructive) effects as long as concurrently executing threads do not gain knowledge of the memory objects allocated by each other. We formally define disentanglement by considering an ML-like higher-order language with mutable references and presenting a dynamic semantics for it that enables reasoning about computation graphs of nested parallel programs. Based on this graph semantics, we formalize a classic correctness property---determinacy race freedom---and prove that it implies disentanglement. This establishes that disentanglement applies to a relatively broad class of parallel programs. We then propose memory management techniques for nested-parallel programs that take advantage of disentanglement for improved efficiency and scalability. We show that these techniques are practical by extending the MLton compiler for Standard ML to support this form of nested parallelism. Our empirical evaluation shows that our techniques are efficient and scale well.
... There have been many studies on debugging data races. Some perform a post-mortem analysis based on program execution traces [8, 11, 13, 21, 22], while others perform on-the-fly analysis during program exe- cution [2, 10, 20, 27]. Among modern shared-memory parallel programming models [9, 23, 24, 26], only Cilk++ [9] provides a data race detector called Cilkscreen [2, 9, 16]. ...
... There have been many studies on debugging data races. Some perform a post-mortem analysis based on program execution traces [8,11,13,21,22], while others perform on-the-fly analysis during program execution [2,10,20,27]. Among modern shared-memory parallel programming models [9,23,24,26], only Cilk++ [9] provides a data race detector called Cilkscreen [2,9,16]. ...
Conference Paper
Full-text available
This paper proposes a data race prevention scheme, which can prevent data races in the View-Oriented Parallel Programming (VOPP) model. VOPP is a novel shared-memory data-centric parallel programming model, which uses views to bundle mutual exclusion with data access. We have implemented the data race prevention scheme with a memory protection mechanism. Experimental results show that the extra overhead of memory protection is trivial in our applications. We also present a new VOPP implementation-Maotai 2.0, which has advanced features such as deadlock avoidance, producer/consumer view and system queues, in addition to the data race prevention scheme. The performance of Maotai 2.0 is evaluated and compared with modern programming models such as OpenMP and Cilk.
... Some analyses allow the programmer to declare an association between data and locks, then check that the program holds the lock whenever it accesses the corresponding data 94,28]. Other analyses trace the control transfers associated with the use of synchronization constructs such as the post and wait constructs used in parallel dialects of Fortran 71,18,36,17], the Ada rendezvous constructs 95,99,33,70,35], or the Java wait and notify constructs 73,74]. The goal is to determine that the synchronization actions temporally separate con icting accesses to shared data. ...
... The characteristics of the analysis depend on the speci c synchronization constructs. Researchers have developed algorithms for programs that use the post and wait constructs used in parallel dialects of Fortran 18,36,17], for the Ada rendezvous constructs 95,33,70,35], and for the Java wait and notify constructs 73,74]. The basic idea behind these algorithms is to match each blocking action (such as a wait or accept) with its potential corresponding trigger actions (such as post or notify) from other threads. ...
Conference Paper
The field of program analysis has focused primarily on sequential programming languages. But multithreading is becoming increasingly important, both as a program structuring mechanism and to support efficient parallel computations. This paper surveys research in analysis for multithreaded programs, focusing on ways to improve the efficiency of analyzing interactions between threads, to detect data races, and to ameliorate the impact of weak memory consistency models. We identify two distinct classes of multithreaded programs, activity management programs and parallel computing programs, and discuss how the structure of these kinds of programs leads to different solutions to these problems. Specifically, we conclude that augmented type systems are the most promising approach for activity management programs, while targeted program analyses are the most promising approach for parallel computing programs.
... The approaches presented above infer equivalent executions by just analyzing the trace, without knowledge about the program generating it. Another interesting and productive line of research attempts to use information about the actual program code to either statically detect potential bad behaviors [11] [21], or to use information about the program and about the property to be checked to further relax the models of an execution [4] [10] [23]. Purely static analysis approaches have to overcome unavoidable undecidability aspects, and typically give up soundness to increase coverage. ...
... The approaches presented above infer equivalent executions by just analyzing the trace, without knowledge about the program generating it. Another interesting and productive line of research attempts to use information about the actual program code to either statically detect potential bad behaviors [11, 21], or to use information about the program and about the property to be checked to further relax the models of an execution [4, 10, 23]. Purely static analysis approaches have to overcome unavoidable undecidability aspects, and typically give up soundness to increase coverage. ...
Article
Full-text available
Extracting causal models from observed executions has proved to be an effective approach to analyze concurrent programs. Most existing causal models are based on happens-before partial orders and/or Mazurkiewicz traces. Unfortunately, these models are inherently limited in the context of multithreaded systems, since multithreaded executions are mainly determined by consistency among shared memory accesses rather than by partial orders or event independence. This paper defines a novel theoretical foundation for multithreaded executions and a novel causal model, based on memory consistency con- straints. The proposed model is sound and maximal: (1) all traces consistent with the causal model are feasible executions of the multithreaded program under analysis; and (2) assuming only the observed execution and no knowledge about the source code of the program, the proposed model captures more feasible executions than any other sound causal model. An algorithm to systematically generate all the feasible executions comprised by maximal causal models is also proposed, which can be used for testing or model checking of multithreaded system executions. Finally, a specialized submodel of the maximal one is presented, which gives an efficient and effective solution to on-the-fly datarace detection. This datarace-focused model, still captures more feasible executions than the existing happens-before-based approaches.
... In the process of monitoring, it reports detected races during the monitored execution. This approach can be a complement to static analysis [2] [6], because on-the-fly detection can be used to identify actual races from the potential races reported by static analysis approaches. On-the-fly detection also uses much less storage space than post-mortem detection [3] [13], because much of the information collected during the monitoring process can be discarded as the execution proceeds. ...
... In the process of monitoring, it reports detected races during the monitored execution. This approach can be a complement to static analysis [2, 6], because on-the-fly detection can be used to identify actual races from the potential races reported by static analysis approaches. On-the-fly detection also uses much less storage space than post-mortem detec- tion [3, 13] , because much of the information collected during the monitoring process can be discarded as the execution proceeds. ...
Conference Paper
Full-text available
Races might result in unintended nondeterministic execution of parallel programs and thus race detection is one of the critical issues to be resolved in debugging of shared-memory parallel programs. On-the-fly race detection techniques have been developed as one of approaches for the problem. However on-the-fly race detection techniques suffer from the huge run-time overhead because the whole execution behavior of the program being debugged must be monitored at run-time. In this paper we present a practical loop transform technique which can significantly reduce the monitoring overhead required for detecting races on-the-fly in parallel programs. Our technique achieves the improvement by minimizing the number of iteration counts to be monitored of each parallel loop by transforming the original loop with the technique. An experimental performance measurement of our technique shows dramatic improvement on the monitoring overhead and it detects more races than those detected by traditional on-the-fly techniques
... The compiler can further use this information to precisely detect statements that may execute concurrently. Several such analyses have been developed to trace the control transfers associated with synchronization constructs such as the post and wait constructs in parallel dialects of Fortran [Callahan and Subhlok 1988;Emrath et al. 1989;Callahan et al. 1990], the Ada rendezvous constructs [Taylor 1983;Duesterwald and Soffa 1991;Masticola and Ryder 1993;Dwyer and Clarke 1994], and the wait and notify constructs in Java [Naumovich and Avrunin 1998;Naumovich et al. 1999]. But these are not designed to analyze the effects of accesses to shared pointers: they either assume the absence of shared pointers or they don't analyze data accesses at all. ...
... Some of these techniques concentrate on the analysis of synchronization, and rely on the fact that detecting conflicting accesses is straightforward once the analysis determines which statements may execute concurrently [Taylor 1983;Balasundaram and Kennedy 1989;Duesterwald and Soffa 1991]. Other analyses focus on parallel programs with affine array accesses in loops, and use techniques similar to those from data dependence analysis for sequential programs [Emrath and Padua 1988;Emrath et al. 1989;Callahan et al. 1990]. However, none of these analyses is designed to detect data races in pointer-based multithreaded programs. ...
Article
This paper presents a novel interprocedural, ow-sensitive, and context-sensitive pointer analysis algorithm for multithreaded programs that may concurrently update shared pointers. The algorithm is designed to handle programs with structured parallel constructs, including fork-join constructs, parallel loops, and conditionally spawned threads. For each pointer and each program point, the algorithm computes a conservative approximation of the memory locations to which that pointer may point. The algorithm correctly handles a wide range of programming language constructs, including recursive functions, recursively generated parallelism, function pointers, structures, arrays, nested structures and arrays, pointer arithmetic, casts between dierent pointer types, heap and stack allocated memory, shared global variables, and thread-private global variables. We have implemented the algorithm in the SUIF compiler system and used the implementation to analyze a set of multithreaded programs written in the Cilk programming language. Our experimental results show that the analysis has good precision and converges quickly for our set of Cilk programs
... Miller and Netzer showed that detecting race conditions in parallel programs that use multiple semaphores is NP-complete [15]. Researchers have developed exact algorithms for cases where the problem is efficiently solvable (programs that use types of synchronization weaker than semaphores such as post/wait/clear) [8, 9, 14], and heuristics for the multiple semaphore case [4, 10]. The complexity for the case of constant number of semaphores was unknown. ...
Article
We address the problem of detecting race conditions in programs that use semaphores for synchronization. Netzer and Miller showed that it is NP-complete to detect race conditions in programs that use many semaphores. We show in this paper that it remains NP-complete even if only two semaphores are used in the parallel programs. For the tractable case, i.e., using only one semaphore, we give two algorithms for detecting race conditions from the trace of executing a parallel program on p processors, where n semaphore operations are executed. The first algorithm determines in O(n) time whether a race condition exists between any two given operations. The second algorithm runs in O( np log n) time and outputs a compact representation from which one can determine in O(1) time whether a race condition exists between any two given operations. The second algorithm is near-optimal in that the running time is only O( log n) times the time required simply to write down the output.
... However, the work by Netzer 24] is for shared memory programs, while the work by Netzer and Miller 26] applies only to message-passing programs with blocking sends and receives . Finally, many researchers have studied race conditions in parallel programs that use shared memory 1, 2, 9, 11, 14, 17, 22, 24, 25]. In this paper we describe an algorithm for detecting race conditions in parallel programs. ...
Article
This paper presents an algorithm for performing on-the-fly race detection for parallel message-passing programs. The algorithm reads a trace of the communication events in a message-passing parallel program and either finds a specific race condition or reports that the traced program is race-free. It supports a rich message-passing model, including blocking and non-blocking sends and receives, synchronous and asynchronous sends, receive selectivity by source and/or tag value, and arbitrary amounts of system buffering of messages. It runs in polynomial time and is very efficient for most types of executions. A key feature of the race detection algorithm is its use of several new types of logical clocks for determining ordering relations. It is likely that these logical clocks will also be useful in other settings.
... There have been many studies on debugging data races. Some perform a post-mortem analysis based on program execution traces [7, 10, 12, 20, 21] , while others perform on-the-fly analysis during program execu- tion [2, 9, 19, 25]. Among modern shared-memory parallel programming models [6, 8, 22, 24], only Cilk++ [8] provides a data race detector called Cilkscreen [2, 8, 15]. ...
Article
Full-text available
Data races hamper parallel programming and threaten the reliability of future software. This paper proposes the data race prevention scheme View-Oriented Data race Prevention (VODAP), which can prevent data races in the View-Oriented Parallel Programming (VOPP) model. VOPP is a novel shared-memory data-centric parallel programming model, which uses views to bundle mutual exclusion with data access. We have implemented the data race prevention scheme with a memory protection mechanism. Experimental results show that the extra overhead of memory protection is trivial in our applications. The performance is evaluated and compared with modern programming models such as OpenMP and Cilk. VOPP-View oriented parallel programming-Concurrent programming-SPMD-Data race free-Data-centric programming-Cilk-Multicore-Shared-memory
... So, its capacity is restricted by the huge program state spaces. Perry [11] described a system that automatically detects races in a parallel program. In this approach, the dynamic execution trace of the program was used to build a task graph and logged points of event style synchronization. ...
Conference Paper
Full-text available
In real-time embedded systems, due to race conditions, synchronization order between events may be different from one execution to another. This behavior is permissible as in concurrent systems, but should be fully analyzed to ensure the correctness of the system. In this paper, a new intelligent method is presented to analyze event synchronization sequence in embedded systems. Our goal is to identify the feasible sequence, and to determine timing parameters that lead to these sequences. Our approach adopts timed event automata (TEA) to model the targeted embedded system and use a race condition graph (RCG) to specify event synchronization sequence (SYN-Spec). A genetic algorithm working with simulation is used to analyze the timing parameters in the target model and to verify whether a defined SYN-Spec is satisfied or not. A case study shows that the method proposed is able to find potential execution sequences according to the event synchronization orders.
... We use a program thread structure graph, similar to a call graph, that allows parallelism between code regions to be easily determined. Synchronization analysis [3] is used to refine the results of MHP analysis. For two statements S1 and S2 that are in different threads and may happen in parallel, it attempts to determine if S1 must execute before S2, or after S2. ...
Conference Paper
Full-text available
Concurrent threads executing on a shared memory system can access the same memory locations. A consistency model defines constraints on the order of these shared memory accesses. For good run-time performance, these constraints must be as few as possible. Programmers who write explicitly parallel programs must take into account the consistency model when reasoning about the behavior of their programs. Also, the consistency model constrains compiler transformations that reorder code. It is not known what consistency models best suit the needs of the programmer, the compiler, and the hardware simultaneously. We are building a compiler infrastructure to study the effect of consistency models on code optimization and run-time performance. The consistency model presented to the user will be a programmable feature independent of the hardware consistency model. The compiler will be used to mask the hardware consistency model from the user by mapping the software consistency model onto the hardware consistency model. When completed, our compiler will be used to prototype consistency models and to measure the relative performance of different consistency models. We present preliminary experimental data for performance of a software implementation of sequential consistency using manual inter-thread analysis.
... Captured data is known as a trace, the most useful of which is an address trace (or execution trace) consisting of executed instruction addresses. It has long been established that examining such a trace can help determine behaviors that lead to software faults [8]. Such analysis is commonly performed off-line in a remote debugging scenario, depicted inFigure 1 for either real SoC or FPGA-emulated processor cores. ...
Conference Paper
Full-text available
In the multicore era, capturing execution traces of processors is indispensable to debugging complex software. The inability to transfer vast amounts of trace data off-chip without significant slow-down has impeded the debugging of such software, in both pre-silicon emulation and in real designs. We consider on-chip trace compression performed in hardware to reduce data volume, using techniques that exploit inherent higher-order redundancy in address trace data. While hardware trace compression is often restricted to poor or moderate performance due to area and memory constraints, we present a parameterizable scheme that leverages the resources already found on existing platforms. Harnessing resources such as existing trace buffers on CPUs, and unused embedded memory on FPGA emulation platforms, our trace compression scheme requires only a small additional hardware area to achieve superior compression ratios.
... Miller and Netzer showed that detecting race conditions in parallel programs that use multiple semaphores is NP-complete [15]. Researchers have developed exact algorithms for cases where the problem is efficiently solvable (programs that use types of synchronization weaker than semaphores such as post/wait/clear) [8, 9, 14], and heuristics for the multiple semaphore case [4, 10]. The complexity for the case of constant number of semaphores was unknown. ...
Article
We address the problem of detecting race conditions in programs that use semaphores for synchronization. Netzer and Miller showed that it is NP-complete to detect race conditions in programs that use many semaphores. We show in this paper that it remains NP-complete even if only two semaphores are used in the parallel programs. For the tractable case, i.e., using only one semaphore, we give two algorithms for detecting race conditions from the trace of executing a parallel program on p processors, where n semaphore operations are executed. The first algorithm determines in O(n) time whether a race condition exists between any two given operations. The second algorithm runs in O(np log n) time and outputs a compact representation from which one can determine in O(1) time whether a race condition exists between any two given operations. The second algorithm is near-optimal in that the running time is only O(log n) times the time required simply to write down the output.
... A data flow analysis framework similar to theirs could be used to compute the ordering. This is essentially the same as the common ancestor algorithm developed by Emrath et al., ( 18,19 ) except that this last algorithm was designed for memory trace analysis. The partial order computed by this method may not be totally accurate. ...
Article
Full-text available
In this paper, we present a constant propagation algorithm for explicitly parallel programs, which we call the Concurrent Sparse Conditional Constant propagation algorithm. This algorithm is an extension of the Sparse Conditional Constant propagation algorithm. Without considering the interaction between threads, classical optimizations lead to an incorrect program transformation for parallel programs. To make analyzing parallel programs possible, a new intermediate representation is needed. We introduce the Concurrent Static Single Assignment (CSSA) form to represent explicitly parallel programs with interleaving semantics and synchronization. The only parallel construct considered in this paper is cobegin/coend. A new confluence function, the p-assignment, which summarizes the information of interleaving statements between threads, is introduced. The Concurrent Control Flow Graph, which contains information about conflicting statements, control flow, and synchronization, is used as an underlying representation for the CSSA from.
... Most race detectors are dynamic tools, which detect potential races by executing the program on a given input. Some dynamic race detectors perform a post-mortem analysis based on program execution traces [7,11,13,15,16]. On-the-fly race detectors, like the one implemented in this thesis, detect races during execution of the program. ...
Article
This thesis describes the implementation of a provably good data-race detector, called the Nondeterminator-3, which runs efficiently in parallel. A data race occurs in a multithreaded program when two logically parallel threads access the same location while holding no common locks and at least one of the accesses is a write. The Nondeterminator-3 checks for data races in programs coded in Cilk [3, 10], a shared-memory multithreaded programming language. A key capability of data-race detectors is in determining the series-parallel (SP) relationship between two threads. The Nondeterminator-3 is based on a provably good parallel SP-maintenance algorithm known as SP-hybrid [2]. For a program with n threads, T1 work, and critical-path length To, the SP-hybrid algorithm runs in O((T1/P + PTO) lg n) expected time when executed on P processors. A data-race detector must also maintain an access-history, which consists of, for each shared memory location, a representative subset of memory accesses to that location. The Nondeterminator-3 uses an extension of the ALL-SETS [4] access-history algorithm used by its serially running predecessor, the Nondeterminator-2. First, the ALL-SETS algorithm was extended to correctly support the inlet feature of Cilk.
... Dynamic race detectors execute the program given a particular input. Some dynamic race detectors perform a post-mortem analysis based on program-execution logs [18, 28, 35,45464748, analyzing a log of program-execution events after the program has finished running. On-the-fly race race detectors, like the one given in this thesis, report races during the execution of the program. ...
Article
A multithreaded parallel program that is intended to be deterministic may exhibit nondeterminism clue to bugs called determinacy races. A key capability of race detectors is to determine whether one thread executes logically in parallel with another thread or whether the threads must operate in series. This thesis presents two algorithms, one serial and one parallel, to maintain the series-parallel (SP) relationships "on the fly" for fork-join multithreaded programs. For a fork-join program with T1 work and a critical-path length of T[infinity], the serial SP-Maintenance algorithm runs in O(T1) time. The parallel algorithm executes in the nearly optimal O(T1/P + PT[infinity]) time, when run on P processors and using an efficient scheduler. These SP-maintenance algorithms can be incorporated into race detectors to get a provably good race detector that runs in parallel. This thesis describes an efficient parallel race detector I call Nondeterminator-3. For a fork-join program T1 work, critical-path length T[infinity], and v shared memory locations, the Nondeterminator-3 runs in O(T1/P + PT[infinity] lg P + min [(T1 lg P)/P, vT[infinity] Ig P]) expected time, when run on P processors and using an efficient scheduler. Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005. Includes bibliographical references (p. 93-98).
... Checking every possible control-flow of an arbitrary program is typically intractable, however, so most race detectors are dynamic tools in which potential races are detected at runtime by executing the program on a given input. Some dynamic race detectors perform a post-mortem analysis based on program execution traces [8,12,16,19], while others perform an on-the-fly analysis during program execution. On-the-fly debuggers directly instrument memory accesses via the compiler [6,7,9,10,15,22], by binary rewriting [25], or by augmenting the machine's cache coherence protocol [17,23]. ...
Article
If two parallel threads access the same location and at least one of them performs a write, a race exists. The detection of races---a major problem in parallel debugging---is complicated by the presence of atomic critical sections. In programs without critical sections, the existence of a race is usually a bug leading to nondeterministic behavior. In programs with critical sections, however, accesses in parallel critical sections are not considered bugs, as the programmer, in specifying the critical sections, presumably intends them to run in parallel. Thus, a race detector should find "data races"---races between accesses not contained in atomic critical sections. We present algorithms for detecting data races in programs written in the Cilk multithreaded language. These algorithms do not verify programs, but rather find data races in all schedulings of the computation generated when a program executes serially on a given input. We present two algorithms for programs in which atomici...
... Our algorithms can be used to exactly detect race conditions in executions of such programs. Past work has shown that exactly detecting races in programs that use multiple semaphores is NP-complete [9], and has developed exact algorithms for other cases where the problem is efficiently solvable (programs that use types of synchronization weaker than semaphores) [6,8], and heuristics for the multiple semaphore case [4,7]. The complexity for the case of a single semaphore has been an open question. ...
Conference Paper
We address a problem arising in debugging parallel programs, detecting race conditions in programs using a single semaphore for synchronization. It is NP-complete to detect races in programs that use many semaphores. For the case of a single semaphore, we give an algorithm that takes O(n 1.5p) time, where p is the number of processors and n is the total number of semaphore operations executed. Our algorithm constructs a representation from which one can determine in constant time whether a race exists between two given events.
... Past work has shown that exactly detecting races in programs that use multiple semaphores is NPcomplete [10], and has developed exact algorithms for other cases where the problem is efficiently solvable (programs that use types of synchronization weaker than semaphores) [6,9], and heuristics for the multiple semaphore case [4,7]. The complexity for the case of constant number of semaphores has been an open question. ...
Article
We address a problem arising in debugging parallel programs, detecting race conditions in programs using semaphores for synchronization. It is NPcomplete to detect race conditions in programs that use many semaphores [10]. We show in this paper that it remains NP-complete even if the programs are allowed to use only two semaphores. For the case of single semaphore, Lu et al. [8] give the previously only-known polynomial-time algorithm that runs in time O(n 1:5 p), where p is the number of processors and n is the total number of semaphore operations executed. Their algorithm, however, detects only a special class of race conditions. In this paper we cope with the general race-condition detection problem and give an O(np log n)-time algorithm. The output of our algorithm is a compact representation from which one can determine in constant time whether a race condition exists between two given operations. 1 Introduction Race-condition detection is a crucial aspect of developing and de...
... Techniques for race detection in the context of debugging programs have either used dynamic information from a program's execution trace or static information from an analysis of the program text [17]. A few techniques have used dynamic information as well as static information [6]. However the static information supplements the dynamic information by ruling out races in certain parts of the program, thereby precluding the need to trace those parts. ...
Article
Shared memory in a parallel computer provides programmers with the valuable abstraction of a shared address space---through which any part of a computation can access any datum. Although uniform access simplifies programming, it also hides communication, which can lead to inefficient programs. The check-in, check-out (CICO) performance model for cache-coherent, shared-memory parallel computers helps a programmer identify the communication underlying memory references and account for its cost. CICO consists of annotations that a programmer can use to elucidate communication and a model that attributes costs to these annotations. The annotations can also serve as directives to a memory system to improve program performance. Inserting CICO annotations requires reasoning about the dynamic cache behavior of a program, which is not always easy. This paper describes Cachier, a tool that automatically inserts CICO annotations into shared-memory programs. A novel feature of this tool is its us...
Article
Recent work has proposed a memory property for parallel programs, called disentanglement, and showed that it is pervasive in a variety of programs, written in different languages, ranging from C/C++ to Parallel ML, and showed that it can be exploited to improve the performance of parallel functional programs. All existing work on disentanglement, however, considers the "fork/join" model for parallelism and does not apply to "futures", the more powerful approach to parallelism. This is not surprising: fork/join parallel programs exhibit a reasonably strict dependency structure (e.g., series-parallel DAGs), which disentanglement exploits. In contrast, with futures, parallel computations become first-class values of the language, and thus can be created, and passed between functions calls or stored in memory, just like other ordinary values, resulting in complex dependency structures, especially in the presence of mutable state. For example, parallel programs with futures can have deadlocks, which is impossible with fork-join parallelism. In this paper, we are interested in the theoretical question of whether disentanglement may be extended beyond fork/join parallelism, and specifically to futures. We consider a functional language with futures, Input/Output (I/O), and mutable state (references) and show that a broad range of programs written in this language are disentangled. We start by formalizing disentanglement for futures and proving that purely functional programs written in this language are disentangled. We then generalize this result in three directions. First, we consider state (effects) and prove that stateful programs are disentangled if they are race free. Second, we show that race freedom is sufficient but not a necessary condition and non-deterministic programs, e.g. those that use atomic read-modify-operations and some non-deterministic combinators, may also be disentangled. Third, we prove that disentangled task-parallel programs written with futures are free of deadlocks, which arise due to interactions between state and the rich dependencies that can be expressed with futures. Taken together, these results show that disentanglement generalizes to parallel programs with futures and, thus, the benefits of disentanglement may go well beyond fork-join parallelism.
Article
Because of its many desirable properties, such as its ability to control effects and thus potentially disastrous race conditions, functional programming offers a viable approach to programming modern multicore computers. Over the past decade several parallel functional languages, typically based on dialects of ML and Haskell, have been developed. These languages, however, have traditionally underperformed procedural languages (such as C and Java). The primary reason for this is their hunger for memory, which only grows with parallelism, causing traditional memory management techniques to buckle under increased demand for memory. Recent work opened a new angle of attack on this problem by identifying a memory property of determinacy-race-free parallel programs, called disentanglement, which limits the knowledge of concurrent computations about each other’s memory allocations. The work has showed some promise in delivering good time scalability. In this paper, we present provably space-efficient automatic memory management techniques for determinacy-race-free functional parallel programs, allowing both pure and imperative programs where memory may be destructively updated. We prove that for a program with sequential live memory of R * , any P -processor garbage-collected parallel run requires at most O ( R * · P ) memory. We also prove a work bound of O ( W + R * P ) for P -processor executions, accounting also for the cost of garbage collection. To achieve these results, we integrate thread scheduling with memory management. The idea is to coordinate memory allocation and garbage collection with thread scheduling decisions so that each processor can allocate memory without synchronization and independently collect a portion of memory by consulting a collection policy, which we formulate. The collection policy is fully distributed and does not require communicating with other processors. We show that the approach is practical by implementing it as an extension to the MPL compiler for Parallel ML. Our experimental results confirm our theoretical bounds and show that the techniques perform and scale well.
Conference Paper
In this paper we address the problem of locating race conditions among synchronization primitives in execution traces of hybrid parallel programs. In hybrid parallel programs collective and point-to-point synchronization can’t be analyzed separately. We introduce a model for synchronization primitives and formally define synchronization races with respect to the model. Based on these concepts we present an algorithm which accurately detects synchronization races and yields a task graph of the execution trace. The task graph represents the guaranteed ordering of events across thread and process boundaries. It is an essential core element for the further analysis (e.g. a data race detection) of a program. Depending on the synchronization model task graph construction can be an NP-hard problem. Our model allows to construct an algorithm with sub-quadratic time complexity. Thus programs adhering to the principles of our model are provable against race conditions. Therefore we argue, that our model should be used as a foundation for the design and implementation of synchronization functions.
Conference Paper
The availability of multicore processors across a wide range of computing platforms has created a strong demand for software frameworks that can harness these resources. This paper overviews the Cilk++ programming environment, which incorporates a compiler, a runtime system, and a race-detection tool. The Cilk++ runtime system guarantees to load-balance computations effectively. To cope with legacy codes containing global variables, Cilk++ provides a ldquohyperobjectrdquo library which allows races on nonlocal variables to be mitigated without lock contention or substantial code restructuring.
Article
A parallel multithreaded program that is ostensibly deterministic may nevertheless behave nondeterministically due to bugs in the code. These bugs are called determinacy races, and they result when one thread updates a location in shared memory while another thread is concurrently accessing the location. We have implemented a provably efficient determinacy-race detector for Cilk, an algorithmic multithreaded programming language. If a Cilk program is run on a given input data set, our debugging tool, which we call the ``Nondeterminator,'' either determines at least one location in the program that is subject to a determinacy race, or else it certifies that the program is race free when run on the data set. The core of the Nondeterminator is an asymptotically efficient serial algorithm (inspired by Tarjan's nearly linear-time least-common-ancestors algorithm) for detecting determinacy races in series-parallel directed acyclic graphs. For a Cilk program that runs in T time on one processor and uses v shared-memory locations, the Nondeterminator runs in O(T α(v,v)) time, where α is Tarjan's functional inverse of Ackermann's function, a very slowly growing function which, for all practical purposes, is bounded above by 4 . The Nondeterminator uses at most a constant factor more space than does the original program. On a variety of Cilk program benchmarks, the Nondeterminator exhibits a slowdown of less than 12 compared with the serial execution time of the original optimized code, which we contend is an acceptable slowdown for debugging purposes.
Chapter
Full-text available
Static Single Assignment (SSA) form has shown its usefulness as a program representation for code optimization techniques in sequential programs. We introduce the Concurrent Static Single Assignment (CSSA) form to represent explicitly parallel programs with interleaving semantics and post-wait synchronization. The parallel construct considered in this paper is cobegin/coend. A new confluence function, the -assignment, which summarizes the information of interleaving statements between threads, is introduced. The Concurrent Control Flow Graph, which contains information about conflicting statements, control flow, and synchronization, is used as an underlying representation for the CSSA from. An extension of the Sparse Conditional Constant propagation algorithm based on the CSSA form makes it possible to apply the constant propagation optimization to explicitly parallel programs.
Conference Paper
Full-text available
Traditional compiler techniques developed for sequential programs do not guarantee the correctness (sequential consistency) of compiler transformations when applied to parallel programs. This is because traditional compilers for sequential programs do not account for the updates to a shared variable by different threads. We present a concurrent static single assignment (CSSA) form for parallel programs containing cobegin/coend and parallel do constructs and post/wait synchronization primitives. Based on the CSSA form, we present copy propagation and dead code elimination techniques. Also, a global value numbering technique that detects equivalent variables in parallel programs is presented. By using global value numbering and the CSSA form, we extend classical common subexpression elimination, redundant load/store elimination, and loop invariant detection to parallel programs without violating sequential consistency. These optimization techniques are the most commonly used techniques for sequential programs. By extending these techniques to parallel programs, we can guarantee the correctness of the optimized program and maintain single processor performance in a multiprocessor environment.
Conference Paper
The availability of multicore processors across a wide range of computing platforms has created a strong demand for software frameworks that can harness these resources. This paper overviews the Cilk++ programming environment, which incorporates a compiler, a runtime system, and a race-detection tool. The Cilk++ runtime system guarantees to load-balance computations effectively. To cope with legacy codes containing global variables, Cilk++ provides a "hyperobject" library which allows races on nonlocal variables to be mitigated without lock contention or substantial code restructuring.
Conference Paper
We address a problem arising in debugging parallel programs, detecting race conditions in programs using semaphores for synchronization. It is NP-complete to detect race conditions in programs that use polynomial number of semaphores [10]. We show in this paper that it remains NP-complete even if the programs are allowed to use only two semaphores, which settles the open question raised in [10]. The proof uses a technique that simulates a graph with any number of semaphores by a graph with only two semaphores. For the case of single semaphore, Lu et al. [8] give the previously only-known polynomial-time algorithm that runs in time O(n1.5p), where p is the number of processors and n is the total number of semaphore operations executed. Their algorithm, however, detects only a special class of race conditions. In this paper we cope with the general race-condition detection problem and give an O(np log n) -time algorithm.The output of our algorithm is a compact representation of size Θ(np), from which one can determine in constant time whether a race condition exists between two given operations. Our algorithm is near-optimal in that the time it takes is only O(log n) times the time required simply to write down the output.
Article
Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003. Includes bibliographical references (p. 69-72). This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections. by Kai Huang. M.Eng.
Article
Thesis (Ph. D.)--University of Delaware, 1998. Principal faculty adviser: Lori L. Pollock, Dept. of Computer and Information Sciences. Includes bibliographical references (leaves 250-260). Microfilm. s
Article
Languages allowing explicitly parallel, multithreaded programming (e.g. Java and C#) need to specify a memory consistency model to define program behavior. The memory consistency model defines constraints on the order of memory accesses in systems with shared memory. The design of a memory consistency model affects ease of parallel programming as well as system performance. Compiler analysis can be used to mitigate the performance impact of a memory consistency model that imposes strong constraints on shared memory access orders. In this work, we explore the capability of a compiler to analyze what restrictions are imposed by a memory consistency model for the program being compiled. Our compiler analysis targets Java bytecodes. It focuses on two components: delay set analysis and synchronization analysis. Delay set analysis determines the order of shared memory accesses that must be respected within each individual thread of execution in the source program. We describe a simplified analysis algorithm that is applicable to programs with general thread structure (MIMD programs), and has polynomial time worst-case complexity. This algorithm uses synchronization analysis to improve the accuracy of the results. Synchronization analysis determines the order of shared memory accesses already enforced by synchronization in the source program. We describe a dataflow analysis algorithm for synchronization analysis that is efficient to compute, and that improves precision over previously known methods. The analysis techniques described are used in the implementation of a virtual machine that guarantees sequentially consistent execution of Java bytecodes. This implementation is used to show the effectiveness of our analysis algorithms. On many benchmark programs, the performance of programs on our system is close to 100% of the performance of the same programs executing under a relaxed memory model. Specifically, we observe an average slowdown of 10% on an Intel Xeon platform, with slowdowns of 7% or less for 7 out of 10 benchmarks. On an IBM Power3 platform, we observe an average slowdown of 26%, with slowdowns of 7% or less for 8 out of 10 benchmarks.
Conference Paper
Because of the nondeterministic timing ordering behavior of parallel primitives, testing of parallel programs is more difficult than that of serial programs. In this paper, we present a new testing strategy for parallel programs---static testing of parallel programs with their primitive dependence graphs. We have developed methods to analyze the timing ordering dependences between primitives and to construct primitive dependence graph. Approaches for static detecting errors with primitive dependence graph have also been discussed
Article
A statement is considered to be monotonic with respect to a loop if its execution, during the successive iterations of a given execution of the loop, assigns a monotonically increasing or decreasing sequence of values to a variable. We present static analysis techniques to identify loop monotonic statements. The knowledge of loop monotonicity characteristics of statements which compute array subscript expressions is of significant value in a number of applications. We illustrate the use of this information in improving the efficiency of run-time array bound checking, run-time dependence testing, and on-the-fly detection of access anomalies. Given that a significant percentage of subscript expressions are monotonic, substantial savings can be expected by using these techniques
Article
The field of program analysis has focused primarily on sequential programming languages. But multithreading is becoming increasingly important, both as a program structuring mechanism and to support e#cient parallel computations. This paper surveys research in analysis for multithreaded programs, focusing on ways to improve the efficiency of analyzing interactions between threads, to detect data races, and to ameliorate the impact of weak memory consistency models. We identify two distinct classes of multithreaded programs, activity management programs and parallel computing programs, and discuss how the structure of these kinds of programs leads to di#erent solutions to these problems. Specifically, we conclude that augmented type systems are the most promising approach for activity management programs, while targeted program analyses are the most promising approach for parallel computing programs.
Article
The design of consistency models for both hardware and software is a difficult task. For programming languages it is particularly difficult because the target audience for a programming language is much wider than the target audience for a hardware programming language, making usability a more important criteria. Exacerbating this problem is the reality that the programming languages community has little experience designing programming language consistency models, and therefore each new attempt is very much a voyage into uncharted territory. A concrete example of the difficulties of the task is the current Java Memory Model. Although designed to be easy to use by Java programmers, it is poorly understood and at least one common idiom (the "double check idiom") to exploit the model is unsafe. In this paper we describe the design of an optimizing Java compiler that will allow a consistency model for the code to be compiled to be specified as an input. The compiler will use Shasha and Snir's delay set analysis, and our CSSA program representation, to normalize the effects of different consistency models on optimizations and analysis. When completed, the compiler will serve as a testbed to prototype new memory models, and to measure the differences of different memory models on program performance.
Article
Full-text available
Static Single Assignment (SSA) form has shown its usefulness for powerful code optimization techniques, such as constant propagation, of sequential programs. We introduce a new Parallel Static Single Assignment (PSSA) form and the transformation algorithm for the explicitly parallel programs with interleaving semantics and post-wait synchronization. The parallel construct considered in this paper is cobegin/coend. A new concept, ß-assignment, which summarizes the information of interleaving statements among threads, is introduced. Parallel Control Flow Graph, which contains the information of conflicting statements in addition to control flow and synchronization information, is used as a intermediate representation for the PSSA transformation. A parallel extension of the Sparse Conditional Constant propagation algorithm based on the PSSA form makes it possible to apply the constant propagation optimization to explicitly parallel programs. 1. Introduction Due to the impact of rapidly...
Article
A multithreaded program with a bug may behave nondeterministically, and this nondeterminism typically makes the bug hard to localize. This thesis presents a debugging tool, the Nondeterminator-2, which automatically finds certain nondeterminacy bugs in programs coded in the Cilk multithreaded language. Specifically, the Nondeterminator-2 finds "dag races," which occur when two logically parallel threads access the same memory location while holding no locks in common, and at least one of the accesses writes the location. The Nondeterminator-2 contains two...
Conference Paper
Full-text available
This paper addresses the design and implementation of an integrated debugging system for parallel programs running on shared memory multi-processors (SMMP). We describe the use of flowback analysis to provide information on causal relationships between events in a program's execution without re-executing the program for debugging. We introduce a mechanism called incremental tracing that, by using semantic analyses of the debugged program, makes the flowback analysis practical with only a small amount of trace generated during execution. We extend flowback analysis to apply to parallel programs and describe a method to detect race conditions in the interactions of the co-operating processes.
Conference Paper
Full-text available
We describe a debugger that is being developed for distributed programs in Amoeba. A major goal in our work is to make the debugger independent of the Amoeba kernel. Our design integrates many facilities found in other debuggers, such as execution replay, ...
Article
Debugging on a parallel processor is more difficult than debugging on a serial machine because errors in a parallel program may introduce nondeterminism. The approach to parallel debugging presented here attempts to reduce the problem of debugging on a parallel machine to that of debugging on a serial machine by automatically detecting nondeterminism. 20 refs., 6 figs.
Article
More and more scientists and engineers are becoming interested in using supercomputers. Earlier barriers to using these machines are disappearing as software for their use improves. Meanwhile, new parallel supercomputer architectures are emerging that may provide rapid growth in performance. These systems may use a large number of processors with an intricate memory system that is both parallel and hierarchical; they will require even more advanced software. Compilers that restructure user programs to exploit the machine organization seem to be essential. A wide range of algorithms and applications is being developed in an effort to provide high parallel processing performance in many fields. The Cedar supercomputer, presently operating with eight processors in parallel, uses advanced system and applications software developed at the University of Illinois during the past 12 years. This software should allow the number of processors in Cedar to be doubled annually, providing rapid performance advances in the next decade.
Conference Paper
We describe a debugger that is being developed for distributed programs in Amoeba. A major goal in our work is to make the debugger independent of the Amoeba kernel. Our design integrates many facilities found in other debuggers, such as execution replay, ...
Book
1. Introduction.- 2. Basic Concepts.- 2.1. Relations and Graphs.- 2.2. Orders on Vectors.- 2.3. Program Model.- 3. Dependence.- 3.1. Dependence Concepts.- 3.2. The Dependence Problem.- 4. Bounds of Linear Functions.- 4.1. Introduction.- 4.2. Bounds in Rectangles.- 4.3. Bounds in Trapezoids.- 5. Linear Diophantine Equations.- 5.1. Introduction.- 5.2. Greatest Common Divisors.- 5.3. Single Equation in Two Variables.- 5.4. Single Equation in Many Variables.- 5.5. Systems of Equations.- Appendix to Chapter 5.- 6. Dependence Tests.- 6.1. Introduction.- 6.2. One-Dimensional Arrays, Single Loops.- 6.3. One-Dimensional Arrays.- 6.4. General Case.- 6.5. Miscellaneous Comments.- References.
Conference Paper
The development of vector and multiprocessor language constructs in Fortran is outlined. The significant architectures, their languages, and optimizers are described. A description is given of Cedar Fortran, the language for the Cedar multiprocessor, a hierarchical, shared-memory, vector multiprocessor currently under development
Programmer's Reference Manual
  • X-Mp Cray
  • Multitasking
Cray X-MP Multitasking Programmer's Reference Manual, Cray Research, Inc., 1987.