Article

Event Synchronization Analysis for Debugging Parallel Programs

August 1994

August 1994

Authors:

University of Illinois, Urbana-Champaign

One of the major difficulties of explicit parallel programming for a shared memory machine model is detecting the potential for nondeterminacy and identifying its causes. There will often be shared variables in a parallel program, and the tasks comprising the program may need to be synchronized when accessing these variables. This paper discusses this problem and presents a method for automatically detecting non-determinacy in parallel programs that utilize event style synchronization instructions, using the Post, Wait, and Clear primitives. With event style synchronization, especially when there are many references to the same event, the difficulty lies in computing the execution order that is guaranteed given the synchronization instructions and the sequential components of the program. The main result in this paper is an algorithm that computes such an execution order and yields a Task Graph upon which a nondeterminacy detection algorithm can be applied. We have focused on events ...

A Unified Approach To Concurrent Debugging

Thesis

Full-text available

Dec 1994

Debugging is a process that involves establishing relationships between several entities: The behavior specified in the program, P, the model/predicate of the expected behavior, M, and the observed execution behavior, E. The thesis of the unified approach is that a consistent representation for P, M and E greatly simplifies the problem of concurrent debugging, both from the viewpoint of the programmer attempting to debug a program and from the viewpoint of the implementor of debugging facilities. Provision of such a consistent representation becomes possible when sequential behavior is separated from concurrent or parallel structuring. Given this separation, the program becomes a set of sequential actions and relationships among these actions. The debugging process, then, becomes a matter of specifying and determining relations on the set of program actions. The relations are specified in P, modeled in M and observed in E. This simplifies debugging because it allows the programmer to think in terms of the program which he understands. It also simplifies the development of a unified debugging system because all of the different approaches to concurrent debugging become instances of the establishment of relationships between the actions. The unified approach defines a formal model for concurrent debugging in which the entire debugging process is specified in terms of program actions. The unified model places all of the approaches to debugging of parallel programs such as execution replay, race detection, model/predicate checking, execution history displays and animation, which are commonly formulated as disjoint facilities, in a single, uniform framework. We have also developed a feasibility demonstration prototype implementation of this unified model of concurrent debugging in the context of the CODE 2.0 parallel programming system. This implementation demonstrates and validates the claims of integration of debugging facilities in a single framework. It is further the case that the unified model of debugging greatly simplifies the construction of a concurrent debugger. All of the capabilities previously regarded as separate for debugging of parallel programs, both in shared memory models of execution and distributed memory models of execution, are supported by this prototype.

Analysis of Event Synchronization in Parallel Programs

Article

Jun 1998

. The increase in the number and complexity of parallel programs has led to a need for better approaches for synchronization error detection and debugging of parallel programs. This paper presents an efficient and precise algorithm for the detection of nondeterminacy (race conditions) in parallel programs. Nondeterminacy exists in a program when the program yields different outputs for different runs with the same input. We limit our attention to nondeterminacy due to errors in synchronization and to race conditions due to these unsynchronized accesses to shared variables. A directed acyclic graph called a task graph is used to represent the accesses to shared variables in a parallel program with edges representing guaranteed ordering. The algorithm proposed here constructs an augmented task graph, and then uses a modification of depth-first search to classify the edges in the augmented task graph. The edges are analyzed and the nodes that are guaranteed to execute before a...

Disentanglement in nested-parallel programs

Article

Full-text available

Dec 2019

Nested parallelism has proved to be a popular approach for programming the rapidly expanding range of multicore computers. It allows programmers to express parallelism at a high level and relies on a run-time system and a scheduler to deliver efficiency and scalability. As a result, many programming languages and extensions that support nested parallelism have been developed, including in C/C++, Java, Haskell, and ML. Yet, writing efficient and scalable nested parallel programs remains challenging, primarily due to difficult concurrency bugs arising from destructive updates or effects. For decades, researchers have argued that functional programming can simplify writing parallel programs by allowing more control over effects but functional programs continue to underperform in comparison to parallel programs written in lower-level languages. The fundamental difficulty with functional languages is that they have high demand for memory, and this demand only grows with parallelism. In this paper, we identify a memory property, called disentanglement, of nested-parallel programs, and propose memory management techniques for improved efficiency and scalability. Disentanglement allows for (destructive) effects as long as concurrently executing threads do not gain knowledge of the memory objects allocated by each other. We formally define disentanglement by considering an ML-like higher-order language with mutable references and presenting a dynamic semantics for it that enables reasoning about computation graphs of nested parallel programs. Based on this graph semantics, we formalize a classic correctness property---determinacy race freedom---and prove that it implies disentanglement. This establishes that disentanglement applies to a relatively broad class of parallel programs. We then propose memory management techniques for nested-parallel programs that take advantage of disentanglement for improved efficiency and scalability. We show that these techniques are practical by extending the MLton compiler for Standard ML to support this form of nested parallelism. Our empirical evaluation shows that our techniques are efficient and scale well.

Maotai 2.0: Data Race Prevention in View-Oriented Parallel Programming

Conference Paper

Full-text available

Jan 2010

This paper proposes a data race prevention scheme, which can prevent data races in the View-Oriented Parallel Programming (VOPP) model. VOPP is a novel shared-memory data-centric parallel programming model, which uses views to bundle mutual exclusion with data access. We have implemented the data race prevention scheme with a memory protection mechanism. Experimental results show that the extra overhead of memory protection is trivial in our applications. We also present a new VOPP implementation-Maotai 2.0, which has advanced features such as deadlock avoidance, producer/consumer view and system queues, in addition to the data race prevention scheme. The performance of Maotai 2.0 is evaluated and compared with modern programming models such as OpenMP and Cilk.

Analysis of Multithreaded Programs

Conference Paper

Jul 2001

Martin Rinard

The field of program analysis has focused primarily on sequential programming languages. But multithreading is becoming increasingly important, both as a program structuring mechanism and to support efficient parallel computations. This paper surveys research in analysis for multithreaded programs, focusing on ways to improve the efficiency of analyzing interactions between threads, to detect data races, and to ameliorate the impact of weak memory consistency models. We identify two distinct classes of multithreaded programs, activity management programs and parallel computing programs, and discuss how the structure of these kinds of programs leads to different solutions to these problems. Specifically, we conclude that augmented type systems are the most promising approach for activity management programs, while targeted program analyses are the most promising approach for parallel computing programs.

Maximal Causal Models for Multithreaded Systems

Article

Full-text available

Extracting causal models from observed executions has proved to be an effective approach to analyze concurrent programs. Most existing causal models are based on happens-before partial orders and/or Mazurkiewicz traces. Unfortunately, these models are inherently limited in the context of multithreaded systems, since multithreaded executions are mainly determined by consistency among shared memory accesses rather than by partial orders or event independence. This paper defines a novel theoretical foundation for multithreaded executions and a novel causal model, based on memory consistency con- straints. The proposed model is sound and maximal: (1) all traces consistent with the causal model are feasible executions of the multithreaded program under analysis; and (2) assuming only the observed execution and no knowledge about the source code of the program, the proposed model captures more feasible executions than any other sound causal model. An algorithm to systematically generate all the feasible executions comprised by maximal causal models is also proposed, which can be used for testing or model checking of multithreaded system executions. Finally, a specialized submodel of the maximal one is presented, which gives an efficient and effective solution to on-the-fly datarace detection. This datarace-focused model, still captures more feasible executions than the existing happens-before-based approaches.

Parallel Loop Transformation Technique for Efficient Race Detection

Conference Paper

Full-text available

Feb 2001

Races might result in unintended nondeterministic execution of parallel programs and thus race detection is one of the critical issues to be resolved in debugging of shared-memory parallel programs. On-the-fly race detection techniques have been developed as one of approaches for the problem. However on-the-fly race detection techniques suffer from the huge run-time overhead because the whole execution behavior of the program being debugged must be monitored at run-time. In this paper we present a practical loop transform technique which can significantly reduce the monitoring overhead required for detecting races on-the-fly in parallel programs. Our technique achieves the improvement by minimizing the number of iteration counts to be monitored of each parallel loop by transforming the original loop with the technique. An experimental performance measurement of our technique shows dramatic improvement on the monitoring overhead and it detects more races than those detected by traditional on-the-fly techniques

Pointer Analysis for Structured Parallel Programs

Article

Jun 2003

This paper presents a novel interprocedural, ow-sensitive, and context-sensitive pointer analysis algorithm for multithreaded programs that may concurrently update shared pointers. The algorithm is designed to handle programs with structured parallel constructs, including fork-join constructs, parallel loops, and conditionally spawned threads. For each pointer and each program point, the algorithm computes a conservative approximation of the memory locations to which that pointer may point. The algorithm correctly handles a wide range of programming language constructs, including recursive functions, recursively generated parallelism, function pointers, structures, arrays, nested structures and arrays, pointer arithmetic, casts between dierent pointer types, heap and stack allocated memory, shared global variables, and thread-private global variables. We have implemented the algorithm in the SUIF compiler system and used the implementation to analyze a set of multithreaded programs written in the Cilk programming language. Our experimental results show that the analysis has good precision and converges quickly for our set of Cilk programs

Detecting Race Conditions in Parallel Programs that Use Semaphores

Article

Apr 2003

We address the problem of detecting race conditions in programs that use semaphores for synchronization. Netzer and Miller showed that it is NP-complete to detect race conditions in programs that use many semaphores. We show in this paper that it remains NP-complete even if only two semaphores are used in the parallel programs. For the tractable case, i.e., using only one semaphore, we give two algorithms for detecting race conditions from the trace of executing a parallel program on p processors, where n semaphore operations are executed. The first algorithm determines in O(n) time whether a race condition exists between any two given operations. The second algorithm runs in O( np log n) time and outputs a compact representation from which one can determine in O(1) time whether a race condition exists between any two given operations. The second algorithm is near-optimal in that the running time is only O( log n) times the time required simply to write down the output.

Efficient race detection for message-passing programs with nonblocking sends and receives

Article

Apr 1996

This paper presents an algorithm for performing on-the-fly race detection for parallel message-passing programs. The algorithm reads a trace of the communication events in a message-passing parallel program and either finds a specific race condition or reports that the traced program is race-free. It supports a rich message-passing model, including blocking and non-blocking sends and receives, synchronous and asynchronous sends, receive selectivity by source and/or tag value, and arbitrary amounts of system buffering of messages. It runs in polynomial time and is very efficient for most types of executions. A key feature of the race detection algorithm is its use of several new types of logical clocks for determining ordering relations. It is likely that these logical clocks will also be useful in other settings.

Data race: Tame the beast

Article

Full-text available

Mar 2010

Data races hamper parallel programming and threaten the reliability of future software. This paper proposes the data race prevention scheme View-Oriented Data race Prevention (VODAP), which can prevent data races in the View-Oriented Parallel Programming (VOPP) model. VOPP is a novel shared-memory data-centric parallel programming model, which uses views to bundle mutual exclusion with data access. We have implemented the data race prevention scheme with a memory protection mechanism. Experimental results show that the extra overhead of memory protection is trivial in our applications. The performance is evaluated and compared with modern programming models such as OpenMP and Cilk. VOPP-View oriented parallel programming-Concurrent programming-SPMD-Data race free-Data-centric programming-Cilk-Multicore-Shared-memory

A Genetic Algorithm Based Approach for Event Synchronization Analysis in Real-Time Embedded Systems

Conference Paper

Full-text available

Jun 2009

In real-time embedded systems, due to race conditions, synchronization order between events may be different from one execution to another. This behavior is permissible as in concurrent systems, but should be fully analyzed to ensure the correctness of the system. In this paper, a new intelligent method is presented to analyze event synchronization sequence in embedded systems. Our goal is to identify the feasible sequence, and to determine timing parameters that lead to these sequences. Our approach adopts timed event automata (TEA) to model the targeted embedded system and use a race condition graph (RCG) to specify event synchronization sequence (SYN-Spec). A genetic algorithm working with simulation is used to analyze the timing parameters in the target model and to verify whether a defined SYN-Spec is satisfied or not. A case study shows that the method proposed is able to find potential execution sequences according to the event synchronization orders.

Automatic Implementation of Programming Language Consistency Models

Conference Paper

Full-text available

Jul 2002
Lect Notes Comput Sci

Concurrent threads executing on a shared memory system can access the same memory locations. A consistency model defines constraints on the order of these shared memory accesses. For good run-time performance, these constraints must be as few as possible. Programmers who write explicitly parallel programs must take into account the consistency model when reasoning about the behavior of their programs. Also, the consistency model constrains compiler transformations that reorder code. It is not known what consistency models best suit the needs of the programmer, the compiler, and the hardware simultaneously. We are building a compiler infrastructure to study the effect of consistency models on code optimization and run-time performance. The consistency model presented to the user will be a programmable feature independent of the hardware consistency model. The compiler will be used to mask the hardware consistency model from the user by mapping the software consistency model onto the hardware consistency model. When completed, our compiler will be used to prototype consistency models and to measure the relative performance of different consistency models. We present preliminary experimental data for performance of a software implementation of sequential consistency using manual inter-thread analysis.

Real-time address trace compression for emulated and real system-on-chip processor core debugging

Conference Paper

Full-text available

May 2011

In the multicore era, capturing execution traces of processors is indispensable to debugging complex software. The inability to transfer vast amounts of trace data off-chip without significant slow-down has impeded the debugging of such software, in both pre-silicon emulation and in real designs. We consider on-chip trace compression performed in hardware to reduce data volume, using techniques that exploit inherent higher-order redundancy in address trace data. While hardware trace compression is often restricted to poor or moderate performance due to area and memory constraints, we present a parameterizable scheme that leverages the resources already found on existing platforms. Harnessing resources such as existing trace buffers on CPUs, and unused embedded memory on FPGA emulation platforms, our trace compression scheme requires only a small additional hardware area to achieve superior compression ratios.

Detecting Race Conditions in Parallel Programs that Use Semaphores

Article

Jan 2003
ALGORITHMICA

We address the problem of detecting race conditions in programs that use semaphores for synchronization. Netzer and Miller showed that it is NP-complete to detect race conditions in programs that use many semaphores. We show in this paper that it remains NP-complete even if only two semaphores are used in the parallel programs. For the tractable case, i.e., using only one semaphore, we give two algorithms for detecting race conditions from the trace of executing a parallel program on p processors, where n semaphore operations are executed. The first algorithm determines in O(n) time whether a race condition exists between any two given operations. The second algorithm runs in O(np log n) time and outputs a compact representation from which one can determine in O(1) time whether a race condition exists between any two given operations. The second algorithm is near-optimal in that the running time is only O(log n) times the time required simply to write down the output.

A Constant Propagation Algorithm for Explicitly Parallel Programs

Article

Full-text available

Oct 1998

In this paper, we present a constant propagation algorithm for explicitly parallel programs, which we call the Concurrent Sparse Conditional Constant propagation algorithm. This algorithm is an extension of the Sparse Conditional Constant propagation algorithm. Without considering the interaction between threads, classical optimizations lead to an incorrect program transformation for parallel programs. To make analyzing parallel programs possible, a new intermediate representation is needed. We introduce the Concurrent Static Single Assignment (CSSA) form to represent explicitly parallel programs with interleaving semantics and synchronization. The only parallel construct considered in this paper is cobegin/coend. A new confluence function, the p-assignment, which summarizes the information of interleaving statements between threads, is introduced. The Concurrent Control Flow Graph, which contains information about conflicting statements, control flow, and synchronization, is used as an underlying representation for the CSSA from.

Nondeterminator-3 : a provably good data-race detector that runs in parallel

Article

Mar 2007

Tushara C Karunaratna

This thesis describes the implementation of a provably good data-race detector, called the Nondeterminator-3, which runs efficiently in parallel. A data race occurs in a multithreaded program when two logically parallel threads access the same location while holding no common locks and at least one of the accesses is a write. The Nondeterminator-3 checks for data races in programs coded in Cilk [3, 10], a shared-memory multithreaded programming language. A key capability of data-race detectors is in determining the series-parallel (SP) relationship between two threads. The Nondeterminator-3 is based on a provably good parallel SP-maintenance algorithm known as SP-hybrid [2]. For a program with n threads, T1 work, and critical-path length To, the SP-hybrid algorithm runs in O((T1/P + PTO) lg n) expected time when executed on P processors. A data-race detector must also maintain an access-history, which consists of, for each shared memory location, a representative subset of memory accesses to that location. The Nondeterminator-3 uses an extension of the ALL-SETS [4] access-history algorithm used by its serially running predecessor, the Nondeterminator-2. First, the ALL-SETS algorithm was extended to correctly support the inlet feature of Cilk.

Provably good race detection that runs in parallel /

Article

Nov 2006

A multithreaded parallel program that is intended to be deterministic may exhibit nondeterminism clue to bugs called determinacy races. A key capability of race detectors is to determine whether one thread executes logically in parallel with another thread or whether the threads must operate in series. This thesis presents two algorithms, one serial and one parallel, to maintain the series-parallel (SP) relationships "on the fly" for fork-join multithreaded programs. For a fork-join program with T1 work and a critical-path length of T[infinity], the serial SP-Maintenance algorithm runs in O(T1) time. The parallel algorithm executes in the nearly optimal O(T1/P + PT[infinity]) time, when run on P processors and using an efficient scheduler. These SP-maintenance algorithms can be incorporated into race detectors to get a provably good race detector that runs in parallel. This thesis describes an efficient parallel race detector I call Nondeterminator-3. For a fork-join program T1 work, critical-path length T[infinity], and v shared memory locations, the Nondeterminator-3 runs in O(T1/P + PT[infinity] lg P + min [(T1 lg P)/P, vT[infinity] Ig P]) expected time, when run on P processors and using an efficient scheduler. Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005. Includes bibliographical references (p. 93-98).

Algorithms for Data-Race Detection in Multithreaded Programs

Article

Nov 1999

If two parallel threads access the same location and at least one of them performs a write, a race exists. The detection of races---a major problem in parallel debugging---is complicated by the presence of atomic critical sections. In programs without critical sections, the existence of a race is usually a bug leading to nondeterministic behavior. In programs with critical sections, however, accesses in parallel critical sections are not considered bugs, as the programmer, in specifying the critical sections, presumably intends them to run in parallel. Thus, a race detector should find "data races"---races between accesses not contained in atomic critical sections. We present algorithms for detecting data races in programs written in the Cilk multithreaded language. These algorithms do not verify programs, but rather find data races in all schedulings of the computation generated when a program executes serially on a given input. We present two algorithms for programs in which atomici...

Detecting Race Conditions in Parallel Programs that Use One Semaphore

Conference Paper

Mar 1996

We address a problem arising in debugging parallel programs, detecting race conditions in programs using a single semaphore for synchronization. It is NP-complete to detect races in programs that use many semaphores. For the case of a single semaphore, we give an algorithm that takes O(n 1.5p) time, where p is the number of processors and n is the total number of semaphore operations executed. Our algorithm constructs a representation from which one can determine in constant time whether a race exists between two given events.

Race-Condition Detection in Parallel Computation with Semaphores

Article

Jul 1999

We address a problem arising in debugging parallel programs, detecting race conditions in programs using semaphores for synchronization. It is NPcomplete to detect race conditions in programs that use many semaphores [10]. We show in this paper that it remains NP-complete even if the programs are allowed to use only two semaphores. For the case of single semaphore, Lu et al. [8] give the previously only-known polynomial-time algorithm that runs in time O(n 1:5 p), where p is the number of processors and n is the total number of semaphore operations executed. Their algorithm, however, detects only a special class of race conditions. In this paper we cope with the general race-condition detection problem and give an O(np log n)-time algorithm. The output of our algorithm is a compact representation from which one can determine in constant time whether a race condition exists between two given operations. 1 Introduction Race-condition detection is a crucial aspect of developing and de...

Cachier: A Tool for Automatically Inserting CICO Annotations

Article

Feb 1999

Shared memory in a parallel computer provides programmers with the valuable abstraction of a shared address space---through which any part of a computation can access any datum. Although uniform access simplifies programming, it also hides communication, which can lead to inefficient programs. The check-in, check-out (CICO) performance model for cache-coherent, shared-memory parallel computers helps a programmer identify the communication underlying memory references and account for its cost. CICO consists of annotations that a programmer can use to elucidate communication and a model that attributes costs to these annotations. The annotations can also serve as directives to a memory system to improve program performance. Inserting CICO annotations requires reasoning about the dynamic cache behavior of a program, which is not always easy. This paper describes Cachier, a tool that automatically inserts CICO annotations into shared-memory programs. A novel feature of this tool is its us...

Disentanglement with Futures, State, and Interaction

Article

Jan 2024

Recent work has proposed a memory property for parallel programs, called disentanglement, and showed that it is pervasive in a variety of programs, written in different languages, ranging from C/C++ to Parallel ML, and showed that it can be exploited to improve the performance of parallel functional programs. All existing work on disentanglement, however, considers the "fork/join" model for parallelism and does not apply to "futures", the more powerful approach to parallelism. This is not surprising: fork/join parallel programs exhibit a reasonably strict dependency structure (e.g., series-parallel DAGs), which disentanglement exploits. In contrast, with futures, parallel computations become first-class values of the language, and thus can be created, and passed between functions calls or stored in memory, just like other ordinary values, resulting in complex dependency structures, especially in the presence of mutable state. For example, parallel programs with futures can have deadlocks, which is impossible with fork-join parallelism. In this paper, we are interested in the theoretical question of whether disentanglement may be extended beyond fork/join parallelism, and specifically to futures. We consider a functional language with futures, Input/Output (I/O), and mutable state (references) and show that a broad range of programs written in this language are disentangled. We start by formalizing disentanglement for futures and proving that purely functional programs written in this language are disentangled. We then generalize this result in three directions. First, we consider state (effects) and prove that stateful programs are disentangled if they are race free. Second, we show that race freedom is sufficient but not a necessary condition and non-deterministic programs, e.g. those that use atomic read-modify-operations and some non-deterministic combinators, may also be disentangled. Third, we prove that disentangled task-parallel programs written with futures are free of deadlocks, which arise due to interactions between state and the rich dependencies that can be expressed with futures. Taken together, these results show that disentanglement generalizes to parallel programs with futures and, thus, the benefits of disentanglement may go well beyond fork-join parallelism.

Provably space-efficient parallel functional programming

Article

Jan 2021

Because of its many desirable properties, such as its ability to control effects and thus potentially disastrous race conditions, functional programming offers a viable approach to programming modern multicore computers. Over the past decade several parallel functional languages, typically based on dialects of ML and Haskell, have been developed. These languages, however, have traditionally underperformed procedural languages (such as C and Java). The primary reason for this is their hunger for memory, which only grows with parallelism, causing traditional memory management techniques to buckle under increased demand for memory. Recent work opened a new angle of attack on this problem by identifying a memory property of determinacy-race-free parallel programs, called disentanglement, which limits the knowledge of concurrent computations about each other’s memory allocations. The work has showed some promise in delivering good time scalability. In this paper, we present provably space-efficient automatic memory management techniques for determinacy-race-free functional parallel programs, allowing both pure and imperative programs where memory may be destructively updated. We prove that for a program with sequential live memory of R * , any P -processor garbage-collected parallel run requires at most O ( R * · P ) memory. We also prove a work bound of O ( W + R * P ) for P -processor executions, accounting also for the cost of garbage collection. To achieve these results, we integrate thread scheduling with memory management. The idea is to coordinate memory allocation and garbage collection with thread scheduling decisions so that each processor can allocate memory without synchronization and independently collect a portion of memory by consulting a collection policy, which we formulate. The collection policy is fully distributed and does not require communicating with other processors. We show that the approach is practical by implementing it as an extension to the MPL compiler for Parallel ML. Our experimental results confirm our theoretical bounds and show that the techniques perform and scale well.

ParSy: Inspection and Transformation of Sparse Matrix Computations for Parallelism

Conference Paper

Nov 2018

Synchronization Debugging of Hybrid Parallel Programs

Conference Paper

Aug 2016

In this paper we address the problem of locating race conditions among synchronization primitives in execution traces of hybrid parallel programs. In hybrid parallel programs collective and point-to-point synchronization can’t be analyzed separately. We introduce a model for synchronization primitives and formally define synchronization races with respect to the model. Based on these concepts we present an algorithm which accurately detects synchronization races and yields a task graph of the execution trace. The task graph represents the guaranteed ordering of events across thread and process boundaries. It is an essential core element for the further analysis (e.g. a data race detection) of a program. Depending on the synchronization model task graph construction can be an NP-hard problem. Our model allows to construct an algorithm with sub-quadratic time complexity. Thus programs adhering to the principles of our model are provable against race conditions. Therefore we argue, that our model should be used as a foundation for the design and implementation of synchronization functions.

Race Detectors for Cilk and Cilk++ Programs

Chapter

Jan 2011

PERFORMANCE MONITORING IN TRANSPUTER-BASED MULTICOMPUTER NETWORKS

Article

Jie Cheng Jiang

Parallel Computing Research at Illinois The UPCRC Agenda

Article

Full-text available

The Cilk++ concurrency platform

Conference Paper

Jan 2009

Charles E. Leiserson

The availability of multicore processors across a wide range of computing platforms has created a strong demand for software frameworks that can harness these resources. This paper overviews the Cilk++ programming environment, which incorporates a compiler, a runtime system, and a race-detection tool. The Cilk++ runtime system guarantees to load-balance computations effectively. To cope with legacy codes containing global variables, Cilk++ provides a ldquohyperobjectrdquo library which allows races on nonlocal variables to be mitigated without lock contention or substantial code restructuring.

Efficient Detection of Determinacy Races in Cilk Programs

Article

Feb 1999

A parallel multithreaded program that is ostensibly deterministic may nevertheless behave nondeterministically due to bugs in the code. These bugs are called determinacy races, and they result when one thread updates a location in shared memory while another thread is concurrently accessing the location. We have implemented a provably efficient determinacy-race detector for Cilk, an algorithmic multithreaded programming language. If a Cilk program is run on a given input data set, our debugging tool, which we call the ``Nondeterminator,'' either determines at least one location in the program that is subject to a determinacy race, or else it certifies that the program is race free when run on the data set. The core of the Nondeterminator is an asymptotically efficient serial algorithm (inspired by Tarjan's nearly linear-time least-common-ancestors algorithm) for detecting determinacy races in series-parallel directed acyclic graphs. For a Cilk program that runs in T time on one processor and uses v shared-memory locations, the Nondeterminator runs in O(T α(v,v)) time, where α is Tarjan's functional inverse of Ackermann's function, a very slowly growing function which, for all practical purposes, is bounded above by 4 . The Nondeterminator uses at most a constant factor more space than does the original program. On a variety of Cilk program benchmarks, the Nondeterminator exhibits a slowdown of less than 12 compared with the serial execution time of the original optimized code, which we contend is an acceptable slowdown for debugging purposes.

Concurrent static single assignment form and constant propagation for explicitly parallel programs

Chapter

Full-text available

Oct 2006

Static Single Assignment (SSA) form has shown its usefulness as a program representation for code optimization techniques in sequential programs. We introduce the Concurrent Static Single Assignment (CSSA) form to represent explicitly parallel programs with interleaving semantics and post-wait synchronization. The parallel construct considered in this paper is cobegin/coend. A new confluence function, the -assignment, which summarizes the information of interleaving statements between threads, is introduced. The Concurrent Control Flow Graph, which contains information about conflicting statements, control flow, and synchronization, is used as an underlying representation for the CSSA from. An extension of the Sparse Conditional Constant propagation algorithm based on the CSSA form makes it possible to apply the constant propagation optimization to explicitly parallel programs.

Basic Compiler Algorithms for Parallel Programs.

Conference Paper

Full-text available

Aug 1999

Traditional compiler techniques developed for sequential programs do not guarantee the correctness (sequential consistency) of compiler transformations when applied to parallel programs. This is because traditional compilers for sequential programs do not account for the updates to a shared variable by different threads. We present a concurrent static single assignment (CSSA) form for parallel programs containing cobegin/coend and parallel do constructs and post/wait synchronization primitives. Based on the CSSA form, we present copy propagation and dead code elimination techniques. Also, a global value numbering technique that detects equivalent variables in parallel programs is presented. By using global value numbering and the CSSA form, we extend classical common subexpression elimination, redundant load/store elimination, and loop invariant detection to parallel programs without violating sequential consistency. These optimization techniques are the most commonly used techniques for sequential programs. By extending these techniques to parallel programs, we can guarantee the correctness of the optimized program and maintain single processor performance in a multiprocessor environment.

The Cilk++ Concurrency Platform

Conference Paper

Jul 2009

Charles E. Leiserson

The availability of multicore processors across a wide range of computing platforms has created a strong demand for software frameworks that can harness these resources. This paper overviews the Cilk++ programming environment, which incorporates a compiler, a runtime system, and a race-detection tool. The Cilk++ runtime system guarantees to load-balance computations effectively. To cope with legacy codes containing global variables, Cilk++ provides a "hyperobject" library which allows races on nonlocal variables to be mitigated without lock contention or substantial code restructuring.

Race-Condition Detection in Parallel Computation with Semaphores (Extended Abstract).

Conference Paper

Sep 1996

We address a problem arising in debugging parallel programs, detecting race conditions in programs using semaphores for synchronization. It is NP-complete to detect race conditions in programs that use polynomial number of semaphores [10]. We show in this paper that it remains NP-complete even if the programs are allowed to use only two semaphores, which settles the open question raised in [10]. The proof uses a technique that simulates a graph with any number of semaphores by a graph with only two semaphores. For the case of single semaphore, Lu et al. [8] give the previously only-known polynomial-time algorithm that runs in time O(n1.5p), where p is the number of processors and n is the total number of semaphore operations executed. Their algorithm, however, detects only a special class of race conditions. In this paper we cope with the general race-condition detection problem and give an O(np log n) -time algorithm.The output of our algorithm is a compact representation of size Θ(np), from which one can determine in constant time whether a race condition exists between two given operations. Our algorithm is near-optimal in that the time it takes is only O(log n) times the time required simply to write down the output.

Data-race detection in transactions-everywhere parallel programming

Article

Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003. Includes bibliographical references (p. 69-72). This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections. by Kai Huang. M.Eng.

Compiler analysis and optimization of tuplespace programs for distributed-memory systems [microform] /

Article

James B. Fenwick

Thesis (Ph. D.)--University of Delaware, 1998. Principal faculty adviser: Lori L. Pollock, Dept. of Computer and Information Sciences. Includes bibliographical references (leaves 250-260). Microfilm. s

Analyzing Threads for Shared Memory Consistency

Article

Zehra Noman Sura

Languages allowing explicitly parallel, multithreaded programming (e.g. Java and C#) need to specify a memory consistency model to define program behavior. The memory consistency model defines constraints on the order of memory accesses in systems with shared memory. The design of a memory consistency model affects ease of parallel programming as well as system performance. Compiler analysis can be used to mitigate the performance impact of a memory consistency model that imposes strong constraints on shared memory access orders. In this work, we explore the capability of a compiler to analyze what restrictions are imposed by a memory consistency model for the program being compiled. Our compiler analysis targets Java bytecodes. It focuses on two components: delay set analysis and synchronization analysis. Delay set analysis determines the order of shared memory accesses that must be respected within each individual thread of execution in the source program. We describe a simplified analysis algorithm that is applicable to programs with general thread structure (MIMD programs), and has polynomial time worst-case complexity. This algorithm uses synchronization analysis to improve the accuracy of the results. Synchronization analysis determines the order of shared memory accesses already enforced by synchronization in the source program. We describe a dataflow analysis algorithm for synchronization analysis that is efficient to compute, and that improves precision over previously known methods. The analysis techniques described are used in the implementation of a virtual machine that guarantees sequentially consistent execution of Java bytecodes. This implementation is used to show the effectiveness of our analysis algorithms. On many benchmark programs, the performance of programs on our system is close to 100% of the performance of the same programs executing under a relaxed memory model. Specifically, we observe an average slowdown of 10% on an Intel Xeon platform, with slowdowns of 7% or less for 7 out of 10 benchmarks. On an IBM Power3 platform, we observe an average slowdown of 26%, with slowdowns of 7% or less for 8 out of 10 benchmarks.

Testing of parallel programs based on primitive dependence graph

Conference Paper

Dec 1993

H.-P. Wu

Because of the nondeterministic timing ordering behavior of parallel primitives, testing of parallel programs is more difficult than that of serial programs. In this paper, we present a new testing strategy for parallel programs---static testing of parallel programs with their primitive dependence graphs. We have developed methods to analyze the timing ordering dependences between primitives and to construct primitive dependence graph. Approaches for static detecting errors with primitive dependence graph have also been discussed

Loop Monotonic Statements

Article

Jul 1995

A statement is considered to be monotonic with respect to a loop if its execution, during the successive iterations of a given execution of the loop, assigns a monotonically increasing or decreasing sequence of values to a variable. We present static analysis techniques to identify loop monotonic statements. The knowledge of loop monotonicity characteristics of statements which compute array subscript expressions is of significant value in a number of applications. We illustrate the use of this information in improving the efficiency of run-time array bound checking, run-time dependence testing, and on-the-fly detection of access anomalies. Given that a significant percentage of subscript expressions are monotonic, substantial savings can be expected by using these techniques

Analysis of Multithreaded Programs

Article

Jun 2003
Lect Notes Comput Sci

Martin Rinard

The field of program analysis has focused primarily on sequential programming languages. But multithreading is becoming increasingly important, both as a program structuring mechanism and to support e#cient parallel computations. This paper surveys research in analysis for multithreaded programs, focusing on ways to improve the efficiency of analyzing interactions between threads, to detect data races, and to ameliorate the impact of weak memory consistency models. We identify two distinct classes of multithreaded programs, activity management programs and parallel computing programs, and discuss how the structure of these kinds of programs leads to di#erent solutions to these problems. Specifically, we conclude that augmented type systems are the most promising approach for activity management programs, while targeted program analyses are the most promising approach for parallel computing programs.

A Compiler for Multiple Memory Models

Article

Feb 2004
CONCURR COMP-PRACT E

The design of consistency models for both hardware and software is a difficult task. For programming languages it is particularly difficult because the target audience for a programming language is much wider than the target audience for a hardware programming language, making usability a more important criteria. Exacerbating this problem is the reality that the programming languages community has little experience designing programming language consistency models, and therefore each new attempt is very much a voyage into uncharted territory. A concrete example of the difficulties of the task is the current Java Memory Model. Although designed to be easy to use by Java programmers, it is poorly understood and at least one common idiom (the "double check idiom") to exploit the model is unsafe. In this paper we describe the design of an optimizing Java compiler that will allow a consistency model for the code to be compiled to be specified as an input. The compiler will use Shasha and Snir's delay set analysis, and our CSSA program representation, to normalize the effects of different consistency models on optimizations and analysis. When completed, the compiler will serve as a testbed to prototype new memory models, and to measure the differences of different memory models on program performance.

Parallel Static Single Assignment Form and Constant Propagation for Explicitly Parallel Programs

Article

Full-text available

Feb 1997

Static Single Assignment (SSA) form has shown its usefulness for powerful code optimization techniques, such as constant propagation, of sequential programs. We introduce a new Parallel Static Single Assignment (PSSA) form and the transformation algorithm for the explicitly parallel programs with interleaving semantics and post-wait synchronization. The parallel construct considered in this paper is cobegin/coend. A new concept, ß-assignment, which summarizes the information of interleaving statements among threads, is introduced. Parallel Control Flow Graph, which contains the information of conflicting statements in addition to control flow and synchronization information, is used as a intermediate representation for the PSSA transformation. A parallel extension of the Sparse Conditional Constant propagation algorithm based on the PSSA form makes it possible to apply the constant propagation optimization to explicitly parallel programs. 1. Introduction Due to the impact of rapidly...

Debugging Multithreaded Programs that Incorporate User-Level Locking

Article

Mar 2001

A multithreaded program with a bug may behave nondeterministically, and this nondeterminism typically makes the bug hard to localize. This thesis presents a debugging tool, the Nondeterminator-2, which automatically finds certain nondeterminacy bugs in programs coded in the Cilk multithreaded language. Specifically, the Nondeterminator-2 finds "dag races," which occur when two logically parallel threads access the same memory location while holding no locks in common, and at least one of the accesses writes the location. The Nondeterminator-2 contains two...

A Mechanism for Efficient Debugging of Parallel Programs.

Conference Paper

Full-text available

Jul 1988

This paper addresses the design and implementation of an integrated debugging system for parallel programs running on shared memory multi-processors (SMMP). We describe the use of flowback analysis to provide information on causal relationships between events in a program's execution without re-executing the program for debugging. We introduce a mechanism called incremental tracing that, by using semantic analyses of the debugged program, makes the flowback analysis practical with only a small amount of trace generated during execution. We extend flowback analysis to apply to parallel programs and describe a method to detect race conditions in the interactions of the co-operating processes.

Static Analysis of Low-level Synchronization.

Conference Paper

Full-text available

Nov 1988

We describe a debugger that is being developed for distributed programs in Amoeba. A major goal in our work is to make the debugger independent of the Amoeba kernel. Our design integrates many facilities found in other debuggers, such as execution replay, ...

Debugging Fortran on a shared memory machine

Article

Jan 1987

Debugging on a parallel processor is more difficult than debugging on a serial machine because errors in a parallel program may introduce nondeterminism. The approach to parallel debugging presented here attempts to reduce the problem of debugging on a parallel machine to that of debugging on a serial machine by automatically detecting nondeterminism. 20 refs., 6 figs.

Parallel Supercomputing Today and the Cedar Approach

Article

Feb 1986
SCIENCE

More and more scientists and engineers are becoming interested in using supercomputers. Earlier barriers to using these machines are disappearing as software for their use improves. Meanwhile, new parallel supercomputer architectures are emerging that may provide rapid growth in performance. These systems may use a large number of processors with an intricate memory system that is both parallel and hierarchical; they will require even more advanced software. Compilers that restructure user programs to exploit the machine organization seem to be essential. A wide range of algorithms and applications is being developed in an effort to provide high parallel processing performance in many fields. The Cedar supercomputer, presently operating with eight processors in parallel, uses advanced system and applications software developed at the University of Illinois during the past 12 years. This software should allow the number of processors in Cedar to be doubled annually, providing rapid performance advances in the next decade.

Debugging Parallel Fortran on a Shared Memory Machine.

Conference Paper

Jan 1987

Automatic Detection of Nondeterminacy in Parallel Programs.

Conference Paper

Nov 1988

Dependence Analysis for Supercomputing

Book

Jan 1988

Utpal Banerjee

1. Introduction.- 2. Basic Concepts.- 2.1. Relations and Graphs.- 2.2. Orders on Vectors.- 2.3. Program Model.- 3. Dependence.- 3.1. Dependence Concepts.- 3.2. The Dependence Problem.- 4. Bounds of Linear Functions.- 4.1. Introduction.- 4.2. Bounds in Rectangles.- 4.3. Bounds in Trapezoids.- 5. Linear Diophantine Equations.- 5.1. Introduction.- 5.2. Greatest Common Divisors.- 5.3. Single Equation in Two Variables.- 5.4. Single Equation in Many Variables.- 5.5. Systems of Equations.- Appendix to Chapter 5.- 6. Dependence Tests.- 6.1. Introduction.- 6.2. One-Dimensional Arrays, Single Loops.- 6.3. One-Dimensional Arrays.- 6.4. General Case.- 6.5. Miscellaneous Comments.- References.

Cedar Fortran and other vector and parallel Fortran dialects

Conference Paper

Dec 1988

The development of vector and multiprocessor language constructs in Fortran is outlined. The significant architectures, their languages, and optimizers are described. A description is given of Cedar Fortran, the language for the Cedar multiprocessor, a hierarchical, shared-memory, vector multiprocessor currently under development

Programmer's Reference Manual

Jan 1987

X-Mp Cray
Multitasking

Cray X-MP Multitasking Programmer's Reference Manual, Cray Research, Inc., 1987.

Event Synchronization Analysis for Debugging Parallel Programs

Abstract

No full-text available

Recommended publications

Gossiping and broadcasting versus computing functions in networks