Robert H. B. Netzer's research while affiliated with Brown University and other places

Publications (43)

Conference Paper
Dynamic data race detection is a critical part of debugging shared-memory parallel programs. The races that can be detected must be refined to filter out false alarms and pinpoint only those that are direct manifestations of bugs. Most race detection methods can report false alarms because of imprecise run-time information and because some races ar...
Article
We address the problem of detecting race conditions in programs that use semaphores for synchronization. Netzer and Miller showed that it is NP-complete to detect race conditions in programs that use many semaphores. We show in this paper that it remains NP-complete even if only two semaphores are used in the parallel programs. For the tractable ca...
Article
The widespread adoption of distributed computing has accentuated the need for an effective set of support tools to facilitate debugging and monitoring of distributed programs. Unfortunately for distributed programs, this is not a trivial task. Many distributed programs are inherently non-deterministic in nature. Two runs of the same programs with t...
Article
To support incremental replay of message-passing applications, processes must periodically checkpoint and the content of some messages must be logged, to break dependencies of the current state of the execution on past events. This paper shows that known adaptive logging algorithms are likely to introduce deadlocks in replay, and we introduce a new...
Article
Full-text available
A useless checkpoint is a local checkpoint that cannot be part of a consistent global checkpoint. This paper addresses the following problem. Given a set of processes that take (basic) local checkpoints in an independent and unknown way, the problem is to design communication-induced checkpointing protocols that direct processes to take additional...
Article
We address a problem arising in debugging parallel programs, detecting race conditions in programs using semaphores for synchronization. It is NPcomplete to detect race conditions in programs that use many semaphores [10]. We show in this paper that it remains NP-complete even if the programs are allowed to use only two semaphores. For the case of...
Conference Paper
To support incremental replay of message-passing applications, processes must periodically checkpoint and the content of some messages must be logged, to break dependencies of the current state of the execution on past events. The paper presents a new adaptive logging algorithm that dynamically decides whether to log a message based on dependencies...
Article
A global checkpoint is a set of local checkpoints, one per process. The traditional consistency criterion for global checkpoints states that a global checkpoint is consistent if it does not include messages received and not sent. The paper investigates other consistency criteria, transitlessness, and strong consistency. A global checkpoint is trans...
Article
To support incremental replay of message-passing applications, processes must periodically checkpoint and the content of some messages must be logged, to break dependencies of the current state of the execution on past events. The paper presents a new adaptive logging algorithm that dynamically decides whether to log a message based on dependencies...
Conference Paper
Full-text available
A useless checkpoint is a local checkpoint that cannot be part of a consistent global checkpoint. The paper addresses the following important problem. Given a set of processes that take (basic) local checkpoints in an independent and unknown way, the problem is to design a communication induced checkpointing protocol that directs processes to take...
Conference Paper
Debugging long program runs can be difficult because of the delays required to repeatedly re-run the execution. Even a moderately long run of five minutes can incur aggravating delays. To address this problem, techniques exist that allow re-executing a distributed program from intermediate points by using combinations of checkpointing and message l...
Article
Full-text available
A useless checkpoint is a local checkpoint that cannot be part of a consistent global checkpoint. This paper addresses the following problem. Given a set of processes that take (basic) local checkpoints in an independent and unknown way, the problem is to design communication-induced checkpointing protocols that direct processes to take additional...
Article
Full-text available
Consistent global checkpoints have many uses in distributed computations. A central question in applications that use consistent global checkpoints is to determine whether a consistent global checkpoint that includes a given set of local checkpoints can exist. Netzer and Xu (1995) presented the necessary and sufficient conditions under which such a...
Article
: A global checkpoint is a set of local checkpoints, one per process. The traditional consistency criterion for global checkpoints states that a global checkpoint is consistent iff it does not include messages received and not sent. This paper investigates other consistency criteria, transitlessness and strong consistency. A global checkpoint is tr...
Article
Flowback analysis is a powerful technique for debugging programs. It allows the programmer to examine dynamic dependences in a program's execution history without having to re-execute the program. The goal is to present to the programmer a graphical view of the dynamic program dependences. We are building a system, called PPD, that performs flowbac...
Technical Report
Full-text available
A useless checkpoint is a local checkpoint that cannot be part of a consistent global checkpoint. This paper addresses the following important problem. Given a set of processes that take (basic) local checkpoints in an independent and unknown way, the problem is to design a communicationinduced checkpointing protocol that directs processes to take...
Article
Full-text available
For shared-memory systems, the most commonly assumed programmer's model of memory is sequential consistency. The weaker models of weak ordering, release consistency with sequentially consistent synchronization operations, data-race-free-0, and data-race-free-1 provide higher performance by guaranteeing sequential consistency to only a restricted cl...
Conference Paper
We address a problem arising in debugging parallel programs, detecting race conditions in programs using semaphores for synchronization. It is NP-complete to detect race conditions in programs that use polynomial number of semaphores [10]. We show in this paper that it remains NP-complete even if the programs are allowed to use only two semaphores,...
Conference Paper
We address a problem arising in debugging parallel programs, detecting race conditions in programs using a single semaphore for synchronization. It is NP-complete to detect races in programs that use many semaphores. For the case of a single semaphore, we give an algorithm that takes O(n 1.5p) time, where p is the number of processors and n is the...
Article
In this paper we address the problem of dynamically locating unwanted nondeterminism (race conditions) in executions of explicitly parallel message-passing programs. We formally define what it means for a race to exist and show conceptually how to dynamically locate races. We also show the importance of accurate race detection as a starting point f...
Conference Paper
We present a sender-based message logging protocol for supporting fault tolerance with checkpointing and rollback recovery in distributed systems. Our scheme achieves the benefits of both optimistic and pessimistic message logging. Experimental results show that the maximum rollback induced by our protocol, and the number of messages logged, can be...
Article
As most important applications today are large-scale in nature, high-performance methods are becoming indispensable. Two promising computational paradigms for large-scale applications are dynamic and I/O-efficient computations. We give efficient dynamic data structures for several fundamental problems in computational geometry, including point loca...
Article
The overhead of saving checkpoints to stable storage is the dominant performance cost in checkpointing systems. In this paper, we present a complete study of compressed differences, a new algorithm for fast incremental checkpointing. Compressed differences reduce the overhead of checkpointing by saving only the words that have changed in the curren...
Article
Consistent global snapshots are important in many distributed applications. We prove the exact conditions for an arbitrary checkpoint, or a set of checkpoints, to belong to a consistent global snapshot, a previously open problem. To describe the conditions, we introduce a generalization of Lamport's (1978) happened-before relation called a zigzag p...
Article
Full-text available
Acommon debugging strategy involves re-executing a program (on a given input) over and over,e ach time gaining more information about bugs. Such techniques can fail on message-passing parallel programs. Because of nondeterminacy,different runs on the given input may produce different results. This non-repeatability is a serious debugging problem, s...
Conference Paper
Debugging long-running, nondeterministic message-passing parallel programs requires incremental replay, the ability to exactly replay selected parts of an execution. To support incremental replay, we must log enough messages and checkpoint processes often enough to allow any requested replay to complete quickly. We present an adaptive tracing strat...
Conference Paper
This paper presents an adaptive message logging algorithm that keeps time and space costs low by logging only a fraction of the messages. The algorithm dynamically tracks dependences among messages to determine which cause domino effects and must be traced. The domino effect can force a replay to start arbitrarily far back in the execution, and dom...
Article
Adaptive message logging, which traces dependences between messages and checkpoints and selectively logs messages, letting users accurately and efficiently replay specific portions of parallel programs, is presented. Traces are reduced by logging only messages that cannot be quickly recomputed during replay. By restarting the execution at the right...
Conference Paper
Execution replay is a debugging strategy where a program is run over and over on an input that manifests bugs. For explicitly parallel shared-memory programs, execution replay requires support of special tools --- because these programs can be nondeterministic, their executions can differ from run to run on the same input. For such programs, execut...
Article
Execution replay is a crucial part of debugging. Because explicitly parallel shared-memory programs can be nondeterministic, a tool is required that traces executions so they can be replayed for debugging. We present an adaptive tracing strategy that is optimal and records the minimal number of shared-memory references required to exactly replay ex...
Article
Full-text available
In shared-memory parallel programs that use explicit synchronization, race conditions result when accesses to shared memory are not properly synchronized. Race conditions are often considered to be manifestations of bugs since their presence can cause the program to behave unexpectedly. Unfortunately, there has been little agreement in the literatu...
Article
Full-text available
For shared-memory parallel programs that use explicit synchronization, data race detection is an important part of debugging. A data race exists when concurrently executing sections of code access common shared variables. In programs intended to be data race free, they are sources of nondeterminism usually considered bugs. Previous methods for dete...
Article
Full-text available
Flowback analysis is a powerful technique for debugging programs. It allows the programmer to examine dynamic dependences in a program's execution history without having to re-execute the program. The goal is to present to the programmer a graphical view of the dynamic program dependences. We are building a system, called PPD, that performs flowbac...
Article
Full-text available
Several methods currently exist for detecting data races in an execution of a shared-memory parallel program. Although these methods address an important aspect of parallel program debugging, they do not precisely define the notion of a data race. As a result, is it not possible to precisely state which data races are detected, nor is the meaning o...
Article
Full-text available
This paper presents results on the complexity of computing event orderings for sharedmemory parallel program executions. Given a program execution, we formally define the problem of computing orderings that the execution must have exhibited or could have exhibited, and prove that computing such orderings is an intractable problem. We present a form...
Article
Full-text available
A common debugging strategy involves reexecuting a program (on a given input) over and over, each time gaining more information about bugs. Such techniques can fail on message-passing parallel programs. Because of variations in message latencies and process scheduling, different runs on the given input may produce different results. This non-repeat...

Citations

... Since then, detection algorithms have been designed for many different classes of bugs such as: race conditions [11], predicates on single global states [1], predicates based on sequences of global states [5]. Research in replaying trace computations have focussed on reducing the size of the trace by determining which events are necessary for successful replaying [9]. Our approach focusses on adding a control mechanism to the debugging process to allow computations to be run under safety constraints. ...
... They divide their work into four section including modeling and design of the systems, data collection, analysis of the collected data, and dynamic performance controlling. Also, a number of bibliographies of parallel debugging tools were presented by Pancake et al. [12] [13] [14]. ...
... Deterministic replay of multithreaded programs has several important uses. First, determinism can help developers effectively debug multithreaded programs using cyclic debugging [23] because the erroneous executions can be repeated. Furthermore, determinism is also necessary in fault detection [30], fault recovery [15], and replay-based intrusion analysis [8]. ...
... In this proof, we have to prove that if having v a > v b , then it is possible to conclude m is receivable at a. To prove this, we have to use one theorem (reference at Theorem 1 in [19]). The theorem is: m is receivable at a ⇔ ab (happened-before relation [13]) (*) Because of v a > v b , so there is no message sent from p a to p b received before b. ...
... Checkpointing is a well-known technique used to identify consistent global snapshots from local recorded states called checkpoints. Informally, a global snapshot is consistent if the set of checkpoints that compose it (one per process) accomplishes the following two constraints: first, all the local checkpoints in the snapshot are concurrent and no Z-path exists from one local checkpoint to another or itself [13]. This last case is called a Z − Cycle (these patterns are formally defined in Section 2.2). ...
... Starting programs from intermediate states can solve this problem. Such a solution is offered by Incremental Replay techniques [31]. They support to start a parallel program at intermediate points and investigate only a part of one process at a time. ...
... Further there are associative operations which while non-deterministic in their execution are deterministic in their results (e.g if you add n numbers which are on different processors, the order of the addition is unimportant and you could typically just add the values you receive in messages, without worrying about the order in which the messages were sent). Non-determinacy inherently stems from races and depending on the kind of race, the behavior is either desirable or not [7]. Hence if a language does not allow the user to express desirable non-determinacy, it is sacrificing some performance. ...
... Task-based programs are susceptible to concurrency errors such as atomicity violations [81] and data races [4,23,63,64,76]. A data race occurs when two accesses, with at least one write, from different tasks are incorrectly synchronized [1]. The presence of data races in shared-memory programs often indicates the presence of other concurrency errors [24], and can affect an execution by crashing, hanging, or corrupting data [26,35]. ...
... This thesis is mainly concerned with the performance side of profiling. Numerous concurrent debugging tools, whose primary purpose is correctness, also exist [PN93]. ...
... (5) Preparing the execution environment and building the master shell script (run.sh), which will handle the whole ASR process. The shell script is executed sequentially, in order to achieve an incremental execution style, which will simplify code debugging (Netzer & Weaver, 1994). Another interesting reason for using this script file is to do parallel execution of some inde-pendent steps or some sub steps of a complex step such as training on a cluster of machines. ...