Conference Paper

Debugging of Heterogeneous Parallel Systems.

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

The Agora system supports the development of heterogeneous parallel programs, e.g. programs written in multiple languages and running on heterogeneous machines. Agora has been used since September 1986 in a large distributed system [1]: Two versions of the application have been demonstrated in one year, contrary to the expectation of two years per one version. The simplicity in debugging is one of the reasons of the productivity speedup gained. This simplicity is due both to the deeper understanding that the debugger has of parallel systems, and to a novel feature: the ability to replay the execution of parallel systems built with Agora. A user is able to exactly repeat for any number of times and at a slower pace an execution that failed. This makes it easy to identify time-dependent errors, which are peculiar to parallel and distributed systems. The debugger can also be customized to support user defined synchronization primitives, which are built on top of the system provided ones. The Agora debugger tackles three set of problems that no parallel debugger in the past has simultaneously addressed: dealing with programming-in-the-large, multiple processes in different languages, and multiple machine architectures.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... This is similar to the shared memory problem, and will not be supported in the rst version of ORM. 6 Comparison with other systems 6 ...
... In other cases, such as one task mapping a page into the memory of another task, there is no system call to use as a handle for logging. This is similar to the shared memory problem, and will not be supported in the rst version of ORM. 6 Comparison with other systems 6 ...
Article
We have built a software layer on top of Mach 2.5 that recovers multitask Mach applications from fail-stop failures. The layer implements Optimistic Recovery (OR), a mechanism for transparent recovery from failing tasks and processors, based on asynchronous checkpointing and logging of inter--task messages. OR recovers from failure by restoring a checkpoint and replaying the logged messages. The current prototype supports message communication via sends and receives, simple port operations, and task interactions through the environment manager. This paper discusses the issues Mach raised for this implementation, the structure of the OR layer, the design of future enhancements, and comparisons with other recovery techniques. 1 Introduction During the lifetime of a distributed system one or more of its tasks may fail. Rather than abort the entire system due to a single failure, it is preferable to make the computation fault--tolerant---that is, the computation should continu...
... Providing debugging support, in the form of debugging tools, should therefore focus on the properties of the target paradigm. Indeed, in Forin (1988) and also in Auguston, Jeffery & Underwood (2002), they remark that a common goal of supporting the debugging process is to assist the developer in their understanding of the system in a way that is consistent with the abstractions that they have used to describe the system. A further reason for providing debugging support that is focused on the target paradigm is given by Tukiainen (2000). ...
... In Recap [21], the compiler inserts code for every read that may access a shared memory location . Agora [9] uses write-once memory and a history log for maintaining the correct state of a program. Instant Re- play [16] logs execution by using Reader-Writer locking semantics for accessing shared memory. ...
Conference Paper
Full-text available
While deterministic replay of parallel programs is a power- ful technique, current proposals have shortcomings. Specifi- cally, software-based replay systems have high overheads on multiprocessors, while hardware-based proposals focus only on basic hardware-level mechanisms, ignoring the overall re- play system. To be practical, hardware-based replay systems need to support an environment with multiple parallel jobs running concurrently — some being recorded, others being replayed and even others running without recording or re- play. They also need to manage limited-size log buffers. This paper addresses these shortcomings by introduc- ing, for the first time, a set of abstractions and a software- hardware interface for practical hardware-assisted replay of multiprocessor systems. The approach, called Capo, intro- duces the novel abstraction of the Replay Sphere to sepa- rate the responsibilities of the hardware and software com- ponents of the replay system. In this paper, we also design and build CapoOne, a prototype of a deterministic multi- processor replay system that implements Capo using Linux and simulated DeLorean hardware. Our evaluation of 4- processor executions shows that CapoOne largely records with the efficiency of hardware-based schemes and the flex- ibility of software-based schemes.
Article
While deterministic replay of parallel programs is a powerful technique, current proposals have shortcomings. Specifically, software-based replay systems have high overheads on multiprocessors, while hardware-based proposals focus only on basic hardware-level mechanisms, ignoring the overall replay system. To be practical, hardware-based replay systems need to support an environment with multiple parallel jobs running concurrently -- some being recorded, others being replayed and even others running without recording or replay. Moreover, they need to manage limited-size log buffers. This paper addresses these shortcomings by introducing, for the first time, a set of abstractions and a software-hardware interface for practical hardware-assisted replay of multiprocessor systems. The approach, called Capo , introduces the novel abstraction of the Replay Sphere to separate the responsibilities of the hardware and software components of the replay system. In this paper, we also design and build CapoOne , a prototype of a deterministic multiprocessor replay system that implements Capo using Linux and simulated DeLorean hardware. Our evaluation of 4-processor executions shows that CapoOne largely records with the efficiency of hardware-based schemes and the flexibility of software-based schemes.
Article
Architectures for deterministic record-replay (R&R) of multithreaded code are attractive for program debugging, intrusion analysis, and fault-tolerance uses. However, very few of the proposed designs have focused on maximizing replay speed -- a key enabling property of these systems. The few efforts that focus on replay speed require intrusive hardware or software modifications, or target whole-system R&R rather than the more useful application-level R&R. This paper presents the first hardware-based scheme for unintrusive, application-level R&R that explicitly targets high replay speed. Our scheme, called Cyrus, requires no modification to commodity snoopy cache coherence. It introduces the concept of an on-the-fly software Backend Pass during recording which, as the log is being generated, transforms it for high replay parallelism. This pass also fixes-up the log, and can flexibly trade-off replay parallelism for log size. We analyze the performance of Cyrus using full system (OS plus hardware) simulation. Our results show that Cyrus has negligible recording overhead. In addition, for 8-processor runs of SPLASH-2, Cyrus attains an average replay parallelism of 5, and a replay speed that is, on average, only about 50% lower than the recording speed.
Article
While deterministic replay of parallel programs is a powerful technique, current proposals have shortcomings. Specifically, software-based replay systems have high overheads on multiprocessors, while hardware-based proposals focus only on basic hardware-level mechanisms, ignoring the overall replay system. To be practical, hardware-based replay systems need to support an environment with multiple parallel jobs running concurrently -- some being recorded, others being replayed and even others running without recording or replay. Moreover, they need to manage limited-size log buffers. This paper addresses these shortcomings by introducing, for the first time, a set of abstractions and a software-hardware interface for practical hardware-assisted replay of multiprocessor systems. The approach, called Capo, introduces the novel abstraction of the Replay Sphere to separate the responsibilities of the hardware and software components of the replay system. In this paper, we also design and build CapoOne, a prototype of a deterministic multiprocessor replay system that implements Capo using Linux and simulated DeLorean hardware. Our evaluation of 4-processor executions shows that CapoOne largely records with the efficiency of hardware-based schemes and the flexibility of software-based schemes.
Conference Paper
There has been significant interest in hardware-assisted deterministic Record and Replay (RnR) systems for multithreaded programs on multiprocessors. However, no proposal has implemented this technique in a hardware prototype with full operating system support. Such an implementation is needed to assess RnR practicality. This paper presents QuickRec, the first multicore Intel Architecture (IA) prototype of RnR for multithreaded programs. QuickRec is based on QuickIA, an Intel emulation platform for rapid prototyping of new IA extensions. QuickRec is composed of a Xeon server platform with FPGA-emulated second-generation Pentium cores, and Capo3, a full software stack for managing the recording hardware from within a modified Linux kernel. This paper's focus is understanding and evaluating the implementation issues of RnR on a real platform. Our effort leads to some lessons learned, as well as to some pointers for future research. We demonstrate that RnR can be implemented efficiently on a real multicore IA system. In particular, we show that the rate of memory log generation is insignificant, and that the recording hardware has negligible performance overhead. However, the software stack incurs an average recording overhead of nearly 13%, which must be reduced to enable always-on use of RnR.
Conference Paper
Architectures for deterministic record-replay (R&R) of multithreaded code are attractive for program debugging, intrusion analysis, and fault-tolerance uses. However, very few of the proposed designs have focused on maximizing replay speed -- a key enabling property of these systems. The few efforts that focus on replay speed require intrusive hardware or software modifications, or target whole-system R&R rather than the more useful application-level R&R. This paper presents the first hardware-based scheme for unintrusive, application-level R&R that explicitly targets high replay speed. Our scheme, called Cyrus, requires no modification to commodity snoopy cache coherence. It introduces the concept of an on-the-fly software Backend Pass during recording which, as the log is being generated, transforms it for high replay parallelism. This pass also fixes-up the log, and can flexibly trade-off replay parallelism for log size. We analyze the performance of Cyrus using full system (OS plus hardware) simulation. Our results show that Cyrus has negligible recording overhead. In addition, for 8-processor runs of SPLASH-2, Cyrus attains an average replay parallelism of 5, and a replay speed that is, on average, only about 50% lower than the recording speed.
Conference Paper
There has been significant interest in hardware-assisted deterministic Record and Replay (RnR) systems for multithreaded programs on multiprocessors. However, no proposal has implemented this technique in a hardware prototype with full operating system support. Such an implementation is needed to assess RnR practicality. This paper presents QuickRec, the first multicore Intel Architecture (IA) prototype of RnR for multithreaded programs. QuickRec is based on QuickIA, an Intel emulation platform for rapid prototyping of new IA extensions. QuickRec is composed of a Xeon server platform with FPGA-emulated second-generation Pentium cores, and Capo3, a full software stack for managing the recording hardware from within a modified Linux kernel. This paper's focus is understanding and evaluating the implementation issues of RnR on a real platform. Our effort leads to some lessons learned, as well as to some pointers for future research. We demonstrate that RnR can be implemented efficiently on a real multicore IA system. In particular, we show that the rate of memory log generation is insignificant, and that the recording hardware has negligible performance overhead. However, the software stack incurs an average recording overhead of nearly 13%, which must be reduced to enable always-on use of RnR.
Conference Paper
Architectures for deterministic record-replay (R&R) of multithreaded code are attractive for program debugging, intrusion analysis, and fault-tolerance uses. However, very few of the proposed designs have focused on maximizing replay speed -- a key enabling property of these systems. The few efforts that focus on replay speed require intrusive hardware or software modifications, or target whole-system R&R rather than the more useful application-level R&R. This paper presents the first hardware-based scheme for unintrusive, application-level R&R that explicitly targets high replay speed. Our scheme, called Cyrus , requires no modification to commodity snoopy cache coherence. It introduces the concept of an on-the-fly software Backend Pass during recording which, as the log is being generated, transforms it for high replay parallelism. This pass also fixes-up the log, and can flexibly trade-off replay parallelism for log size. We analyze the performance of Cyrus using full system (OS plus hardware) simulation. Our results show that Cyrus has negligible recording overhead. In addition, for 8-processor runs of SPLASH-2, Cyrus attains an average replay parallelism of 5, and a replay speed that is, on average, only about 50% lower than the recording speed.
Conference Paper
Shared-memory parallel programs can be highly nondeterministic due to the unpredictable order in which shared references are satisfied. However, deterministic execution is extremely important for debugging’ and can also be used for fault-tolerance and other replay-based algorithms. We present a hardware/software design that allows the order of memory references in a parallel program to be logged efficiently by recording a subset of the cache traffic between memory and the CPU ‘s. This log can then be used along with hardware and software control to replay execution. Simulation of several parallel programs shows that our device records no more than 1.17 MB/second for an application exhibiting fine-grained sharing behavior on a 16-way multiprocessor consisting of 12 MIP CPU’S. In addition, no probe effect or performance degradation is introduced. This represents several orders of magnitude improvement in both performance and log size over purely software-based methods proposed previously.
Article
Developing distributed systems includes activities such as testing, verification, debugging and performance analysis. We identify the set of generic events that are commonly monitored by different development tools, and the characteristics of a generic monitoring tool. We then introduce new language-based constructs to be used for the development of event-based development tools for distributed systems. Those constructs are easy to use, and provide accurate information about the monitored system without causing any perturbation.
Thesis
Full-text available
Debugging is a process that involves establishing relationships between several entities: The behavior specified in the program, P, the model/predicate of the expected behavior, M, and the observed execution behavior, E. The thesis of the unified approach is that a consistent representation for P, M and E greatly simplifies the problem of concurrent debugging, both from the viewpoint of the programmer attempting to debug a program and from the viewpoint of the implementor of debugging facilities. Provision of such a consistent representation becomes possible when sequential behavior is separated from concurrent or parallel structuring. Given this separation, the program becomes a set of sequential actions and relationships among these actions. The debugging process, then, becomes a matter of specifying and determining relations on the set of program actions. The relations are specified in P, modeled in M and observed in E. This simplifies debugging because it allows the programmer to think in terms of the program which he understands. It also simplifies the development of a unified debugging system because all of the different approaches to concurrent debugging become instances of the establishment of relationships between the actions. The unified approach defines a formal model for concurrent debugging in which the entire debugging process is specified in terms of program actions. The unified model places all of the approaches to debugging of parallel programs such as execution replay, race detection, model/predicate checking, execution history displays and animation, which are commonly formulated as disjoint facilities, in a single, uniform framework. We have also developed a feasibility demonstration prototype implementation of this unified model of concurrent debugging in the context of the CODE 2.0 parallel programming system. This implementation demonstrates and validates the claims of integration of debugging facilities in a single framework. It is further the case that the unified model of debugging greatly simplifies the construction of a concurrent debugger. All of the capabilities previously regarded as separate for debugging of parallel programs, both in shared memory models of execution and distributed memory models of execution, are supported by this prototype.
Article
Full-text available
The debugging cycle is the most common methodology for finding and correcting errors in sequential programs. Cyclic debugging is effective because sequential programs are usually deterministic. Debugging parallel programs is considerably more difficult because successive executions of the same program often do not produce the same results. In this paper we present a general solution for reproducing the execution behavior of parallel programs, termed Instant Replay. During program execution we save the relative order of significant events as they occur, not the data associated with such events. As a result, our approach requires less time and space to save the information needed for program replay than other methods. Our technique is not dependent on any particular form of interprocess communication. It provides for replay of an entire program, rather than individual processes in isolation. No centralized bottlenecks are introduced and there is no need for synchronized clocks or a globally consistent logical time. We describe a prototype implementation of Instant Replay on the BBN Butterfly™ Parallel Processor, and discuss how it can be incorporated into the debugging cycle for parallel programs.
Article
Full-text available
Interlisp is a programming environment based on the LISP programming language left bracket Charniak et al. 1980 right bracket left bracket Friedman 1974 right bracket . In widespread use in the artificial intelligence community, Interlisp has an extensive set of user facilities, including syntax extension, uniform error handling, automatic error correction, an integrated structure-based editor, a sophisticated debugger, a compiler, and a filing system. Its most popular implementation is Interlisp-10, which runs under both the Tenex and Tops-20 operating systems for the DEC PDP-10 family. Interlisp-10 now has approximately 300 users at 20 different sites (mostly universities) in the US and abroad. It is an extremely well documented and maintained system. Interlisp has been used to develop and implement a wide variety of large application systems. Examples include the Mycin system for infectious disease diagnosis left bracket Shortliffe 1976 right bracket , the Boyer-Moore theorem prover left bracket Boyer and Moore 1979 right bracket , and the BBN speech understanding system left bracket Wolf and Woods 1980 right bracket . This article describes the Interlisp environment, the facilities available in it, and some of the reasons why Interlisp developed as it has.
Article
Concurrent and distributed programs are hard to debug. This thesis agues that structuring activities as nested atomic actions can make debugging such programs much like debugging traditional sequential programs. To support the argument, the author presents a method for debugging computations in the Argus language and system. Her method is applicable to other action systems since it depends only on the atomicity properties of actions. To debug a computation in our method, the user inspects a serial execution that is equivalent to the original computation. The debugging process involves two phases. In the first phase, the user examines pre- and post- states of actions to isolate the action that exposes the bug. In the second phase, the debugging system re-executes code to reproduce the details of the culprit action. The user can repeat this re-execution and can use standard break-and-examine tools on it to isolate the bug. This debugging system supports the method by saving a partial history when an action runs. This history consists mainly of recovery versions of objects. The system also timestamps the termination of actions so it can determine from the saved versions the values of objects in an action's pre- and post- states. The debugging system itself uses pre-states to repeat actions. This thesis presents the first detailed design that ues recovery versions and timestamps for debugging. (Author)
Article
Most extant debugging aids force their users to think about errors in programs from a low-level, unit-at-a-time perspective. Such a perspective is inadequate for debugging large complex systems, particularly distributed systems. In this paper, we present a high-level approach to debugging that offers an alternative to the traditional techniques. We describe a language, edl, developed to support this high-level approach to debugging and outline a set of tools that has been constructed to effect this approach. The paper includes an example illustrating the approach and discusses a number of problems encountered while developing these debugging tools.
Article
The concept of one event happening before another in a distributed system is examined, and is shown to define a partial ordering of the events. A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events. The use of the total ordering is illustrated with a method for solving synchronization problems. The algorithm is then specialized for synchronizing physical clocks, and a bound is derived on how far out of synchrony the clocks can become.
Article
The authors discuss Symbolics Genera, a mature software development environment that integrates features normally found in an operating system, its utilities, and the applications running under it. An environment intended for developing large, complex software systems, Genera also supports its own development and maintenance as a commercial product. Its development methodology, requirements, and implementation are described. Other issues and aspects of Genera that are examined are data-level integration, incremental change, openness, reusability, extensibility, and transparency.
Article
A system called Agora was designed and implemented that supports the development of multilanguage parallel applications for heterogeneous machines. Agora hinges on two ideas: the first one is that shared memory can be a suitable abstraction to program concurrent, multilanguage modules running on heterogeneous machines. The second idea is that a shared memory abstraction can be efficiently supported across different computer architectures that are not connected by a physical shared memory, e.g., local area network workstations or ensemble machines. Agora has been in use for more than a year. The authors describe the Agora shared memory and its software implementation on both tightly and loosely coupled architectures. Measurements of the current implementation are also included
CFrames: A Miniature Frame System
  • R A Lerner
  • M M Bauer
A Modula-2+ Debugger
  • J Detreville
  • Loupe
  • DeTreville J.
Unix User's Reference Manual University of California at Berkeley , 1986 . DBX(1)
  • Dbx