Conference PaperPDF Available

Breakpoints and halting in distributed programs

July 1988

July 1988
8:316 - 323

DOI:10.1109/DCS.1988.12532

Source
IEEE Xplore

Conference: Distributed Computing Systems, 1988., 8th International Conference on

Authors:

Barton P. Miller

University of Wisconsin–Madison

Interactive debugging requires that the programmer be able to half a program at interesting points in its execution. The authors define distributed breakpoints and present an algorithm for implementing the detection points and an algorithm for halting a distributed program in a consistent state. Events that can be partially ordered are defined as detectable and form the basis for the breakpoint predicates. From the breakpoint definition, an algorithm is obtained that can be used in a distributed debugger to detect these breakpoints. The halting algorithm extends K.M. Chandy and L. Lamport's (1985) algorithm for recording global state and solves the problem of processes that are not fully connected or frequently communicating

A Distributed System

…

A Distributed System with a Debugger Process

…

Figures - uploaded by Barton P. Miller

Content may be subject to copyright.

Content uploaded by Barton P. Miller

Content may be subject to copyright.

Breakpoints and Halting in Distributed Programs

Barton P. Miller

Jong-Deok Choi

Computer Sciences Department

University of Wisconsin-Madison

1210 W. Dayton Street

Madison, Wisconsin 53706

Abstract

Interactive debugging requires that the programmer be able to halt a program at interesting

points in its execution. This paper presents an algorithm for halting a distributed program in a

consistent state, and presents a deﬁnition of distributed breakpoints with an algorithm for imple-

menting the detection of these breakpoints. The Halting Algorithm extends Chandy and

Lamport’s algorithm for recording global state and solves the problem of processes that are not

fully connected or frequently communicating. The deﬁnition of distributed breakpoints is based

on those events that can be detected in a distributed system. Events that can be partially ordered

are detectable and form the basis for the breakpoint predicates, and from the breakpoint deﬁnition

comes the description of an algorithm that can be used in a distributed debugger to detect these

breakpoints.

Index Items - Distributed Programming, Distributed Debugging, Halting Algorithm, Distri-

buted Breakpoints.

1. Introduction

Interactive debugging requires that the programmer be able to halt a program at interesting points in

its execution. Halting consists of the mechanisms to stop the program’s execution and the predicates,

called breakpoints, that are used to trigger the halting. This paper presents an algorithm for halting a distri-

buted program in a consistent state, and presents a deﬁnition of distributed breakpoints with an algorithm

for implementing these breakpoints.

Halting a single-process, sequential program is well-understood. There is a single thread of execu-

tion that can be stopped without regard for other activities in the system. When a program consists of

cooperating processes executing on different machines, halting decisions are affected by unpredictable

communication delays between machines. We cannot instantly transmit a command to halt all processes,

nor can we guarantee that the halt command will simultaneously reach all processes. In Section 2 of this

paper we present an algorithm for consistently halting a distributed program given the inherent

Research supported by the National Science Foundation grant MCS-8105904 and a Digital Equipment Corporation External

Research Grant.

Revised Revised

communication delays. This algorithm is derived from Chandy and Lamport’s algorithm for recording glo-

bal state [1], and extends this algorithm to work for processes that communicate infrequently or are not

fully connected.

Breakpoints in a sequential program have an implied reference to time. When we say ‘‘stop when

procedure X is entered or when procedure Y is entered’’, we mean to stop the program when any of these

conditions becomes true. When we say ‘‘stop when procedure X is entered and i[j]=7’’, we mean to stop

the program when, at the same instant, both of these conditions are true.

We have no single, global notion of time in a distributed system [2], so we may not be able to deter-

mine whether one condition really occurred before another. This means that we will have to tolerate break-

points that occur independently on different machines. Likewise, we cannot determine whether events on

different machines occurred simultaneously. This means that we must replace the concept of simultaneous

events with one that is suitable for a distributed system. In Section 3, we present a deﬁnition of predicates

for breakpoints in a distributed program. This deﬁnition is based on detectable orderings of events. We

describe an algorithm from which one can implement a satisfaction detector for these predicates.

Section 4 discusses the application of these ideas to current research in distributed debugging.

2. Consistent Halting

This section describes how to halt all processes belonging to a distributed program so that no critical

information is lost when the processes halt. This problem is easy to solve for a single machine because

there is only one active process at a given moment. When processes of the same program reside on dif-

ferent machines, they cannot be stopped simultaneously. Therefore, some information may be lost or

recorded incorrectly.

Our halting algorithm is derived from Chandy and Lamport’s algorithm for recording global states

[1]. We ﬁrst summarize Chandy and Lamport’s algorithm and then present an algorithm to halt the distri-

buted computation in such a way that, in spite of the time delay in halting processes, the ﬁnal halted states

of the processes of the computation result in globally consistent states. Although the physical instant of

halting each process by our algorithm is different, we show that all the processes halt at the same virtual

time instant [2]. For any two halted processes of a computation, the halted state of a process is not affected

Revised Revised

by the halted state of the other process and, therefore, there can be no happened-before [2] relationship

between the two halted states. Each process’s view of event ordering is preserved by our algorithm.

We show some problems with this basic halting algorithm and then present an extended algorithm

that is suitable for a debugger.

2.1. Chandy and Lamport’s Algorithm

A distributed program consists of a ﬁnite number of processes and a ﬁnite number of channels

between the processes. Figure 1 shows an example where each process is represented by a circle and chan-

nels are represented by directed edges.

process

channel

Figure 1. A Distributed System

Processes in a distributed program communicate by sending and receiving messages. Channels are

assumed to have inﬁnite buffers, to be error-free and to deliver messages in the order sent. Following are

some deﬁnitions from [1].

Deﬁnitions:

An event e is a 5-tuple <p,s,ss,M,c> where p is a process, s and ss are states of the process

before and after the event, M and c are the message and the channel through which the message

is sent or received by p at that event. M and c can have the special value null if no message is

involved in the event.

A global state S

consists of the states of processes of the computation and the states of channels.

We brieﬂy restate Chandy and Lamport’s algorithm, which we will call the C&L Algorithm, to

record the global state. In that algorithm, each process records its own state, and the two processes upon

which a channel is incident cooperate in recording the channel state. The algorithm, which can be initiated

independently by more than one process at the same time, is as follows:

C&L Algorithm:

Marker-Sending Rule for a Process p.

For each channel c, incident on, and directed away from p:

p sends one marker along c after p records its state and before p sends further

messages along c.

Marker-Receiving Rule for a Process q.

On receiving a marker along a channel c:

Revised Revised

if q has not recorded its state then

q records its state;

q records the state c as the empty sequence

else

q records the state of c as the sequence of messages received along

c after q’s state was recorded and before q received the marker

along c

Theorem 1.

The recorded state S

is globally consistent.

Proof:

See [1].

2.2. Halting Algorithm

We now present an algorithm to halt the processes to yield a globally consistent halted state S

that is

equivalent to the recorded state S

. These states are equivalent in the sense that the state of each process in

the halted global state S

is the same as the state of each process in the recorded global state S

and the

undelivered messages in each channel in S

are the same as the recorded messages in the state of the chan-

nel in S

. We begin the discussion with the same model as in [1].

2.2.1. Basic Algorithm

Our model is the same as in[1] except that we use a halt marker instead of marker. This halt marker

carries with it a sequence number referred to as halt_id. This halt_id enables each process to distinguish an

old halt marker (to ignore) from a new halt marker. Each process also keeps track of the latest halt_id

received as last_halt_id whose value is initially set to zero. Like the C&L Algorithm, halting can be ini-

tiated spontaneously by more than one process. The decision as to when to halt can be made independently

by each process. We discuss how to set and detect breakpoints in distributed debugging system in section

Halting Algorithm:

Marker-Sending Rule for a Process p.

Increment last_halt_id;

Halt Routine (p)

Marker-Receiving Rule For a Process q.

On receiving a halt marker along a channel c:

Compare the halt_id with its last_halt_id;

if halt_id is greater than last_halt_id then

Update last_halt_id;

Revised Revised

Halt Routine (q);

else

Ignore;

Halt Routine (x: process):

For each channel c, incident on and directed away from process x, send a halt marker

with a halt_id equal to the last_halt_id along c;

Halt;

A process halts either by receiving halt marker from any one of its adjacent neighboring processes or

by spontaneously deciding to halt. If a process halts by receiving a halt marker, it does so on receiving the

ﬁrst halt marker with the new halt_id (old halt markers are left over from previous haltings). When all

processes are ﬁnally halted, the state of each process is preserved. Each outgoing channel contains

undelivered messages with a halt marker as the last one, or is empty if the halt marker was delivered to the

receiving process. Given the assumptions of reliable channels and each process observing the same algo-

rithm, it can be shown that when all processes halt, the value of each process’s last_halt_id is the same.

This is true because the initial value of each last_halt_id is zero and gets incremented exactly once during

the Halting Algorithm (since a process can halt only once). The global halted state S

consists of the halted

states of the processes and undelivered messages in channels. We claim that S

is the same as S

in the

sense that

(1) the state of each process in S

is the same as the recorded state of the corresponding process in S

;

and

(2) the undelivered messages in each channel in S

are the same as the recorded state of the correspond-

ing channel in S

We begin the proof of our claim with two lemmas.

Lemma 2.1.

The halted state of each process in S

is the same as the recorded state of the process in S

Proof:

The Halting Algorithm is structurally identical to the C&L Algorithm. In the Halting Algorithm,

each process halts at the instant it would record its state in the C&L Algorithm. So the halted

state of each process p in S

is the same as the recorded state of the process in S

Lemma 2.2.

The undelivered messages in each channel in S

are the same as the recorded messages of the

state of the corresponding channel in S

Proof:

In the Halting Algorithm, a process p halts as soon as it receives a halt marker on any one of its

Revised Revised

incoming channels. When a halt marker is received on a channel, we know that the channel is

now empty since the process that was sending on the channel halted as soon as it sent the halt

marker on the channel. All of process p’s other incoming channels will contain pending mes-

sages. Since each process sends a halt marker before it halts, the last message in each of these

pending channels is the halt marker. Therefore, the state of an incoming channel c of a process p

in S

either consists of (zero or more) pending messages followed by a halt marker or is empty.

In the C&L Algorithm, each process proceeds with its computation after it records its state when

it receives the ﬁrst marker from any of the incoming channels. The state of an incoming channel

c of a process p in S

consists of the sequence of recorded messages received on the channel until

a marker is received on the channel. Since each process in C&L Algorithm sends a marker at the

instant it would send a halt marker in the Halting Algorithm, the sequence of recorded messages

in S

received on each incoming channel c until a marker is received is the same as the stored

messages in the channel c in S

Theorem 2.

is the same as S

Proof:

The proof follows from Lemma 2.1 and Lemma 2.2.

2.2.2. Problems with the Basic Algorithm

There are two problems with our Halting Algorithm that also occur in the C&L Algorithm. The ﬁrst

problem is how to halt a process that has only infrequent interactions with the other processes of the com-

putation. The process would eventually halt, potentially long after all other processes have halted. Even

though nothing is conceptually wrong with this kind of process, it is awkward in practice.

The second problem is one that can make both the Halting Algorithm and the C&L Algorithm fail.

This problem occurs when the network connection is acyclic, as in producer-consumer or pipeline relation-

ship. Figure 2 shows an example of this case.

producer

consumer

Figure 2. Producer - Consumer Connection

If halting is initiated by the consumer process in this example, there is no way to send the halt marker

to the producer process to halt the entire computation. The C&L Algorithm avoids this problem by assum-

Revised Revised

ing that the processes are strongly connected.

2.2.3. Extended Model

We now present our model of the interactive distributed debugger system that works with our Halt-

ing Algorithm and solves the problems mentioned before. In our extended model, there is an additional

process d as the debugger process of the system, and there are two additional control channels connecting

the debugger process with each user process. The introduction of a debugger process solves not only the

problems mentioned above but is also a natural structure for an interactive debugging system [3].

debugger process

p q

Figure 3. A Distributed System with a Debugger Process

Figure 3 shows the model with user processes p, q and debugger process d. Since each process has

two control channels, one to and one from the debugger process, the network is strongly connected, i.e.,

there always is a message path from a process to any other process. In addition to guaranteeing strong con-

nectivity of the network, the debugger process performs the typical functions of a debugger. The algorithm

to halt the computation need not be changed except that the debugger process d never really halts and user

processes are always willing to accept a message from the debugger process.

2.2.4. Order of Halting

A process may have more than one incoming channel. This means that a halt marker could be

received from any one of the processes attached to these channels, depending on when and from where the

halting is initiated. The order in which the processes halt can provide useful information to the program-

mer, but this information is not available in our Halting Algorithm.

Revised Revised

The halting order information can be obtained by making a small change to the halt markers while

leaving the structure of the Halting Algorithm unchanged. Each process will append its name to the halt

marker before sending the marker to the next process(es). The halt marker that a process receives then

describes which processes have already been halted.

3. Distributed Breakpoints

3.1. Types of Breakpoints in Distributed Debugger

In a sequential programming, the decision to halt the program is usually done by specifying predi-

cates about the program’s behavior and state. The satisfaction of these predicates corresponds to interest-

ing points in the execution of the program, which we call breakpoints. The predicates are expressed in

terms of events that correspond to a particular behavior or change in state of the program.

A predicate that is based entirely on the execution behavior or state of a single process is called a

Simple Predicate. We can combine the Simple Predicates using the disjunctive operator to make a Dis-

junctive Predicate. Likewise, we can combine the Simple Predicates using the conjunctive operator to

make a Conjunctive Predicate.

Predicates can also be combined to describe a sequence of events. For example, a user may want to

halt a program and examine its state when a speciﬁed sequence of events is observed during the execution

of the program. We call such predicates Linked Predicates. Linked Predicates have been used with

hardware-based debugging tools such as logic state analyzers. For example, the programmer speciﬁes a

non-contiguous sequence of values (such as program addresses) that must occur and the debugging tool

detects when this sequence has occurred.

There is usually more than one thread of control in a distributed program and the breakpoint predi-

cates can involve more than one process. We call such predicates distributed predicates. We now describe

distributed predicates and how to detect the satisfaction of these predicates. When the distributed predicate

is satisﬁed, the Halting Algorithm presented in Section 2 is used to halt the computation.

Revised Revised

3.2. Simple Predicates (SP)

Simple Predicates consist of the typical predicates used in sequential program debuggers, such as

entering a particular procedure. We also have interprocess event predicates such as a message sent or

received, a channel created or destroyed, or a process created or terminated.

3.3. Disjunctive Predicates (DP)

Disjunctive Predicates are speciﬁed by expressions using the disjunctive operator ‘‘∪’’:

DP ::= SP [ ∪ SP]∗.

The Disjunctive Predicate is satisﬁed when one or more of the Simple Predicates is satisﬁed. Halting can

be initiated at the instant when any of the SP’s of the DP is satisﬁed. Multiple SP’s of the DP can be

satisﬁed at the same virtual time. Since the Halting Algorithm works for simultaneous initiations from

multiple processes, each process where any SP is satisﬁed can initiate the Halting Algorithm.

3.4. Linked Predicates (LP)

Linked Predicates specify sequences of events that can be ordered by the happened-before relation

and are speciﬁed by expressions using the ‘‘

→

’’ operator:

LP ::= DP [

→

DP]∗.

The semantics of LP can be interpreted as follows:

Let Σ be the set of DP

’s such that Σ = {DP

, i = 1..n}.

Then, the Linked Predicate

LP = DP

→

. . . 1 ≤ i,j,k ≤ n

means the following regular expression

LP = DP

[Σ - DP

]∗ DP

[Σ - DP

]∗ DP

. . .

The implementation of the Linked Predicates will be described in section 3.6.

3.5. Conjunctive Predicates (CP)

The Conjunctive Predicates are speciﬁed by expressions using the conjunctive operator ‘‘∩’’:

CP ::= SP [ ∩ SP]∗.

A Conjunctive Predicate is said to be satisﬁed at the instant when all the Simple Predicates of the Conjunc-

tive Predicate are satisﬁed. There is no single time reference across machine boundaries in a distributed

Revised Revised

system, so we cannot precisely detect the simultaneous events needed for the Conjunctive Predicate. This

form of predicate is well deﬁned within a single machine, but can have several interpretations in a distri-

buted system. Based upon the virtual time concept of a distributed system, our interpretation is as follows.

Given two processes P

and P

residing on different machines, each process has its own virtual time axis,

called T

and T

respectively. Predicate SP

is on the state of P

and SP

on the state of P

. We deﬁne a

pair of virtual time points (t

, t

) to describe a time when SP

is satisﬁed such that t

∈ T

(written as:

) is true), and the time when SP

is satisﬁed such that t

∈ T

(written as: SP

) is true).

We deﬁne a set of these virtual time pairs, called SCP, to be:

SCP = {(t

, t

) | t

∈ T

, t

∈ T

, SP

) ∩ SP

)}.

At any point in the set SCP, the conjuntive predicate SP

∩ SP

is satisﬁed.

Since T

and T

are virtual time axes, it is not always possible to order a given virtual time t

on P

and a given virtual time t

on P

according to Lamport’s happened-before relationship. We can divide the

SCP into two subsets named ordered−SCP where there is an ordering between t

and t

, and

unordered−SCP where there is no ordering. Since the Linked Predicates introduced in the previous section

is a mechanism to detect events with ordering, the two subsets can be expressed as follows:

ordered−SCP = {(t

, t

) | (t

, t

) ∈ SCP, ((SP

)

→

(SP

)

) ∪ ((SP

)

→

(SP

)

such that 1 ≤ i, j}

†

unordered−SCP = {(t

, t

) | (t

, t

) ∈ SCP, (t

, t

) ∈/ ordered-SCP}.

Figure 4 shows examples from each of these sets. We see an ordering in time pair (t

, t

) and no order-

ing possible in (t

, t

† We use (SP)

as a shorthand for SP

→

. . .

→

SP. For example, (SP)

stands for SP

→

SP.

Revised Revised

$m sub 2$

$t sub 22$

$t sub 21$

$t sub 13$

$t sub 12$

$t sub 11$

$m sub 1$

$P sub 2$

$P sub 1$

unordered-SCP

pair

ordered-SCP

pair

$t sub 23$

Figure 4. Examples from the Set SCP

We can use the algorithm for detecting satisfaction of Linked Predicates (see section 3.6) for detect-

ing events that occur at virtual times belonging to the set ordered-SCP. For example, if SP

) is true,

and SP

) and SP

) are true, we can use the Linked predicate SP

→

to detect the events at the

time pair (t

, t

). We can use SP

→

(SP

)

to detect the events at the time pair (t

, t

). Halting is

initiated at the moment when the last predicate in the ordering is satisﬁed. Detecting events that occur at

virtual times belonging to the unordered-SCP is more difﬁcult. For example, if we detect that SP

on P

satisﬁed, we must also detect SP

on process P

. Since there is no common time reference in a distributed

system, it is necessary to have some process gather the information from the other process(es) before halt-

ing is to be initiated. We cannot decide until the last notiﬁcation arrives at the information gathering pro-

cess, and the inherent time delay in such information gathering makes it impossible for the processes to halt

soon enough to preserve the meaningful states of the processes.

3.6. Implementation of Linked Predicate Detection

Since the deﬁnition of the Linked Predicate is general enough to comprise the Simple Predicate and

the Disjunctive Predicate, only one algorithm is needed to detect the predicates. In addition to the halt

marker for the Halting Algorithm, we need a predicate marker to carry the Linked Predicate. If necessary,

we can append to every message originated by the program some kind of tag so that each process can dis-

tinguish the genuine messages from halt markers and predicate markers which are introduced by the debug-

ging system.

Revised Revised

To issue the Link Predicate DP

→

, the debugger process sends a predicate marker

containing the Linked Predicate to each process involved in DP

. Upon receiving the predicate marker,

each process sets up the condition to detect when DP

is satisﬁed. When DP

is satisﬁed at process p, pro-

cess p creates a new predicate, newLP, from the remainder (DP

→

) of the original Linked Predicate.

This new predicate is issued to each process involved in DP

. This process is repeated until last Disjunc-

tive Predicate (in this case, DP

) in the Linked Predicate is satisﬁed, at which time a process knows that it

should initiate the Halting Algorithm.

Linked Predicate Detection Algorithm:

Predicate-Marker-Sending Rule for a process p.

Send a predicate marker containing the Linked Predicate to each process involved in

the ﬁrst Disjunctive Predicate of the Linked Predicate;

Predicate-Marker-Receiving Rule for a process q.

On receiving a predicate marker from other process:

Separates the ﬁrst Disjunctive Predicate from the Linked Predicate carried by

the predicate marker;

Make a newLP from the received Linked Predicate by excluding the ﬁrst Dis-

junctive Predicate;

When the extracted Disjunctive Predicate is met:

if the newLP is null then

Initiate the halting Algorithm;

else

Send a new predicate marker containing the newLP as the new

Linked Predicate according to the Predicate-Marker-Sending Rule.

Halt markers are manipulated only by the Halting Algorithm and Predicate Markers are manipulated

only by the Predicate Detection Algorithm, so these algorithms do not interfere with each other.

4. Application to Current Research

Distributed debugging is an area of active research. For our purposes, we can separate this research

into two approaches. The ﬁrst approach avoids the problem of stopping a program by providing tools only

for monitoring a program’s execution [3-6]. For example, Bates and Wileden[4] deﬁne an Event Descrip-

tion Language (EDL) that allows a programmer to group low-level events into high-level abstract events.

EDL requires the ability to observe sequences of events and recognize pattern in these sequences. Our

algorithm for recognizing distributed predicates (Section 3.6) could be used to support an EDL abstract

event recognizer.

A second approach to distributed debugging is one that more closely approximates traditional,

single-process debuggers [7-9]. For example, IDD [8] provides a stepping mode of execution for a

Revised Revised

collection of processes because IDD does not guarantee that a program can be halted in a timely and con-

sistent manner. The suggested IDD strategy is for a programmer to individually halt processes early

enough so that the entire computation is halted before the interesting points are reached. The programmer

can then execute the program in single instruction steps to ﬁnd the error. The Halting Algorithm using dis-

tributed breakpoints would simplify the programmer’s debugging task.

A variation on the second approach re-routes all normal communications through a centralized

debugger process [10, 11]. While this simpliﬁes the detection of distributed breakpoints by providing a sin-

gle point of event ordering, it also has several disadvantages. First, there can be substantial communication

overhead in re-routing the messages through a central hub. Second, the change in message ﬂow could sub-

stantially change the execution of the program. Last, the facility to re-route the communications can be

complex to build.

The Linked Predicates are similar to Path Expressions [12]. Our distributed predicate detection algo-

rithm provides a vehicle to implement Path Expressions in a distributed system.

5. Conclusion

Interactive debugging is a familiar scenario to any programmer. The Halting Algorithm presented in

Section 2 and the deﬁnition of distributed breakpoints in Section 3 provide the programmer with the neces-

sary tools to apply these techniques to a distributed program. The fundamental idea is that the program’s

view of event ordering and relative timing is preserved.

We have presented a deﬁnition of breakpoints in a distributed system. This deﬁnition shows what

type of logical statements make sense in such an environment. A satisfying result is that the types of

breakpoints that make sense (those excluding Conjunctive Predicates based on unordered events) are not

difﬁcult to implement; the type of breakpoint that is difﬁcult to implement turns out to not be desirable.

Any software debugging tool will cause some change in the absolute timing of a program. We have

not tried to avoid this, but our algorithms should only impose a minimal change on the execution of a pro-

gram. This change is one that should not affect any but the most timing sensitive programs — and for

these programs, a hardware monitor may be the only suitable form of debugger.

Revised Revised

6. REFERENCES

[1] K. M. Chandy and L. Lamport, ‘‘Distributed Snapshots: Determining Global States of Distributed

Systems,’’ ACM Trans. Computing Systems 3(1) pp. 63-75 (February 1985).

[2] L. Lamport, ‘‘Time, Clocks, and the Ordering of Events in a Distributed System,’’ Communications

of the ACM 21(7) pp. 558-565 (July 1978).

[3] H. Garcia-Molina, F. Germano, Jr., and W. H. Kohler, ‘‘Debugging a Distributed System,’’ IEEE

Trans. on Software Engineering SE-10(2) pp. 210-219 (March 1984).

[4] P. Bates and J. C. Wileden, ‘‘An Approach to High Level Debugging of Distributed Programs,’’

Proc. of the SIGSOFT/SIGPLAN Symp. on High-Level Debugging, pp. 107-111 Paciﬁc Grove,

Calif., (August 1983).

[5] R. J. LeBlanc and A. D. Robbins, ‘‘Event-Driven Monitoring of Distributed Programs,’’ Proc. of the

5th International Conf. on Distributed Computing Systems, pp. 515-522 Denver, (May 1985).

[6] B. P. Miller, C. Macrander, and S. Sechrest, ‘‘A Distributed Programs Monitor for Berkeley UNIX,’’

Software - Practice and Experience 16(2) pp. 183-200 (February 1986).

[7] E. T. Smith, ‘‘Debugging Techniques for Communicating, Loosely-Coupled Processes,’’ Ph.D.

Dissertation - Technical Report TR100, Univ. of Rochester (December 1981).

[8] P. K. Harter, Jr., D. M. Heimbigner, and R. King, ‘‘IDD: An Interative Distributed Debugger,’’

Proc. of the 5th International Conf. on Distributed Computing Systems, pp. 498-506 Denver, (May

1985).

[9] F. Baiardi, N. De Francesco, and G. Vaglini, ‘‘Development of a Debugger for a Concurrent

Language,’’ IEEE Trans. on Software Engineering SE-12(4) pp. 547-553 (April 1986).

[10] R. Curtis and L. Wittie, ‘‘BUGNET: A Debugging System for Parallel Programming Environment,’’

Proc. of the 3rd International Conf. on Distributed Computing Systems, pp. 394-399 Denver,

(August 1982).

[11] R. D. Schiffenbaur, ‘‘Interactive Debugging in a Distributed Programs,’’ M.S. Thesis, M.I.T.,

(August 1981).

[12] B. Bruegge and P. Hibbard, ‘‘Generalized Path Expressions: A High Level Debugging Mechanism,’’

Proc. of the SIGSOFT/SIGPLAN Symp. on High-Level Debugging, pp. 34-44 Paciﬁc Grove, Calif.,

(August 1983).

Revised Revised

Towards Interactive, Adaptive and Result-aware Big Data Analytics

Preprint

Full-text available

Dec 2022

Avinash Kumar

As data volumes grow across applications, analytics of large amounts of data is becoming increasingly important. Big data processing frameworks such as Apache Hadoop, Apache AsterixDB, and Apache Spark have been built to meet this demand. A common objective pursued by these traditional cluster-based big data processing frameworks is high performance, which often means low end-to-end execution time or latency. The widespread adoption of data analytics has led to a call to improve the traditional ways of big data processing. There have been demands for making the analytics process more interactive and adaptive, especially for long running jobs. The importance of initial results in the iterative process of data wrangling has motivated a result-aware approach to big data analytics. This dissertation is motivated by these calls for improvement in data processing and the experiences over the past few years while working on the Texera project, which is a collaborative data analytics service being developed at UC Irvine. This dissertation mainly consists of three parts. The first part is about the design of the Amber engine that serves as the backend data processing framework for the Texera service. The second part is about an adaptive and result-aware skew-handling framework called Reshape. Reshape uses fast control messages to implement iterative skew mitigation techniques for a wide variety of operators. The mitigation techniques in Reshape have also been analyzed from the perspective of their effects on the results shown to the user. The last part is about a result-aware workflow scheduling framework called Maestro. This part talks about how to schedule a workflow for execution on computing clusters and make result-aware decisions while doing so. This work improves the data analytics process by bringing interactivity, adaptivity and result-awareness into the process.

Decentralized runtime verification of message sequences in message-based systems

Article

Full-text available

Oct 2022
ACTA INFORM

Message-based systems are usually distributed in nature, and distributed components collaborate via asynchronous message passing. In some cases, particular ordering among the messages may lead to violation of the desired properties such as data confidentiality. Due to the absence of a global clock and usage of off-the-shelf components, such unwanted orderings can be neither statically inspected nor verified by revising their codes at design time. We propose a choreography-based runtime verification algorithm that given an automata-based specification of unwanted message sequences detects the formation of the unwanted sequences. Our algorithm is fully decentralized in the sense that each component is equipped with a monitor, as opposed to having a centralized monitor, and also the specification of the unwanted sequences is decomposed among monitors. In this way, when a component sends a message, its monitor inspects if there is a possibility for the formation of unwanted message sequences. As there is no global clock in message-based systems, monitors cannot determine the exact ordering among messages. In such cases, they decide conservatively and declare a sequence formation even if that sequence has not been formed. We prevent such conservative declarations in our algorithm as much as possible and then characterize its operational guarantees. We evaluate the efficiency and scalability of our algorithm in terms of the communication overhead, the memory consumption, and the latency of the result declaration through simulation.

Amber: a debuggable dataflow system based on the actor model

Article

Feb 2020

A long-running analytic task on big data often leaves a developer in the dark without providing valuable feedback about the status of the execution. In addition, a failed job that needs to restart from scratch can waste earlier computing resources. An effective method to address these issues is to allow the developer to debug the task during its execution, which is unfortunately not supported by existing big data solutions. In this paper we develop a system called Amber that supports responsive debugging during the execution of a workflow task. After starting the execution, the developer can pause the job at will, investigate the states of the cluster, modify the job, and resume the computation. She can also set conditional breakpoints to pause the execution when certain conditions are satisfied. In this way, the developer can gain a much better understanding of the run-time behavior of the execution and more easily identify issues in the job or data. Amber is based on the actor model, a distributed computing paradigm that provides concurrent units of computation using actors. We give a full specification of Amber, and implement it on top of the Orleans system. Our experiments show its high performance and usability of debugging on computing clusters.

Surveillance de systèmes à composants multi-threads et distribués

Thesis

Jun 2017

Hosein Nazarpour

La conception à base de composants est le processus qui permet à partir d’exigences et un ensemble de composants prédéfinis d’aboutir à un système respectant les exigences. Les composants sont des blocs de construction encapsulant du comportement. Ils peuvent être composés afin de former des composants composites. Leur composition doit être rigoureusement définie de manière à pouvoir i) inférer le comportement des composants composites à partir de leurs constituants, ii) déduire des propriétés globales à partir des propriétés des composants individuels. Cependant, il est généralement impossible d’assurer ou de vérifier les propriétés souhaitées en utilisant des techniques de vérification statiques telles que la vérification de modèles ou l’analyse statique. Ceci est du au problème de l’explosion d’espace d’états et au fait que la propriété est souvent décidable uniquement avec de l’information disponible durant l’exécution (par exemple, provenant de l’utilisateur ou de l’environnement). La vérification à l’exécution (Runtime Verification) désigne les langages, les techniques, et les outils pour la vérification dynamique des exécutions des systèmes par rapport à des propriétés spécifiant formellement leur comportement. En vérification à l’exécution, une exécution du système vérifiée est analysée en utilisant une procédure de décision : un moniteur. Un moniteur peut être généré à partir d’une spécification écrite par l’utilisateur (par exemple une formule de logique temporelle, un automate) et a pour but de détecter les satisfactions ou les violations par rapport à la spécification. Généralement, le moniteur est une procédure de décision réalisant une analyse pas à pas de l’exécution capturée comme une séquence d’états du système, et produisant une séquence de verdicts (valeur de vérité prise dans un domaine de vérité) indiquant la satisfaction ou la violation de la spécification.Cette thèse s’intéresse au problème de la vérification de systèmes à composants multithread et distribués. Nous considérons un modèle général de la sémantique et système à composants avec interactions multi-parties: les composants intrinsèquement indépendants et leur interactions sont partitionées sur plusieurs ordonnanceurs. Dans ce contexte, il est possible d’obtenir des modèles avec différents degrés de parallelisme, des systèmes séquentiels, multi-thread, et distribués. Cependant, ni le modèle exact ni le comportement du système est connu. Ni le comportement des composants ni le comportement des ordonnanceurs est connu. Notre approche ne dépend pas du comportement exact des composants et des ordonnanceurs. En s’inspirant de la théorie du test de conformité, nous nommons cette hypothèse : l’hypothèse de monitoring. L’hypothèse de monitoring rend notre approche indépendante du comportement des composants et de la manière dont ce comportement est obtenu. Lorsque nous monitorons des composants concurrents, le problème qui se pose est celui de l’indisponibilité de l’état global à l’exécution. Une solution naïve à ce problème serait de brancher un moniteur qui forcerait le système à se synchroniser afin d’obtenir une séquence des états globaux à l’exécution. Une telle solution irait complètement à l’encontre du fait d’avoir des exécutions concurrentes et des systèmes distribués. Nous définissons deux approches pour le monitoring de système un composant multi-thread et distribués. Dans les deux approches, nous attachons des contrôleurs locaux aux ordonnanceurs pour obtenir des événements à partir des traces locales. Les événements locaux sont envoyés à un moniteur (observateur global) qui reconstruit l’ensemble des traces globale qui sont i) compatibles avec les traces locales et ii) adéquates pour le monitoring, tout en préservant la concurrence du système.

A Mechanism for Efficient Debugging of Parallel Programs

Article

Full-text available

Jan 1989

This paper addresses the design and implementation of an integrated debugging system for parallel programs running on shared memory multi-processors (SMMP). We describe the use of flowback analysis to provide information on causal relationships between events in a program's execution without re-executing the program for debugging. We introduce a mechanism called incremental tracing that, by using semantic analyses of the debugged program, makes the flowback analysis practical with only a small amount of trace generated during execution. We extend flowback analysis to apply to parallel programs and describe a method to detect race conditions in the interactions of the co-operating processes.

On the Potential of Event Sourcing for Retroactive Actor-based Programming

Conference Paper

Jul 2016

The actor model is an established programming model for distributed applications. Combining event sourcing with the actor model allows the reconstruction of previous states of an actor. When this event sourcing approach for actors is enhanced with additional causality information, novel types of actor-based, retroactive computations are possible. A globally consistent state of all actors can be reconstructed retrospectively. Even retroactive changes of actor behavior, state, or messaging are possible, with partial recomputations and projections of changes in the past. We believe that this approach may provide beneficial features to actor-based systems, including retroactive bugfixing of applications, decoupled asynchronous global state reconstruction for recovery, simulations, and exploration of distributed applications and algorithms.

Decentralized Deadlock-Free Enforcement of Message Orderings in Message-Based Systems

Article

Apr 2024
J COMPUT SYST SCI

Udon: Efficient Debugging of User-Defined Functions in Big Data Systems with Line-by-Line Control

Article

Dec 2023

Many big data systems are written in languages such as C, C++, Java, and Scala to process large amounts of data efficiently, while data analysts often use Python to conduct data wrangling, statistical analysis, and machine learning. User-defined functions (UDFs) are commonly used in these systems to bridge the gap between the two ecosystems. In this paper, we propose Udon, a novel debugger to support fine-grained debugging of UDFs. Udon encapsulates the modern line-by-line debugging primitives, such as the ability to set breakpoints, perform code inspections, and make code modifications while executing a UDF on a single tuple. It includes a novel debug-aware UDF execution model to ensure the responsiveness of the operator during debugging. It utilizes advanced state-transfer techniques to satisfy breakpoint conditions that span across multiple UDFs. It incorporates various optimization techniques to reduce the runtime overhead. We conduct experiments with multiple UDF workloads on various datasets and show its high efficiency and scalability.

Consistent retrospective snapshots in distributed event-sourced systems

Conference Paper

Mar 2017

An increasing number of distributed, event-based systems adopt an architectural style called event sourcing, in which entities keep their entire history in an event log. Event sourcing enables data lineage and allows entities to rebuild any previous state. Restoring previous application states is a straight-forward task in event-sourced systems with a global and totally ordered event log. However, the extraction of causally consistent snapshots from distributed, individual event logs is rendered non-trivial due to causal relationships between communicating entities. High dynamicity of entities increases the complexity of such reconstructions even more. We present approaches for retrospective and global state extraction of event-sourced applications based on distributed event logs. We provide an overview on historical approaches towards distributed debugging and breakpointing, which are closely related to event log-based state reconstruction. We then introduce and evaluate our approach for non-local state extraction from distributed event logs, which is specifically adapted for dynamic and asynchronous event-sourced systems.

Literatur

Chapter

Jan 1992

Development of a debugger for a concurrent language

Article

Full-text available

Apr 1986

This work deals with some issues concerned in the debugging of concurrent programs. A set of desirable characteristics for a debugger for concurrent languages is deduced from a review of the differences between the debugging of concurrent programs and that of sequential ones. A debugger for a concurrent language, based upon CSP, is then described. The debugger makes it possible to compare a description of the expected program behavior to the actual behavior. The description of the behavior is given in terms of expressions composed by events and/or assertions on the process state. The developed formalism is able to describe behaviors at various levels of abstraction. Lastly, some guidelines for the implementation of the debugger are given and a detailed example of program debugging is analyzed.

Development of a debugger for a concurrent language

Article

Aug 1983

This work discusses some issues in the debugging of concurrent programs. A set of desirable characteristics of a debugger for concurrent languages is deduced from an examination of the differences between the debugging of concurrent programs and that of sequential ones. A debugger for a concurrent language, derived from CSP, is then presented. It is based upon a semantic model of the supported language. The debugger enables to compare a description of the program behaviour to the actual behaviour as well as to valuate assertions on the process state. The description of the behaviuor is given by a formalism whose semantics is also specified. The formalism can specify program behaviuors at various abstraction levels. Lastly some guidelines for the implementation of the debugger are shown and a detailed example of program description is analyzed.

Generalized Path Expressions: A High Level Debugging Mechanism

Article

Aug 1983

This paper introduces a modified version of path expressions called Path Rules which can be used as a debugging mechanism to monitor the dynamic behaviour of a computation. Path rules have been implemented in a remote symbolic debugger running on the Three Rivers Computer Corporation PERQ computer under the Accent operating system.

Generalized path expressions

Conference Paper

Aug 1983
Software Eng Notes

Debugging a distributed system

Article

Interactive debugging in a distributed programs

Article

R. D. Schiffenbaur

An approach to high-level debugging of distributed systems

Article

Aug 1983
Software Eng Notes

As part of a study of methods and strategies for problem solving in a distributed environment [Less80], we have been investigating techniques suitable for use in debugging programs written for implementation on distributed processing networks. Traditional debugging methods emphasize techniques that apply at the level of computation units and generally allow users to examine, and possibly alter, the state of a computation. Interactive debugging monitors are probably the most powerful implementations of the traditional method and usually permit a user to examine an entire snspshot of system state at any step of the computation. It is the job of the debugger (usually a person directing the error search) to determine what units are relevant to some problem, examine the units in whatever fashion is available, and then fit the results of these examinations into a model of how the computation works. Two elements essential to the successful completion of the debugging task are evident here: the ability to monitor , in some meaningful way, the relevant system activity so as to understand how system behavior differs from the debugger's model, and the ability to perform experiments based (implicitly or explicitly) on the information gathered. Through the interaction of these two elements a debugger attempts to gain an understanding of the causes of an error or at least to note where the implementation and the expected behavior differ.

High-level debugging of distributed systems: The behavioral abstraction approach

Article

Dec 1983
J SYST SOFTWARE

Most extant debugging aids force their users to think about errors in programs from a low-level, unit-at-a-time perspective. Such a perspective is inadequate for debugging large complex systems, particularly distributed systems. In this paper, we present a high-level approach to debugging that offers an alternative to the traditional techniques. We describe a language, edl, developed to support this high-level approach to debugging and outline a set of tools that has been constructed to effect this approach. The paper includes an example illustrating the approach and discusses a number of problems encountered while developing these debugging tools.

Event-Driven Monitoring of Distributed Programs.

Conference Paper

Jan 1985

IDD: An Interactive Distributed Debugger.

Conference Paper

Jan 1985

Breakpoints and halting in distributed programs

Abstract and Figures

Recommended publications

Breakpoint and halting in distributedprograms

A Mechanism for Efficient Debugging of Parallel Programs

A Mechanism for Efficient Debugging of Parallel Programs.

Techniques for Debugging Parallel Programs with Flowback Analysis

Jong-Deok Choi