ArticlePDF Available

Debugging Parallel Programs with Instant Replay

May 1987
IEEE Transactions on Computers C-36(4):471 - 482

May 1987
C-36(4):471 - 482

DOI:10.1109/TC.1987.1676929

Source
IEEE Xplore

Authors:

John M. Mellor-Crummey

Rice University

The debugging cycle is the most common methodology for finding and correcting errors in sequential programs. Cyclic debugging is effective because sequential programs are usually deterministic. Debugging parallel programs is considerably more difficult because successive executions of the same program often do not produce the same results. In this paper we present a general solution for reproducing the execution behavior of parallel programs, termed Instant Replay. During program execution we save the relative order of significant events as they occur, not the data associated with such events. As a result, our approach requires less time and space to save the information needed for program replay than other methods. Our technique is not dependent on any particular form of interprocess communication. It provides for replay of an entire program, rather than individual processes in isolation. No centralized bottlenecks are introduced and there is no need for synchronized clocks or a globally consistent logical time. We describe a prototype implementation of Instant Replay on the BBN Butterfly™ Parallel Processor, and discuss how it can be incorporated into the debugging cycle for parallel programs.

Content uploaded by John M. Mellor-Crummey

Content may be subject to copyright.

IEEE

TRANSACTIONS

COMPUTERS,

VOL.

C-36,

NO.

APRIL

1987

Debugging

Parallel

Programs

with

Instant

Replay

THOMAS

LEBLANC

AND

JOHN

MELLOR-CRUMMEY

Abstract-The

debugging

cycle

the

most

common

methodol-

ogy

for

finding

and

correcting

errors

sequential

programs.

Cyclic

debugging

effective

because

sequential

programs

are

usually

deterministic.

Debugging

parallel

programs

considera-

bly

difficult

because

successive

executions

the

same

program

often

not

produce

the

same

results.

this

paper

present

general

solution

for

reproducing

the

execution

behavior

parallel

programs,

termed

Instant

Replay.

During

program

execution

save

the

relative

order

significant

events

they

occur,

not

the

data

associated

with

such

events.

result,

our

approach

requires

less

time

and

space

save

the

information

needed

for

program

replay

than

other

methods.

Our

technique

not

dependent

any

particular

form

interprocess

communi-

cation.

provides

for

replay

entire

program,

rather

than

individual

processes

isolation.

centralized

bottlenecks

are

introduced

and

there

need

for

synchronized

clocks

globally

consistent

logical

time.

describe

prototype

imple-

mentation

Instant

Replay

the

BBN

Butterfly'

Parallel

Processor,

and

discuss

how

can

incorporated

into

the

debugging

cycle

for

parallel

programs.

Index

Terms-CREW

protocols,

distributed

debugging,

execu-

tion

replay,

parallel

programming,

program

instrumentation,

shared

objects.

INTRODUCTION

EBUGGING

sequential

programs

well-understood

task

that

draws

tools

and

techniques

developed

over

many

years.

One

early

technique

was

record

snapshots

the

entire

program

state,

often

consisting

many

pages

hexadecimal

digits,

for

perusal

the

programmer.

Debug-

ging

was

programmer-intensive

operation,

since

there

were

few

tools

for

analyzing

the

program

state.

Over

time

this

approach

was

replaced

interactive

debuggers,

which

allow

the

programmer

examine

arbitrary

details

the

program

state

during

execution.

Debugging

became

computation-

intensive,

since

the

computer

was

used

reproduce

execution

sequences

with

successively

greater

detail.

result,

the

most

common

methodology

used

today

debug

sequential

programs

cyclic:

the

program

executed

until

error

manifests

itself,

the

programmer

postulates

set

underlying

causes

for

the

error,

trace

statements

additional

breakpoints

are

inserted

gather

information

about

the

causes

the

error,

and

the

program

reexecuted.

This

technique

effective

because

sequential

programs

are

usually

determinis-

Manuscript

received

September

1986;

revised

December

1986.

This

work

was

supported

the

National

Science

Foundation

under Grant

DCR-

8320136

and

DARPA/ETL

under

Grant

DACA76-85-C-0001.

The

Xerox

Corporation

University

Grants

Program

provided

equipment

used

the

preparation

this

paper.

Mellor-Crummey

was

supported

part

Sproull

Fellowship

awarded

the

University

Rochester.

The

authors

are

with

the

Department

Computer

Science,

University

Rochester,

14627.

tic.

That

is,

for

fixed

input,

each

execution

program

will

always

the

same

execution

path

and

produce

the

same

results.

Debugging

parallel

programs

considerably

difficult

because

parallel

programs

are

often

not

deterministic.

our

model

parallel

programs

consist

multiple

asynchronous

processes

that

communicate

using

some

form

message-

passing

shared

memory.

assumption

may

made

about

the

relative

speed

processes;

can

only

assume

finite

progress

each

process.

Since

parallel

programs

not

fully

specify

all

possible

execution

sequences,

the

execution

behavior

parallel

program

response

fixed

input

may

indeterminate,

with

the

results

depending

particular

resolution

race

conditions

existing

among

processes.

Therefore,

cyclic

debugging

techniques

for

error

isolation

are

not

guaranteed

work

because

successive

executions

the

same

parallel

program

may

not

produce

the

same

results.

are

left

with

two

options

for

debugging

parallel

programs:

can

either

take

snapshots

the

program

state

during

execution

for

later

examination

can

provide

mechanism

that

guarantees

reproducible

behavior

parallel

programs.

Only

the

latter

approach

allows

reliable

use

cyclic

debugging

techniques.

The

first

alternative,

which

the

programmer

analyzes

snapshots

program

state

taken

during

execution,

recognizes

that

multiple

executions

parallel

programs

are

indetermi-

nate,

therefore,

all

information

necessary

diagnose

program

errors

must

collected

during

single

execution.

Behavioral

Abstraction

(BA)

typical

this

approach

[2].

provides

mechanism

for

the

hierarchical

definition

events

terms

sequences

primitive

events

that

can

occur

during

program

execution.

event

recognition

tool

monitors

the

stream

primitive

events

that

occur

during

program

execu-

tion

and

presents

the

user

with

abstract

view

the

program's

behavior

terms

sequence

hierarchically-

defined

events.

There

are

two

disadvantages

this

technique.

First,

requires

that

user

exhaustively

describe

interesting

events

which

take

place

during

execution

terms

bottom-

specification.

creating

the

specification,

the

user

must

anticipate

all

interesting

events

error

before

execution;

there

mechanism

for

gathering

additional

information

about

error

after

observed.

Second,

the

amount

information

gathered

tends

voluminous.

Since

the

technique

not

cyclic,

the

user

must

collect

enough

are

interested

programs

that

exhibit

true

parallelism

or,

the

very

least,

appear

exhibit

parallelism

due

preemptive

scheduling

processes.

concurrent

program

implemented

coroutines

running

single

processor

without

the

possibility

preemption

can

debugged

were

sequential

program.

0018-9340/87/0400-0471$01.00

1987

IEEE

471

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 08,2010 at 11:30:21 EST from IEEE Xplore. Restrictions apply.

IEEE

TRANSACTIONS

COMPUTERS,

VOL.

C-36,

NO.

APRIL

1987

information

during

execution

diagnose

any

error

that

might

arise.

Other

work

based

one-shot

execution

parallel

programs

has

the

same

limitation

[1],

[9],

[21].

The

second

alternative

for

debugging

parallel

programs

based

reproducible

program

execution,

which

allows

cyclic

debugging

techniques-

applied.

Reproducible

program

behavior

has

been

studied

several

domains,

including

concurrent

programs

using

semaphores

and

monitors

for

communication,

systems

based

nested

atomic

transactions,

and

systems

comprised

loosely

coupled

processes

that

communicate

via

messages.

Carver

and

Tai

have

considered

repeatable

execution

for

programs

consisting

concurrent

processes

that

interact

through

semaphores

and

monitors

[3].

their

approach,

execution

concurrent

program

characterized

sequence

operations

(termed

sequence)

shared

semaphores.

The

same

idea

can

used

produce

sequence

for

monitors,

which

records

series

starts

monitor

procedures.

sequence

ordered

pairs;

each

pair

corresponds

operation

specific

semaphore

specific

process.

Thus,

sequence

total

order

all

synchronization

operations

that

occur

program.

sequences

can

created

the

programmer

test

specific

synchronization

sequences

concurrent

pro-

gram

can

reproduced

during

execution

provide

repeatable

execution.

The

disadvantage

this

approach

that

requires

that

all

operations

serialized,

thereby

losing

much

the

potential

for

parallelism

that

exists

program.

While

adequate

for

single

processor

systems

that

simulate

concurrency,

this

technique

would

not

useful

truly

parallel

environment.

There,

the

serialization

constraint

could

have

such

impact

program

performance

that

would

impractical

monitor

programs

during

normal

execution.

Use

this

method

would

then

relegated

distinct

testing

and

debugging

phase.

Chiu's

technique

for

replaying

program's

execution

atomic

transaction

system

involves

checkpointing

each

version

all

atomic

objects

and

recording

timestamp

for

each

atomic

action

during

program

execution

[5].

debugger

uses

this

information

traverse

action

trees'(corresponding

the

nested

atomic

actions

program

execution)

according

serialization

their

constituent

atomic

actions.

Traversing

action

tree

pernits

viewing

the

state

atomic

objects

before

and

after

each

atomic

update,

well

replaying

execution

through

action

sequences

isolate

program

flaws.

The

major

drawback

this

work

that

the

techniques

are

restricted

computations

structured

terms

nested

atomic

actions.

addition,

these

techniques

require

significant

storage

overhead'

maintain

the

necessary

checkpoints

atomic

objects,

although

the

checkpoints

may

required

for

recovery

actions

anyway.

Methods

reproduce

the

execution

behavior

programs

comprised

loosely

coupled

processes

that

communicate

using

messages

typically

require

that

the

contents

each

message

recorded

event

log

received

[7],

[13],

[24].

The

programmer

can

either

review

the

events

(messages)

the

log,

attempt

isolate

errors,

the

events

can

used

input

replay

the

execution

process

isolation.

similar

event

logging

approach

has

also

been

used

monitor

program

performance

[16].

There

are

several

disad-

vantages

this

approach.

First,

has

only

been

used

loosely

coupled

systems

and

there

are

reasons

believe

would

not

well-suited

tightly

coupled

systems.

Although

the

amount

data

exchanged

messages

could

very

large,

this

technique

exploits

the

fact

that

communication

loosely

coupled

systems

takes

place

infrequently,

primarily

because

the

high

cost

communication.

The

additional

time

necessary

copy

message

into

event

log

local

memory

does

not

seriously

affect

performance

when

compared

with

the

time

required

send

message.

This

assumption

does

not

necessarily

apply

tightly

coupled

systems,

where

the

cost

communication

lower,

allowing

frequent

communica-

tion.

Another

disadvantage

that

the

space

requirements

for

the

event

log

tend

very

large.

Again,

within

the

domain

loosely

coupled

processes,

reasonable

assume

the

logs

will

grow

slowly

enough

that

they

can

buffered

memory

and

then

stored

external

devices

without

seriously

affecting

the

performance

the

program.

The

third,

and

most

important

drawback,

that

difficult

examine

the

global

effects

process

interactions

using

this

technique,

since

the

replay

mechanism

only

operates

single

process

isolation.

attempts

replay

groups

processes

using

this

scheme

require

that

network-wide

consistent

time

maintained

[7].

this

paper

present

general

solution

for

reproducing

the

execution

behavior

parallel

programs,

termed

Instant

Replay.

Our

emphasis

providing

repeatable

execution

highly

parallel

programs

tightly

coupled

systems,

although

our

approach

naturally

extends

loosely

coupled

systems.

During

program

execution

save

the

relative

order

significant

events

they

occur,

not

the

data

associated

with

such

events.

Since

not

require

the

contents

all

process

interactions

(e.g.,

messages)

saved,

our

approach

requires

less

time

and

space

save

the

information

needed

for

program

replay

than

other

methods.

Our

technique

guarantees

reproducible

program

behavior

during

the

debugging

cycle

using

the

same

input

from

the

external

environment

and

imposing

the

same

relative

order

events

during

replay

that

occurred

during

the

original

execution.

Unlike

techniques,

Instant

Replay

not

dependent

the

particular

form

interprocess

communication

used.

addition,

provide

replay

for

entire

program,

rather

than

individual

processes

isolation.

Finally,

avoid

introducing

any

global

synchronization

events

through

the

use

fully

distributed

protocol;

there

centralized

bottleneck

and

need

for

synchronized

clocks

globally-consistent

logical

time.

With

these

properties,

Instant

Replay

especially

useful

for

debugging

parallel

programs

tightly

coupled

multipro-

cessors,

where

interprocess

communication

cheaper,

and

therefore

frequent,

than

loosely

coupled

systems.

the

section

present

Instant

Replay,

including

our

goals,

assumptions,

and

approach.

Section

III

describes

prototype

implementation

the

BBN

Butterfly,

tightly

coupled

multiprocessor

comprised

128

MC68000

proces-

sors.

Section

discuss

how

Instant

Replay

can

incorporated

into

the

debugging

cycle

for

parallel

programs.

472

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 08,2010 at 11:30:21 EST from IEEE Xplore. Restrictions apply.

LEBLANC

AND

MELLOR-CRUMMEY:

DEBUGGING

PARALLEL

PROGRAMS

WITH

INSTANT

REPLAY

Section

summarizes

the

advantages

our

approach

and

describes

our

plans

for

future

work.

II.

INSTANT

REPLAY

When

debugging

sequential

program,

one

can

usually

guarantee

'reproducible

program

execution

supplying

the

same

input

each

time

the

program

executed.

Successive'

executions

with

the

same

input

produce

the

same

behavior

because

sequential

programs

tend

deterministic.

The

same

true

the

individual

processes

parallel

program.

each

process

supplied

the

same

input

values

(correspond-

ing

the

contents

messages

received

the

values

shared

memory

locations

referenced)

the

same

order

during

successive

executions,

will

produce

the

same

behavior

each

time.

particular,

each

process

will

produce

the

same

output

values

the

same

order.

Each

those

output

values

may

then

serve

input

value

for

some

other

process.

Therefore,

order

debug

parallel

program,

not

need

store

all

input

values

for

each

process

event

log,

since

any

input

value

corresponding

some

output

value

can

recomputed

during

replay.

ensuring

that

each

process

sees

the

same

input

values

every

step

execution,

all

processes

will

exhibit

the

same

execution

behavior

during

both

the

monitor-

ing

phase

and

replay.

Instant

Replay

based

this

observation.

our

approach,

all

interactions

between

processes

are

modeled

operations

shared

objects.

series

modifica-

tions

shared

object

represented

totally

ordered

sequence

versions.

Each

version

has

corresponding

version

number,

which

unique

with

respect

particular

object.

During

normal

program

execution

(i.e.,

the

monitor-

ing

phase)

record

par-tial

order

the

accesses

each

object.

(It

partial

order

because

not

need

impose

ordering

multiple

processes

that

read

particular

version

shared

object.)

This

partial

order

specified

sequence

version

numbers

for

each

object.

record

the

partial

order

the

system

maintains

the

current

version

number

for

each

object

and

the

number

readers

for

each

version

each

object.

addition,

each

process

records

the

version

number

each

shared

object

accesses.

During

program

replay,

allow

each

process

recompute

its

output

values,

thereby

providing

input

values

for

other

processes.

use

the

record

object

accesses

recorded

each

process

ensure

that

the

same

version

input

values

used

the

process

during

the

monitoring

phase

used

during

replay.

long

the

recorded

information

available,

the

original

program

execution

can

repeated

over

and

over.

Our

goal

provide

flexible

monitoring

system,

applicable

both

loosely

coupled

and

tightly

coupled

envi-

ronments,

that

allows

programmer

replay

arbitrary

execution

sequences

produced

parallel

program.

Since

cannot

predict

when

may

desirable

replay

particular

execution

sequence,

must

practical

for

the

monitoring

mechanisms

place

during

every

execution.

Therefore,

our

mechanisms

must

have

minimal

impact

program

performance.

Instant

Replay

provides

reproducible

behavior

For

now,

assume

that

processes

not

contain

nondeterministic

statements.

particular,

processes

not

allow

asynchronous

interrupts.

parallel

programs

with

minimal

impact

performance

simulating

the

original

external

environment

during

replay,

modeling

all

interprocess

events

operations

shared

data,

subsuming

both

shared-memory

and

message-passing

primitives,

recording

only

the

version

number

data

values

that

are

input

each

process,

not

the

values

themsel-

ves,

thereby

minimizing

the

amount

information-recorded,

and

using

distributed

data

collection

mechanism,

that

central

bottleneck

present

when

program

being

monitored

replayed.

will

explore

each

these

aspects

the

following

sections.

Simulating

the

External

Environment

with

any

cyclic

debugging

system,

assume

that

the

original

execution

program

and

subsequent

replays

occur

equivalent

virtual

machine

environments.

Two

virtual

machines

and

are

said

equivalent

with

respect

program

can

exhibit

the

same

behavior

whether

executed

virtual

machine

For

practical

reasons,

not

require

equivalent

physical

machine

states,

since

that

would

include

the

contents

all

external

devices,

the

exact

value

the

clock,

and

the

internal

states

all

components.

particular,

and

need

not

have

identical

real-time

clock

values

P's

execution

does

not

depend

the

real-time

clock.

Similarly,

the

contents

file

machine

and

can

differ

does

not

attempt

reference

program

depends

physical

details

its

virtual

machine

during

execution,

becomes

difficult,

not

impossible,

simulate

the

virtual

machine

during

replay.

Real-time

programs,

particular,

may

not

good

candidates

for

Instant

Replay

because

difficult

simulate

equivalent

virtual

machines.3

require

that

pro-

grams

receive

identical

input

from

the

environment

during

both

execution

and

replay.

However,

not

sufficient

simply

supply

the

same

input

the

process,

must

also

supply

the

same

points

during

program

execution.

This

can

very

difficult

for

real-time

programs

since

they

often

receive

input

result

asynchronous

interrupts.

Without

making

special

provisions

record

when

interrupts

occur

during

program

execution,

which

could

severely

degrade

performance,

cannot

accurately

simulate

the

original

virtual

machine

envi-

ronment.

important

note

that

the

problem

finding

equivalent

virtual

machines

also

arises

when

debugging

sequential

programs;

orthogonal

the

specific

problem

debug-

ging

parallel

programs.

not

depend

particular

simulation

virtual

machines,

any

techniques

developed

for

sequential

program

debugging

can

probably

used.

Specifically,

assume

that

programs

not

exploit

the

physical

characteristics

any

resources

allocated

the

system,

therefore,

need

only

ensure

that

the

amount

resources

available

during

replay

least

the

amount

used

the

program

during

the

original

execution.

Any

unsuccessful

attempt

allocate

resources

during

execution

can

re-

corded,

that

the

same

behavior

can

recreated

during

replay.

3To

our

knowledge,

significant

software

debugging

system

exists

for

real-time

programs.

473

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 08,2010 at 11:30:21 EST from IEEE Xplore. Restrictions apply.

IEEE

TRANSACTIONS

COMPUTERS,

VOL.

C-36,

NO.

APRIL

1987

Communication

Through

Shared

Objects

processes

parallel

program

not

communicate,

each

process

can

debugged

using

traditional

techniques,

since

other

processes

the

program

would

have

effect

the

execution

path

particular

process.

only

when

processes

interact,

via

communication

and

synchronization

primitives,

that

the

potential

for

nonrepeatable

behavior

arises.

Examples

process

interactions

include

and

primitives

applied

shared

semaphore

[8],

monitor

entry

procedures

[10],

send/receive

message

primitives,

and

general

sharing

memory.

Instant

Replay

models

all

process

interactions

parallel

program

operations

shared

data.

This

characteri-

zation

process

interactions

not

restrictive

since

all

communication

and

synchronization

primitives

can

reduced

operations

shared

data.

particular,

message

passing

can

modeled

communication

through

shared

port,

mailbox,

memory

segment.

Our

approach

exploits

the

fact

that

values

exchanged

between

processes

via

shared

data

depend

only

the

initial

values

shared

objects,

the

order

which

processes

are

granted

access

shared

objects,

and

the

deterministic

nature

processes.

Operations

shared

data

objects

can

separated

into

two

classes:

read

operations,

which

not

change

the

state

-object,

and

write

operations,

which

do.

recording

the

sequence

write

operations

each

shared

object,

possible

recreate

the

proper

sequence

state

transitions

for

all

shared

objects

during

program

replay.

Similarly,

recording

the

version

number

each

shared

object

read

process,

possible

recreate

the

proper

input

values

for

that

process

during

replay.

This

exactly

the

information

record

during

the

monitoring

phase.

Instant

Replay

requires

that

the

set

operations

each

shared

object

have

valid

serialization.

set

operations

has

valid

serialization

the

result

each

individual

operation

the

same

would

the

operations

had

all

been

executed

some

sequential

order.

protocol

that

ensures

valid

serialization,

such

concurrent-read-exclusive-write

(CREW)

protocol

[6],

must

used

for

access

each

shared

object.

choosing

protocol,

look

for

one

that

guarantees

serializability,

while

exerting

minimal

impact

shared

object

access

and

allowing

maximal

parallelism.

access

protocol

that

guarantees

serializability

for

operations

shared

objects

already

present

the

application

the

system,

not

necessary

superimpose

another.

Therefore,

our

techniques

are

applicable

programs

that

incorporate

results

current

research

efforts

how

structure

interprocess

communica-

tion

admit

the

most

parallelism.

For

example,

Lamport

[121,

Peterson

[20],

and

Vitanyi

and

Awerbuch

[25]

present

algorithmic

solutions

for

the

concurrent-reading-while-writing

(CRWW)

problem

that

permit

concurrency

among

readers

and

writers,

well

among

writers

themselves.

Instrumentation

for

Instant

Replay

can

added

systems

that

use

such

protocols,

serialization

order

operations

each

shared

object

can

determined.4

For

serialization

operations

shared

object

possible

the

object

must

regular

[12].

object

regular

when

all

reads

not

concurrent

with.

write

get

correct

values,

and

any

read

that

overlaps

series

writes

obtains

either

the

value

the

object

before

the

first

the

writes,

one

the

values

being

written.

For

the

remainder

this

paper,

will

illustrate

our

technique

using

CREW

protocol

for

access

shared

objects.

CREW

protocol

ensures

total

order

writers

with

respect

each

shared

object,

total

order

readers

with

respect

writers

each

shared

object,

and

partial

order

readers

with

respect

each

shared

object.

Although

could

use a

protocol

that

requires

mutually

exclusive

(MO)

access

shared

objects,

resulting

total

order

accesses

each

object,

many

parallel

programs

allow

concurrent

readers.

exclusive

access

protocol

would

artificially

limit

the

parallel-

ism

such

programs.

Since

the

execution

path

program

can

characterized

partial

order

the

operations

with

respect

each

shared

object,

will

not

require

total

order.

addition

being

independent

particular

protocol,

Instant

Replay

does

not

rely

particular

granularity

interprocess

communication.

The

granularity

access

shared

objects

implementation-dependent.

Message-passing

systems

only

require

the

protocol

during

shared

buffer

access;

shared-memory

systems

may

require

the

protocol

used

whenever

shared

storage

referenced.

Data

Structures

for

Program

Monitoring

order

record

the

partial

order

accesses

objects

that

characterizes

execution,

use

set

process

history

tapes.

During

the

monitoring

phase,

process

history

tape

used

record

the

version

number

each

shared

object

accessed

process;

modified

only

the

corresponding

process.

Since

the

relevant

information

read

and

recorded

part

the

access

object,

the

monitoring

phase

imitates

whatever

parallelism

exhibited

the

application.

Each

history

tape

has

header

containing

several

fields:

pointer

the

current

square

the

tape,

pointer

the

last

non-blank

square

the

tape,

and

pointer

the

initial

square

the

tape.

The

two

operations

that

can

applied

history

tape

are

ReadHistoryTape,

which

reads

the

value

written

the

current

square,

and

WriteHistoryTape,

which

writes

value

the

current

square.

Each

these

operations

advances

the

current

square

pointer

the

tape.

Upon

creation,

each

shared

object

assigned

version

number

Also

upon

creation,

each

process

assigned

history

tape

that

initially

blank.

Durirg

each

read

write

operation

shared

object

process,

information

about

the

object

recorded

the

process's

history

tape.

All

history

tapes

created

during

the

execution

parallel

program

are

linked

together

form

tree.

Each

time

process

spawns

child,

reference

the

history

tape

the

child

process

recorded

the

history

tape

the

parent.

This

organization

history

tapes

enables

each

process

history

tape

associated

with

the

correct

process

during

execution

replay.

addition

the

information

recorded

process's

history

tape

regarding

interactions

with

shared

objects

and

child

processes,

arbitrary

details

process's

execution

can

recorded

the

tape

for

use

during

replay.

Specifically,

the

resolution

certain

interesting

events

can

recorded

the

history

tape

order

replay

programs

containing

nondeter-

minism.

The

information

recorded

about

such

events

can

used

recreate

the

same

event

during

program

replay.

mechanism

support

the

recording

these

events

would

need

added

the

implementation

the

programming

474

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 08,2010 at 11:30:21 EST from IEEE Xplore. Restrictions apply.

LEBLANC

AND

MELLOR-CRUMMEY:

DEBUGGING

PARALLEL

PROGRAMS

WITH

INSTANT

REPLAY

ReaderEntry

(object,process);

mode

MONITOR

then

P(object.lock);

AtomicAdd(object.activeReaders,

1);

V(object.lock);

WriteHistoryTape(process,object.version);

else

Find

out

version

read

during

monitoring

phase

key

ReadHistoryTape(process);

while

object.version

key

delay

end

if;

end

ReaderEntry;

ReaderExit

(object);

AtomicAdd(object.totalReaders,

1);

Ignored

replay

mode

AtomicAdd(object.activeReaders,

1);

end

ReaderExit;

Fig.

language

the

appropriate

level

(i.e.,

compiler

code

genera-

tion

language

runtime

support).

Such

mechanism

would

appropriate

record

the

statement

alternative

chosen

nondeterministic

selection

statement,

whether

not

timeout

interval

had

expired

during

execution,

and

clock

values

returned

system

calls.

Access

Protocols

for

Shared

Objects

order

properly

record

partial

order

the

accesses

each

shared

object,

protocol

that

ensures

valid

serialization

needed.

this

section

will

describe

such

protocol,

concurrent-read-exclusive-write

(CREW)

protocol

that

can

used

implement

Instant

Replay.

The

CREW

access

protocol

for

shared

objects

consists

four

procedures:

entry

and

exit

procedures

for

readers,

and

entry

and

exit

procedures

for

writers.

During

the

monitoring

phase,

these

procedures

enforce

CREW

access

protocol

shared

objects

and

record

partial

order

accesses

each

shared

object.

During

the

replay

phase,

these

same

procedures

are

used

enforce

the

partial

order

recorded

during

the

monitoring

phase.

Each

process

that

reads

shared

object

must

use

the

entry

procedure

ReaderEntry

(Fig.

1).

This

routine

uses

sema-

phore

associated

with

the

object

ensure

that

readers

not

attempt

access

that

object

while

writer

using

it.

Once

the

reader

granted

access

the

sefmaphore,

increments

the

number

active

readers

using

the

object.5

Writers

are

not

allowed

modify

the

object

long

the

count

active

readers

nonzero.

Once

the

count

active

readers

has

been

updated,

the

reader

process

releases

the

semaphore

and

records

the

version

the

object

about

read

its

process

history

tape.

Then,

the

reader

allowed

access

the

object.

Eventually,

the

exit

routine

ReaderExit

(also

Fig.

called,

which

simply

maintains

count

all

readers

for

particular

version

the

object

and

decrements

the

number

active

readers

for

the

object,

thereby

allowing

writers

chance

proceed.

replay

mode,

the

entry

procedure

for

readers

proceeds

before,

except

that

history

tapes

are

not

written,

they

are

merely

read

and

advanced

execution

proceeds.

Each

reader

use

atomic

increment

and

decrement

operations

maintain

the

reader

counts

for

object,

thereby

avoiding

the

need

for

additional

synchronization.

WriterEntry

(object,

process);

mode

MONITOR

then

P(object.lock);

Wait

for

all

current

readers

finish

while

object.activeReaders

delay;

WriteHistoryTape(process,object.version);

WriteHistoryTape(process,object.totalReaders);

else

Read

version

modified

during

monitoring

phase

key

ReadHistoryTape(process);

while

object.version

key

delay;

Read

count

readers

for

version

key

ReadHistoryTape(process);

while

object.totalReaders

key

delay;

end

if;

end

WriterEntry;

WriterExit

(object);

object.totalReaders

mode

MONITOR

then

object.version

V(object.lock);

else

AtomicAdd(object.version,

1);

end

if;

end

WriterExit;

Fig.

process

must

wait

until

the

version

number

for

the

target

object

equal

the

version

number

recorded

the

reader's

history

tape.

This

ensures

that

the

reader

will

see

the

correct

version

the

target

object

during

replay.

Once

the

reader

has

read

the

object,

count

readers

for

that

version

incremented

the

exit

routine.

This

counter

allows

writer

create

the

version

object

only

when

all

readers

have

finished

with

the

current

version.

Each

process

that

modifies

shared

object

must

use

the

entry

procedure

WriterEntry

(Fig.

2).

this

routine,

the

writer

uses

semaphore

associated

with

the

object

gain

exclusive

access

the

object.

Once

the

semaphore

acquired,

the

writer

process

waits

for

all

active

readers

finish.

new

readers

can

access

the

object

since

the

entry

routine

for

reader

must

also

acquire

the

semaphore.

When

all

readers

have

finished

with

the

object,

the

writer

free

access

the

current

version

the

object.

The

writer

records

the

current

version

number

the

object

onto

its

process

history

tape

well

the

number

readers

for

that

version.

The

writer

may

then

modify

the

shared

object.

Exclusive

access

maintained

because

the

semaphore

not

released

until

the

exit

procedure

called.

The

WriterExit

routine

(also

Fig.

simply

initializes

the

number

readers

for

the

new

version,

increments

the

version

number

for

the

object,

and

releases

exclusive

access

the

object

performing

operation

the

object's

semaphore.

replay

mode,

the

object

semaphore

not

required

for

either

readers

writers

because

the

information

process

history

tapes,

conjunction

with

the

counts

maintained

with

the

object,

sufficient

correctly

order

the

operations

target

object.

writer

must

wait

until

the

current

version

the

object

matches

the

version

number

recorded

the

writer's

history

tape.

This

ensures

that

the

writer

modifies

the

correct

version.

Next,

the

writer

must

make

sure

that

the

number

readers

that

have

seen

the

current

version

the

object

during

replay

equal

the

number

readers

that

saw

that

version

475

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 08,2010 at 11:30:21 EST from IEEE Xplore. Restrictions apply.

IEEE

TRANSACTIONS

COMPUTERS,

VOL.

C-36,

NO.

APRIL

1987

the

original

execution.

Since

the

ReaderExit

routine

updates

the

count

total

readers

for

the

object

version

after

completing

the

read,

writer

cannot

proceed

until

all

reads

the

version

have

finished.

Following-

the

write

operation,

the

WriterExit

procedure

simply-

initializes

the

number

readers

for

the

new

version

and

then

increments

the

object

version

number.

Since

this

the

last

operation

performed

-writer,

reader

will

attempt

access

the

new

version

until

the

writer

has

finished.

This

description

CREW

access

protocol

intended

illustrative,

not

definitive.

Instant

Replay

requires

neither

CREW

protocol

nor

this

particular

implementation

CREW

protocol.

stated

previously,

could

use

protocol

guarantee

valid

serialization.

different

imple-

mentation

would

probably

required

loosely

coupled

system,

one

that

does

not

use

shared

memory.

particular,

rather

than

accessing

shared

memory

locations

read

and

record

object

status

information,

some

parts

the

protocol

could

implemented

remote

operations.

Version

numbers

could

used

control

access

message

buffers

remote

nodes,

preventing

buffer

overflow

problems

during

replay.

Also,

additional

machinery

(e.g.,

buffers)

would

need

added

that

the

communication

necessary

for

replay

does

not

compete

for

the

same

limited

resources

used

the

executing

program.

Nevertheless,

regardless

the

characteristics

particular

implementation

the

access

protocols,

our

basic

approach

record

partial

order

operations

each

shared

object

and

ensure

the

same

order

during

program

replay.

III.

MULTIPROCESSOR

PROTOTYPE

INSTANT

REPLAY

prototype

implementation

Instant

Replay

has

been

developed

for

the

BBN

ButterflyT'

Parallel

Processor.

Several

considerations

motivated

the

choice

the

Butterfly

testbed.

First,

have

Butterfly

the

University

Rochester,

but

lack

methods

and

tools

for

debugging

parallel

programs.

This,

combined

with

the

current

surge

software

development

for

the

Butterfly,

created

urgent

need

wanted

fulfill.

Second,

interprocess

communication

the

Butterfly

inexpensive,

which

tends

encourage

develop-

ment

communication-intensive

programs.

Third,

communi-

cation

the

Butterfly

available

over

wide

range

granularities;

process

interactions

can

occur

through

direct

sharing

memory,

through

the

use

higher

level

primitives

for

message

passing.

Finally,

the

high

degree

parallelism

offered

the

Butterfly

provides

challenging

test

since

highly

parallel,

communication-intensive

applications

will

experience

the

greatest

performance

degradation

using

any

program

monitoring

technique.

The

BBN

Butterfly

Parallel

Processor

The

BBN

Butterfly

Parallei

Processor

the

University

Rochester

consists

128

processing

nodes

connected

switching

network.

Each

switch

node

the

switching

network

4-input,

4-output

crossbar

switch

with

bandwidth

Mbits/s.

Each

processor

MHz

MC68000

with

24-bit

virtual

addresses.

2901-based

bit-slice

coprocessor

inter-

prets

every

memory

reference

issued

the

68000

and

used

communicate

with

other

nodes

across

the

switching

net-

work.

All

the

memory

the

system

resides

individual

nodes,

but

any

processor

can

address

any

memory

through

the

switch.

remote

memory

reference

(read)

takes

about

As,

roughly

five

times

long

local

reference.

Chrysalis

[191,

the

Butterfly

operating

system,

consists

largely

protected

subroutine

library

that

implements

operations

set

primitive

data

types,

including

event

blocks

(structures

used

processes

post

word

data

the

event

owner),

dual

queues

(queues

that

hold

sequence

long

word

data

enqueued

processes,

alternatively,

sequence

process

handles

corresponding

processes

waiting

dequeue

data

becomes

available),

shared

memory

segments,

and

global

name

table.

Objects

these

types

can

shared

among

all

processes

executing

the

machine.

Low-level

operations

these

data

types

are

provided

Chrysalis,

many

which

are

implemented

microcode.

These

primitive

operations

provide

general

framework

upon

which

efficient

high-level

communication

protocols

and

software

systems

can

built.

Monitoring

Chrysalis

Operations

Our

prototype

implementation

provides

programmers

with

encapsulated

versions

the

Chrysalis

primitive

operations

events,

dual

queues,

shared

memory

objects,

and

processes.

The

encapsulated

versions

the

Chrysalis

primitives

enforce

CREW

access

synchronization

and

record

partial

order

the

operations

detailed

the

section.

This

implementation

was

done

the

level

primitive

Chrysalis

operations

make

replay

available

all

programs;

can

used

any

software

system

developed

top

the

Chrysalis

operating

system.

particular,

recent

system

development

efforts

the

University

Rochester

that

can be

easily

modified

incorporate

Instant

Replay

include

LYNX,

programming

language

and

runtime

system

for

distributed

computing

[22],

[23],

and

SMP,

message-passing

system

that

supports

multicast

message

communication

among

groups

processes

[15].

While

encapsulating

the

Chrysalis

primitives

for

events

and

dual

queues,

became

apparent

that

providing

CREW

protocol

for

all

operations

was

inappropriate.

Most

the

operations

events

and

dual

queues

are

atomic,

which

means

that

the

operations

must

occur

serially

with

respect

their

target

data

object

characteristic

the

hardware).

The

CREW

protocol

allows

concurrent

readers

shared

objects,

but

introduces

additional

cost.

Since

event

and

dual

queue

operations

cannot

exploit

concurrent

execution

readers,

the

expense

the

CREW

protocol

not

justified.

replacing

the

CREW

protocol

with

the

simpler

mutual

exclusion

(ME)

protocol,

force

the

serial

execution

the

Chrysalis

event

and

dual

queue

primitive

operations,

but

reduce

execution

overhead

simplifying

the

entry

and

exit

protocols.

protocol

enables

use

single

entry/exit

routine

pair

and

reduces

the

amount

information

recorded

process

history

tapes,

since

need

not

maintain

count

the

readers

for

each

version.

Using

encapsulated

versions

Chrysalis

primitives

program

code

requires

additional

effort

beyond

that

476

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 08,2010 at 11:30:21 EST from IEEE Xplore. Restrictions apply.

LEBLANC

AND

MELLOR-CRUMMEY:

DEBUGGING

PARALLEL

PROGRAMS

WITH

INSTANT

REPLAY

necessary

use

the

original

primitives.

Additional

program

code

only

necessary

for

regulating

access

shared

memory

objects.

Chrysalis

provides

primitives

for

sharing

segments

memory.

General

sharing

memory

objects

provided

the

Butterfly

hardware

and

Chrysalis

primitive

operations

imposes

restrictions

memory

access

other

than

serializ-

ing

word

operations

each

node,

since

the

memory

hardware

has

only

single

port.

guarantee

that

operations

such

shared

segments

conform

CREW

access

protocol,

necessary

use

access

entry

and

exit

routines

control

sharing

these

segments.

The

programmer

can

control

the

granularity

operations

bracketed

the

access

routines

response

performance

concerns.

controlling

the

cost

the

operations

within

entry

and

exit

routine

pair,

the

programmer

can

balance

the

reduction

parallelism

incurred

when

locking

for

long

periods

time

with

the

overhead

frequently

executing

the

locking

primitives.

(Since

the

access

protocol

entry

and

exit

routines

have

small

critical

section

requiring

mutual

exclusion,

there

serial

nature

their

execution.)

Case

Studies

Two

applications

were

chosen

for

experiments

program

monitoring

and

replay:

computation

knight's

tour

chess

board

and

Gaussian

elimination.

The

knight's

tour

problem

was

chosen

because

there

existing

implementa-

tion

the

Butterfly

that

exhibits

extremely

nondeterministic

behavior.

parallel

implementation

Gaussian

elimination

was

chosen

for

study

since,

unlike

the

knight's

tour

program,

matter

what

execution

path

occurs

when

the

Gaussian

elimination

program

run,

the

overall

amount

computation

performed

the

program

constant.

Also,

our

implementa-

tion

Gaussian

elimination

has

already

been

the

subject

thorough

performance

study

[14]

and

the

statistics

previously

obtained

about

the

program's

execution

behavior

can

used

baseline

for

comparison

determine

the

cost

our

monitoring

techniques.

Knight's

Tour:

knight's

tour

path

chess

board

for

knight

that

successively

visits

each

square

once

and

only

once

using

legal

chess

moves.

Our

program

compute

knight's

tour

chess

board

consists

master

process

and

user-specified

number

slave

processes.

The

master

selects

initial

position

the

knight

the

chess

board

and

enters

the

corresponding

board

description

global

task

queue.

Next,

the

master

creates

set

slave

processes

that

cooperate

for

knight's

tour

beginning

with

the

initial

board

position.

Each

slave

removes

set

board

descriptions

from

the

global

task

queue

and

replaces

with

new

set

board

descriptions

which

could

generated

adding

legal

move

the

knight

from

its

position.

The

order

which

these

board

descriptions

are

added

and

deleted

from

the

task

queue

determines

the

breadth

and

depth

the

performed.

Since

the

order

which

slave

processes

are

granted

access

the

task

queue

depends

the

relative

progress

the

processes

and

resolution

memory

contention

for

the

task

queue,

successive

executions

the

program

tend

produce

different

tours.

Calls

monitored

versions

the

task

queue

primitive

operations

(the

task

queue

dual

queue)

were

inserted

the

program

place

the

original

calls

Chrysalis

primitives.

These

modifications

did

not

require

substantial

effort

and

caused

significant

growth

code

size.

The

effect

the

performance

each

individual

primitive

substantial,

since

the

original

primitives

are

implemented

microcode

and

there

such

support

for

the

history

tape

maintenance

operations.

However,

the

effect

overall

program

perform-

ance

difficult

measure

due

the

inherent

nondeterminis-

tic

nature

the

knight's

tour

computation.

cannot

obtain

identical

executions

the

monitored

and

unmonitored

ver-

sions

the

program

compare

execution

times

because

such

times

vary

wildly

between

successive

invocations

the

program.

were

able

measure

accurately

the

comparative

execution

times

for

knight's

tour

program

during

the

monitoring

phase

and

the

replay

phase

the

same

execution.

The

difference

between

the

two

execution

times

was

less

than

percent.

Using

processors,

three

successive

executions

required

18,

38,

and

find

three

different

solutions

for

chess

board;

the

executions

used

12K,

36K,

and

60K

bytes,

respectively,

for

history

tapes.6

Using

processors,

solution

was

found

and

required

48K

bytes

for

history

tapes.

not

surprising

that

the

amount

space

required

for

the

history

tapes

the

knight's

tour

program

varies

with

the

amount

time

taken

find

solution.

Communication

roughly

constant

percentage

the

computation

and

matter

how

many

processors

are

working

the

task,

communication

speed,

hence

history

tape

space

requirements,

limited

the

need

serialize

access

single

shared

task

queue.

estimate

that

the

knight's

tour

program

generates

between

250

and

300

communication

events

per

second;

each

communication

event

requires

four

bytes

record.

From

this

can

estimate

the

space

requirements

for

the

history

tape

function

the

time

needed

find

particular

solution.

Gaussian

Elimination:

obtain

empirical

compari-

son

the

relative

cost

monitored

and

unmonitored

program

executions,

existing

program

solve

system

linear

equations

using

Gaussian

elimination

was

instrumented.

Gaussian

elimination,

the

total

amount

work

performed

the

program

independent

the

precise

ordering

interprocess

events

during

execution;

the

computation

for

each

pivot

row

depends

fixed

number

other

rows.

The

implementation

Gaussian

elimination

uses

broad-

cast

message-passing

system

the basis

for

communication

among

the

cooperating

processes

the

program.7

single

master

process

initializes

shared

data

structures

and

then

spawns

worker

processes

diagonalize

the

matrix.

The

master

delegates

rows

the

matrix

each

slave

process

participating

the

solution.

Each

time

the

processing

row

completed,

the

contents

are

broadcast

the

process

holding

that

row

each

the

other

slaves.

Our

current

implementation

uses a

32-bit

word

for

each

entry

history

tape,

although

16-bit

words

would

suffice

for

our

case

studies,

well

most

other

programs.

Therefore,

our

space

requirements

are

conservative

and

could

easily

reduced

factor

The

message-passing

system

used

here

early

prototype

SMP

[15].

The

results

described

this

section

are

particularly

relevant

programs

based

SMP,

similar

communication

models.

477

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 08,2010 at 11:30:21 EST from IEEE Xplore. Restrictions apply.

IEEE

TRANSACTIONS

COMPUTERS,

VOL.

C-36,

NO.

APRIL

1987

Send

Message

Find

buffer

WriterEntry(buffer,

myprocess)

Copy

message

into

buffer

Set

number

recipients

WriterExit(buffer)

Receive

Message

ReaderEntryPoll(buffers,

myprocess)

Poll

incoming

message

buffers

Copy

message

into

user

area

ReaderExitPoll(buffers)

WriterEntry(buffer,

myprocess)

Decrement

number

recipients

WriterExit(buffer)

Fig.

instrument

this

application

replaced

some

dual

queue

and

event

primitives

used

for

synchronization

between

the

master

and

slaves

with

monitored

versions

the

Chrysalis

primitives.

The

underlying

message-passing

system,

however,

required

extensive

changes.

Message

passing

was

imple-

mented

using

shared

memory

segments

communication

buffers.

Modifications

the

send

and

receive

primitives

the

message-passing

system

were

required

enforce

the

CREW

access

protocols,

detailed

Section

II,

for

the

shared

communication

buffers.

Although

the

code

overhead

and

programming

effort

make

this

transformation

were

substantial

than

that

required

for

the

knight's

tour,

the

size

the

effort

was

still

small.

The

original

Gaussian

elimination

program

contains

1059

lines

code.

instrument

the

program

for

Instant

Replay,

lines

code

were

altered

and

lines

code

were

added:

Most

the

changes

the

source

code

files

occurred

the

message-passing

module.

Fig.

shows

the

skeletal

form

the

monitored

message-passing

routines.

The

performance

the

Gaussian

elimination

implementa-

tion

was

degraded

the

enforcement

CREW

protocol

shared

object

access

and

recording

the

access

order

shared

objects.

Fig.

depicts

the

performance

monitored

and

unmonitored

versions

the

application

400

matrix.

The

unmonitored

program

improves

dramatically

performance

additional

processors

become

involved

the

computation,

however,

there

significant

improvement

performance

when

than

processors

are

use.

fact,

performance

begins

degrade

slightly

beyond

processors

because

the

additional

communication

involved

not

justified

the

gain

parallelism

[14].

Our

first

attempt

monitoring

this

program

did

not

incorporate

any

optimizations

and

resulted

severe

performance

degradation

when

than

processors

were

use,

shown

Fig.

This

experiment

demonstrates

the

importance

efficient

monitoring

opera-

tions.

Modifying

the

monitoring

protocols

reduce

the

size

critical

sections

greatly

improved

the

performance,

but

still

managed

roughly

triple

the

execution

time

the

program

processors.

Examination

the

monitoring

cost

showed

that

the

program

was

spending

great

deal

time

monitoring

and

recording

noncritical

polling

operations

buffers.8

New

evidence

has

cast

doubt

upon

the

data

used

plot

the

curve

for

monitoring

without

polling

primitives.

While

monitoring

with

polling

primi-

tives

clearly

preferable,

now

believe

the

disparity

between

these

two

approaches

less

severe

than

our

graph

suggests.

Execution

Time

Seconds

190

180

170

monitoring

without

polling

primitive

160

150

140

-first

attempt

130

monitoring

120

110

100/,

70-

60-

50t

,./

monitoring

with

polling

prirnitive

nmonitored

80 90

100

110

Processors

'Fig.

Gaussian

elimination

400

matrix

using

message

passing.

lower

the

cost

monitoring,

devised

special

entry

procedure

for

use

with

the

common

programming

idiom

which

readers

poll

before

reading

value.

Our

implementation

message

passing

uses

polling

find

incoming

messages.

Whenever

process

attempts

receive

message,

large

number

buffers,

one

for

each

process

the

computation,

are

polled.

Our

naive

approach

monitoring

operations

considered

each

polling

operation

access

shared

object,

which

was

duly

recorded

the

process's

history

tape.

The

realization

that

none

the

polling

opera-

tions,

except

the

last

one,

are

necessary

for

replay

led

devise

special

entry

procedure

used

conjunction

with

polling.

With

this

new

entry

procedure,

the

access

ordering

buffer

recorded

only

when

message

found.

indication

which

buffer

supplied

the

message

and

the

version

number

for

that

buffer

are

recorded

the

process's

history

tape.

During

replay,

only

the

buffer

from

which

process

received

message

during

the

monitoring

phase

polled.

Use

this

entry

procedure

eliminated

recording

nonessential

ordering

information

during

the

monitoring

phase,

saving

both

time

and

storage

space

for

the

information

collected.

The

performance

the

program

using

the

special

entry

procedure

also

shown

Fig.

The

result:

were

able

monitor

communication-intensive

application

for

replay

imposing

performance

overhead

less

than

percent

for

processors.

addition,

were

able

replay

the

program

the

same

amount

time

was

used

the

original

execution.

have

already

stated,

Gaussian

elimination

communication-intensive

program,

which

tends

produce

large

history

tapes.

Diagonalization

800

matrix

processors

requires

400K

bytes

for

the

history

tapes.

While

this

not

small

amount

space,

worth

478

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 08,2010 at 11:30:21 EST from IEEE Xplore. Restrictions apply.

LEBLANC

AND

MELLOR-CRUMMEY:

DEBUGGING

PARALLEL

PROGRAMS

WITH

INSTANT

REPLAY

comparing

the

space

requirements

for

our

method

with

other

techniques

that

save

the

contents

every

message

received

process

event

log.

Such

approach

requires

over

150

Mbytes

space!

general,

Instant

Replay

will

always

take

less

space

than

event

log

whenever

large

messages

are

involved,

since

only

require

between

and

bytes

for

each

message.

IV.

INSTANT

REPLAY

THE

DEBUGGING

CYCLE

Program

replay

makes

possible

repeat

the

execution

parallel

program

often

desired.

Unfortunately,

Instant

Replay

does

not

automatically

debug

programs,

parallel

otherwise.

How

then

use

the

replay

capability

debug

parallel

programs?

this

section

describe

several

tech-

niques

for

error

isolation

that

can

used

together

with

our

approach.

have

already

used

some

these

techniques

our

own

work;

others

require

the

cooperation

additional

tools

that

have

not

yet

developed.

Our

goal

provide

repeatable

execution,

that

possible

observe

the

same

execution

parallel

program

often

desired.

Any

results

that

may

have

been

ignored

during

observations

can

always

reproduced

demand

for

closer

examination.

This

capability

especially

useful

for

parallel

programs

since

multiple

processes

tend

generate

lot

output,

making

easy

miss

important

results

and

the

programming

environments

for

parallel

architectures

are

not

mature

the

programming

environ-

ments

for

sequential

machines,

and

often

lack

tools

for

collecting

and

analyzing

output

data.

However,

the

most

important

reason

for

reproducible

behavior

that

makes

cyclic

debugging

possible.

The

simplest

form

cyclic

debugging

add

output

statements

erroneous

program

that

provide

additional

details

about

the

execution

the

program.

Successive

executions

can

used

provide

successively

greater

detail

about

those

parts

the

program

under

suspicion.

This

technique

does

not

work

with

parallel

programs

general

because

the

output

statements

can

change

the

relative

timing

operations

within

the

program

and

yield

different

execution

sequence.

With

Instant

Replay,

however,

any

number

output

statements

can

added

the

program

without

changing

the

execution

sequence

provided

the

replay

mechanism.

fact,

any

type

statement

may

added

the

program

during

replay,

long

the

additions

not

affect

the

sequence

interactions

with

shared

objects

each

process.

Thus,

the

programmer

can

debug

parallel

programs

adopting

the

same

cyclic

methodology

for

error

isolation

used

debugging

sequential

programs.

have

found

that

this

capability

alone

valuable

tool

for

debugging

parallel

programs,

particularly

the

absence

other

debugging

tools.

Repeatable

execution

also

makes

top-down,

interactive

debugging

possible.

Hierarchical

abstraction

detail

necessary

cope

with

the

complexity

large

software

Normally,

four

bytes

per

message

are

used,

however,

the

polling

entry

systems.

Abstraction

particularly

important

understand-

ing

the

behavior

parallel

programs.

The

programmer

should

not

have

concerned

with

the

low-level

details

execution

parallel

program,

such

the

interleaving

primitive

operations.

Instead,

are

interested

the

salient

features

the

execution

that

characterize

its

behavior.

Our

approach

allows

the

programmer

start

with

high-level

view

program's

behavior,

produced

normal

output

statements

event

mechanism

similar

Behavioral

Abstraction.

carefully

refining

that

viewpoint,

based

the

information

made

available

during

successive

replay,

the

programmer

can

study

erroneous

behavior

any

level

detail

desired.

result,

one

can

diagnose

program

errors

top-down

fashion

without

wading

through

voluminous,

irrelevant

detail

each

step.

Another

common

technique

used

debug

sequential

programs

breakpoint

insertion.

Breakpoints

are

added

the

program

interesting

points

the

code.

Execution

suspended

each

breakpoint,

allowing

the

programmer

examine

the

system

state.

Breakpoints

only

suspend

single

thread

execution,

however,

which

not

sufficient

for

parallel

programs

consisting

multiple

threads

execution.

Inserting

breakpoint

one

process

parallel

program

will

have

effect

every

process

that

communicates,

directly

indirectly,

with

the

suspended

process.

particular,

break-

points

can

change

the

relative

order

events

during

execu-

tion,

producing

different

execution

sequence

each

time.

Fortunately,

can

provide

reproducible

execution

even

the

presence

breakpoints.

matter

how

many

brelak-

points

are

encountered

during

replay,

continue

order

operations

based

the

contents

history

tapes.

process

that

suspended

breakpoint

will

eventually

cause

all

other

processes

wait

for

some

shared

object

read

written

(assuming

connected

graph

process

interactions).

When

the

suspended

process

allowed

continue

beyond

the

breakpoint,

will

eventually

catch

the

other

processes

and

the

entire

program

will

continue

executing.

Thus,

possible

cycle

through

breakpoints

many

different

processes

during

program

replay,

examining

system

state

for

different

process

each

breakpoint.

This

use

breakpoints

also

allows

the

programmer

examine

the

global

state

the

computation.

Due

communi-

cation

delays

and

reliance

local

viewpoints,

impossible

take

instantaneous

snapshot

global

state.

However,

all

really

need

see are

meaningful

global

states

[4],

consistent

states

based

the

happened

before

ordering

Lamport

[11].

For

example,

suspend

process

breakpoint

all

events

that

occurred

before

reached

should

reflected

eventually

all

other

proc-

esses.

addition,

other

processes

should

not

allowed

proceed

beyond

any

point

that

requires

process

proceed

beyond

This

view

computation

the best

can

hope!

for

since,

all

processes

are

stopped

the

result

setting

single

breakpoint,

the

happened

before

relation

cannot

distinguish

between

the

global

state

represented

all

sus-

pended

processes

and

omniscient

snapshot

the

global

procedure

used

Gaussian

elimination

requires

eight

bytes.

479

state

during

normal

execution.

provide

exactly

this

notion

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 08,2010 at 11:30:21 EST from IEEE Xplore. Restrictions apply.

IEEE

TRANSACTIONS

COMPUTERS,

VOL.

C-36,

NO.

APRIL

1987

global

state,

and

any

notion

that

attempts

precise

not

likely

meaningful

distributed

system.

can

use

breakpoints,

conjunction

with

Instant

Replay,

provide

the

ability

halt

distributed

programs

consistent

state,

[18],

without

the

need

for

additional

mechanisms.

setting

local

breakpoint

during

replay

are,

effect,

setting

breakpoint

the

global

state.

When

the

local

breakpoint

reached,

can

see

the

exact

state

the

local

process

containing

the

breakpoint,

and

the

exact

state

all

other

processes

they

block

due

enforcement

the

happened

before

relation.

Differences

between

the

state

each

process

instantaneous

snapshot

and

what

see

breakpoint

reflect

the

natural

degree

asynchrony

between

processes

the

program.

consequence

our

breakpoint

capability

the

ability

support

single-step

execution

processes.

Single-step

execu-

tion

can

used

during

debugging

trace

the

state

transitions

individual

process

the

effects

interprocess

communication

the

internal

states

communication

partners.

can

replay

process

using

single-step

execution

because

enforcement

the

happened

before

relation

ensures

that

asynchrony

between

processes

remains

within

allowable

bounds.

Instant

Replay

can

also

used

conjunction

with

event

log

technique

allow

repeatable

execution

subset

processes

involved

computation.

have

described

it,

our

approach

requires

that

the

input

each

process

recomputed

during

replay,

rather

than

retrieved

from

event

log.

This

both

advantage

and

disadvantage.

While

our

technique

requires

less

time

and

space

during

the

monitoring

phase,

also

requires

that

all

processes

reexecuted

during

replay.

Global

replay

disadvantage

the

computational

requirements

replay

program

are

very

large,

particularly

when

unnecessary

recreate

the

entire

original

set

processes

isolate

error.

using

event

log

together

with

Instant

Replay,

can

reexecute

the

subset

processes

which

are

interested

and

simulate

the

rest.

There

tradeoff

between

the

expense

maintaining

event

log

during

normal

execution

and

the

expense

reexecuting

all

processes

during

replay.

The

event

log

approach

and

Instant

Replay

represent

two

extremes,

wherein

the

expense

shifted

from

the

monitoring

phase

the

replay

phase.

However,

compromise

between

our

technique

and

the

event

log

approach

possible.

When

frequent

replay

subset

processes

computation

desired,

would

the

case

when

using

cyclic

debugging

isolate

errors,

possible

collect

additional

information

event

log

during

replay

that

would

eliminate

the

need

for

reexecution

the

entire

program

during

subsequent

replay.

can

record

event

log

all

external

inputs

the

subset

processes

interest.

This

record

would

include

both

inputs

from

the

external

environment

and

inputs

from

processes

not

under

scrutiny.

Interactions

involving

processes

reexecuted

during

replay

are

recorded,

before,

partial

orders

history

tapes.

subsequent

executions,

only

the

designated

subset

processes

would

reexecuted

and

their

interface

with

the

external

environment,

including

the

other

processes,

would

simulated

using

the

event

log.

Since

assume

that

the

debugging

methodology

cyclic,

the

set

processes

that

are

simulated

event

log

will

grow

larger

look

fewer

processes

greater

detail

(i.e.,

top-down

debugging).

Note

however

that

would

continue

use

Instant

Replay

the

monitoring

phase

because

has

the

least

impact

normal

program

execution

and

can

used

generate

event

logs

during

the

debugging

cycle.

Finally,

can

use

Instant

Replay,

together

with

techniques

developed

Miller

[16],

[17],

for

both

causal

analysis

and

performance

monitoring

parallel

programs.

These

tech-

niques

use

program

history

graph,

which

represents

in-

terprocess

events

and

the

elapsed

time

between

events,

analyze

the

behavior

the

program.

possible

change

some,aspects

the

history

graph

analyze

the

effect

changes

the

execution

environment

[17],

however,

there

guarantee

that

modifying

system

parameters,

such

expected

communication

delay

and

processor

load,

will

not

change

the

execution

behavior

the

program.

using

Instant

Replay

guarantee

repeatable

execution

behavior,

possible

change

cost

labels

the

history

graph

and

replay

the

program

under

new

assumptions,

without

changing

the

execution

behavior

the

program.

(Of

course,

the

replay

mechanism

would

have

modified

incorporate

changes

the

history

graph,

such

the

message

delay

time.)

particular,

one

could

examine

the

effect

communication

costs

overall

program

performance

artificially

varying

the

delay

associated

with

communication.

important

note

that

performance

results

derived

from

such

exercise

are

estimates,

since

the

program

forced

obey

particular

execution

sequence

the

presence

varying

performance

parameters.

However,

still

possible

learn

great

deal

about

parallel

programs

using

these

techniques,

particularly

when

used

with

programs

whose

executions

are

less

sensitive

race

conditions.

CONCLUSIONS

AND

FUTURE

WORK

One

the

most

important

tools

for

analyzing

and

debug-

ging

software

the

interactive

debugger.

Cyclic

debugging

with

interactive

debugger

requires

the

ability

reproduce

program

behavior

demand.

have

described

the

design

and

implementation

system

for

reproducible

execution

parallel

programs.

summary,

Instant

Replay:

provides

reproducible

execution

parallel

programs

not

dependent

any

particular

form

interprocess

communication

makes

possible

global

replay

programs,

rather

than

processes

introduces

centralized

bottleneck,

either

during

moni-

toring

replay

does

not

require

synchronized

clocks

globally

consist-

ent

logical

time

extraordinary

circumstances

where

even

single

replay

impractical,

process

history

tapes

and

partial

event

log

could

both

recorded

during

the

480

monitoring

phase.

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 08,2010 at 11:30:21 EST from IEEE Xplore. Restrictions apply.

LEBLANC

AND

MELLOR-CRUMMEY:

DEBUGGING

PARALLEL

PROGRAMS

WITH

INSTANT

REPLAY

allows

modifications

programs

during

the

debugging

cycle

has

only

minor

impact

program

performance

during

the

monitoring

phase

has

reasonable

space

requirements

applicable

both

loosely

coupled

and

tightly

coupled

environments.

There

are

two

potential

disadvantages

our

approach.

First,

record

version

number

for

each

access

shared

object.

the

granularity

communication

very

small

(e.g.,

one-byte

messages),

could

use

less

space

simply

storing

data

values

(i.e.,

the

event

log

approach).

Second,

require

that

all

processes

program

reexecute

during

replay.

Even

though

have

shown

how

use

event

logs

eliminate

some

processes

during

successive

replays,

iterative

technique

well-suited

programs

that

are

impractical

reexecute.

Nevertheless,

our

experience

has

shown

that

Instant

Replay

effective,

efficient,

and

practical.

Additional

experience

with

our

technique

necessary,

however.

must

perform

further

empirical

studies

determine

the

performance

cost

our

monitoring

technique

other

programming

environ-

ments.

Specifically,

intend

explore

applications

our

techniques

message-based

communication

loosely

coup-

led

systems

and

lightweight

tasks

and

shared

memory

tightly

coupled

systems.

Our

case

studies,

while

very

different

programming

style,

not

address

all

the

programming

models

wish

support.

Several

optimizations

reduce

further

the

time

and

space

needs

our

technique

are

also

under

consideration.

example

such

optimization

was

described

Section

III.

Other

optimizations

based

similar

idempotent

operations

are

possible.

Another

interesting

optimization

based

the

observation

that

some

parallel

programs

(or

segments

programs)

are

deterministic.

The

Gaussian

elimination

pro-

gram

good

example.

The

processes

that

perform

Gaussian

elimination

proceed

lockstep;

monitoring

operations

are

necessary

reproduce

behavior.

possible

reduce

contention

and

space

needs

for

monitoring

can

determine

that

some

sequence

interprocess

operations

yields

deterministic

schedule.

Clearly

this

information

application-

specific

and

may

only

obtainable

with

programmer

assist-

ance.

Nonetheless,

this

approach

worth

exploring

for

large

parallel

systems

with

deterministic

components.

Finally,

intend

explore

the

impact

Instant

Replay

the

development

general-purpose

programnming

envi-

ronment

for

parallel

architectures.

Additional

tools

will

constructed

part

any

such

environment

(e.g.,

source-

level

single-process

debuggers

for

parallel

programs,

tools

monitor

execution

with

graphical

displays,

compilers

automatically

instrumert

programs),

and

will

want

integrate

our

program

replay

capability

with

those

tools

they

are

developed.

ACKNOWLEDGMENT

are

grateful

our

colleagues

the

Department

Computer

Science,

University

Rochester

for

their

many

comments

and

criticisms.

particular,

Gafter

made

numerous

suggestions

that

substantially

improved

the

presen-

tation.

REFERENCES

[1]

Balardi,

DeFrancesco,

and

Vaglini,

"Development

debugger

for

concurrent

language,"

IEEE

Trans.

Software

Eng.,

vol.

SE-12,

pp.

547-553,

Apr.

1986.

[21

Bates

and

Wileden,

"High-level

debugging

distributed

systems:

The

behavioral

abstraction

approach,"

Dep.

Comput.

and

Inform.

Sci.,

Univ.

Massachusetts,

COINS

83-29,

1983.

[3]

Carver

and

Tai,

"Reproducible

testing

concurrent

programs

based

shared

variables,''

Proc.

6th

Int.

Conf.

Distrib.

Comput.

Syst.,

Cambridge,

MA,

May

1986,

pp.

428-433.

[4]

Chandy

and

Lamport,

"Distributed

snapshots:

Determining

global

states

distributed

systems,"

ACM

Trans.

Comput.

Syst.,

vol.

pp.

63-75,

Feb.

1985.

[5]

Chiu,

"Debugging

distributed

computations

nested

atomic

action

system,"

Dep.

EECS,

Massachusetts

Inst.

Technol.,

Cam-

bridge,

MIT/LCS/Tech.

Rep.-327,

Dec.

1984.

[6]

Courtois,

Heymans,

and

Parnas,

"Concurrent

control

with

readers

and

writers,"

Commun.

ACM,

vol.

14,

pp.

667-668,

Oct.

1971.

[7]

Curtis

and

Wiktie,

"BugNet:

debugging

system

for

parallel

programming

environments,"

Proc.

3rd

Int.

Conf.

Distrib.

Comput.

Syst.,

Miami,

FL,

pp.

394-399,

Oct.

1982.

[8]

Dijkstra,

"The

structure

the

'THE'

multiprogramming

system,"

Commun.

ACM,

vol.

11,

pp.

341-346,

May

1968.

[9]

Garcia-Molina,

Germano,

and

Kohler,

"Debugging

distributed

computing

system,

IEEE

Trans.

Software

Eng.,

vol.

SE-

10,

pp.

210-219,

Mar.

1984.

[10]

Hoare,

"Monitors:

operating

system

structuring

con-

cept,"

Commun.

ACM,

vol.

17,

pp.

549-556,

Oct.

1974.

[11]

Lamport,

"Time,

clocks,

and

the

ordering

events

distributed

system,"

Commun.

ACM,

vol.

21,

pp.

558-565,

July

1978.

[12]

"On

interprocess

communication,"

Digital

Equipment

Corpora-

tion's

Western

Research

Lab.,

Tech.

Rep.,

Dec.

1985.

[13]

LeBlanc

and

Robbins,

"Event

driven

monitoring

distributed

programs,"

Proc.

5th

Int.

Conf.

Distrib.

Comput.

Syst.,

Denver,

CO,

May

1985,

pp.

515-522.

[14]

LeBlanc,

"Shared

memory

versus

message-passing

tighdly-

coupled

multiprocessor:

case

study,"

Proc.

Int.

Conf.

Parallel

Processing,

St.

Charles,

IL,

Aug.

1986,

pp.

463-466.

[15]

LeBlanc,

Gafter,

and

Ohkami,

"SMP:

message-based

programming

environment

for

the

BBN

Butterfly,"

Dep.

Comput.

Sci.,

Univ.

Rochester,

NY,

Butterfly

Proj.

Rep.

July

1986.

[16]

Miller,

"Performance

characterization

distributed

programs,"

Ph.D.

dissertation,

Computer

Science

Division

(EECS),

Univ.

Califor-

nia,

Berkeley,

Tech.

Rep.

UCB/Computer

Science

Dep.

85/197,

Jan.

1985.

[17]

Miller,

"Parallelism

distributed

programs:

Measurement

and

prediction,"

Dep.

Comput.

Sci.,

Univ.

Wisconsin,

Madison,

Tech.

Rep.,

May

1985.

[18]

Miller

and

Choi,

"Breakpoints

and

halting

distributed

programs,"

Dep.

Comput.

Sci.,

Univ.

Wisconsin,

Madison,

Tech.

Rep.

648,

July

1986.

[19]

Milliken

al.,

Chrysalis

Programmer's

Manual,

Version

2.2,

BBN

Lab.,

June

1985.

[20]

Peterson,

"Concurrent

reading

while

writing,"

ACM

Trans.

Programming

Lang.

Syst.,

vol.

pp.

46-55,

Jan.

1983.

[21]

Schiffenbauer,

"Interactive

debugging

distributed

computa-

tional

environment,"

masters

thesis,

Computer

Science

Division,

Dep.

EECS,

Massachusetts

Inst.

Technol.,

Cambridge,

MITfLCS/Tech.

Rep.

264,

Sept.

1981.

[22]

Scott,

LYNX

Reference

Manual,

Butterfly

Proj.

Rep.

Dep.

Comput.

Sci.

tUniv.

Rochester,

NY,

Mar.

1986.

[23]

Scott,

"The

interface

between

distributed

operating

system

and

high-level

programmning

language,"

Proc.

Int.

Conf.

Parallel

Proc-

essing,

St.

Charles,

IL,

Aug.

1986,

pp.

242-249.

[24]

Smith,

"Debugging

tools

for

message-based,

communicating

processes,"

Proc.

4th

Int.

Conf.

Distrib.

Comput.

Syst.,

San

Francisco,

CA,

May

1984,

pp.

303-310.

[25]

Vitanyi

and

Awerbuch,

"Atomic

shared

access

asynchronous

hardware,"

Proc.

27th

Ann.

Symp.

Found.

Com-

put.

Sci.,

Toronto,

Ont.,

Oct.

1986,

pp.

233-243.

481

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 08,2010 at 11:30:21 EST from IEEE Xplore. Restrictions apply.

IEEE

TRANSACTIONS

COMPUTERS,

VOL.

C-36,

NO.

APRIL

1987

Thomas

LeBlanc

received

the

B.S.

degree

computer

science

from

the

State

University

New

York

1977

and

the

M.S.

and

Ph.D.

degrees

computer

science

from

the

University

Wisconsin,

Madison,

1979

and

1982,

respectively.

Since

1983

has

been

Assistant

Professor

-the

Department

Computer

Science,

University

Rochester,

Rochester

NY,

where

his

research

fo-

cuses

software

support

for

parallel

and

distrib-

uted

systems,

including

programming

languages,

operating

systems,

and

program

debugging.

now

exploring

these

issues

the

BBN

Butterfly

environment,

tightly

coupled

multiprocessor

consisting

128

MC68000's.

His

paper

(with

Friedberg)

was

the

recipient

the

Best

Award

the

5th

International

Distributed

Computing

Systems

Confer

ze.

also

received

the

Distin-

guished

Presentation

Award

the

1986

International

Conference

Parallel

Processing.

Dr.

LeBlanc

member

IEEE

Computer

Society

and

the

Association

for

Cotnputing

Machinery.

John

Mellor-Crummey

received

the

B.S.E.

degree

electrical

engineering

and

computer

sci-

ence

from

Princeton

University,

Princeton,

NJ,

1984

and

the

M.S.

degree

computer

science

from

the

University

Rochester,

NY,

1986.

From

1984

1986

was

Sproull

Fellow

the

Department

Computer

Science,

University

Rochester,

where

currently

working

towards

the

Ph.D.

degree

computer

science.

His

research

interests

include

parallel

processing,

distributed

systems,

VLSI,

programming

languages,

and

compiler

design.

482

Authorized licensed use limited to: University of Wisconsin. Downloaded on March 08,2010 at 11:30:21 EST from IEEE Xplore. Restrictions apply.

RaceHunter Dynamic Data Race Detector

Article

Jan 2023

Evgeny Anatol'evich Gerlits

Data races are a class of concurrency errors where two threads access a shared memory location without proper synchronization. Data races are hard to reveal and debug. This paper presents RaceHunter - a dynamic data race detection technique which monitors executions of shared memory concurrent programs, discovers pairs of conﬂicting memory accesses and systematically veriﬁes them for data races. RaceHunter does not report false data races when the target software exploits non-standard synchronization primitives or unknown synchronization protocols and can ﬁnd data races missed by other techniques. Dynamic data race detectors can monitor continuous, e.g. real-life, program executions or they can verify relatively short program executions, e.g. organized by system tests. The latter is the primary use case scenario for RaceHunter.

Agile Development of Linux Schedulers with Ekiben

Preprint

Full-text available

Jun 2023

Kernel task scheduling is important for application performance, adaptability to new hardware, and complex user requirements. However, developing, testing, and debugging new scheduling algorithms in Linux, the most widely used cloud operating system, is slow and difficult. We developed Ekiben, a framework for high velocity development of Linux kernel schedulers. Ekiben schedulers are written in safe Rust, and the system supports live upgrade of new scheduling policies into the kernel, userspace debugging, and bidirectional communication with applications. A scheduler implemented with Ekiben achieved near identical performance (within 1% on average) to the default Linux scheduler CFS on a wide range of benchmarks. Ekiben is also able to support a range of research schedulers, specifically the Shinjuku scheduler, a locality aware scheduler, and the Arachne core arbiter, with good performance.

The Aurora Single Level Store Operating System

Conference Paper

Full-text available

Oct 2021

Efficient Auditing of Event-driven Web Applications

Conference Paper

Apr 2024

Enoki: High Velocity Linux Kernel Scheduler Development

Conference Paper

Apr 2024

Metaverse as a Service: Megascale Social 3D on the Cloud

Conference Paper

Oct 2023

TreeSLS: A Whole-system Persistent Microkernel with Tree-structured State Checkpoint on NVM

Conference Paper

Oct 2023

Performance Prediction for Scalability Analysis

Chapter

Jun 2023

Performance prediction is an effective approach for understanding the scalability of large-scale parallel applications when the whole target systems are not available. However, accurate and efficient performance prediction is difficult because the execution time of parallel applications is determined by several factors, including sequential computation time in each process, communication time, and their convolution. This chapter proposes a novel approach to acquire the sequential computation time accurately and efficiently, which only needs a single node of the target platform. First, we employ deterministic replay techniques to execute any process of a parallel application on a single node at real speed. Thus, we can simply measure the real sequential computation time on a target node for each process one by one. Second, we observe that processes in parallel applications can be clustered into a few groups where processes in each group have similar computation behaviors. Based on this observation, we only execute representative parallel processes and significantly reduce measurement time. We implement a performance prediction framework, called Phantom, which integrates the above computation time acquisition approach with a trace-driven network simulator. We validate our approach on several platforms, and the prediction error of our approach is less than 8% on average. (Ⓒ 2015 IEEE. Reproduced, with permission, from Jidong Zhai, et al., Performance prediction for large-scale parallel applications using representative replay, IEEE Transactions on Computers, 2015.)

Diagnosing Kernel Concurrency Failures with AITIA

Conference Paper

May 2023

Model Assisted Distributed Root Cause Analysis

Conference Paper

Sep 2021

Breakpoints and halting in distributed systems

Article

Full-text available

Development of a debugger for a concurrent language

Article

Full-text available

Apr 1986

This work deals with some issues concerned in the debugging of concurrent programs. A set of desirable characteristics for a debugger for concurrent languages is deduced from a review of the differences between the debugging of concurrent programs and that of sequential ones. A debugger for a concurrent language, based upon CSP, is then described. The debugger makes it possible to compare a description of the expected program behavior to the actual behavior. The description of the behavior is given in terms of expressions composed by events and/or assertions on the process state. The developed formalism is able to describe behaviors at various levels of abstraction. Lastly, some guidelines for the implementation of the debugger are given and a detailed example of program debugging is analyzed.

Concurrent control with “readers” and “writers”

Article

Full-text available

Oct 1971

The problem of the mutual exclusion of several independent processes from simultaneous access to a “critical section” is discussed for the case where there are two distinct classes of processes known as “readers” and “writers.” The “readers” may share the section with each other, but the “writers” must have exclusive access. Two solutions are presented: one for the case where we wish minimum delay for the readers; the other for the case where we wish writing to take place as early as possible.

On interprocess communication

Article

Jan 1986

L. Lamport

The structure of the ’THE’-multiprogramming system

Article

May 1968

Edsger W. Dijkstra

The Structure of the "THE" Multiprogramming System

Article

Jan 1968

Edsger W. Dijkstra

A multiprogramming system is described in which all activities are divided over a number of sequential processes. These sequential processes are placed at various hierarchical levels, in each of which one or more independent abstractions have been implemented. The hierarchical structure proved to be vital for the verification of the logical soundness of the design and the correctness of its implementation.

Smp: a message-based programming environment for the bbn butterfly

Article

Debugging Distributed Computations in a Nested Atomic Action System

Article

Dec 1984

S. Y. Chiu

Concurrent and distributed programs are hard to debug. This thesis agues that structuring activities as nested atomic actions can make debugging such programs much like debugging traditional sequential programs. To support the argument, the author presents a method for debugging computations in the Argus language and system. Her method is applicable to other action systems since it depends only on the atomicity properties of actions. To debug a computation in our method, the user inspects a serial execution that is equivalent to the original computation. The debugging process involves two phases. In the first phase, the user examines pre- and post- states of actions to isolate the action that exposes the bug. In the second phase, the debugging system re-executes code to reproduce the details of the culprit action. The user can repeat this re-execution and can use standard break-and-examine tools on it to isolate the bug. This debugging system supports the method by saving a partial history when an action runs. This history consists mainly of recovery versions of objects. The system also timestamps the termination of actions so it can determine from the saved versions the values of objects in an action's pre- and post- states. The debugging system itself uses pre-states to repeat actions. This thesis presents the first detailed design that ues recovery versions and timestamps for debugging. (Author)

High-level debugging of distributed systems: The behavioral abstraction approach

Article

Dec 1983
J SYST SOFTWARE

Most extant debugging aids force their users to think about errors in programs from a low-level, unit-at-a-time perspective. Such a perspective is inadequate for debugging large complex systems, particularly distributed systems. In this paper, we present a high-level approach to debugging that offers an alternative to the traditional techniques. We describe a language, edl, developed to support this high-level approach to debugging and outline a set of tools that has been constructed to effect this approach. The paper includes an example illustrating the approach and discusses a number of problems encountered while developing these debugging tools.

Debugging Tools for Message-Based, Communicating Processes.

Conference Paper

Jan 1984

Edward T. Smith

Debugging Parallel Programs with Instant Replay

Abstract

Recommended publications

Execution Replay and Debugging of Distributed Multi-threaded Parallel Programs.

Debugging parallel programs with instant replay

An Integrated Approach to Parallel Program Debugging and Performance Analysis of Large-Scal Multipro...

A software instruction counter

Debuggina Parallel Pregrams with Instant Replay