ArticlePDF Available

Optimistic Recovery in Distributed Systems

August 1985
ACM Transactions on Computer Systems 3(3):204-226

August 1985
3(3):204-226

Source
DBLP

Authors:

Optimistic Recovery is a new technique supporting application-independent transparent recovery from processor failures in distributed systems. In optimistic recovery communication, computation and checkpointing proceed asynchronously. Synchronization is replaced by causal dependency tracking , which enables a posteriori reconstruction of a consistent distributed system state following a failure using process rollback and message replay . Because there is no synchronization among computation, communication, and checkpointing, optimistic recovery can tolerate the failure of an arbitrary number of processors and yields better throughput and response time than other general recovery techniques whenever failures are infrequent.

Content uploaded by Robert Strom

Content may be subject to copyright.

Optimistic Recovery in Distributed Systems

ROBERT E. STROM and SHAULA YEMINI

IBM Thomas J. Watson Research Center

Optimistic Recovery

is a new technique supporting application-independent transparent recovery

from processor failures in distributed systems. In optimistic recovery communication, computation

and checkpointing proceed asynchronously. Synchronization is replaced by causal

dependency trock-

ing,

which enables a posteriori reconstruction of a consistent distributed system state following a

failure using

process rollback

and message

replay.

Because there is no synchronization among computation, communication, and checkpointing,

optimistic recovery can tolerate the failure of an arbitrary number of processors and yields better

throughput and response time than other general recovery techniques whenever failures are infre-

quent.

CR Categories Subject Descriptors: [Operating Systems]: reliability; D.4.7 [Operating Systems]:

Organization and

Design-distributed systems

General Terms: Algorithms, Reliability

Additional Key Words and Phrases: Distributed algorithms, fault-tolerance message replay, recovery,

optimistic algorithms, orphans transparent recovery

1. INTRODUCTION

Distributed multiprocessor configurations are replacing centralized processors as

a result of an increasing demand for both higher throughput and higher availa-

bility. However, achieving high availability is more difficult in multiprocessor

configurations because of the more complicated failure modes of such systems.

This paper addresses the problem of restoring a consistent state of a distributed

system following the failure of one or more of its processors.

We consider distributed systems that are constructed from processes, each of

which maintains private state information and communicates with other proc-

esses by exchanging messages. As a result of communication, individual process

states will become dependent on one another. A set of process states in which

each pair of processes agrees on communication between them has taken place

and which has not is called a consistent system state. If the state of a process

that has sent a message is ever lost, then in order for the system state to be

Authors’ address: IBM Thomas J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY

10598.

Permission to copy without fee all or part of this material is granted provided that the copies are not

made or distributed for direct commercial advantage, the ACM copyright notice and the title of the

publication and its date appear, and notice is given that copying is by permission of the Association

for

Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific

permission.

0 1985 ACM 0734-2071/85/0800-0204 $00.75

ACM Transactions on Computer Systems, Vol.

3, No. 3,

August

1985,

Pages

204-226.

Optimistic Recovery in Distributed Systems 205

Fig. 1. Domino effect: If process PI fails t

and is restored to checkpoint Clr, it loses its

memory of having received Mz and having

sent MS. Thus P2 will have received M3

which was “never sent.” If Pz rolls back to

C&, it will have sent message M* which PI

“never received.” If P2 rolls back to Cz3, PI

will have sent message Mr not received by

Pz, necessitating a further rollback of PI.

“=gllli cn

Clzil+ .

Q .

Ii c23l

Process Pl Process PZ

consistent, the state change resulting from the receipt of that message in the

receiving process must be undone; that is, the process must be rolled back.

A processor failure will cause the states of some of the processes executing on

that processor to be lost. Recovery from a failure involves restoring a consistent

system state. Recovery mechanisms typically recover from the loss of a process’s

state by retrieving a saved snapshot of an earlier state of that process, called a

checkpoint. Since it is infeasible to take a checkpoint of an entire distributed

system “at once,” an attempt to recover may result in the unbounded cascading

of rollbacks in an attempt to find a consistent set of individual process check-

points. This problem is called the domino effect [14] and is depicted in Figure 1.

The domino effect is typically avoided by synchronizing checkpointing with

communication and computation (see, e.g., [2], [4], [7], [ll], and [12]).

This paper describes optimistic recovery, a new application-independent, trans-

parent recovery technique based on dependency tracking, which avoids the domino

effect while allowing computation, communication, checkpointing, and “commit-

ting” to proceed asynchronously. Because there are no synchronization delays

during normal operation, optimistic recovery can make use of stable storage, that

is, storage that persists beyond processor failures [lo], thereby supporting recov-

ery from failure of an arbitrary number of processors. The elimination of

synchronization delays additionally yields improved response time over other

transparent recovery mechanisms.

In this paper we describe the optimistic recovery protocols for recovering a

consistent systemwide state following a failure of one or more processors in a

distributed system. We do not discuss (1) means for detecting failures, (2)

mechanisms for determining the new system configuration after a failure, (3)

mechanisms for implementing a stable store, (4) mechanisms for providing

reliable communication within the distributed system. Solutions to these prob-

lems are orthogonal to our recovery technique. The reader is referred to [l], [lo],

[13], and [19] for further discussion of these issues.

ACM Transactions on Computer Systems, Vol. 3, No. 3, August 1985.

206

R. Strom and S. Yemini

1 .l Goals for Distributed System Recovery

Our approach to distributed recovery has the following goals:

-Application-independence.

The recovery technique should be applicable to

arbitrary programs.

-Application-transparency.

The recovery technique should be transparent to

the programs being made recoverable. Application-transparency (1) simplifies

programming; (2) allows both applications and recovery protocols to evolve

independently, thereby avoiding the risk that the software becomes obsolete

as a result of small changes in either the application or the underlying

hardware; (3) enables preexisting programs to become recoverable without

modification.

--High throughput.

The CPU resources of all processors should be available for

productive work when there are no failures.

-Maximal fault-tolerance.

The recovery mechanism should provide recovery

from failure of

any number

of processors of the system.

2. OPTIMISTIC RECOVERY

2.1 Overview

In optimistic recovery, computation, communication, and checkpointing proceed

asynchronously.

Instead of consistent checkpoints being maintained at all times,

enough information is saved to reconstruct a consistent state

after

a failure.

When reconstructing a consistent state, we face the following problem: A process

may receive and process a message M from a process

Pz,

but

may fail

before having recorded enough information on stable storage to enable restoring

the

state

from which it sent

Optimistic recovery solves this problem by having each process

track its

dependency

on the states of other processes with which it communicates. As a

result of dependency tracking, it is possible for

to detect that it has performed

computations that causally depend on states that the failed process

has lost.

Such computations are sometimes called

orphans.

If such computations have

been performed, they will be

undone

by restoring an earlier state of

that does

not depend on lost states.

A state is restored by first restoring an earlier checkpoint from stable storage

and then

replaying logged messages-

that is, reexecuting the process by driving

it from the sequence of input messages saved in stable storage. Because we control

the extent of rollback by replaying the correct number of messages, the system

never rolls back “too far” and hence avoids the domino effect.

The optimistic recovery protocols ensure that the externally visible behavior

of a distributed system incorporating these protocols is equivalent to some failure-

free execution. By “equivalent,” we mean that all messages sent outside the

distributed system in the failure-free execution that would be sent in the same

order during actual execution and that no other messages will be sent. Despite

the existence of processor failures that result in the loss of the recent state of

some processes, we meet the above correctness criterion by (a) restoring an earlier

possible state of the failed processes using rollback and replay, (b) rolling back

ACM Transactions on Computer Systems, Vol. 3, No. 3, August 1985.

Optimistic Recovery in Distributed Systems

207

other processes whenever these have been determined by dependency tracking to

depend on lost states, and (c) committing messages to the outside as soon as it

is determined from dependency information that the states that generated the

messages will never need to be rolled back.

2.2 Concepts and Definitions

2.2.1 Logical Machine and Recovery Units.

A cluster of machines incorporating

optimistic recovery appears to applications executing on it and to other systems

communicating with it as a single

logical machine,

as shown in Figure 2.

The logical machine is partitioned into a fixed number of

recovery units

(RUs),

which communicate with one another through message passing. Application

processes may be created and destroyed dynamically and are assigned to partic-

ular recovery units. In optimistic recovery, recovery units, rather than individual

processes, fail and are recovered.

By allowing each recovery unit to schedule multiple processes, and by holding

the number of recovery units fixed, we simplify the recovery algorithms without

restricting the dynamic variability of the system workload. In addition, it becomes

a system design parameter whether to have many “small” recovery units or a few

“large” ones.

We make the following assumptions about the logical machine:

-Reliable, FIFO channels between recovery units.

These can be implemented by

any of a number of communication protocols (see, e.g., [19]). We assume

nothing about the arrival order of messages sent to a recovery unit from two

different sources.

-Fail-stop

[15]. All failures are detected immediately and result in halting the

failed recovery units and initiating recovery action. (We later weaken this

assumption. It suffices to require that failures be detected before any event

resulting from them is made visible (“committed”) outside the logical machine.)

-Independence.

Failures will not recur if the recovery unit is reexecuted on

another processor. (We later weaken this assumption as well in our discussion

of recovery from software faults.)

-Stable storage.

Recovery units store their current state in volatile storage,

which is lost upon their failure. Information needed to reconstruct volatile

storage is maintained in stable storage and persists across failures [lo].

-Spare processing capacity.

It is always possible to relocate a failed recovery

unit to some working processor, which will be able to access the previously

logged recovery information on stable storage. We assume that physical proc-

essors will multiplex the workload of several recovery units, Relocating a

recovery unit to another processor may degrade performance but will have no

other visible effect.

2.2.2 State Intervals.

We assume the behavior of each recovery unit to be

repeatable

and

message driven.

That is, the state of the recovery unit can be

regenerated by restoring an earlier state, restoring the subsequent input message

queue in its original order (using the message log described below), and replaying

the processing of the recovery unit. Thus, we can identify a state of a recovery

unit by the ordinal number of the last input message that it processed.

ACM Transactions on Computer Systems, Vol. 3, No. 3, August 1985.

208

R. Strom and S. Yemini

Fig. 2. Logical Machine: The Logical Machine is seen by

the external world as a single machine communicating

through channels, depicted by double lines. Internally, it

contains multiple communicating recovery units.

Suppose RUi has already processed its first n - 1 input messages and is ready

to process its nth input message MJn). Its volatile storage is in some state,

which we shall call Si( n). Given state Si( n) and message Mi( n), RUGS processing

component will execute a series of computations, which we call state interval

1i( n) of RUi. During state interval 1i( n), RUi may conceivably generate output

messages destined to other recovery units or to outside the logical machine.

When RUi’s processing unit is ready to dequeue message Mi(n + l), state interval

li(n + 1) is started (Figure 3).

The property of repeatability implies that an arbitrary state S;(n) of RU; can

always be restored, provided we can recover an earlier state Si(n - d) and the

subsequent messages Mi( n - d) through Mi ( n -

1).

Si( n) is restored by replaying

the processing of these messages in order, starting at Si(n - d).

2.2.3 Incarnation Numbers and State Indices. Because RUi may roll back and

then resume processing either as a result of its own failure or in response to

failure of another recovery unit RUj which is unable to reconstruct states that

have affected RU; some input message ordinal numbers (and the corresponding

state

interval numbers) may be reused. In order to continue to have a unique

way of identifying state intervals, we designate each input message of a given

recovery unit and its corresponding state interval, by a message or state index,

which is a pair [ 1, p], where p is a message number and L is an incarnation number.

The incarnation number of a recovery unit is incremented each time a recovery

unit resumes processing after having rolled back (see Figure 4.)

2.2.4

Live History. A

state

interval of a recovery unit RUi is live if it has not

been rolled back. The live history of a recovery unit is the sequence of state

intervals of that recovery unit that have not been rolled back. The live history

constitutes a sequence of state intervals that could have arisen during a failure-

free execution of the recovery unit. For example, in Figure 4, the live history

consists of state intervals [l, I] through [l, 51, 12, 61 through [2, 81, and [3, 91

through [3, 131.

state

interval [L, p] of RUi is

live predecessor of a state interval [L’, ~‘1 iff

[L, P] precedes [I’, ~‘1 within the live history of RUi. We use the symbol -C to

ACM Transactions on Computer Systems, Vol. 3, No. 3, August 1985.

Optimistic Recovery in Distributed Systems

209

Fig. 3. Numbering messages and inter- m

vals. e

11. 11

Il. 21

r1, 31

r1, 41

Il. 51

11, 61

[l, 71

Fig. 4. Numbering messages in the

presence of rollbacks.

interval

I(i)

interval Kit 1) I

, rev m(i) H m(i)

send x Y

send y -

k rev m(i+l) /m(i+l)

send z -

I *

rci m(i+2) /m(i+2)

t2. 61

7 [2, 71

[2, 81

[2,

91 4

[3, 91

12, 101 [3, 101

[2,

111

[3. 111

[3, 121

[3, 131 4

denote the live predecessor relation. Thus, while the actual execution order

follows the lexicographical order, [L, p] < [L’, ~‘1 means that [L, ~1 < [L’, ~‘1

according to the lexicographical order, and that no other interval that precedes

[l’, PI ] supersedes [t, ~1. For example, [l, 51 < [2, lo], whereas [l, 61 #Z [2, lo].

In a correct implementation of optimistic recovery, all messages committed to

outside the logical machine depend only on live histories.

2.2.5 CUZL.SU~ Precedence. State intervals in a logical machine are partially

ordered by a causality relation. Within a recovery unit, the order of the state

intervals in the live history, that is <, determines the causality order: Each

interval is caused by its live predecessor interval. Between recovery units a partial

order is induced by the sending and receiving of messages. State interval [Lo, pi]

of RUi immediately causes interval [ Lj, pj ] of RUj whenever a message sent from

ACM Transactions on Computer Systems, Vol. 3, No. 3, August 1985.

210

R. Strom and S. Yemini

RUi during interval [Li, pi] is dequeued by RUj to begin interval [ Lj, pj 1. The

transitive closure of the relations “immediately causes” and i results in a partial

ordering over the set of state intervals in the distributed system, which we call

causal precedence

dependency.

2.2.6 Possible States.

A state interval is said to be

impossible

iff it depends on

two state

intervals of the same recovery unit that cannot both be live; that is,

the

two state

intervals have identical message numbers and different incarnation

numbers. Otherwise, it is said to be

possible.

For example, suppose a state interval

depended on both states [3, 131 and [2, lo] within the recovery unit shown in

Figure 4. Since [3, lo] 4 [3, 131, the state interval also depends on interval [3,

lo] and is therefore impossible.

If a

state

interval is possible, its dependencies can be encoded by a single index

into the live history of each of the recovery units. This index denotes the “latest”

state interval of each recovery unit on which the possible state interval depends

and indicates a causal dependency on that state interval and all its live predeces-

sors.

2.2.7 Logging.

Each recovery unit periodically logs information to stable stor-

age in order to support recovery. Logging to stable storage is

not

synchronized

with communication, There are two kinds of information saved on stable storage:

checkpoints and input message logs.

Checkpoints

are snapshots of the complete state of a recovery unit. Checkpoint

frequency is a tuning option: More frequent checkpoints may imply more disk

activity; less frequent checkpoints may result in recovery taking a longer time.

An input message is said to be “logged” whenever both its data and the ordinal

position in which it is processed can be obtained on demand during recovery.

(There exist optimizations in which it is not necessary to actually write some or

all of the message to stable store in order to “log” it, because the message is

known to be reconstructible from other stable information in the system.)

2.2.8 Lost Messages, Lost States, and Orphans.

Messages processed but not yet

logged by a recovery unit at the time of a failure are

lost messages;

the correspond-

ing

state

intervals are called

lost state intervals.

Messages and state intervals that

are either lost or causally dependent on lost state intervals are called

orphans.

State intervals that will never become orphans are called

committable.

Note that a message can be considered “lost,” even when its data value is

completely recoverable, if the ordinal position the message occupied in the input

message stream of the receiving recovery unit is unrecoverable. This is because

the relative order in which messages sent by different recovery units are merged

is not deterministic, and, therefore, upon recovery the message might be merged

in a different order, and the computation may be different.

2.3 Components and Data Structures

A recovery unit consists of (1) a set of input and output

half-sessions, (2)

merge

component, (3) a processing component,

and

(4)

recovery manager component.

Each recovery unit maintains (1) a

dependency vector, (2)

a log

vector,

and (3)

incarnation start table,

as well as checkpoint and message logs on stable storage

(Figure 5).

ACM Transactions on Computer Systems, Vol. 3, No. 3, August 1985.

Optimistic Recovery in Distributed Systems

211

Proctwelng Component

Fig. 5. Structure of a recovery unit.

2.3.1 Sessions and Boundary Functions.

Each recovery unit may receive mes-

sages from other recovery units and from input devices external to the logical

machine. Each recovery unit may also send messages to other recovery units and

to output devices external to the logical machine.

The protocol by which a recovery unit receives data from an external device is

called the

input boundary function;

the protocol by which a recovery unit sends

data to an external device is called the

output boundary function.

Between each

pair of recovery units, there is a pair of unidirectional

sessions.

A session consists

of a pair of

half-session

protocols-an

output half-session

in the sender and an

input half-session

in the receiver. Figure 6 shows a session between recovery

units in the dashed box.

Session protocols serve the purpose of detecting lost and duplicate messages

resulting from failures. Boundary function protocols are mostly device specific;

however, output boundary functions additionally delay messages intended to be

sent outside the system until they are committable.

2.3.2 Merge Component.

The merge component combines all the input message

streams from the input half-sessions and the input boundary functions into a

merged input stream.

This component assigns each message in the merged input

stream an

ordinal position number,

which corresponds to the order in which

messages will be processed by the processing component. The merge component

may be implemented very simply-for example, first-come, first-served-or it

may be more sophisticated, For example, it may take advantage of the fact that

ACM Transactions on Computer Systems, Vol. 3, No. 3, August 1985.

212 - R. Strom and S. Yemini

Fig. 6. Sessions between recovery units.

the recovery unit schedules many processes, and it may delay inserting a message

into the input stream if it is destined for a process that is not ready to receive it.

The merge component is not required to be deterministic, because its output is

logged, and the log is used to reconstruct the merged sequence if replay is

required.

2.3.3 Processing Component. The processing component is that part of a

recovery unit where the application processes execute. Application processes

running in the processing component keep their state in volatile storage. The

application processes are unaware of the recovery protocols. If several application

processes are allocated to a single recovery unit, the processing component is

responsible for scheduling these processes.

The processing component is driven by the merged input stream. The process-

ing component must be deterministic.

2.3.4 Recovery Manager Component. Each recovery unit’s recovery manager is

responsible for maintaining recovery information on stable storage. This includes

scheduling checkpointing, logging messages, and reclaiming obsolete checkpoints

and messages. The recovery manager is also responsible for recovering its recovery

unit following a failure. Recovery consists of restoring the recovery unit’s earliest

checkpoint and replaying the subsequent message log. When the message log is

exhausted, the recovery manager is responsible for broadcasting an appropriate

recovery message.

2.3.5 Dependency Vector. In a logical machine with m recovery units, the

causal predecessors of a possible state interval I of a recovery unit RUi can be

ACM Transactions on Computer Systems, Vol. 3, No. 3, August 1985.

Optimistic Recovery in Distributed Systems

213

encoded by a vector of

state

indices, (dl, dz, . . . , d,), provided that RUi never

enters an impossible state. The causal predecessors of I in each RUj are the set

of state intervals d, such that d 5 di. We call this vector a dependency vector.

As part of its internal state, each recovery unit RUi maintains a dependency

vector DVi, which identifies the causal predecessors of RU{s current

state

interval. The ith component of DVi, DVi(i), is the index of RUls current state

interval.

If state interval I of RUi depends on intervals dl , dz, . . . , d, of the m recovery

units of a logical machine, then if any one of these intervals cannot be recovered

after a failure, I must be rolled back.

2.3.6 Incarnation Start Table. To be able to identify rolled back states, each

recovery unit keeps an incarnation start table, which records the earliest message

number in each incarnation of each recovery unit. We use the notation IST(L, k)

to denote the earliest message number in incarnation L of RUk. For example, if

RUk is the recovery unit shown in Figure 4, then IST(l, k) = 1, IST(2, k) = 6,

and IST(3, k) = 9. The incarnation start table enables the processing component

to determine whether a given state interval is part of the live history of a recovery

unit, or whether it has been rolled back.

An interval [L, P] is part of the live history of RUk iff

31’ 3 1’ > L A IST(l’, 12) I CL,

that is, that there does not exist a later incarnation 1’ of RUk that starts at

message number cc or less.

The incarnation start table entry for an incarnation L is needed only as long

as messages depending on incarnation numbers less than 1 still exist in the logical

machine. In practice, if the logical machine has been failure free for a long

enough time, there is only one relevant incarnation number for each recovery

unit, and the incarnation start tables can be empty.

2.3.7 Log Vector. Each recovery unit logs its input messages in the back-

ground-the more messages logged, the more computations become committable.

In order that a recovery unit be able to determine which of its computations

are committable, it must know both the status of its own logging progress and

the logging progress of the other recovery units in the logical machine. This

information is recorded in a log

vector

LV, maintained in each recovery unit. The

ith component of LVk is a state index li of a state in RUi such that it and all its

live predecessors have been logged. LVI, reflects the current logging status of the

recovery units in the logical machine as perceived by RUk.

The actual logging status of a logical machine may be further ahead than

indicated by any of the local log vectors. Log vectors are updated by periodically

and asynchronously broadcasting local log vectors to other recovery units. Since

messages once logged remain logged forever, log vectors are strictly monotonically

increasing. Log vectors are used to determine the committability of state inter-

vals.

ACM Transactions on Computer Systems, Vol. 3, No. 3, August 1985.

214

FL Strom and S. Yemini

checkpts

RUi o

+ msg log rev ro,51

f Icrl 0

m ro,llo

e 0

[O.ZlO

rev [0,61

[0,3lEz 0

send M \

[0,4. 0

LO,510

t1 ___- _ --__- _ -.__- ___-_ _________.__._.” _.____. _.

EO.611 rev io.71

[0.710 0

t2 .---_.-_-..---_--_--_ _ .___-._-__-_.__-__._-.-.

I 0

RUj

checkpts

0 + msg log

0 I

0 [O,l!n

0 10.2kzl

x rev LO,31

send N

rev ro.41

Fig. 7. Sample history of a logical machine: Message

depends on interval [0, 61 of RUi.

The receipt of message M begins interval [0, 31 of RUj, which now depends on interval [0, 61

of RUi. Message N depends on both interval [0, 31 of RUj and interval [0,6] of RUi. A failure

of RU, at time t, will necessitate a rollback of RUj, but a failure at time ts will not.

2.4 Algorithms

2.4.1 Example.

We illustrate the algorithms of optimistic recovery with an

example (Figure 7) depicting two recovery units RUi and RUj. In this example,

the sixth message received by RUi gives rise to state interval [0, 61. During

interval [0, 61 RUi sends a message it4 to RUj. The receipt of M begins state

interval [0,3] of RUj. Thus, [0, 61 of RUi is a causal predecessor of [0,3] of RUj.

As a result of M, RUj sends message N to a printer outside of the logical machine.

We illustrate the operation of the algorithms in the failure-free case and in

several possible failure cases.

2.4.2 Processing Component Algorithms.

In addition to performing computa-

tions, the processing component has the responsibility of avoiding impossible

states,

maintaining the current dependency vector, and labeling the dependencies

of each output message.

2.4.2.1

CHECKING POSSIBLE STATES AND MAINTAINING THE DEPENDENCY

VECTOR.

When the processing component of RUk dequeues an input message

thereby beginning a new state interval, it is necessary to check that accepting M

does not lead to an impossible state, because the dependency vector encoding is

only meaningful for possible states.

ACM Transactions on Computer Systems, Vol.

3, No. 3,

August

1985.

Optimistic Recovery in Distributed Systems 215

The processing component of RUk maintains a dependency vector DV,, = ( [I1 ,

MII, v2, K.1, . * .

, [I,, M,]) subject to the following invariant:

-The current state is a possible state.

-The current state is not currently known to be an orphan, that is, for each i,

3 i 3 L > Ii A IST(L, i) 5 Mj.

When a new message M with dependency vector ([Lo, ~~1, [ 12, ~21, . . . , [L,, ~~1)

is dequeued, it is checked that processing M preserves the above invariant. There

are three possible cases:

(1) Usual case. For each i,

and

313(IiClI li A IST(L, i) 5 Mi)

Zli 3 (Li < L A IST(L, i) I hi).

The first condition guarantees that if [lip Mi] lexicographically precedes [Li, pi],

it is also a live predecessor and, therefore, that dequeuing the message will lead

to a possible state. The second condition guarantees that the new message is not

itself known to be an orphan. (If the system has been failure free for some time,

then li = Ii, in which case the first test is not needed, and li will be the most

recent incarnation of RUi, in which case the second test is trivially satisfied.)

In the normal case, M is accepted, and DVk must be updated to reflect the new

state dependency, as follows:

([Ii, Mi + 11

Dvk(i) t Imax([Ii, Mi], [ci, pi]) for i = k,

for i # k,

where max is defined on pairs using the usual lexicographical ordering.

(2) M depends on a new incarnation

some recovery unit RUi with an as yet

unknown start number, that is, such that IST(Li, i) is undefined. (The message

has arrived at RUk before the incarnation start table has been updated.) Since a

new incarnation is known to exist, but its start number is not yet known, it is

possible that the current state of RUk depends on a state of RUi that has been

rolled back. In this case the processing of M is delayed until the recovery message

announcing the new incarnation’s start number has been received by the recovery

manager.

Notice that this situation, which results in a delay of a recovery unit, only

occurs in the case in which another recovery unit has failed and is restarting,

and does not arise in failure-free operation.

In Figure 7, if RUi fails at time ti, it will restore checkpointed state C1 and

replay its first five messages. RUi will then begin a new incarnation, number 1.

If RUi subsequently sends a message to RUj, the dependency vector on that

message will contain incarnation number 1 in the ith component. If this message

arrives at RUj prior to the arrival of the recovery message containing incarnation

l’s start message number, then the processing of the message will be delayed.

ACM Transactions on Computer Systems, Vol. 3, No. 3, August 1985.

216

R. Strom and S. Yemini

(3) The new message is an orphan. That is,

3(~, i) 3 li < 1 A IST(L, i) 5 pi.

In this case, the orphan message is discarded.

Orphans will be detected in this way whenever the sending recovery unit has

not yet learned that it depends on lost states, but the receiving recovery unit has

received a recovery message identifying these states as lost.

Since logging is not synchronized with computation, the orphan may have

already been logged. If the message has been logged, the log must be undone. If

no other recovery unit has been informed that the message was logged, the

message may simply be erased. Otherwise, the recovery unit responds as if it had

crashed itself, in order to undo the effects of logging this message; that is, it must

begin a new incarnation and send a recovery message.

2.4.2.2 SENDING MESSAGES. The processing component appends the CUT-

rent value of the dependency vector to the header of each message it sends.

The overhead associated with tracking dependency is (in a naive implementa-

tion) m cells in each message header, where m is the number of recovery units in

the logical machine. For logical machines having large numbers of recovery units,

there are space-saving optimizations that lower this overhead, such as sending

only those dependency vector values that have changed since the last message

on this session.

2.4.3 Half-Session Algorithms. The half-session of the sender:

(1) appends successive session sequence numbers (SSNs) to each message it

sends on the channel. Note that these sequence numbers are relative to a

particular channel between two recovery units and are unrelated to message

indices, which are relative to a particular recovery unit.

(2) “saves” (i.e., is responsible for reconstructing) each sent message until

notified by the receiver that the message has been logged.

It is safe to use volatile storage to save these messages, since, if the sender

fails, messages needed by the receiver will be recreated by replay. (As detailed

below, the sender always resumes from its earliest checkpoint, and no checkpoint

is discarded unless it is no longer needed for purposes of regenerating messages

not yet logged by the receiver.) Any messages that would not be recreated by

replay are orphans, and so are not needed, since the receiver would have had to

back them out had it not failed.

The half-session of the receiver:

(1) Notifies the sender of the messages it has logged so that the sender may give

up recovery responsibility.

(2) Maintains an “expected” session sequence number and an “expected” sender

incarnation number. The receiver compares the actual and expected session

sequence numbers on each message it receives. There are three possible cases:

(a) Normalcase. The actual and expected session sequence numbers are

equal. In this case the half-session passes the message to the merge

function and increments its expected session sequence number.

ACM Transactions on Computer Systems, Vol. 3, No. 3, August 1985.

Optimistic Recovery in Distributed Systems

217

(b) Sender

failed.

The sender’s session sequence number is lower than ex-

pected. There are two possible actions, depending on whether the sender’s

incarnation number (which is stored in the message as part of the

dependency vector) is greater than the incarnation number expected by

the receiver:

(bl) The message’s incarnation number is lower than or equal to the

expected incarnation number. In this case the message is a dupli-

cate,

sent

by the recovering unit during replay. Such a message is

discarded by the input half-session.

(b2) The message’s incarnation number is higher than the expected

incarnation number. In this case the sender has completed recovery,

and the receiver accepts the message and modifies its expected

session sequence number and incarnation number to be the next

higher sequence number of the new incarnation.

In the example (Figure 7), suppose that RUi fails at time tl. Suppose

that just prior to the failure it had sent two messages to RUi: a message

from state [0,2] with session sequence number 101, and message

from

state [0, 61, with session sequence number 102. After RUi restores

checkpoint C1, it replays state intervals [0, l] through [0, 51, thereby

resending the message with session sequence number 101. Since RUj

expects sequence number 103, and since message 101 arrives from the

earlier incarnation, the message is discarded. After replaying, RUi starts

a new incarnation, number 1. The next message sent to RUi will carry

sequence number 102, but incarnation number 1. RUj will accept this

message and reset its expected sequence and incarnation number.

Receiver failed.

In this case the session sequence number of the received

message will be higher than the expected message. The receiver will

retrieve the missing messages by obtaining any logged messages from the

log and any unlogged messages from the sender.

The actions of the receiver half-session as a result of comparing session

sequence numbers are summarized in Figure 8.

2.4.4 Output Boundary Function Algorithm.

The

outside world,

that is, any

entity outside of the logical machine, is not guaranteed to be able to participate

in the optimistic recovery algorithms, since it may be unable to back out

computations. Therefore, all output messages with destinations outside the logical

machine are buffered in output boundary functions (OBFs), until the states from

which they were sent became committable. The following theorem allows us to

ascertain when a state interval can be determined to be committable:

THEOREM 1. Let (all, dp, . . . , d,,,) be the dependency vector of an arbitrary

possible state interval [L, ~1. Then, if there is a log vector LV such that for each i,

di 5 LV(i) then [L, ~1 is a committable state.

PROOF.

Suppose that interval [L, ~1 were not committable. Then by definition

of “committable,” at some time in the future, there will exist a lost state interval

of one of the recovery units, say, RUi, on which [L, ~1 depends. Call that state

ACM Transactions on Computer Systems, Vol. 3, No. 3, August 1985.

218

R. Strom and S. Yemini

SSN test Incarnation

number Meaning

actual = expected

IlOMl

actual a expected receiver fai led

sender fai led;

actual rexpetted message is a

dup I i cate

actual < expected

sender failed;

actual sexpetted message begins a

new incarnation

Actlon

accept rsessage

obtain missing

messages from log

and/or sender

ignore message

accept message;

modify expected SSN

and incarnation

nunber

Fig. 8. Possible actions of a receiver half-session.

interval si+ By the definition of the dependency vector for possible state intervals,

si 5

di.

But,‘by the definition of the log vector, di and all its live predecessors

have been logged. Therefore si must have been logged, so it cannot have been

lost. q

The output boundary function of RUk may release any messages whose de-

pendency vectors satisfy the condition

di 5

LVk(i) for all

Note that committing requires only local information-the current log vector of

the recovery unit committing the message and the message’s dependency vector.

In Figure 7, message N is not committable until both message [0, 31 of RUi

and message [0, 61 of RUi have been logged. Theorem 1 asserts that the

commitment can be guaranteed whenever RUj’s message [0,3] and RU:s message

[0, 61 are both logged.

Data destined for outside the logical machine are subject to a commitment

delay that can be affected by (a) the speed with which units log their inputs, and

(b) the delay between logging and communicating the log vector.

If the system designer knows that the message is destined for an external

entity which itself has rollback capability (e.g., a transaction processing system),

then it is possible to improve response time by sending uncommitted messages

for early processing by the external entity and later sending it “commit” or

“abort” messages, allowing the transaction processing to proceed in parallel with

the period of uncertainty about the message. This, in effect, “extends the logical

machine” to include the entity with rollback capability.

ACM Transactions on Computer Systems, Vol. 3, No. 3, August 1985.

Optimistic Recovery in Distributed Systems

219

2.45 Recovery Manager

2.4.5.1 LOGGING. From time to time, the recovery manager takes a check-

point of the entire volatile state of the recovery unit. (In practice, there exist

optimizations for incremental checkpointing.)

The recovery manager logs all input messages on stable storage in the order in

which the merge function determines that they are to be dequeued by the

processing component.

As a result of buffering, queuing, and I/O delays, messages may be logged either

before or after they are processed by the processing component;

there is no

synchronization between logging and processing.

Thus in Figure 7, by time

tl,

RUi

has logged its input messages only up to the fifth, but is already processing its

sixth input message. Because there is no synchronization, it is possible to perform

optimizations in the stable storage I/O that are not possible with synchronous

logs, such as blocking several log entries into a single disk track before writing

them out.

2.4.5.2

MAINTAINING THE LOG VECTOR.

It is necessary to let other recovery

units know what has been logged, since this information is used to commit

computations to outside the logical machine and to reclaim obsolete checkpoint

and log storage. Whereas dependency information must be communicated on

every transmission between recovery units in order for the algorithm to work

correctly, log vector information (provided it is eventually transmitted) can be

sent more infrequently, if desired. However, the sooner the log vector is broadcast

to other recovery units, the sooner these recovery units can commit computations,

and hence the better the response time observed outside the logical machine.

The log vector is maintained as follows. Each recovery unit, upon a receipt of

a new log vector ([L*,

Ml], . . . , [Lo, M,]),

computes the union of the sets of

logged messages indicated by taking the pointwise maximum of the components

of the old and new log vectors. Let LVk = [II,

Ml],

[12,

Mz],

. . . , [I,,,,

Mm].

Then

Lvk(i) + ma-d[Ii,

Mil, [G, Wi])

for

i = 1 . . . m,

where max is defined on pairs using lexicographical ordering.

A typical protocol for propagating the log vector would be to have the output

half-session piggyback the current log vector on each data message sent out of a

recovery unit. In this case, in order to ensure progress, it may be necessary to

periodically send a message containing only a log vector on any channel that has

remained idle for a sufficiently long time.

2.4.5.3 RECOVERY AFTER FAILURE.

After a failure or roliback, a recovery

unit restores its earliest checkpoint and replays its log until either an orphan or

the end of the log is reached. It then begins a new incarnation by

(1)

increasing

the incarnation number and (2) sending a

recovery message

to other recovery

units in the system informing them of the starting message number of the new

incarnation. Each input half-session is given an expected session sequence

number and incarnation number based on the last processed message received

from that half-session.

In our example, if RUi fails at time

tl,

it will restore checkpoint Cl, replay

logged messages through [0, 51, and then begin a new incarnation starting at

ACM “hansactions on Computer Systems, Vol. 3, No. 3, August 1985.

220

R. Strom and S. Yemini

state

[l, 61.

It will send a recovery message informing other recovery units (e.g.,

RUj) that the new incarnation has begun, and that states 6 or greater of earlier

incarnations (in this case, incarnation number 0) are lost.

When a recovery unit RUk receives a recovery message announcing the start

of a new incarnation L of some recovery unit RUi, it updates its incarnation start

table by adding a new entry IST(L, i). It then examines its current dependency

vector to see whether its current state is still a possible state. If the current state

of the recovery unit depends on a state that is no longer live, that is,

DV,(i) = [Ii, Mi] and IST(l, ;) 5 Mi,

then the recovery unit must roll back to an earlier state that does not depend on

the no longer live messages of RUi,

In our example (Figure 7), suppose RUi fails at time tl, after RUj has processed

message M, but before RUi has logged its message [0, 61 on which

depends.

After the failure, RUi will replay through messages [0, 51, and begin a new

incarnation with interval [l, 61. When RUj receives the recovery message an-

nouncing that RUi has begun incarnation 1 with message number 6, it will

examine its dependency vector, which will show a dependency on RUi’s interval

[0, 61. Since this interval is now known to be lost, RUj will roll back and act

exactly as if it had failed. It will restore its checkpoint C1, replay messages [0,

and [0, 21, and then begin a new incarnation of its own, sending other recovery

units a recovery message with the new incarnation start number.

2.4.5.4 RECLAIMING CHECKPOINTS AND

LOGS. During normal operation,

each recovery unit accumulates checkpoint and log records in stable storage.

A recovery unit may discard a checkpoint Ci whenever it knows that it will

never be required to recover any interval between Ci and the following checkpoint

Ci+l either (a) to roll back for the purposes of undoing the effects of orphan

messages (state backout), or (b) to roll back for the purpose of resending messages

lost by a receiving recovery unit that has failed

(message recovery).

A recovery unit can determine that no interval between checkpoints Ci and

Ci+l will ever need to be recovered for state backout when Ci+l’s state is

committable, as determined by its dependency vector and current log vector.

A recovery unit

can

determine that an interval is no longer needed for message

recovery when all messages sent from that interval have been logged by their

receiving recovery units. If no interval between Ci and Ci+l is needed for message

recovery, then Ci is not needed for message recovery.

Whenever Ci may be discarded, all the subsequent log entries up to Ci+l can

also be discarded.

THEOREM 2.

a state interval I of RIJk is committable, and if all messages sent

by RUk in intervals preceding I are recoverable, then a systemwide consistent state

can thereafter always be found without having to back out interval I.

PROOF. (1)

By definition, state interval I does not depend on any orphan

messages. Therefore, it is possible to recover all other recovery units to a point

where they have sent any message that RUk has received from them (i.e., RUk

will never have to be backed out to undo orphans). (2) All messages sent by RUk

to other recovery units are recoverable, so they will eventually all be received

ACM Transactions on Computer Systems, Vol. 3, No. 3, August 1985.

Optimistic Recovery in Distributed Systems 221

(i.e., RUk will never have to be backed out for the purpose of message recov-

ery). Cl

THEOREM 3. Provided that no recovery unit indefinitely delays (1) logging its

input, (2) transmitting its log vector, and (3) taking another checkpoint, then each

recovery unit will eventually be able to safely discard its oldest checkpoint.

PROOF.

Any given recovery unit will eventually take a second checkpoint

(third assumption). That second checkpoint will depend on some set of states of

each of the other recovery units. But the log vector of each of the recovery units

will continue to increase monotonically (first assumption), and that fact will be

transmitted to the given recovery unit (second assumption). Therefore the log

vector will eventually encompass all the states on which the second checkpoint

depends, making the state at that second checkpoint committable. Any messages

sent between the first and second checkpoint will also eventually be logged.

Therefore, eventually both conditions will be met for discarding the oldest

checkpoint. Cl

COROLLARY. There is no domino

effect.

This follows from the

fact

that backout

is bounded by the earliest checkpoint that has not been discarded and from the

fact

that checkpoints are continuously being discarded.

2.4.5.5 RECLAIMING INCARNATION START TABLES.

The recovery manager

also reclaims obsolete entries in the incarnation start table. Information for a

given incarnation L of a recovery unit can be discarded when there are no more

messages from incarnation L -

left in any of the recovery units or in the

channels between them. This fact can be determined by means of a periodic

broadcast from each recovery unit of the oldest (smallest) incarnation number

for every other recovery unit that resides in its log or in any output message

queue for which it has message recovery responsibility. The minimum of these

incarnation numbers can be used to determine a bound below which any incar-

nation number is obsolete.

Since increments of incarnation numbers are rare and the extra storage in the

incarnation start tables is negligible, reclaiming incarnation start table entries

can be given extremely low priority and requires negligible overhead in normal

operation.

2.5 Recovery from Non-Fail-Stop Failures

The fail-stop and independence assumptions are not always met in practice,

because

(1)

some errors are not immediately detected, and processing will

continue until the faulty state results in a failure; (2) failures resulting from

faulty operating system software will repeat themselves if the identical conditions

are restored.

Although the repeatability of software failures is advantageous for debugging

in that it is easier to locate the fault by replaying the crash than by examining a

postmortem dump, recovery entails

avoiding

the replay of the events leading to

the error.

To maximize the chance of full recovery from a software failure, it is necessary

to discard as much of recent past history as possible. To do so, we compute for

ACM Transactions on Computer Systems, Vol. 3, No. 3, August 1985.

222

R. Strom and S. Yemini

each RUi the earliest state index ai such that no output boundary function

committed any message dependent on interval si. It is then safe for each recovery

unit to roll back, acting as if all messages with indices si and later had not been

merged. Those of the unmerged messages that depend only on committed states

will be retained and remerged, while the rest will be discarded completely. To

reduce further the likelihood of repeating the software failure, the nondetermin-

istic merge algorithm can be perturbed, so that remerged messages will be

processed in a different order.

Similarly, if a failure is detected after a lapse of time and it is determined that

several recent state intervals are invalid, recovery is still possible, provided that

no output boundary function has yet committed output that depends on the

invalid state intervals. These intervals can be backed out, together with any

computations in other recovery units that depend on them.

Although optimistic recovery cannot recover from failures that are detected

too late, or from operating system software failures that repeat themselves despite

perturbation of the system state, the fact that dependency tracking makes it

possible to reconstruct several past consistent states gives optimistic recovery an

advantage over recovery systems such as those in [2] and [4], which can recon-

struct only the latest consistent state.

3. RELATED WORK

We distinguish the following categories of recovery schemes:

-application-specific recovery,

-transaction-based recovery,

-pessimistic recovery.

3.1 Application-Specific Recovery

In application-specific recovery, recovery is explicitly programmed as part of the

application. The recovery code may involve intimate knowledge of both the

application domain and of the underlying hardware (see, e.g., [ 161 for a survey of

some such techniques). Small changes in either the application or the underlying

hardware may entail substantial redesign of the recovery algorithms.

3.2 Transaction Recovery

In the transaction model [3, 6, 7, 111 computation is divided into units of work

called

transactions.

A transaction system is expected to behave as if individual

transactions were executed in some serial order

(serializability),

although, to

reduce response time, the transactions are actually executed in parallel through

multiprogramming or multiprocessing.

Transactions terminate by either

aborting,

or by

committing

their updates to

stable storage. In distributed environments, transaction commitment involves

synchronous checkpointing (“force-writing”) to stable storage by each of the

processors at each transaction boundary [ 121.

The protocols supporting the committing and aborting of transactions are

easily extended to handle recovery from machine failures by treating failures as

ACM Transactions on Computer Systems, Vol. 3, No. 3, August 1985.

Optimistic Recovery in Distributed Systems

223

Fig. 9. Short transactions: Commits are frequent, and little concurrency

is achievable, since within a given layer each message modifies the same

data.

aborts.

These protocols can be built into an operating system and can therefore

be made transparent to applications.

Unfortunately, not all programs can be expressed as transactions. Transaction

systems are based on several assumptions:

-Serializability is required.

-The probability that different transactions contend for the same data is low,

-Transactions are “long enough,” that is, involve enough computation to am-

ortize the I/O delays of synchronous commits at each transaction boundary.

For applications that are structured as transactions and that satisfy the above

assumptions, transaction-based recovery is likely to be cost effective, since there

is little additional overhead involved. However, transactions are only a special

case of the more general model of communicating asynchronous processes, for

which serializability may be an unnecessary restriction.

Consider, for example, a layered communication protocol, consisting of three

layers:

L1, Lp,

and

LB.

Each layer consists of a process that maintains state

information. Messages passirig through such a layered protocol typically change

the same state information in each layer; for example, the layer sequence number

is updated by

each

message. Thus the assumption of low conflict is not satisfied.

We consider two possibilities for defining transaction boundaries in such a

system:

(1)

Short transactions.

Passing through a single layer is considered a complete

transaction. Since upon completion transactions synchronously force-write in-

formation to stable storage, the resulting synchronization delays due to I/O to

stable storage at each commit point would be unacceptably high. See Figure 9.

(2) Long transactions.

A single transaction consists of passing through mul-

tiple layers (see Figure 10). Each layer supports messages going out into the

network (down) and messages coming in from the network (up). In this case the

serializability requirement of the transaction model is overly restrictive. The

following sequence of updates is perfectly acceptable for the communication

protocol:

(up);

(down);

(up);

(down). However, this sequence is not

serializable, because

Lz’s

data sees the transactions in the order “down-up,” and

ACM Transactions on Computer Systems, Vol. 3, No. 3, August 1985.

224 * R. Strom and S. Yemini

Fig. 10. Long transactions: Transactions

span multiple layers in order to reduce the

frequency of commits. Serialization im-

poses delays because interleaving conflicts

with serialization.

Ll’r L26 L2f Ll$

Ll’s

data sees the order “up-down.” Serialization would substantially reduce the

level of concurrency in such a system.

In our view, transactions are best viewed as a technique for efficiently imple-

menting a

single

logically serial process, such as a database manager, by executing

its noninterfering parts in parallel and not as a general recovery technique.

3.3 Pessimistic Recovery

Both Tandem NonStop’” [2], and AuragenTM [4]) are systems that support

transparent, application-independent recovery. Unlike optimistic recovery, these

systems synchronize communication and computation with checkpointing. We

call such systems pessimistic, since they delay processing each message until both

the

state

of sender and the state of the receiver have been checkpointed to avoid

an inconsistency in the rare case of a failure. To avoid the substantial delays

associated with checkpointing onto a stable storage medium such as mirrored

disks, pessimistic systems typically use a

backup process

on another processor to

hold checkpoints.

All communication then requires a multiway synchronization of the primary

and backup of both sender and receiver, and multiple failures can no longer be

tolerated: If both the primary’s processor and the backup’s processor fail, recovery

is no longer possible.

4. SUMMARY AND CONCLUSIONS

Optimistic recovery is a

transparent

recovery mechanism. That is, applications

can be written as if they were to be executed on an ideal failure-free machine.

Transparency is important for any system that does not have a single, permanent,

hard-wired application. If the application is to be modified, or if new applications

are to be written, or if old applications written without recovery in mind are to

be run on the system, then transparent recovery will save programming effort

and reduce the risk of introducing errors.

Optimistic recovery applies to any system that can be viewed as a collection of

recovery units communicating by message passing. It is not restricted, as trans-

action-based recovery is, to applications that can be structured as units of work

accessing a global database to which concurrency control is applied.

ACM Transactions

on Computer Systems, Vol. 3, No. 3, August 1985.

Optimistic Recovery in Distributed Systems

225

We believe that optimistic recovery pays a low price for transparent recovery.

Unlike pessimistic recovery techniques, there is no synchronization required

upon communication. Therefore, as long as the I/O bandwidth to the disk is

sufficiently high, logging delays do not slow down computation. Because logging

may be asynchronous, several log entries may be blocked into a buffer and written

out in a single I/O operation. Provided that an input message is logged by the

time the computation it engendered has completed and is ready to return a result

to the external user, there is no response time delay.

Optimistic recovery has the further advantages over pessimistic recovery that

a backup processor is not required for checkpointing, that recovery is possible

even after the temporary loss of all processors, and that some failures not

satisfying the fail-stop condition can be recovered.

Optimistic recovery is a special case of

optimistic algorithms

[ 181. An optimistic

algorithm is one that guesses that an uncertain but highly probable event will

happen, and executes “guarded” computations dependent on that assumption. If

the assumption should prove true, the optimistic algorithm commits the guarded

computations; otherwise, it rolls them back. Optimistic algorithms perform better

than their pessimistic counterparts whenever the net gain (the performance

improvement when the guess succeeds less the performance loss when the guess

fails, weighted by the respective probabilities) is greater than the fixed overhead

of tracking dependencies and maintaining rollback capability.

The “guess” in optimistic recovery is that the set of state intervals named in

the dependency vector of an input message will be made recoverable before the

next failure. The fixed overhead during failure-free operation is

-appending a session sequence number to each message traveling between

recovery units and checking it upon arrival;

-maintaining a dependency vector in each recovery unit, copying the vector to

the header of each message sent, and updating the dependency vector on each

message received;

-periodically checkpointing the full state of each recovery unit and incremen-

tally logging input

messages;

-periodically transmitting and updating the log vector;

-buffering messages in the output boundary function until they are committable.

Because the optimistic recovery algorithms gamble that failures will not occur,

we expect optimistic recovery to recover somewhat more slowly when failures

occur. However, since in most distributed systems failures are very infrequent,

we expect optimistic recovery to perform significantly better overall than other

recovery techniques.

ACKNOWLEDGMENTS

The authors are indebted to David Jefferson for valuable discussions of our

research. Jefferson’s own work on virtual time and the time warp mechanism [8]

proved to be inspirational for turning our thoughts into precise algorithms.

Jefferson’s work, while designed for concurrency control and distributed simu-

lation rather than for recovery, has a number of points in common with ours.

Our work differs from Jefferson’s in our use of a separate time-line for each

ACM Transactions on Computer Systems, Vol. 3, No. 3, August 1985.

226

R. Strom and S. Yemini

recovery unit, together with a partial ordering based on causality rather than a

single global time scale as in [9], and in our use of rollback to recover from

processor failures.

We wish to thank Marc Auslander and Irving Traiger for useful discussions

regarding the implementability of our technique, and Giacomo Cioffi, Arthur

Goldberg and Yehuda Afek for valuable criticisms of earlier drafts of this paper.

REFERENCES

Note: References [5], and [ 171 are not cited in text.

1. AGHILI, H., KIM, W., MCPHERSON, J., SCHKOLNICK, M, AND STRONG, R. A highly available

database system. IBM Research Rep. RJ 3755, IBM, Jan. 1983.

2. BARTLE~, J. F. A ‘nonstop’ operating system. In 11th Hawaii International Conference on

System Sciences. University of Hawaii, 1978.

3. BJORK, L. Recovery scenario for a DB/DC system. In Proceedings of the ACM Annual Confer-

ence (Atlanta, Ga., Aug. 24-29). ACM, New York, 1973, pp. 142-146.

4. BORG, A., BAUMBACH, J., AND GLAZER, S. A message system supporting fault tolerance. In 9th

ACM Symposium on Operating Systems Principles (Bretton Woods, N.H., Oct. 11-13). Oper.

Syst. Reu. 17, 5 (Oct. 1983), pp. 90-99.

5. CHANDY, K. M., AND LAMPORT L. Distributed snapshots: Determining global states in distrib-

uted systems. ACM Trans. Computer Syst. 3, 1 (Feb. 1985), 63-75.

6. DAVIES, C. T. Recovery semantics for a DB/DC system. In Proceedings

of the

ACM Annual

Conference (Atlanta, Ga., Aug., 24-29). ACM, New York, 1973, pp. 136-141.

7. GRAY, J., MCJONES, P., BLASGEN, M., LINDSAY, B., LORIE, R., PRICE, T., PVTZOLU, F., AND

TRAIGER, I. The recovery manager of the System R Database Manager. ACM Computing Sur.

13, 2, (June 1981), 223-242.

8. JEFFERSON, D. Virtual time. USC Tech. Rep. TR-83-213, Univ. of Southern California, Los

Angeles, May 1983.

9. LAMPORT, L. Time clocks, and the ordering of events in a distributed system. Commun. ACM

21, (July 1978), 558-564.

10. LAMPSON, B., AND STURGIS, H. Crash recovery in a distributed storage system. Xerox PARC

Tech. Rep., Xerox Palo Alto Research Center, Palo Alto, Calif., Apr. 1979.

11. LISKOV, B., AND SCHEIFLER R., Guardians and actions: Linguistic support for robust distributed

programs. In The 9th Annual Symposium on Principles of Programming Languages (Albuquerque,

New Mex., Jan. 25-27). ACM, New York, 1982, pp. 7-19.

12. MOHAN, C., AND LINDSAY, B. Efficient commit protocols for the tree of processes model of

distributed transactions. In Proceedings

the 2nd ACM SIGACTfSIGOPS Symposium on

Principles of Distributed Computing (Montreal, Canada, Aug.), 1983, pp. 76-80.

13. MOHAN, C., STRONG, H. R., AND FINKELSTEIN, S. Method for distributed transaction commit

and recovery using Byzantine agreement within clusters of processors. IBM Res. Rep. RJ 3882,

IBM, San Jose, Calif., June 1983.

14. RUSSELL, D. L. State restoration in systems of communicating processes. IEEE Trans. Softw.

Eng. SE-6, (2), (Mar. 1980), 193-194.

15. SCHNEIDER, F. B. Fail-stop processors. In Digest of Papers from Spring Compcon ‘83 (Mar.).

IEEE Computer Society, San Francisco, 1983.

16. SCOTT, R. K., GAULT, J. W., MCALLISTER, D. G., AND WIGGS, J. Experimental validation of

six fault-tolerant software reliability models. In Proceedings of 14th Annual Symposium on Fault-

Tolerant Computer Systems (Kissimmee, Fla., June 20-22). 1984.

17. STROM, R. E., AND YEMINI, S. Optimistic recovery: An asynchronous approach to fault tolerance

in distributed systems. Proceedings of the 14th Annual Symposium on Fault Tolerant Computer

Systems (June 20-22, 1984).

18. STROM, R., AND YEMINI, S. Synthesizing distributed and parallel programs through optimistic

transformations. IBM Res. Rep. RC 10797, IBM, 1984.

19. TANNENBAUM, A. S. Computer Networks. Prentice-Hall, Englewood Cliffs, N.J., 1981.

Received December 1983; revised February 1985; accepted April 1985

ACM Transactions on Computer Systems, Vol. 3, No. 3, August 1985.

Deep neural network-based spatiotemporal heterogeneous data reconstruction for landslide detection

Article

Full-text available

Sep 2022

Landslides could cause huge threats to lives and cause property damages. In the landslide prediction system, environmental information can be collected through sensors to detect the possibility of landslide occurrences. However, the data collected by wireless sensor network systems (WSNs) may be lost due to sensor failures, external interferences, or other environmental factors, which may affect the accuracy of landslide predictions. In order to solve the problem of missing data, we propose a data reconstruction method based on rainfall intensity, soil moisture, slope, and slope direction and reconstruct missing data based on heterogeneous data and temporal and spatial relationships. A convolutional long short-term memory (ConvLSTM) deep neural network is trained to predict the missing time slot data. We use the predicted data to compensate for missing data. The results of the experiments show that the factor of safety of ConvLSTM achieves better RMSE in almost all of the missing data types and rates than LSTM. The mean and stdev forecast error of gradual fading with ConvLSTM at missing rate 30% are -0.001 and 0.033, respectively.

A Behavioral Theory for Distributed Systems with Weak Recovery

Preprint

Full-text available

Jun 2024

Distributed systems can be subject to various kinds of partial failures, therefore building fault-tolerance or failure mitigation mechanisms for distributed systems remains an important domain of research. In this paper, we present a calculus to formally model distributed systems subject to crash failures with recovery. The recovery model considered in the paper is weak, in the sense that it makes no assumption on the exact state in which a failed node resumes its execution, only its identity has to be distinguishable from past incarnations of itself. Our calculus is inspired in part by the Erlang programming language and in part by the distributed $\pi$-calculus with nodes and link failures (D$\pi$F) introduced by Francalanza and Hennessy. In order to reason about distributed systems with failures and recovery we develop a behavioral theory for our calculus, in the form of a contextual equivalence, and of a fully abstract coinductive characterization of this equivalence by means of a labelled transition system semantics and its associated weak bisimilarity. This result is valuable for it provides a compositional proof technique for proving or disproving contextual equivalence between systems.

CONSISTENCY OF DISTRIBUTED SYSTEM WITH ACTIVE INITIATOR PROCESS WITHOUT USELESS CHECKPOINTS

Article

Aug 2014

Checkpointing mechanism is the one of the best attractive approach for providing software fault tolerance in distributed message passing systems. This paper aims to implement a distributed checkpointing technique, which eliminates the drawbacks of the centralized approach like “domino effect”, “useless checkpoint” (checkpoints that do not contribute to global consistency), and “hidden and zigzag” dependencies. The proposed checkpointing protocol has a checkpoint initiator, but, coordination among the local checkpoints is done in a distributed fashion. This guaranty that no message would be lost in case of failure occurs, has been maintained in this work by exchange of information among the processes. However, there is no central checkpoint initiator, but each of the processes takes turn to act as an initiator. Processes take local checkpoints only after being notified by the initiator. The processes synchronize their activities of the current checkpointing interval before finally committing their checkpoints. Thus, the checkpointing pattern described in this paper takes only those checkpoints that will contribute to the consistent global snapshot thereby eliminating the number of useless checkpoints.

An Exploration of Methods for Empathetic Cluster Formation Using Mobile Computing Systems

Article

Full-text available

Aug 2022

A Review of Approaches for Compassionate Checkpointing with Mobile Computing Systems

Article

Jul 2022

Naheeda Zaib

A distribution system is a group of autonomous entities working together to address a challenge that cannot be addressed by any one of them alone. A distributed system called a mobile computing device (MCD) has certain processes that are executed on mobile nodes, whose position within the network shifts over time. Distributed mobile systems create new problems such as mobility, poor wireless channel bandwidth, disconnections, low battery life, and a lack of a steady, trustworthy store on mobile nodes. The issue of fault-tolerant computation in mobile distributed databases is discussed in this study. Checkpointing and roll- it-back recovery are the foundations for the procedures outlined.

Scabbard: Single-Node Fault-Tolerant Stream Processing

Preprint

Full-text available

Feb 2022

Single-node multi-core stream processing engines (SPEs) can process hundreds of millions of tuples per second. Yet making them fault-tolerant with exactly-once semantics while retaining this performance is an open challenge: due to the limited I/O bandwidth of a single-node, it becomes infeasible to persist all stream data and operator state during execution. Instead, single-node SPEs rely on upstream distributed systems, such as Apache Kafka, to recover stream data after failure, necessitating complex cluster-based deployments. This lack of built-in fault-tolerance features has hindered the adoption of single-node SPEs. We describe Scabbard, the first single-node SPE that supports exactly-once fault-tolerance semantics despite limited local I/O bandwidth. Scabbard achieves this by integrating persistence operations with the query workload. Within the operator graph, Scabbard determines when to persist streams based on the selectivity of operators: by persisting streams after operators that discard data, it can substantially reduce the required I/O bandwidth. As part of the operator graph, Scabbard supports parallel persistence operations and uses markers to decide when to discard persisted data. The persisted data volume is further reduced using workload-specific compression: Scabbard monitors stream statistics and dynamically generates computationally efficient compression operators. Our experiments show that Scabbard can execute stream queries that process over 200 million tuples per second while recovering from failures with sub-second latencies.

Integrating Causality in Messaging Channels

Chapter

May 2024

Delay Wreaks Havoc on Your Smart Home: Delay-based Automation Interference Attacks

Conference Paper

May 2022

Exploiting Nil-External Interfaces for Fast Replicated Storage

Article

Jun 2022

Do some storage interfaces enable higher performance than others? Can one identify and exploit such interfaces to realize high performance in storage systems? This paper answers these questions in the affirmative by identifying nil-externality , a property of storage interfaces. A nil-externalizing (nilext) interface may modify state within a storage system but does not externalize its effects or system state immediately to the outside world. As a result, a storage system can apply nilext operations lazily, improving performance. In this paper, we take advantage of nilext interfaces to build high-performance replicated storage. We implement Skyros , a nilext-aware replication protocol that offers high performance by deferring ordering and executing operations until their effects are externalized. We show that exploiting nil-externality offers significant benefit: for many workloads, Skyros provides higher performance than standard consensus-based replication. For example, Skyros offers 3 × lower latency while providing the same high throughput offered by throughput-optimized Paxos.

Overview of Data Synchronization and Fault Recovery Technology in Multi Active Data Center

Conference Paper

Nov 2021

Method for distributed transaction commit and recovery using Byzantine Agreement within clusters of processors

Article

Full-text available

Jul 1985

This paper describes an application of Byzantine Agreement [DoSt82a, DoSt82e, LyFF82] to distributed transaction commit. We replace the second phase of one of the commit algorithms of [MoLi83] with Byzantine Agreement, providing certain trade-offs and advantages at the time of commit and providing speed advantages at the time of recovery from failure. The present work differs from that presented in [DoSt82b] by increasing the scope (handling a general tree of processes, and multi-cluster transactions) and by providing an explicit set of recovery algorithms. We also provide a model for classifying failures that allows comparisons to be made among various proposed distributed commit algorithms. The context for our work is the Highly Available Systems project at the IBM San Jose Research Laboratory [AAF-KM83].

Method for Distributed Transaction Commit and recovery Using Byzantine Agreement Within Clusters of Processors.

Conference Paper

Full-text available

Jan 1983

Recovery scenario for a DB/DC system

Article

Jan 1973

Lawrence A. Bjork

Previously developed sphere-of-control (SOC) concepts are used to develop a scenario for post-process recovery. An information structure provides the recovery boundary around the effects of the usage of a resource. Rules for building this structure are addressed, and the capability for searching backward in time is identified to determine the source of error and possible recovery strategies.

Optimistic Recovery: an Asynchronous Approach to Fault tolerance in Distributed Systems

Article

Jan 1984

A prototype for a highly available database system

Article

Efficient commit protocols for the tree of process model of distributed transactions

Article

Aug 1983

This paper describes two efficient distributed transaction commit protocols, the Presumed Abort (PA) and Presumed Commit (PC) protocols, which have been implemented in the distributed data base system R* [DSHLM82, LHMWY83]. PA and PC are extensions of the well-known two-phase (2P) commit protocol [Gray78, Lamp80, LSGGL80]. PA is optimized for read-only transactions and a class of multi-site update transactions, and PC is optimized for other classes of multi-site update transactions. The optimizations result in reduced inter-site message traffic and log writes, and, consequently, a better response time for such transactions. We derive the new protocols in a step-wise fashion by modifying the 2P protocol.

Synthesizing Distributed and Parallel Programs through Optimistic Transformations.

Conference Paper

Jan 1985

We propose a programming methodology for distributed systems, based upon writing programs in a high-level, implementation-independent programming language in which distribution and physical parallelism are hidden, and where the programmer sees only a 'single systems image. ' A distributed implementation of the program is then obtained by translating the program into a lower level language, in which distribution and parallelism, as well as other implementation-dependent and performance-related details are exposed. The translation is further optimized by applying correctness-preserving program transformations. We introduce one particular family of such program transformations called optimistic transformations. We give three examples of 'guesses' which lead to optimistic transformations of practical value: (a) the guess that multiple iterations of a loop will not conflict, (b) the guess that exceptional program conditions will not occur, and

A Message System Supporting Fault Tolerance.

Conference Paper

Dec 1983

A simple and general design uses message-based communication to provide software tolerance of single-point hardware failures. By delivering all interprocess messages to inactive backups for both the sender and the destination, both backups are kept in a state in which they can take over for their primaries. An implementation for the Auragen 4000 series of M68000-based systems is described. The operating system, Auros TM , is a distributed version of UNIX*. Major goals have been transparency of fault tolerance and efficient execution in the absence of failure.

The Recovery Manager of the System R Database Manager

Article

Jun 1981

The recovery subsystem of an experimental data management system is described and evaluated. The transactmn concept allows application programs to commit, abort, or partially undo their effects. The DO-UNDO-REDO protocol allows new recoverable types and operations to be added to the recovery system Apphcation programs can record data m the transaction log to facilitate application-specific recovery. Transaction undo and redo are based on records kept in a transaction log. The checkpoint mechanism is based on differential fries (shadows). The recovery log is recorded on disk rather than tape.

Distributed Snapshots: Determining Global States of Distributed Systems

Article

Feb 1985

This paper presents an algorithm by which a process in a distributed system determines a global state of the system during a computation. Many problems in distributed systems can be cast in terms of the problem of detecting global states. For instance, the global state detection algorithm helps to solve an important class of problems: stable property detection. A stable property is one that persists: once a stable property becomes true it remains true thereafter. Examples of stable properties are “computation has terminated,” “ the system is deadlocked” and “all tokens in a token ring have disappeared.” The stable property detection problem is that of devising algorithms to detect a given stable property. Global state detection can also be used for checkpointing.

Optimistic Recovery in Distributed Systems

Abstract

Recommended publications

New Causal Message Logging Protocol with Asynchronous Checkpointing for Distributed Systems

Optimistic Recovery: an Asynchronous Approach to Fault tolerance in Distributed Systems

NIL: An Integrated Language and System for Distributed Programming

A Recoverable Object Store

Synthesizing Distributed and Parallel Programs through Optimistic Transformations.