ArticlePDF Available

On the quality of service of failure detectors

February 2002
IEEE Transactions on Computers 51(1):13-32

February 2002
51(1):13-32

DOI:10.1109/12.980014

Source
IEEE Xplore

Authors:

Wei Chen

Microsoft

Marcos Aguilera

We study the quality of service (QoS) of failure detectors. By QoS, we mean a specification that quantifies: (1) how fast the failure detector detects actual failures and (2) how well it avoids false detections. We first propose a set of QoS metrics to specify failure detectors for systems with probabilistic behaviors, i.e., for systems where message delays and message losses follow some probability distributions. We then give a new failure detector algorithm and analyze its QoS in terms of the proposed metrics. We show that, among a large class of failure detectors, the new algorithm is optimal with respect to some of these QoS metrics. Given a set of failure detector QoS requirements, we show how to compute the parameters of our algorithm so that it satisfies these requirements and we show how this can be done even if the probabilistic behavior of the system is not known. We then present some simulation results that show that the new failure detector algorithm provides a better QoS than an algorithm that is commonly used in practice. Finally, we suggest some ways to make our failure detector adaptive to changes in the probabilistic behavior of the network

The 99% confidence intervals for the expected values of mistake recurrence times of NFD-S (shown by ⊤ ⊥), with the plot of the analytical formula for E(T MR ) of NFD-S (shown by-).

…

The change of the QoS of NFD-E when n increases. Parameter α = 1.90.

…

The maximum detection times observed in the simulations of NFD-E (shown by ×)

…

The average mistake recurrence times obtained from the simulations of SFD-L and SFD-S (shown by-⋄-and-•-), with the plot of the analytical formula for E(T MR ) of NFD-S (shown by-).

…

Figures - uploaded by Wei Chen

Content may be subject to copyright.

Content uploaded by Wei Chen

Content may be subject to copyright.

ON THE QUALITY OF SERVICE OF

FAILURE DETECTORS

A Dissertation

Presented to the Faculty of the Graduate School

of Cornell University

in Partial Fulﬁllment of the Requirements for the Degree of

Doctor of Philosophy

Wei Chen

May 2000

 We i Chen 2000

ON THE QUALITY OF SERVICE OF FAILURE DETECTORS

Wei Chen, Ph.D.

Cornell University 2000

Failure detectors are basic building blo cks of fault-tolerant distributed systems

and are used in a wide variety of settings. They are also the basi s of a paradigm for

solving several fundamental problems in fault-tolerant distributed computing such

as consensus, atomic broadcast, l eader electi on, etc.

In this thesis, we study the quality of service (QoS) of failure detectors. By QoS,

we mean a spe ciﬁcation that quantiﬁes (a) how fast the failure detector detects actual

failures, and (b) how well it avoids false detections. To the best of our knowledge,

this is the ﬁrst comprehensive and systematic study of the QoS of failure detectors

that provides both a rigorous mathematical foundation and practic al solutions.

We ﬁrst study t he QoS speciﬁcation of failure detectors. In particular, we propose

a set of QoS metrics that are especially suited for specifying failure detectors with

probabilistic be haviors. We then provide a rigorous mathematical foundation based

on stochastic modeling to support our QoS speci ﬁcation.

Next , we develop a new f ai lure detector algorithm for systems with probabilistic

behaviors (i.e., the behaviors of message de lays and message losses follow some prob-

ability distributions). We perform quantitative analysis and derive closed formulas

on the QoS metrics of the new algor ithm. We show t hat among a large class of fail-

ure detectors, the new algorithm is optimal with respect to some of the QoS metrics.

We then show how to conﬁgure the new failure detector algorithm to satisfy QoS

requirements given by an application. In order to put the algorithm into practice, we

further explain how to modify the al gorithm so that it works when the local clocks

of processes are not synchronized, and how to conﬁgure the failure dete ctor even if

the probabilistic behaviors of the system is not known. Finally, we run simulations

of both the new algorithm and a sim ple failure detec t or algorithm commonly used in

practice. The simulation results demonstrate that the new failure detector algorithm

provides better QoS t han the simple algorithm.

Biographical Sketch

Wei Chen was born on May 2, 1968 in Beijing, China. During most of his ﬁrst

twenty years, he lived with his parents in a lovely neighborhood north to the Long

Tan Lake and t hree bus stops away from the famous Temple of He aven, by which his

wife Jian Han was brought up. In his early age, his mother fostered his interest in

mathematics, while his father sent him to a nearby amateur sports school to receive

regular soccer training. Since then, mathematics and socc er have been two of his

long lasting interests, giv ing him many joy and excitement.

After six years at No.26 Middle School (l ater renamed to Hui Wen Middle School

during the years when Jian was studying there), where he wrote his ﬁrst program

on an APPLE II computer, he entered Tsinghua University in 1986 and selected

Computer Sci ence as his major. He recei ved his Bachelor of Enginee r ing degree in

July 1991 with the honor of “Excell ent Graduate”, and then continued in Tsinghua

for graduate study and received his Master of Engineering degree in March 1993.

Only at around this time he ﬁnally met Jian, even though they had been brought up

in nearby neighborhoods, and had attended the same elementary and middle schools.

After graduation, he worked in the Department of Computer Science and Tech-

nology, Tsinghua University as a Teaching and Research Associate. In August 1994,

iii

he came to the States and pursue his Do ctoral degree at the Department of Com-

puter Science, Cornell University. One year later, he married Jian, who since then

has accompanied and supported him through out his study at Cornell , and in the

mean time pursues her own graduate degree in management science.

To my parents, Chen Chengda, Wang Zhengli

and my wife , Jian

Acknowledgements

More than any other person, I am indebted to my wife, Jian. Her love, understanding,

and support have made my ﬁve years at Cornell much more joy ful and much less

frustrating than it could have bee n.

I am extremely grateful to my advisor, Sam Toueg, who has guided me through

my re search work. Sam has taught me everything important to a high quality aca-

demic research, from exploring new ideas, formulating the results, to writing every

single sentence of a paper. I cannot imagine how I could have reached t his point

without his help and guidance.

I have also beneﬁted a lot from the collab oration with Marcos Kawazo e Aguilera.

Working with Marcos is always a pleasant and informative experi ence.

I extend my gratitude to my other committee members Robbert van Renesse,

Joseph Halpe r n, David Shmoys, and Jon Kleinberg (as the proxy of Professor Shmoys

at my de fense), who carefully reviewed my thesis work and provided helpful input. I

am also greatly b eneﬁted from the interactions through out the years wi t h many peo-

ple in or outside the department. Among others, they include Ken Birman, Tushar

Chandra, Francis Chu, Vassos Hadzilacos, Narahari U. Prabhu, Michel Raynal, and

Jean-Marie Sulmont.

I would like to oﬀer my special thanks to my friend and soccer teammate Fang

Xue, who has supplied me with many needed background knowledge in probability

theory and stochastic proce sses. I would also like to thank Thomas Wan and Brian

James, who read part of the thesis and helped me to improve my thesis presentation.

I would like to thank my teachers and advisors in Tsinghua, in particular, Pro-

fessor Dai Yiq i, Lin Xingliang, Lu Kaicheng, Huang Liansheng, and Lu Zhongwan,

who introduce d me to computer science research.

This research work is partiall y supported by NSF grants CCR-9402896 and CCR-

9711403, and by ARPA/ONR grant N00014-96-1-1014. Any opinions, ﬁndi ngs, or

recommendations presented in this thesis, however, are my own and do not neces-

sarily reﬂect the views of any of the organizations mentioned i n this paragraph.

Last but not the least, I thank my family, and my wife’s family, for their constant

support to my graduate study at Cornell.

vii

Table of Contents

1 Introduction 1

1.1 On the QoS Speciﬁcation of Failure Detectors . . . . . . . . . . . . . 4

1.2 The Design and Analysis of a New Failure Detector Algorithm . . . . 6

1.3 Summary of Other Research Works . . . . . . . . . . . . . . . . . . . 8

1.3.1 Failure Detect ion and Consensus in the Crash-Recovery Model 8

1.3.2 Achieving Quiescence wit h the Heartbeat Failure Detector . . 9

1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 On the Quality-of-Service Speci ﬁcation of Failure Detectors 12

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . 13

2.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Failure Detect or Spec iﬁcation . . . . . . . . . . . . . . . . . . . . . . 18

2.2.1 The Failure Detector Model . . . . . . . . . . . . . . . . . . . 18

2.2.2 Primary Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.3 Derived Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 Relations betwe en Accuracy Metrics . . . . . . . . . . . . . . . . . . 25

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Stochastic Modeling of Failure Detec tors and Their Quality-of-

Service Speciﬁcations 31

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 Failure Detect or M odel . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.1 Failure Detect or Deﬁnition . . . . . . . . . . . . . . . . . . . . 33

3.2.2 Failure Detect or Histories as Marked Point Proc esses . . . . . 36

3.2.3 The Steady State Behaviors of Failure Detectors . . . . . . . . 38

3.3 Failure Detect or Spec iﬁcation Metrics . . . . . . . . . . . . . . . . . . 46

3.3.1 Deﬁnitions of Metrics . . . . . . . . . . . . . . . . . . . . . . . 46

3.3.2 Relations betwe en Accuracy Metrics . . . . . . . . . . . . . . 50

viii

4 The Design and Analysis of a New Failure Detec tor Algorit hm 60

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.1.1 A Common Failure Detection Algorithm and its Drawbacks . 61

4.1.2 The New Algorithm and its QoS Analysis . . . . . . . . . . . 62

4.1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2 The Probabil istic Network Model . . . . . . . . . . . . . . . . . . . . 67

4.3 The New Failure Detec tor Algorithm and Its Analysis . . . . . . . . . 68

4.3.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.3.2 The Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.3.3 An Optimality Result . . . . . . . . . . . . . . . . . . . . . . . 84

4.3.4 Conﬁguring the Failure Detector to Satisfy QoS Requirements 88

4.4 Dealing with Unknown System Behavior and Unsynchronized Clocks 92

4.4.1 Conﬁguring t he Failure Detector NFD-S When the Probabilis-

tic Behavior of the Messages i s Not Known . . . . . . . . . . . 92

4.4.2 Dealing with Unsynchronized Clocks . . . . . . . . . . . . . . 96

4.4.3 Conﬁguring the Failure Detector When Loc al Clocks are Not

Synchronized a nd the Probabilistic Behavior of the Messages

is Not Known . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.5.1 Simulation Results of NFD-S . . . . . . . . . . . . . . . . . . 108

4.5.2 Simulation Results of NFD-E . . . . . . . . . . . . . . . . . . 112

4.5.3 Simulation Results of the Simple Algorithm . . . . . . . . . . 118

4.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

A Theory of Marked Point Processes 127

Bibliography 137

List of Figures

2.1 Detection time T

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 FD

and FD

have the same quer y accuracy probability of .75, but

the mistake rate of FD

is four times that of FD

. . . . . . . . . . . . 14

2.3 FD

and FD

have the same mistake rate 1/16, but the query accuracy

probabilities of FD

and FD

are .75 and .50, respectively. . . . . . . 15

2.4 Mistake duration T

, Good period duration T

, and Mistake recur-

rence time T

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 Three scenarios of the failure detector output in one interval [τ

, τ

i+1

) 68

4.2 The new failure detector algorithm NFD-S, with synchronized clocks,

and with parameters η and δ . . . . . . . . . . . . . . . . . . . . . . . 70

4.3 The new failure detector algorithm NFD-U, with unsynchronized

clocks and known expected arrival times, and with parameters η and α 97

4.4 The new failure detector algorithm NFD-E, with unsynchronized

clocks and estimated expected arrival times, and with parameters η

and α . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.5 The maximum detec t ion times observed in the simulations of

NFD-S(shown by +) . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.6 The average mistake recurrence times obtained from the simulations

of NFD-S (shown by +), with the plot of the analyti cal formula for

E(T

) of NFD-S (shown by —). . . . . . . . . . . . . . . . . . . . . 109

4.7 The 99% conﬁdence intervals for the expected values of mistake recur-

rence times of NFD-S (shown by ⊤⊥), with the plot of the analytical

formula for E(T

) of NFD-S (shown by —). . . . . . . . . . . . . . . 113

4.8 The change of the QoS of NFD-E when n increases. Parameter α = 1.90.115

4.9 The maximum detection times observed in the simulations of NF D-E

(shown by ×) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

4.10 The average mistake recurrence times obtained from the simulations

of NFD-E (shown by ×), with the plot of the analyti cal formula for

E(T

) of NFD-S (shown by —). . . . . . . . . . . . . . . . . . . . . 117

4.11 The maximum dete ction times observed in the simulations of SFD-L

and SFD-S (shown by ⋄ and ◦) . . . . . . . . . . . . . . . . . . . . . 119

4.12 The average mistake recurrence times obtained from the simulations

of SFD-L and SFD-S (shown by -⋄- and -◦-), with the plot of the

analytical formula for E(T

) of NFD-S (shown by —). . . . . . . . . 120

Chapter 1

Introduction

Fault-tolerant distributed systems are designed to provide reliable and cont inuous

service despite the f ai lures of some of the ir components. A basic building block

in these systems is the failure detector. Failure detect ors are used in a wide vari-

ety of settings, such as network communication protocols [Bra89], computer clus-

ter management [Pﬁ98], group membership protocols [ADKM92, BvR93, BDGB94,

vRBM96, MMSA

96, Hay98], etc.

Roughly speaking, a failure detector consists of distributed modules such that

each process has access to a local failure detector modul e that provides (possibly

erroneous) information about which processes have crashed. This information is

typically given in the form of a list of suspects. In general, due to the nondeter-

minism present i n distributed systems, such as message delays and losses caused by

network congestion, failure detectors are not reliable: a process that has crashed is

not necessarily susp ected and a process may be erroneously susp ected even though

it has not crashed.

Chandra and Toueg [CT96] provide the ﬁrst formal speciﬁcation of unreliable

failure detectors and show how they c an be used to solve some fundamental problems

in distributed computing, such as consensus and atomic broadcast. This approach

was later used and/or generalized in other works, e.g., [GLS95, DFKM96, FC96,

ACT, ACT00, ACT99].

In all of the above works, the failure detector speciﬁcations are deﬁned in terms of

the eventual behaviors of failure detectors (e.g. a process that crashes is eventually

suspected). These speciﬁcations are appropriate for purely asynchronous systems

in which there is no timing assumption whatsoever. Practical distributed systems,

however, usually do have certain timing constraints. I n these systems, applications

require more than just proper ties on the eventual behaviors of failure de t ectors. For

example, a failure detector that starts suspecting a process one hour after the process

crashes may still satisfy the propert ies necessary for solvi ng asynchronous consensus,

but it can hardly satisfy the requirement of any application in practice. Therefore,

in practice, one needs to know the quality of service (QoS) of failure detectors. By

QoS, we mean a speciﬁcation that quantiﬁes the behavior of a failure detector. More

precisely, it speciﬁes (a) how fast the fai lure detector detects actual failures, and (b)

how well it avoids false detections.

In this thesi s, we focus on the QoS of failure detectors. More speciﬁcally:

1. We study how to specify the QoS of fail ure detectors. In particular:

(a) We propose a set of QoS metrics that are especially sui t ed for specifying

failure detectors with probabilistic behaviors.

(b) We provide a rigorous mathematical foundation based on stochastic

modeling to support our QoS speciﬁcation.

2. We develop a new fai lure detector algorithm, and study the QoS it provides.

In particular:

(a) We p erform a quantitative analysis and derive closed formulas on the QoS

metrics of the new algorithm.

(b) We show that among a large class of failure detect ors the new algorithm

is optimal with respect to some of the QoS metrics.

by an application. More precisely, given the QoS requirements of an ap-

plication, we show how to use the closed formulas we derived to compute

the parameters of the new algorithm to satisfy the requirements.

(d) To widen the applicability of the new algorithm, we further explain how

to conﬁgure the fail ure detector even if the probabilistic behavior of the

system is not known, and how to modify the algorithm so that it works

when the local clocks of processes are not synchronized.

(e) We run simulations of both the ne w algorithm and a simple algorithm

commonly used in practice, and from the simulation results we de mon-

strate that the new algorithm is better than the simple algorithm with

respect to some QoS metrics.

To the best of our knowledge, this is the ﬁrst comprehensive and systematic study

of the QoS of failure detectors that provides both a rigorous mathematical foundation

and practical solutions.

1.1 On the QoS Speciﬁcation of Failure Detectors

How should one specify the QoS of a failure detector? As pointed out above, a

failure detector may be slow in detecting a crash, and it may make mistakes, i.e., it

may suspect some processes that are actually up. Thus the QoS speciﬁc ation should

be given by a set of metrics that describes the fail ure detector’s speed (how fast it

detects crashes) and its accuracy (how well it avoids mistakes). Note that, when

specifying the QoS of a failure detector, we should consider the failure detector as a

“black box”: the QoS metric s should refer only to the external behavior of the failure

detector, and not to various aspe cts of its internal implementation.

A failure det ector’s speed is easy to measure: this is the time elapsed fr om the mo-

ment when a process crashes to the time when the failure detector starts suspecting

the process permanently. We call this QoS metric the detection time.

The accuracy met rics should measure how well a failure detector avoids erroneous

suspicions of processes that are actually up. Therefore, when measuring the accuracy

of failure detectors, we assume that the processes being monitored do not crash. It

turns out that determ ining a good set of accuracy metrics is a subtle task. The

subtleties are due to the variety of the accuracy aspects that applications might be

interested in. For example, consider an application that at random times queries

a failure detector about a process being monitored. For such an application, a

natural measure of accuracy is the probability that, when queried at a random time,

the failure detect or does not suspect the process, i.e., the failure detec t or output

is correct. We cal l this QoS metric the query accuracy probability. This metric,

however, is not suﬃcient to fully describe the accuracy of a failure detector. In fact,

it is easy to ﬁnd two f ailure detectors that have the same que ry accuracy probability,

but one makes mistakes more f requently than the other. In some applications, every

mistake of the failure detector causes a costly interrupt, and for such applications the

mistake rate is an important accuracy metr ic. Mi stake rate alone, however, cannot

fully characterize the accuracy either: one can ﬁnd two failure detectors that have

the same mistake rate but diﬀerent query accuracy probability. Moreover, even when

used together, these two m etric s are still not suﬃcient. It is easy to ﬁnd two f ai lure

detectors such that one is better in both mistake rate and q uer y accuracy probability,

but the other is better in some other aspect of the accuracy.

These subtleties show that there are several diﬀerent aspects of accuracy that may

be important to applications, and each aspect has a corresponding accuracy metric.

We identify six accuracy metrics, and then use the theory of stochastic proc esses to

determine their relations. B ased on these relations, we select two accuracy metrics

as the primary ones in the sense that (a) they are not redundant (one cannot be

derived from the other), and (b) together, they can be used to derive the other four

accuracy metrics. These two accuracy metrics, together with the detection time,

provide the QoS speciﬁcation of failure detectors.

The QoS metrics we proposed are especially suited for specifying failure detec t ors

with probabilistic behaviors (such probabilistic behaviors may be due to the fact that

(a) message losses and delays follow a certain probability distribution, or (b) the

failure detector algorithm itself uses randomization, as in [vRMH98]). We provide a

solid mathematical foundation based on stochastic modeling to formally model the

probabilistic behaviors of failure detectors and their QoS. More pre cisely, we use the

theory of marked point processes to formally deﬁne the fai lure detector model and

the QoS metrics proposed, and then we perform a rigorous analysis on the relations

between the accuracy metrics under this formal model .

1.2 The Design and Analysis of a New Failure

Detector Algorithm

When designing a failure detector algorithm, one should strive to achieve both good

speed and good accuracy. However, these are two conﬂicting objectives. To se e this,

note that in practice a failure detector typically works as follows: the failure detector

waits for messages from the process being monitored, and if it does not receive any

message from the process for a while, it starts suspecting the process. This suspicion

could be a mistake since the messages from the process may be delayed or lost. If

the failure detector waits for a longer period of time bef ore suspecting the process,

it reduces the chance of making a mistake, but it increases the detection time if

the process actually crashes. Conversely, if the failure det ector waits for a shorter

period of time before suspecting the process, it reduc es the detection ti me if the

process actuall y crashes, but increases the chance of making a mistake. Thus to

design a good algorithm design, one should ﬁnd the right balance between these two

conﬂicting obj ectives.

We ﬁrst examine a simple failure detector algorithm commonly used in practice,

and notice that when the variation of the message delays i s large, this algorithm

cannot achieve both good spe ed and good accuracy. We then design a new failure

detector algorithm that overcomes the problem of the simple algorithm.

We analyze the QoS of the new algorithm in distributed systems with probabilistic

behaviors (i.e., the behaviors of message de lays and message losses follow some prob-

ability distributions). We use the theory of stochastic processes in the analysis, and

derive closed formulas on the QoS metrics of the new algorithm. We then show the

following optimality result: Roughly speaking, among all fail ure detectors that send

messages at the same rate and satisfy the same upper bound on the worst-case de-

tection time, the new failure detector algorithm is optimal with respect to the query

accuracy probability. This shows that the new f ai lure dete ctor algorithm provides

both good speed and good acc uracy. We then show that, given a set of QoS require-

ments by an application, we can use the closed formulas we derived to compute the

parameters of the new algorithm to meet these requirements.

Next , we explain how to make the new algorithm applicable to more gene ral

settings. This involves the following two mo diﬁcations: (a) When conﬁguring the new

failure detector algorithm to meet an application’s QoS requirements, the original

conﬁguration proc edure requires the knowledge of the probabilistic behaviors of the

system (i.e., the probability distributions of message delays and message losses). We

show how to conﬁgure the new failure detector even if the probabilisti c behavior of

the system is not known. (b) The new failure detector algorithm is ﬁrst given wit h

the assumption that the local clocks of processes are synchronized. We show how to

modify the new algorithm so that this assumption is no longer necessary.

Finally, we run simulations of both the new algorithm and the simple algorithm,

and prov ide a detailed analysis on the simulation results. The conclusion we draw

from these simulations are: (a) the simulation results of the new algorithm are con-

sistent with our m athematical analysis of the Q oS metrics; (b) the new algorithm

that does not assume synchronized clocks provides similar QoS as the algorithm that

assumes synchronized clocks; and (c) when comparing the new algorithm with the

simple algorithm under the condition that both algorithms send messages at the

same rate and satisfy the same bound on the worst-case detecti on time, the new

algorithm provides (in some cases orders of magnitude) better accuracy than the

simple algorithm.

1.3 Su mmary of Other Research Works

Our research on the QoS of failure det ectors aims to provide both a solid f oundation

and useful solutions to practical systems. In the same spirit, our other research works

emphasize ext ending previous theoretical works to more practical computing model s.

These works have appeared or will app ear as the following journal pap ers [ACT00,

ACT, ACT99]. We only brieﬂy summarize the main results of these research works

here.

1.3 .1 Failure Detection and Consens us in the Crash-

Recovery Model

The problem of solving consensus in asynchronous systems with unreliable failure

detectors was ﬁrst investigated in [CT96, CHT96]. These works established the

paradigm of usi ng fail ure detec tion to solve some fundamental problem s in fault-

tolerant computing. However, these works only considered systems where process

crashes are permanent and links are reliable (i.e ., they do not lose messages). In

practical distributed systems, processes may recover after crashing and links may

lose messages.

In [ACT00], we study the problems of failure detection and consensus in asyn-

chronous systems in which processes may crash and recover, and links may lose

messages. We ﬁrst propose new failure dete ctors that are particularl y suited for the

crash-recovery model. We nex t determine the conditions under which stable storage

is necessary to solve consensus in this model. Using the new failure detectors, we give

two consensus algorithms that match these conditions: one requires stable storage

and the other does not. Both algorithms tol erate link failures and are particularly

eﬃcient in the runs that are most likely in practice — those with no failures or fail-

ure detector mistakes. In such runs, consensus is achieved within 3δ time units and

with 4n messages, wher e δ is the maximum message delay and n is the number of

processes in the system.

1.3 .2 Achieving Quiescence wi th the Heartbeat Failure

Detec tor

An algorithm is quiescent if it eventually stops sending messages. Quiescence is an

important property of an algorithm, but in asynchronous systems subject t o both

process crashes and message losses, quiescence is not easy to achieve.

In [ACT], we study the problem of achievi ng reliable communication with quies-

cent algorithms in asynchronous system s with process crashes and lossy links. We

ﬁrst show that it is impossible to solve this problem in purely asynchronous sys-

tems (with no failure detectors). We then show that, among failure detectors that

output lists of suspects, the weakest one that can be used to solve this problem is

P, a failure detec tor that cannot be implemented. To overcome this diﬃculty, we

introduce an implementable fai lure detector called Heartbeat and show that it can be

used to achieve quiescent reliable communication. Heartbeat is novel: in contrast to

typical failure detectors, it does not output li sts of suspects and it is implem entable

without timeouts. With Heartbeat, many exi sting algorithms that t ol erate only pro-

cess crashes can be transformed into quiescent algorithms that tolerate both proce ss

crashes and message losses. This can be applie d to consensus, atomic broadcast,

k-set agreement, atomic commitment, etc.

In [ACT99], we show how to achieve quiescent reliable communication and quies-

cent consensus in partitionable networks, i n which not only processes may crash and

messages may be lost, but also the network may be partitioned into disconnected

components. We ﬁrst tackle the problem of reliable communication for partitionable

networks by extending the results in [ACT]. In particular, we generalize the speci-

ﬁcation of the heartbeat failure detector, show how to implement it, and show how

to use it to achieve qui escent reliable communication. We then turn our attention

to the problem of consensus for partitionable networks. We ﬁrst show that, even

though this problem can be solved using a natural extension of failure detector

(the one used in [ CT96] to solve consensus), such solutions are not quiesc ent —

in other words,

S alone is not suﬃcient to achieve quiescent consensus in parti-

tionable networks. We t hen solve this problem using

S and the quiescent reliable

communication primitives that we develop ed.

1.4 Thesis Organization

In Chapter 2, we propose a set of metrics for the QoS speciﬁcation of failure detectors.

In Chapter 3, we present the f ormal ization of the failure detector model and t he QoS

speciﬁcation. In Chapter 4, we develop a new failure detector algorithm, analyze its

QoS, show its optimality result, show how to conﬁgure the algorithm to satisfy QoS

requirements given by an application, show how to make the algorithm applicable

to more general settings, and show the simulation results that provide an empirical

comparison between the new algorithm and t he simple algorithm. In Appendix A,

we summarize relevant deﬁnitions and results in the t heory of marked point processes

that are used in Chapter 3.

Chapter 2

On the Quality-of-Service

Spec iﬁcation of Failure Detectors

2.1 Introduction

In this chapter, we study how to specify the quality of service (QoS) of failure de-

tectors. In particular, we propose a set of QoS metrics that are especiall y suited for

specifying failure detectors with probabilistic behaviors (such probabilistic behaviors

may be due to the fact that (a) message losses and delays follow a certain probabil-

ity distributi on, or (b) the fail ure detector algorithm itself uses randomization, as in

[vRMH98]).

2.1 .1 Background and Motivation

We consider message-passing distributed systems in which processes may fail by

crashing, and messages may be delayed or dropped by communication links.

such systems, fai lure detec tors typical ly provide a list of processes that are sus-

pected to have crashed so far. A failure detector can be slow, i.e ., it may take a long

time to suspect a process that has crashed, and it can make mistakes, i.e., it may

erroneously suspect some processes that are actually up (such a mistake i s not nec-

essarily permanent: the failure detector may later remove this process fr om its list

of suspects). To be useful, a failure detector has to be reasonably fast and accurate.

In this chapter, we propose a set of metrics for the QoS speciﬁcation of fail-

ure detectors. In general, these QoS metric s should be able to describe the failure

detector’s speed (how fast it detects crashes) and its accuracy (how well it avoids

mistakes). Note that speed is with respect to processes that crash, while accuracy i s

with respect to processes that do not crash.

A failure detector’s spee d is easy to measure: this is simply the time that elapses

from the mome nt when a process p crashes t o the time when the failure dete ctor starts

suspecting p permanently. This QoS metric, called detection time, is illustrated in

Fig. 2.1.

How do we measure a failure detector’s accuracy? It turns out that deter mining

a good set of accuracy metri cs is a delicate task. To illustrate some of the subtleties

invol ved, consider a system of two processes p and q connected by a lossy commu-

nication li nk, and suppose that the failure detector at q monitors process p. The

We assume that process crashes are permanent, or, equivalently, that a process that recovers

from a crash assumes a new identity.

trust

suspect

down

trust

suspect

FD at q

Figure 2.1: Detection time T

output of the f ai lure detector at q is either “I suspect that p has crashed” or “I trust

that p is up”, and it may alternate between these two outputs from time to time.

For the purpose of measuring the accuracy of the failure detec t or at q, suppose that

p does not crash.

Consider an application that queries q’s failure detector at random times. For

such an application, a natural measure of accuracy is the probability that, when

queried at a random time, the output of the fail ure detector at q is “I trust that p

is up” — which is correct. This QoS metric is the query accuracy probability. For

example, in Fig. 2.2, the que r y accuracy probability of FD

at q is 12/(12+4) = .75.

12 12 12

3 ...

1 ...

4 4 4

Figure 2.2: FD

and FD

have the same query accuracy probability of .75, but the

mistake rate of FD

is four tim es that of FD

The query accuracy probability, however, is not suﬃcient to fully describe the

12 12 12

8 8

4 4 4

Figure 2.3: FD

and FD

have the same mistake rate 1/16, but the query accuracy

probabilities of FD

and FD

are .75 and .50, respec t ively.

accuracy of a failure detector. To see this, we show i n Fig. 2.2 two failure detectors

and FD

such that (a) they have the same query accuracy probability, but

(b) FD

makes mistakes more frequently than FD

In some applicati ons, every

mistake causes a costly interrupt, and for such applications the mi stake rate is an

important accuracy metric.

Note, however, that the mistake rate alone is not suﬃcient to characterize acc u-

racy: as shown in Fig. 2.3, two fail ure detectors can have the same mistake rate, but

diﬀerent query accuracy probabilities.

Even when used together, the above two accuracy metri cs are still not suﬃ cient.

In fact, it is easy to ﬁnd two failure detectors FD

and FD

, such that (a) FD

better than FD

in both measures (i.e., it has a higher query accuracy probability and

a lower m istake rate), but (b) FD

is better than FD

in another respect: speciﬁcall y,

whenever FD

makes a mistake, it corrects this mistake faster than FD

; in other

words, the mistake durations in FD

are smaller than in FD

. Having small mistake

durations may be im portant to some applications.

The failure detector at q makes a mistake every time its output changes from “trust” to “sus-

pect” while p is actually up.

As it can be seen from the above, there are several diﬀerent aspec ts of accuracy

that may be important t o applications, and each aspect has a corresponding accuracy

metric.

In this chapter, we ﬁrst identify six accuracy m etrics (sinc e the behavior of a

failure detector is probabilistic , most of these metrics are random variables). We

then use the theory of stochastic processes to determine their precise relation. This

analysis allows us to select two accuracy metrics as the primary ones in the sense

that: (a) they are not redundant (one cannot be derived from the other), and (b)

together, they can be used to derive the other four accuracy metrics.

In summary, we show that the QoS speciﬁcation of failure detectors can be given

in terms of three basic m etrics, namely, the detec t ion time and the two primary

accuracy metrics that we identiﬁed. Taken together, these m etrics can be used to

characterize and compare the QoS of failure det ectors.

2.1 .2 Related Work

There is not much previous work on the QoS speciﬁcation of failure detectors.

In [CT96], unreliable failure detectors were introduced as an abstraction that c an

be used to solve some fundamental problems of fault-tolerant distributed computing,

such as consensus, in asynchronous systems. This approach was later used and/or

generalized in other works, e.g., [GLS95, DFKM96, FC96, ACT, ACT00, ACT99].

In all of these works, the failure dete ctor speciﬁcations are deﬁned in terms of the

eventual behaviors of failure detec tors (e.g., a process that crashes is eventually

suspected). The eventual be havior, however, does not describe the QoS of failure

detectors (e.g., how fast a process that crashes becomes suspected).

In [GM98], Gouda and McGuire measure the performance of some failure detector

protocols under the assumption that the proto col stops as soon as some process is

suspected to have crashed (even if this suspicion is a mistake). This class of failure

detectors is less general than the one that we studied here : in our work, a failure

detector can alternate between suspicion and trust many times.

In [vRMH98], van Renesse et. al. propose a gossip-style randomized failure

detector protocol. They measure the accuracy of this protocol in terms of the prob-

ability of p remature timeouts.

The probability of premature tim eouts, however, is

not an appropriate metric for the speciﬁcation of failure detectors in general: i t is

implementation-speci ﬁc and it cannot be used to compare failure detectors that use

timeouts in diﬀerent ways. This point is further explained in Section 2.4.

In [V R00], Ve r´ıssimo and Raynal study timing failure detectors, which detect

timing failures (such as the delay of a message or the exe cution time of a task being

longer than a given time bound). The class of timing failure detec t ors is more general

than the class of failure detectors that dete ct crash failure s. They also study QoS,

but their work diﬀers signiﬁcantly from ours in that: What they study is Q oS-FD

— failure detectors that detect quality-of-service failures. More precisely, they study

failure detectors that output some index (and other derived information) to indicate

the quality of service of some system services (e.g. network connectivity). What we

study here is the quality of service of (crash) f ailure detectors, i.e. how good a failure

detector is in terms of detecting process crashes, and how to conﬁgure the fai lure

detector to satisfy QoS requireme nts given by an application.

This is called “the probability of mistakes” in [vRMH98].

The rest of the chapter is organized as follows. In Section 2.2, we propose a set of

QoS metric s for failure detectors. We quantify the relation between these metrics in

Section 2.3, and conclude the chapter with a brief discussion in Section 2.4.

In this chapter, we keep our presentation at an intuitive level. The formal def-

initions of our mode l and of our QoS metrics are developed using the theory of

stochastic processes, and are given in Chapter 3.

2.2 Failure Detector Spec iﬁcation

We consider a system of two processes p and q. We assume that the failure detector

at q monitors p, and that q does not crash. Henceforth, real time is continuous and

ranges from 0 to ∞.

2.2 .1 The Failure Dete ctor Model

The output of the failure detector at q at time t is either S or T , which means that

q suspects or trusts p at time t, respectively. A transition o ccurs when the output

of the failure detector at q changes: An S-transition occurs when the output at q

changes from T to S; a T-transition occurs when the output at q changes from S to

T . We assume that there are only a ﬁnite numb er of transitions during any ﬁnite

time interval. A failure detector history describes the output of the failure dete ctor

in an entire run.

A failure pattern of process p is just a number F ∈ [0, ∞], denoting the time t at

which p crashes; F = ∞ me ans that p does not crash. A run in which p does not

crash is called a failure-free run. For each failure pattern, there is a corresponding

set of possible fail ure detector histories (as in [CT96]), and this set has a probability

distribution. To understand this, consider the following example. Let F be the

failure pattern in which p crashes at time 5. The failure detector at q may “detec t ”

this crash at time 6, or 6.74, or 9, etc. This is the set H of failure detector histories

corresponding to the failure pattern F . In a probabilistic system, some of the failure

detector histories in H are more likely than others, and this is given by the probability

distribution on H. With this probability distribution, quantities like “the probability

that the crash of p is detec t ed b efore time 8”, or “the expec t ed detection time” are

now meaningful.

We consider only fail ure detectors whose behavior eventually reaches steady state,

as we now expl ai n.

When a failure detector starts r unning, and for a while after,

its behavior depends on the initi al conditi on (such as whether initially q suspects p

or not) and on how long i t has bee n running. Typically, as time passes the eﬀect

of the initial condition gradually diminishes and its behavior no longer depends on

how long i t has been running — i.e., eventually the failure detector behavior reaches

equilibri um, or steady state. In steady state, the probability law governing the

behavior of the failure detector does not change over time. The QoS metrics that we

propose refer to the behavior of a failure detector after it reaches steady state.

2.2 .2 Primary Metrics

We propose three primary metrics for the QoS speciﬁcation of failure detec t ors. The

ﬁrst one measures the speed of a failure detector. It is deﬁned with respect to the

We omit the formal deﬁnition of steady state here; this deﬁnition is based on the theory of

stochastic processes.

trust

suspect suspect

FD at q

Figure 2.4: Mistake duration T

, Good period duration T

, and Mistake recurrence

time T

runs in which p crashes.

Detection time (T

): Informally, T

is the time that elapses from p’s crash to the

time when q starts suspecting p perm anently. More precisely, T

is a random variable

representing the time that elapses from the time that p crashes t o the time when the

ﬁnal S-transition (of the failure detector at q) occurs and there are no transitions

afterwards (Fig. 2.1). If there is no such ﬁnal S-transition, then T

= ∞; if such an

S-transition occurs before p crashes, then T

= 0.

The next two metrics can be used to specify the accuracy of a failure detector.

They are deﬁned with respect to failure-free runs.

Mistake recurrence time (T

): this measures the time between two consecutive

mistakes. More precisely, T

is a random variable representing the time that elapses

from an S-transition to the next one (Fig. 2.4). If no new S-transition occurs, then

= ∞.

Mistake duration (T

): this measures the time it takes the f ai lure detector to

correct a mi stake. More precisely, T

is a random variable representing the time that

elapses from an S-transiti on to the next T-transition (Fig. 2.4). If no S-transition

In Section 2.4, we explain why these metrics also measure the failure detector accuracy in runs

in which p crashes.

occurs, then T

= 0; if no T-transition occurs after an S-transition, then T

= ∞.

As we discussed in the introduction, there are many aspe cts of failure detector

accuracy that may be important to applic ations. Thus, in addition to T

and T

we propose four other accuracy metrics in the next section. We selected T

and

as the primary metrics because given these two, one can compute the other four

(this will be shown in Section 2.3).

2.2 .3 Derived Metrics

We propose the following four additional accuracy metric s (they are deﬁned wi th

respect to failure-free runs).

Average m istake rate (λ

): this measures the rate at which a failure de tector

make mistakes, i.e., it is the average number of S-transitions per time unit. This

metric is important to long-lived applicati ons such as group membership and c luster

management, where each mistake (each S-transition) results in a costly interrupt.

Query accuracy probability (P

): this is the probability t hat the failure detector’s

output is correct at a random time. This metr ic is im portant to applications that

interact with the failure detector by querying it at random times.

Many applications are slowed down by failure detector mistakes. Such applica-

tions prefer a failure detector with long good periods — periods in which the failure

detector makes no mistakes. This observation leads to the following two metric s.

Good period duration (T

): this measures the length of a good period. More

precisely, T

is a random variable representing the time that elapses from a T-

transition to the next S-transition (Fig. 2.4). If no T-transition occurs, then T

= 0;

if no S-transition occurs after a T-transition, then T

= ∞.

For short-lived applications, however, a closely related metric may be more

relevant. Suppose that an application i s started at a random time, and that this

happens to occur somewhere inside a goo d period. In this case, we are interested

in measuring the remaining portion of this good perio d: if i t is long enough, the

short-lived application will be able to complete its task within this period. The

corresponding metric is as follows.

Forward good period duration (T

): this is a random variable representing the

time that elapses from a random time at which q trusts p, to the time of the next

S-transition. If no such S-transition occurs, then T

= ∞. If the probability that q

trusts p at a random time is 0 (i.e. P

= 0), then T

is always 0.

At ﬁrst sight, it may seem that, on the average, T

is just half of T

(the length

of a goo d period). But this is incorrect, and in Section 2.3 we give the actual relation

between T

and T

We now give a simple example to illustrate how these deﬁnitions work.

Example 1. Consider the following simple failure detector algorithm A: process p

sends a heartbeat message to q every one tim e unit ; process q suspects p initially;

every time q receives a heartbeat message from p, q trusts p for one tim e unit, and

by the end of the unit if q has not received any new heartbeat message, then q starts

suspecting p.

Suppose that algorithm A runs i n the following simpliﬁed ne twork environment:

every heartbeat message is either lost, or is delivered instantaneously; each heartbeat

message has an independent probability p

∈ (0, 1) to be lost.

We now analyze all seven metrics of this failure detector. In this system, if p does

not crash, then q keeps trusting p if and only if the heartbeat messages are not lost.

Once a message is lost, q starts suspecting p immediately and the suspicion is kept

until a new heartbeat message is received.

For the detection time T

, let T be the time elapsed between the time t when p

sends the last message and the time t

′

when p crashes. T has a uniform distribution

between 0 and 1. If the last heartbeat message is lost, then q starts suspecting p

permanently at time t before p crashes, and so T

= 0 in this case. If the last message

is not lost, then q starts suspecting p permanently at time t+1, and so T

= 1−T in

this case. Therefore, the distribution of T

is such that with probability p

, T

= 0,

and with probability 1 −p

, T

= 1−T , where T has a uniform distribution between

0 and 1. Thus we have

Pr(T

≤ x) =











0 x < 0

x = 0

+ x(1 − p

) 0 < x ≤ 1

1 x > 1

For the accuracy metrics, suppose that p does not crash.

For the mistake duration T

, suppose that a message m

is lost, which c auses an

S-transition of the failure detector. Then after m

, the ﬁrst message that q receives

causes the next T-transition. Since message losses are indep endent, we have that

the probability that m

j+i

is the ﬁrst message that q receives after m

is p

i−1

(1 −p

for all i ≥ 1. Since m essages are sent e very one unit of time and are delivered

instantaneously if not lost, we know that the distribution of T

is such that with

probability p

i−1

(1 − p

), T

= i, i ≥ 1. This is a geometric distribution with

parameter 1 − p

The good period duration T

has a distribution symmetric t o the distribution of

. Suppose that message m

is received and it causes a T-transition. Then after

, the ﬁrst message that is lost causes the next S-transition. Thus the distribution

of T

is such that with probability (1 − p

)

i−1

, T

= i, i ≥ 1. This is a geometric

distribution with parameter p

For the mi stake recurrence time T

, starting at an S-transition, the ﬁrst message

that is received causes the next T-transition, and then the ﬁrst message that is lost

causes the next S-transition. The length of the suspicion period is independent of

the length of the tr ust period that follows, due to the indep endence of message loss.

Thus T

is the sum of two independent random variable X and Y , where X and Y

have geometric distributions wi t h parameters 1 − p

and p

, respectively.

For the average m istake rate λ

, in any unit time interval in steady state, there

is either no S-transition or exactly one S-transition. Thus the ave r age number of

S-transitions in the unit i nterval is just the probability that one S-transition occurs

in the interval. An S-transition occurs in the interval if and only if the message sent

in the interval is lost and the previous message is not lost. Thus the probability that

an S-transition occurs in the interval is p

(1 − p

). Therefore, λ

= p

(1 − p

For the query accuracy P

, q trusts p at a random time t if and only if the message

sent before t is not lost. Therefore, P

= 1 − p

For the forward good period duration T

, suppose that q trusts p at a random

time t. Let T

′

be the time elapsed from t to the time when next heartbeat message is

sent. Then T

′

has a uniform distribution from 0 to 1. From t ime t on, an S-transition

occurs when a heartbeat message is lost. Therefore, T

has the distribution such

that with probability (1 − p

)

, T

= T

′

+ i, i ≥ 0.

2.3 Relations between Accuracy Metrics

In Theorem 2.1 below we state the re lation between the six accuracy metrics that

we deﬁned in the previous sections. We then use this theorem to justify our choice

of the primary accuracy metrics.

Henceforth, Pr(A) denotes the probability of event A, and E(X), V(X), and

σ(X) denote the e xpected value (or mean), the variance, and the standard deviation

of random variable X, respecti vel y.

Parts (2) and (3) of Theorem 2.1 assume that in fai lure-free runs, the probabilistic

distribution of failure detector histories is ergodic. Roughly speaking, this means that

in failure-free runs, the failure detector sl owly “forgets” its past history: from any

given time on, its future behavior may depend only on i t s recent behavior. We

call failure detectors satisfying this ergodicity c ondition ergodic failure detectors. In

Chapter 3, we formally deﬁne the ergodicity condition, prove the following theorem,

and also show the relations between our accuracy metrics in the case that ergodicity

does not hold.

Theorem 2.1 For any ergodic failure detector, the following results hold:

(1) T

= T

− T

(2) If 0 < E(T

) < ∞, then:

E(T

)

, (2.1)

E(T

)

E(T

)

E(T

) −E(T

)

E(T

)

. (2.2)

(3) If 0 < E(T

) < ∞ and E(T

) = 0, then T

is always 0. If 0 < E(T

) < ∞

and E(T

) 6= 0, then:

for all x ∈ [0, ∞), Pr(T

≤ x) =

E(T

)

Pr(T

> y)dy, (2.3)

E(T

) =

E(T

k+1

)

(k + 1)E(T

)

. (2.4)

In particular,

E(T

) =

E(T

)

2E(T

)

E(T

)

1 +

V(T

)

E(T

)

. (2.5)

The fact that T

= T

− T

holds is immediate by deﬁnition. Equalities (2.1)

and (2.2) are intuitive, but (2.3), (2.4) and (2.5), which describe the relation be-

tween T

and T

, are more complex . Moreover, (2.5) is counter-intuitive: one may

think that E(T

) = E(T

)/2, but ( 2. 5) says that E(T

) is in general larger than

E(T

)/2 (this is a version of the “waiting tim e paradox” i n the theory of stochastic

processes [Al l90]).

We now explain how Theorem 2.1 guided our selection of the primary accuracy

metrics. Equalities (2.1), (2.2) and (2.3) show that λ

, P

and T

can be derived

from T

, T

and T

. This suggests that the primary metrics should be selected

among T

, T

and T

. Moreove r , since T

= T

− T

, it is clear that given

the joint distribution of any two of them, one c an derive the remaini ng one. Thus,

two of T

, T

and T

should be selected as the primary metric s, but which two?

By choosing T

and T

as our pri mary metrics, we get the following convenient

property that helps to compare failure detectors: if FD

is better than FD

in terms

of both E(T

) and E(T

) (the expected values of the pri mary metrics) then we can

be sure that FD

is also better than FD

in terms of E(T

) (the ex pected values of

the other metric). We would not get this useful property if T

were selected as one

of the primary metrics.

We now demonstrate parts (2) and (3) of Theorem 2.1 with the example in the

previous section.

Example 2. Consider again the algorithm in Example 1. Since message losses are

independent, the behavior of the failure dete ctor does not depend on what happened

in the past history, and so the distribution of f ai lure detector histories in fail ure-free

runs is ergodic.

has a geometric distribution with parameter p

, so we have E(T

) = 1/p

and V(T

) = (1 − p

)/p

. Similarly, we have E(T

) = 1/(1 − p

), and E(T

) =

1/p

+ 1/(1 −p

) = 1/[p

(1 −p

)]. With p

∈ (0, 1), we have 0 < E(T

) < ∞ and

E(T

) 6= 0. Thus the conditions for Equalities (2.1)–(2.5) to hold are true.

From Ex ample 1, we know that λ

= p

(1 − p

) and P

= 1 − p

. Since

E(T

) = 1/[p

(1 − p

)] and E(T

) = 1/p

, Equalities (2.1) and (2. 2) hold for this

failure detector.

We now check Equality (2.3). Given any x ∈ [0, ∞), let n = ⌊x⌋ and r = x − n.

From Example 1, we know that T

has the distribution such that with probability

(1 − p

)

, T

= T

′

+ i, i ≥ 0, where T

′

has a uniform distribution from 0 to 1.

Then

Pr(T

≤ x) = Pr(T

< n) + Pr(n ≤ T

≤ x)

n−1

i=0

(1 − p

)

+ (1 − p

)

Pr(0 ≤ T

′

≤ r)

= 1 − ( 1 − p

)

+ rp

(1 − p

)

For example, FD

may be better than FD

in terms of both E(T

) and E(T

), but worse than

in terms of E(T

On the other side, for any y ∈ [i −1, i), Pr(T

> y) =

∞

j=i

Pr(T

= j) =

∞

j=i

(1 −

)

j−1

= (1 − p

)

i−1

. Thus

E(T

)

Pr(T

> y) dy = p

i=1

i−1

Pr(T

> y) dy +

n+r

Pr(T

> y) dy

= p

i=1

(1 − p

)

i−1

+ r(1 − p

)

= 1 − (1 − p

)

+ rp

(1 − p

)

Therefore, Equality (2.3) holds for this failure detector.

Equalities (2.4) and (2.5) are direct consequences of Equality (2.3), and we

only che ck Equality (2.5) here. From the distribution of T

, we have E(T

) =

∞

i=0

E(T

′

+ i)(1 −p

)

∞

i=0

(i + 1/2)(1−p

)

= (2 −p

)/(2p

). On the other

side,

E(T

)

1 +

V(T

)

E(T

)

1 +

(1 − p

)/p

1/p

2 − p

Therefore, Equality (2.5) holds for this failure detector.

Note that for this failure detector E(T

) = (2 − p

)/(2p

) while E( T

) = 1/p

so E(T

) > E(T

)/2.

2.4 Discussion

On the Probability of Premature Timeouts

For timeout-based failure detectors, the probabili ty of premature timeouts is some-

times used as the accuracy measure: thi s is the probability that when the timer is

set, it will prematurely timeout on a process that is actually up. The problem with

this measure, however, is that (a) it is implementation-spe ciﬁc, and (b) it is not

useful to applications unless it is given together with other implementation-speciﬁc

measures, e.g., how often timers are started, whether the timers are started at regular

or variable intervals, whether the timeout periods are ﬁxed or variable, etc. ( many

such variations exist in practice [Bra89, GM98, vRMH98]). Thus, the probability

of pre mature timeouts is not a good metric for the speciﬁcation of failure detectors,

e.g., it cannot be used to compare the QoS of failure detectors that use timeouts in

diﬀerent ways. The six accuracy metrics that we i dentiﬁed in this paper do not refer

to implementation-speciﬁc f eatures, in particular, they do not refer to timeouts at

all.

Accuracy Metrics and Runs with Crashes

To measure the accuracy of a failure detector that monitors a process p, we considered

runs in which p does not crash. In real systems, however, such runs rarely occur: p

is likely to crash eventually. Are our accuracy metrics applicable to such systems?

The answer is yes, as we now explain.

Note that the output of any failure detector implementation at a ti me t should

not depend on what happens after time t, i.e., the implementation doe s not predi ct

the future.

Therefore, the steady state b ehavior of a failure detector before a process

p crashes is the same as its steady state behavior in runs in which p does not crash.

Thus, our accuracy metrics also measure the accuracy of a fail ure detector in runs in

which p eventually crashes (provided that this crash occurs after the failure detector

has reached its steady state behavior).

Our model can enforce this assumption by imposing some restriction on the sets of failure

detector histories and their associated distributions (see Chapter 3).

Good Periods versus Stable Periods

Recall that a good period of a failure detector is deﬁned in terms of runs in which p

does not crash. It starts when the failure detector trusts p (makes a T-transition) and

terminates when the failure detector erroneously suspects p (makes an S-transition).

In contrast, a stable period of a system starts when the fai lure detector trusts p

and p is up, and terminates when either: (a) the fai lure detector susp ects p, or (b) p

crashes. The length of stable perio ds is an important measure for many applications.

This measure, however, cannot be part of the QoS speciﬁcation of failure detectors:

since a fai lure detector has no control over process crashes, i t cannot by itself ensure

“long” stable periods, even if it is very accurate.

To measure the length of a stable period in a system, one can use measures on

the accuracy of the failure detector and on the likelihood of crashes. For example,

let T

be the r andom vari able representing the length of a good period of the failure

detector, and C b e the random variable re presenting the lifetime of process p. Assume

that C has an exponential distribution (so that at any given time at which p is still

up, the remaining lifetime of p has the same distribution as C). Let S be the

random variable representing the length of a stable peri od after the failure detector

has reached steady state. Then the distribution of S is given by Pr(S ≤ x) =

1 − Pr (C > x)Pr(T

> x). Intuitively, this is because a stable p eriod terminates as

soon as the failure detector makes a mistake or p crashes.

Chapter 3

Stochastic Modeling of Failure

Detectors and Their

Quality-of-Service Speciﬁ cations

3.1 Introduction

In Chapter 2, we proposed a set of metrics for the QoS speciﬁcation of failure de-

tectors. The deﬁnitions of failure dete ctors and the QoS metrics were kept at an

intuitive level to emphasize the main i deas of the QoS speci ﬁcation of failure detec-

tors. In thi s chapter, we give a rigorous formalization of failure detectors and their

QoS metrics based on stochastic modeling, in particular the theory of marked point

processes (c.f. [Sig95]). Upon the ﬁrst reading, readers can skip this chapter and

read Chapter 4 without any diﬃculty.

In the formalization, we ﬁrst deﬁne random failure detector histories that model

the probabilistic behaviors of failure detectors. We show that a random failure

detector history is an extension of a (particular type of) random mar ked point process.

We nex t deﬁne failure detectors as mappings from failure patterns to random failure

detector histories. This is an extension to the failure detector model of [CT96].

We then deﬁne some particular random failure detector histories as the steady state

behaviors of a failure detector and use them to deﬁne the Q oS metrics. Some of

these random failure detector histories match with the stationary versions of random

marked point processes. Finally, we analyze the relation between the QoS metrics we

deﬁned. The analysi s is based on the results in the theory of marked point processes,

such as Birkhoﬀ’s Ergodic Theorem for marked point processes, and the empirical

inversion formulas. The relations we present in this chapter are more general than

the results in Theorem 2.1 of Chapter 2.

The rest of the chapter is organized as follows. In Se ction 3.2, we deﬁne the failure

detector model, which includes the deﬁnitions of random failure detector histories,

failure detectors, and the steady state behaviors of failure detectors. In Se ction 3.3,

we deﬁne the QoS metrics and analyze the relation between these metrics.

In Appendix A, we summarize relevant deﬁniti ons and r esults in the theory of

marked point processes.

3.2 Failure Detector Model

As in Chapter 2, we consider a system of two processes p and q, and a fai lure detector

at q that monitors p. We assume that q does not crash. Real time is continuous and

ranges from 0 to ∞.

3.2 .1 Failure Detector Deﬁnition

As described in Section 2.2.1, the output of the fail ure detector is denoted as either

S or T , and it has two types of transiti ons: S-transitions and T-transitions. Roughly

speaking, a failure detector history describes the output of the failure detector in an

entire run, and it can be represented by the initial output and the ti mes at which

transitions occur.

More precisely, we deﬁne a fail ure detector history as f ol lows. Let R, R

, and Z

denote the set of real numbers, nonnegative real numbers, and nonnegative integers,

respectivel y. Let K

def

= {S, T }. For x ∈ K, let

x denote the element other than x in

A failure detector history is given as ψ = hk, {t

: n ∈ I}i such that

(1) k ∈ K ;

(2) I = Z

or I = {0, 1, . . . , m − 1} for some m ∈ Z

(if m = 0, then I = ∅);

(3) t

∈ R

for all n ∈ I;

(4) if I = Z

, then 0 ≤ t

< t

< ···, and lim

n→∞

= ∞;

(5) if |I| = m < ∞, then 0 ≤ t

< t

< ··· < t

m−1

In the representation of failure detector history ψ, k represents the output of the

failure detector at time 0, and the increasing sequence {t

} represents the times at

which failure detector transitions occur. We call t

the n-th transition time (so ψ

starts with the zeroth transition). When t

> 0, ψ = hk, {t

}i represents a run in

which the failure detector outputs k in the period [0, t

), makes a transition at time

, then outputs

k in the period [t

, t

), and then makes another transition at time

, and so on. When t

= 0, ψ = hk, {t

}i represents a run in which failure detector

has a transition at time 0, then outputs k in the period [t

, t

), makes a transition at

time t

, and then outputs k in the perio d [t

, t

), and so on. All owing a transition at

time 0 is to conform with the representation of marked point processes (as in [Sig95]),

which is the basic tool we use to model the failure detectors. Intuitively it makes

sense when there is output before time 0, and this can happ en if the time line is

shifted. The requirement lim

n→∞

= ∞ in (4) enforces that there are only a ﬁnite

number of transitions in any bounded time interval.

For a fail ure detec t or history ψ = hk, {t

: n ∈ I}i, we deﬁne the n-th inter-

transition time T

of ψ as follows. If |I| = ∞, then T

def

= t

n+1

− t

for all n ≥ 0;

if |I| = m < ∞, then (a) if m = 0, then T

def

= ∞ and T

def

= 0 for n ≥ 1; and (b) if

m ≥ 1, then T

def

= t

n+1

− t

for 0 ≤ n ≤ m − 2, T

m−1

def

= ∞, and T

def

= 0 for n ≥ m.

To model the probabilistic behavior of a failure detector, we need to deﬁne ra ndom

failure detector histories with probability distributions over the set of f ailure detector

histories. To do so, we ﬁrst need to deﬁne what are t he subsets of failure detector

histories that we can assign probability to. Formally, we need to deﬁne a σ-ﬁeld

which contains all measurable subsets of failure detector histories. This is done as

follows.

Let H be the set of all failure detector histories. Let Z

∞

def

= Z

∪{∞}. for m ∈ Z

∞

let H

(m)

be the set of all failure detector histories with exactly m transitions, i.e.

(m)

def

= {ψ = hk, {t

: n ∈ I}i : |I| = m}. Thus {H

(m)

: m ∈ Z

∞

} forms a partition

of H. We next deﬁne the Borel σ-ﬁeld B(H

(m)

) of H

(m)

for each m ∈ Z

∞

When m < ∞, H

(m)

is a subset of K × R

, where R

is the m-dim ensional

Euclidean space with the Borel σ-ﬁeld B(R

). Let B(K × R

) be the product

σ-ﬁeld generated by {K × B : K ⊆ K, B ∈ B(R

)}. It is easy to check that

(m)

∈ B(K × R

). Then, we get the Borel σ-ﬁeld B(H

(m)

)

def

= {E : E ∈ B(K ×

), E ⊆ H

(m)

When m = ∞, H

(∞)

is a subset of K × R

, where R

is the set of all count-

ably inﬁnite sequences of real numbe rs. It is known ( se e e.g. [Sig95]) that R

is a

complete separable metric space, and the Borel σ-ﬁeld B(R

) is well deﬁned. Let

B(K × R

) be the product σ-ﬁeld generated by {K × B : K ⊆ K, B ∈ B(R

)}.

It is easy to check that H

(∞)

∈ B(K × R

). Then, we get the Borel σ-ﬁeld

B(H

(∞)

)

def

= {E : E ∈ B(K × R

), E ⊆ H

(∞)

With the above deﬁnitions of B(H

(m)

) for all m ∈ Z

∞

, we then deﬁne the Borel

σ-ﬁeld B(H ) of H to be {

m∈Z

∞

: E

∈ B(H

(m)

)}. Hence we have a measurable

space ( H, B(H)) on the set of all failure detector histories.

Some simple examples of Borel sets in H are: (1) {ψ ∈ H : k = S}, the set of

failure detector histories in which the output at time 0 is S; (2) {ψ ∈ H : t

≤ x},

the set of failure detector histories in which the zeroth transition occurs within x

time units; and (3) {ψ ∈ H : T

≤ x}, the set of failure detector histories i n which

the zeroth intertransition time is at most x time units.

It is easy to verify that B(H) can also be generated fr om the following collection

of the sets:

{ψ = hk, {t

: n ∈ I}i ∈ H : |I | = m, k ∈ K, t

≤ x

, . . . , t

≤ x

}, (3.1)

where m ∈ Z

∞

, K ⊆ K, l ∈ I, 0 ≤ n

< ··· < n

< m, x

∈ R

We deﬁne a random failure detector history to be a measurable mapping Ψ : Ω →

H, with (Ω, F, P ) as the underlying probability space. With this deﬁnition, the ran-

dom f ai lure detector history Ψ has the probability distribution P(Ψ ∈ E)

def

= P ({ω ∈

Ω : Ψ(ω) ∈ E}) deﬁned for all E ∈ B(H). For convenience, we use {Ψ ∈ E} as a

short hand for {ω ∈ Ω : Ψ (ω) ∈ E}. Let Ψ be the set of all random failure detector

histories.

A failure pattern F of process p is just a number F ∈ [0, ∞], denoting the time

F at which p crashes; F = ∞ means that p does not crash, and we call this pattern

failure-free pattern. Let F denote the set of all failure patterns. Thus F = [0, ∞].

A fai lure detector D is a mapping D : F → Ψ. Intuitively, the random failure

detector history D(F ) characterizes the probabilistic behavior of the failure detector

output when process p crashes at time F . This is an extension of the f ai lure detector

deﬁnition in [CT96] to model the probabilistic behavior of the failure de t ector output.

3.2 .2 Failure Detector Histories as Marked Point Processes

We now build the relation between failure detector histories and marked point pro-

cesses.

Given a failure detector history ψ = hk, {t

: n ∈ I}i where I 6= ∅, let k

the output of the failure detector at time t

for all n ∈ I. Thus, we know that

the transition occurred at time t

is a k

-transition, and in period [t

, t

n+1

), the

failure detector output is k

. The relation between k and k

’s is: (1) if t

= 0, then

= k

= ··· = k and k

= k

= ··· =

k; and (2) i f t

> 0, then k

= k

= ··· = k

and k

= k

= ··· = k. For notational convenience, l et t

−1

def

= 0, and k

−1

def

= k.

With k

’s, the failure de t ector history ψ can be equivalently represented as

ψ = {(t

, k

) : n ∈ I}. When I = Z

, this repre se ntation coincides with the

representation of a simple marked point process as given in [Sig95]. In fact, A failure

detector history wit h an inﬁnite number of transitions can be directly modeled as a

simple marked point process, with transitions as events and K as the mark space.

Therefore, deﬁnitions and results for marked point processes can be directly applied

to failure detector histories with an inﬁnite numb er of transitions. For consistency

and convenienc e, we extend some of the deﬁni t ions to include failure de t ector histo-

ries with only a ﬁnite number of transitions.

One important extension is the shift mappings on failure detector histories with a

ﬁnite number of transitions, as given below. Suppose θ

: H → H is a shift mapping

deﬁned on all failure det ector histories. Intuitively, for a failure detector history ψ,

ψ is t he failure detector history obtained from ψ by shifti ng the ori gi n to s, using

the output at time s as the initial output, re-labeling transitions at and after s as

, t

, . . ., and ignoring the portion of the failure detector history before s. More

precisely, if s = 0, then θ

is the identity mapping; if ψ has an inﬁnite number of

transitions, then ψ is also a simple marked point process, and thus θ

ψ is de ﬁned

as in A ppendix A. N ow suppose s > 0 and ψ = hk, {t

: n ∈ I}i has only a ﬁnite

number of transitions, i.e., |I| = m < ∞. If t

i−1

< s ≤ t

for some i ∈ I, then

def

= hk

′

, {t

i+n

− s : 0 ≤ n ≤ m − 1 − i}i, where k

′

is the output at time s, and

′

= k

if s = t

; k

′

if s < t

. If s > t

m−1

, then θ

def

= hk

m−1

, ∅i.

We now deﬁne shift mapping by event t ime θ

(j)

for j ≥ 0. Intuitively, for a failure

detector history ψ, θ

(j)

ψ is a failure detector history obtained f r om ψ by shift ing the

origin to the time of j-th transition in ψ, and if ψ does not have enough transitions,

then the origin is shifte d to the last transition of ψ. More precisely, if ψ has at

least j + 1 transitions, then θ

(j)

def

= θ

; if ψ has less than j + 1 transitions, then

(j)

def

= θ

m−1

, where m is the number of transitions in ψ. We then let

def

= θ

ψ and ψ

(j)

def

= θ

(j)

ψ. (3.2)

Note that ψ

(j)

always has a transition at the origin, exce pt the case when ψ itself

has no transition at all.

For a random fai lure detector history Ψ : Ω → H, let Ψ

: Ω → H be a random

failure detector history obtained from Ψ by shifting the origin to time s, that is,

(ω) = Ψ(ω)

for all ω ∈ Ω. Similarly, le t Ψ

(j)

: Ω → H be a random failure detector

history obtained from Ψ by shifting the origin to the time of the j-th transition, that

is, Ψ

(j)

(ω) = Ψ(ω)

(j)

for all ω ∈ Ω. Intuitively,Ψ

represents what you see if you

always start observing Ψ at time s, and Ψ

(j)

represents what you see if you alway s

start observing Ψ at the j-th transition.

Shift mapping is an important tool to the study of the steady state behaviors of

failure detectors, as we discuss in the next section.

3.2 .3 The Steady State Behaviors of Failure Detec tors

We consider failure detectors whose behaviors e ventually reach the steady state.

Roughly speaking, when a failure detector starts running, and for a while after,

its behavior depends on the initi al conditi on (such as whether initially q suspects p

or not) and on how long i t has bee n running. Typically, as time passes the eﬀect

of the initial condition gradually diminishes and its behavior no longer depends on

how long i t has been running — i.e., eventually the failure detector behavior reaches

equilibri um, or steady state.

Suppose that while p is still up, the behavior of the failure detector reaches a

steady state. We consider two kinds of behaviors in this case: First, i f p remains up,

what would be the behavior of the failure detector? Second, if p crashes, what would

be the behavior of the failure detector in response of the crash of p?

We now formally deﬁne several random failure detector histories that capture

such steady state behaviors. Let D be the failure detector in consideration.

The Steady State Behavior If p Does Not Crash

Let F = ∞ be the failure-free pattern of p. Then Ψ

def

= D(F ) deﬁnes the behavior of

the failure detector output under this failure-free pattern. Suppose the unde r lying

probability space deﬁning Ψ is (Ω, F, P ). The steady state behavior of Ψ is given

by its event stationary version Ψ

and time stationary version Ψ

∗

, if they exist.

Formally, they are deﬁned by the following distributions ( assuming they exist), just

as the deﬁnitions in Section 2.3 of [Sig95]:

Pr(Ψ

∈ E)

def

= lim

n→∞

n−1

j=0

P (Ψ

(j)

∈ E), for all E ∈ B(H), (3.3)

Pr(Ψ

∗

∈ E)

def

= li m

t→∞

P (Ψ

∈ E) ds, for all E ∈ B(H). (3.4)

The event stationary version Ψ

is obtained by averaging the distribution of Ψ

(j)

over all t ransition times, and the time stationary version Ψ

∗

is obtained by ave r aging

the di stribution of Ψ

over all times. Such average distributions are referred to as

empirical distributions in [Sig95]. The i ntuitive me anings of (3.3) and (3.4) are: if we

randomly pick a transition and start observing Ψ after this transition, the random

failure detector history we observed is given by the event stationary version Ψ

; if

we randomly pick a real time s and then observe the behavior of Ψ after s, then the

Note that in (3.3) and (3.4) we use the no tation Pr (·) to avoid the complication of specifying

the underlying probability spaces for Ψ

and Ψ

∗

. These stationary versions can be deﬁned in

probability spaces diﬀerent from Ψ, but there is no need to specify them here since we are only

interested i n the probability di stributions of Ψ

and Ψ

∗

. We will use the notation Pr (·) whenever

it is convenient for us.

random failure detector history we observed is given by the time stationary version

∗

. Using the expressions in [Sig95], Ψ

is the version of Ψ when we randomly observe

Ψ way out at a tra nsition, and Ψ

∗

is the version of Ψ when we randomly observe Ψ

way out in time.

Intuitively, Ψ

is event stationary, i.e., the distribution does not change if Ψ

shifted by transition t imes (see Appendi x A for its deﬁnition), because after already

randomly observing Ψ way out at a transition to obtain Ψ

, observing Ψ several

transitions later make s no diﬀerence to the di stribution of Ψ

. Similarly, Ψ

∗

is time

stationary, i.e, the distribution does not change if Ψ

∗

is shifted by time, because after

already randomly observing Ψ way out in time to obtain Ψ

∗

, observing Ψ some time

units later makes no diﬀerence to the distribution of Ψ

∗

Lem ma 3.1 Ψ

is event stationary and Ψ

∗

is time stationary.

Proof. The proof is the same as the proof in [Sig95] p.26, except that the deﬁnit ions

of shift mappings ar e extended to include the case where the number of transitions

are ﬁnite.

We say that the behavior of the failure detector D reaches steady state in failure-

free runs if the distributions deﬁned in (3.3) and (3.4) exist. The accuracy metrics

of the failure detector are deﬁned with respect to the steady state behavior of t he

failure detector in failure-free runs, i.e., with respect to stationary versions Ψ

and

∗

To further understand the stationary versions Ψ

and Ψ

∗

, we break down events

in B(H) into diﬀerent c ategories and study them separately. From Section 3.2.1,

we know t hat {H

(m)

: m ∈ Z

∞

} is a partition of H, and B(H) = {

m∈Z

∞

∈ B(H

(m)

)}. Therefore, for any event E =

m∈Z

∞

, if we know Pr(Ψ

∈ E

)

for all m ∈ Z

∞

, then by the additivity of the probability measure we know that

Pr(Ψ

∈ E) =

m∈Z

∞

Pr(Ψ

∈ E

). The case for Pr(Ψ

∗

∈ E) is similar. Thus, we

now focus on Pr (Ψ

∈ E) and Pr(Ψ

∗

∈ E) wi t h E ∈ B(H

(m)

), for each m ∈ Z

∞

Let E

and E

be the sets of all failure detector histories in which eventually

the output is always S or T , respectively. Thus E

∪E

contains all failure detector

histories with a ﬁnite number of transitions. Let p

def

= P (Ψ ∈ E

) and p

def

= P (Ψ ∈

). Thus p

and p

are the probabili t ies that eventually the output of the random

failure detector history Ψ is always S or always T , respectively. Let p

∞

def

= P (Ψ ∈

(∞)

), the probability that Ψ has an inﬁnite number of transitions.

Proposit ion 3.2 p

+ p

∞

= 1.

Proof. It is direct from the fact that E

, E

, and H

(∞)

are disjoint and E

∪E

∪

(∞)

= H.

Probabilities p

and p

are used to characterize the steady state behavior of

failure detector histories with only a ﬁnite number of transitions. Intuitively, if a run

of the failure detec t or only has a ﬁnite number of transitions, then in ste ady state

the failure detect or should keep its ﬁnal output value . In other words, when you

randomly observe a failure detector history with a ﬁnite number of transitions way

out in time or way out at a transition, with probability one what you observe is the

portion in which the failure detector keeps its ﬁnal output value. The probability

that the output you observe is S or T is given by p

or p

, respectively. This is

formalized in the fol lowing lemma.

Lem ma 3.3

(1) Pr(Ψ

∗

∈ {hS, ∅i}) = p

, and Pr(Ψ

∗

∈ {hT, ∅i}) = p

;

(2) For all m ∈ Z

\ {0}, for all E ∈ B(H

(m)

), Pr(Ψ

∗

∈ E) = 0;

(3) Pr(Ψ

∈ {hS, ∅i, hS, {0}i}) = p

, and Pr (Ψ

∈ {hT, ∅i, hT, {0}i}) = p

;

(4) For all m ∈ Z

\ {0}, for all E ∈ B(H

(m)

), if E does not contain hS, {0}i or

hT, {0}i, then Pr(Ψ

∈ E) = 0.

Proof. (1) Let E = {hS, ∅i}. We have that if s ≤ t, then {Ψ

∈ E} ⊆ {Ψ

∈ E}, i.e.,

if a failure detector history remains the output S fr om time s on, then it of course

remains the output S from a later time t on. Thus P (Ψ

∈ E) ≤ P (Ψ

∈ E). It is clear

that {Ψ

∈ E} ↑ {Ψ ∈ E

}, i .e., as t → ∞, {Ψ

∈ E} monotonically increasing and

tends to {Ψ ∈ E

} from below. Then for integer valued n, {Ψ

∈ E} ↑ {Ψ ∈ E

Since the probability measure is cont inuous from below ( see e.g. [Bil95], p.25), we

have P (Ψ

∈ E) ↑ P (Ψ ∈ E

), then it is also true that P (Ψ

∈ E) ↑ P (Ψ ∈ E

From (3.4), we have

Pr(Ψ

∗

∈ E) = lim

t→∞

P (Ψ

∈ E) ds

≤ lim

t→∞

P (Ψ

∈ E) ds = lim

t→∞

P (Ψ

∈ E) = P (Ψ ∈ E

) = p

On the other hand, from P(Ψ

∈ E) ↑ P (Ψ ∈ E

), we have that for all ǫ > 0, there

exists K such that for all s ≥ K, P(Ψ

∈ E) ≥ p

− ǫ. Then

Pr(Ψ

∗

∈ E) = lim

t→∞

P (Ψ

∈ E) ds ≥ lim

t→∞



− ǫ



ds = p

− ǫ.

Let ǫ → 0, we have Pr(Ψ

∗

∈ E) ≥ p

. Therefore, Pr(Ψ

∗

∈ E) = p

. Similarly, we

can prove that Pr(Ψ

∗

∈ {hT, ∅i}) = p

(2) For all E ∈ B(H

(m)

) with m ∈ Z

\{0}, since E∩(H

(∞)

∪{hS, ∅i, hT, ∅i}) = ∅,

we have Pr(Ψ

∗

∈ E) ≤ 1−Pr (Ψ

∗

∈ H

(∞)

)−p

−p

. Thus, to prove Pr(Ψ

∗

∈ E) = 0,

it is enough to show that Pr(Ψ

∗

∈ H

(∞)

) = p

∞

, since we know that p

∞

= 1.

To prove Pr(Ψ

∗

∈ H

(∞)

) = p

∞

, note that {Ψ

∈ H

(∞)

} = {Ψ ∈ H

(∞)

}, i.e ., a

failure detector history has an inﬁnite number of transitions from time s on if and

only if itself has an inﬁnite number of transitions. Then we have

Pr(Ψ

∗

∈ H

(∞)

) = lim

t→∞

P (Ψ

∈ H

(∞)

) ds

= lim

t→∞

P (Ψ ∈ H

(∞)

) ds = P (Ψ ∈ H

(∞)

) = p

∞

(3) and (4) have similar proofs as those of (1) and (2).

We now look at failure detector histories with an inﬁnite number of transitions.

We know that with probability p

∞

, Ψ has an inﬁnite number of transitions. If

∞

= 0, then p

+ p

= 1, and in the stationary versions of Ψ, only trivial histories

that never change the failure detector outputs have a nonzero probability.

We now consider the case when p

∞

> 0. In this case, we restrict Ψ onto H

(∞)

More precisely, we ﬁrst deﬁne the restricted probability space (Ω

∞

, F

∞

, P

∞

) such that

(1) Ω

∞

= Ψ

−1

(∞)

), (2) F

∞

= {B : B ∈ F, B ⊆ Ω

∞

}, and (3) P

∞

(B) = P (B)/p

∞

for all B ∈ F

∞

. We then deﬁne the restricted random f ai lure detector history Ψ

∞

the measurable mapping from Ω

∞

to H

(∞)

such that Ψ

∞

(ω) = Ψ(ω) for all ω ∈ Ω

∞

Since a failure de t ector history in H

(∞)

is also a simple marked point process, Ψ

∞

also a ra ndom marked point process as deﬁned in [Sig95].

∞

, as a random marked point process, has its own event stationary version

∞

and time stationary version Ψ

∗

∞

(see deﬁnitions in Appendix A). The following

lemma gives the relation between the distributions of Ψ

, Ψ

∗

and Ψ

∞

, Ψ

∗

∞

Lem ma 3.4 If p

∞

> 0, then for all E ∈ B(H

(∞)

), Pr(Ψ

∈ E) = p

∞

∗Pr(Ψ

∞

∈ E),

and Pr (Ψ

∗

∈ E) = p

∞

∗ Pr(Ψ

∗

∞

∈ E).

Proof. Direct from (3.3), (3.4), (A.5), (A.6) and the deﬁnition of the probability

measure P

∞

The Steady State Behavior after p Crashes

We now deﬁne a random failure detector history that represents the steady state

behavior of the failure detector after p crashes. Formally, a post-crash version Ψ

of failure detector D is a random failure detector history deﬁned by the following

distribution (assuming it e xists):

Pr(Ψ

∈ E)

def

= li m

t→∞

Pr(D(s)

∈ E) ds, for all E ∈ B(H). (3.5)

Intuitively, (3.5) means that if we randomly pick a time s at which p crashes and

then observe the behavior of the failure detec t or after time s, the random failure

detector history we observed is given by the post-crash version Ψ

. So, simil arly

to Ψ

and Ψ

∗

, we say that Ψ

is the version of the failure detector D when we

randomly observe D way out at a time when p crashes. Ψ

is obtained by averaging

the distribution of D( s)

, the post-crash behavior of the f ai lure detector D, over all

crash times. Thus it is also an empirical distribution. We say that the failure detector

D has s teady state behavior after p crashes if the distribution deﬁned i n (3.5) exists.

One primary metric, the detec t ion time, is deﬁned with re spec t to the steady state

behavior after p crashes, i.e., with respect to the post-crash version Ψ

Non-Futuristic Property

Before p crashes, no failure detector implementation can tell whether p will crash later

or not, i.e. , the failure detector cannot predict the future. Therefore, the behavior

of the failure detector up to any time t at which p is stil l up should be the same as

the behavior of the failure detector in the same period in failure -free runs. We now

formalize this idea.

For all m ∈ Z

and for all t ∈ R

, let H

(m)

≤t

be the subset of H

(m)

such that

the time of the last transition of any failure detector history in H

(m)

≤t

is at most t,

i.e., H

(m)

≤t

def

= {ψ ∈ H

(m)

: t

m−1

≤ t}. So H

(m)

≤t

gives the set of failure detector history

preﬁxes up to time t t hat contains exactly m transitions. Clearly H

(m)

≤t

∈ B(H

(m)

and so we can deﬁne the Borel σ-ﬁeld B(H

(m)

≤t

) = {E : E ∈ B(H

(m)

), E ⊆ H

(m)

≤t

}. Let

≤t

m∈Z

(m)

≤t

. H

≤t

contains all failure detector history preﬁxes up to time t.

The Borel σ-ﬁeld of H

≤t

is deﬁned as B(H

≤t

) = {

m∈Z

: E

∈ B(H

(m)

≤t

)}. We

deﬁne a preﬁx mapping f

≤t

: H → H

≤t

, such that for any failure detector history

ψ ∈ H, f

≤t

(ψ) is the f ai lure detector history preﬁx that only contains the transitions

of ψ up to time t. It is easy to verify that f

≤t

is a measurable mapping.

We now formally de ﬁne the probabilistic behavior of a failure detector history up

to some time t. Given a random failure detector history Ψ : Ω → H, the random

failure detector history preﬁx up to time t is the measurable mapping Ψ

≤t

: Ω → H

≤t

such that Ψ

≤t

= f

≤t

◦ Ψ.

We say that a failure detector D is non-futuristic (or not predicting the future)

if for all t ∈ R

, for al l t

, t

∈ R

∞

and t

, t

> t, D(t

)

≤t

and D(t

)

≤t

have

the same distri bution, i.e., for all E ∈ B(H

≤t

), Pr(D(t

)

≤t

∈ E) = Pr(D(t

)

≤t

∈ E).

Intuitively, this means that as long as p has not crashed yet by t ime t, the probabilistic

behavior of the f ai lure detector up to time t is the same no matter whether or when

p may crash later. In other words, the fail ure detec tor do es not provide hints on

whether or when process p will crash in the future.

3.3 Failure Detector Spec iﬁcation Metrics

With the formal model of the failure detector given in the previous section, we are

now ready to formally deﬁne the QoS metrics of the failure detector introduced in

Chapter 2.

Let D be a failure detector. Let R

∞

def

= R

∪ {∞}.

3.3 .1 Deﬁnitions of Metrics

Detection time (T

): T

is deﬁned from the post-crash version Ψ

of D. Suppose

: Ω

→ H, with (Ω

, F

, P

) as the underlying probability space. We ﬁrst deﬁne a

measurable m apping f

: H → R

∞

such that for any failure detector hi story ψ ∈ H,

(ψ) i s: (a) 0, if ψ has no transition and the output is always S; or (b) the time of

the last transition, if ψ has a ﬁnite number of transitions and the output after the

last transition is always S; or (c) ∞ otherwise. Then T

: Ω

→ R

∞

is the random

variable such that T

= f

◦Ψ

. That is, given any particular post-crash history ψ,

= f

(ψ) is the time elapsed from the time of crash to the time when the failure

detector starts suspecting p permanently, and the distribution of T

is determined

by the distribution of Ψ

, which is deﬁned i n (3.5).

All accuracy metrics are deﬁned with respect to the steady state behavior in

failure-fre e runs, i.e., with respect to the stationary versions Ψ

and Ψ

∗

of the random

failure detector history Ψ

def

= D(∞). For the convenience of studying the relations

between the accuracy metrics in the next section, we assume that Ψ, Ψ

and Ψ

∗

use the same underlying probability space (Ω, F, P ) (one can always construct some

common space supporting all of them).

The following three accuracy metrics are deﬁned in terms of the e vent stationary

version Ψ

. Recall from Section 3.2.1 that T

and T

are deﬁned as the zeroth and

the ﬁrst intertransition time of a given failure detector history ψ. Thus T

and T

are actually measurable mappings fr om H to R

∞

Mistake recurrence time (T

): We deﬁne a me asurable mapping f

: H →

∞

such that f

= T

+ T

. Then T

: Ω → R

∞

is the random variable such that

= f

◦Ψ

. Intuitively, T

is the length of the ﬁrst two consecutive periods, one

trust period and one suspicion period, of Ψ

. We call any two consecutive per iods a

recurrence interval. Since Ψ

is event stationary, the distribution of the length of any

recurrence interval is the same, and thus we only take the ﬁrst recurrence interval of

. T

represents the length of the recurrence interval when we randomly observe

the failure detector output way out at a transition in some failure-free run, and its

distribution is det ermi ned by the distribution of Ψ

, which is deﬁned in ( 3. 3). Note

that when deﬁning T

we do not restrict the recurrence interval to be started and

ended with S-transitions. This is because in steady state whether you observe at an

S-transition or a T-transition does not change the distribution of the length of the

recurrence interval, and so we choose not to make this restriction f or convenience.

Mistake duration (T

): we deﬁne a measurable mapping f

: H → R

∞

such

that for any failure detector history ψ = hk, {t

}i ∈ H, f

(ψ) = T

(ψ) if k = S,

and f

(ψ) = T

(ψ) i f k = T . Then T

: Ω → R

∞

is the random variable such

that T

= f

◦ Ψ

. Recall that after being shifted by a transition time, any failure

detector history has a transition at origin (i.e., t

= 0) ex cept the historie s with

no transitions at all. Thus the deﬁnition of f

guarantees that it always takes the

length of the ﬁrst suspic ion (mistake) period from the event stationary version Ψ

Therefore, T

represents the length of the mistake pe r iod when we randomly observe

the failure detector output way out at an S-transition in some failure-free run, and

its distribution is determined by the distribution of Ψ

Good period duration (T

): The deﬁnition of T

is symmetric to that of T

we deﬁne a measurable mapping f

: H → R

∞

such that for any failure detector

history ψ = hk, {t

}i ∈ H, f

(ψ) = T

(ψ) if k = T , and f

(ψ) = T

(ψ) if k = S.

Then T

: Ω → R

∞

is the random variable such that T

= f

◦ Ψ

. Intuitively,

represents the length of the trust (good) p eriod when we randomly observe the

failure detector output way out at a T-transition in some failure-free run, and its

distribution is det ermi ned by the distribution of Ψ

The following three accuracy metrics are deﬁned in ter ms of the time stationary

version Ψ

∗

Query accuracy probability (P

): Let B

be the set of failure dete ctor histo-

ries with output T at ti me 0. Then P

def

= P (Ψ

∗

∈ B

). Intuitively, when we randomly

observe the failure detector output way out at a time t in some failure-free run, the

probability that the output at time t is T is just the probabili ty that the output is

T at t ime 0 of the time stationary version Ψ

∗

. Therefore, P

is the probability that,

when querie d at a random time in some failure-free run, the output of the failure

detector is T (and thus is correct).

Average mistake rate (λ

): We deﬁne a measurable mapping N

: H → R

such that for any failure detector history ψ, N

(ψ) i s the number of S-transitions

in the period (0, 1]. Thus N

◦ Ψ

∗

is a random variable representing the number of

S-transitions of Ψ

∗

in the unit interval (0, 1]. Since Ψ

∗

is time stationary, N

◦ Ψ

∗

is the numb er of S-transitions in any unit interval when we randomly observe the

failure detector output way out in time in some fail ure-free run. Then λ

is deﬁned

as E(N

◦ Ψ

∗

), the expected value of N

◦ Ψ

∗

Forward good period duration (T

): Roughly speaking, we deﬁne T

the time from the origin to the ﬁrst transition of Ψ

∗

, conditioned on the fact that the

output of Ψ

∗

at the origin is T . Since Ψ

∗

is obtained when we r andomly observe the

failure detector output way out in time in failure-free runs, T

represents the time

elapsed from a random time at whi ch q tr usts p to the time of the next S-transition.

We now formal ly deﬁne T

. If P

= 0, then let T

≡ 0, i.e., if the probabili ty

that q trusts p at a random time is 0, then T

is always 0. If P

> 0, then we

deﬁne a random failure detector history Ψ

∗

obtained by restricting Ψ

∗

onto B

More precisely, we ﬁrst deﬁne a restricte d probability space (Ω

, F

, P

) such that

(1) Ω

= {Ψ

∗

∈ B

}, (2) F

= {B : B ∈ F, B ⊆ Ω

}, and (3) P

(B) = P (B) /P

for all B ∈ F

. Then Ψ

∗

: Ω

→ B

is the random failure detector history such that

∗

(ω) = Ψ

∗

(ω) for all ω ∈ Ω

. I ntuitively, Ψ

∗

is the version of Ψ

∗

conditioned on

the fact that the output at the origin is T . Let f

: B

→ R

∞

be the measurable

mapping such that for all ψ = hk, {t

}i ∈ B

, if ψ has at least one transiti on

then f

(ψ) = t

, else f

(ψ) = ∞, i.e., f

(ψ) is the time from the origin to the

zeroth transition of ψ. Then T

: Ω

→ R

∞

is the random variable such that

= f

◦ Ψ

∗

We now give an example that is helpful for understanding the above deﬁnitions.

It shows how these deﬁnitions are linked with the steady state distributions deﬁned

in Section 3.2.3, and how they match with the intuition.

Example 1. Gi ven a failure detector D, suppose we want to know the probabili ty

that its mistake recurrence time is at least x, i.e. Pr (T

≥ x), for some x ∈ R

. Let

def

= {ψ ∈ H : T

(ψ) + T

(ψ) ≥ x}, i.e., the set of failure detector histories in which

the length of the very ﬁrst recurrence interval is at least x. Let Ψ

def

= D(∞) be the

random failure det ector history in failure-free runs, and let Ψ

be the event stationary

version of Ψ. By the deﬁnition of T

, we have Pr(T

≥ x) = Pr(f

◦ Ψ

≥ x) =

Pr(Ψ

∈ E). From (3.3), we have

Pr(T

≥ x) = lim

n→∞

n−1

j=0

Pr(Ψ

(j)

∈ E). (3.6)

Note that Pr(Ψ

(j)

∈ E) is the probability that the length of the j-th recurrence

interval is at least x. Thus Pr(T

≥ x) is obtained by averaging t hese probabiliti es

over the ﬁrst n recurrence intervals, and then taking the li mit as n goes to inﬁnity.

Equality (3.6) corresp onds to what we would do if we want to obtain an estimate

on Pr(T

≥ x) by experiments. We would r un the failure detector a number of

times such t hat each run contains a large number of recurrence intervals. We then

compute the ratio of the number of recurrence intervals that are at least x time units

long over the total number of recurrence intervals, and use this ratio as the estimate

of Pr(T

≥ x). This ratio can be equivalently obtained by computing such ratios

for the zeroth, ﬁrst, second ... recurrence intervals, and then averaging these ratios.

This matches the intuitive idea behind equality (3.6).

3.3 .2 Relations between Accuracy Me trics

We now analyze the relations between the accuracy metrics deﬁned in the previous

section. The analysi s is based on the results in the theory of marked point processes,

such as Birkhoﬀ’s Ergodic Theorem for marked point processes, and the empirical

inversion formulas.

Let Ψ

def

= D(∞) be the random failure detector history of some failure detector D

under the fail ure-free pattern, and supp ose that Ψ has the event stationary version

and the time stationary version Ψ

∗

. Suppose that the underlying probability

space for Ψ, Ψ

and Ψ

∗

are (Ω, F, P ).

Lem ma 3.5 T

= T

+ T

Proof. This is immediate from the fact that f

= f

+ f

, where f

, f

and f

are measurable mappings used to deﬁne T

, T

and T

respectivel y.

Henceforth, we only consider the nondegenerated case in which 0 < E(T

) < ∞.

Intuitively, thi s means that the average time f or a failure detec t or to make the next

mistake is ﬁnite and nonzero.

Proposit ion 3.6 If E(T

) < ∞, then p

= p

= 0 and p

∞

= 1.

Proof. Let E

def

= {hS, ∅i, hS, {0}i}. By Lemma 3.3 (3), Pr(Ψ

∈ E) = p

. For all

ω ∈ Ω such that Ψ

(ω) ∈ E, T

(Ψ

(ω)) = ∞, and thus {ω ∈ Ω : Ψ

(ω) ∈ E} ⊆ {ω ∈

Ω : T

(ω) = ∞}. Therefore Pr(T

= ∞) ≥ Pr(Ψ

∈ E) = p

. If p

> 0, then

E(T

) = ∞, which contradict s to the assumption that E(T

) < ∞. So p

= 0.

Similarly we have p

= 0. By Proposition 3.2, we have p

∞

= 1.

From this proposition and Lemma 3.3, we know that the probability that the

stationary version Ψ

or Ψ

∗

has a ﬁnite number of transitions is zero. Formally,

Corollary 3.7 Let E

def

= H \ H

(∞)

. If E(T

) < ∞, then Pr (Ψ

∈ E) = Pr(Ψ

∗

∈

E) = 0.

Henceforth, we treat Ψ

and Ψ

∗

as mappings from Ω to H

(∞)

, since {Ψ

∈

H \H

(∞)

} and {Ψ

∗

∈ H \H

(∞)

} have measure zer o. In this case, Ψ

and Ψ

∗

are just

simple random marked point processes, and so results from the theory of random

marked point processes can be applied to Ψ

and Ψ

∗

direct ly.

Let I be the invariant σ-ﬁeld of H

(∞)

(see Appendix A for the deﬁnition). Let

(X) denote the conditional expected value of X given the σ-ﬁeld I (see [Bi l95]

p.445 for a deﬁnition of the conditional expected value given a σ-ﬁeld).

Proposit ion 3.8 E

) = 2E

◦ Ψ

) a.s.

Proof. by deﬁnition, E

) = E

◦Ψ

) = E

((T

+ T

) ◦Ψ

). Since E

((T

) ◦Ψ

) = E

◦Ψ

) + E

◦Ψ

) a.s., it is enough to show that E

◦ Ψ

) =

◦ Ψ

) a.s. By (A.9) of Theorem A.5, we have

◦Ψ

) = lim

n→∞

n−1

j=0

◦ Ψ

(j)

= lim

n→∞

j=1

◦ Ψ

(j)

= lim

n→∞

n + 1





j=0

◦ Ψ

(j)

− T

◦ Ψ

(0)





= lim

n→∞

n + 1





j=0

◦ Ψ

(j)





= E

◦ Ψ

) a.s.

We say that Ψ is ergodic if Ψ

(or e quivalently Ψ

∗

) is ergodi c (see Appendix A

for the deﬁnition). Informally, in this case we also say that the distribution of failure

detector histories in failure-free runs is ergodic, or simply the failure detector is

ergodic.

Lem ma 3.9

= E

)

. (3.7)

If Ψ is ergodic, then

E(T

)

. (3.8)

Proof. Let λ

def

= E(N

◦ Ψ

∗

) be the arrival rate of Ψ. From (A.16) and (A.15), we

have λ = E(E

◦Ψ

∗

)) and E

◦Ψ

∗

) = lim

t→∞

◦Ψ

a.s. Similarly, we can have

= E(E

◦Ψ

∗

)) and E

◦Ψ

∗

) = lim

t→∞

◦Ψ

a.s., where N

: H → R

a measurable mapping representing the number of S-transitions in the period (0, t].

Since in any period (0, t], the numbers of S-transitions and T-transitions diﬀer at

most by one, we have for any ψ ∈ H, 2N

(ψ) − 1 ≤ N

(ψ) ≤ 2N

(ψ) + 1. Thus

lim

t→∞

◦Ψ

= 2 lim

t→∞

◦Ψ

and so λ = 2λ

. By (A.16) and (A.15), we know that

E({E

◦ Ψ

)}

−1

). By Propositi on 3.8, we have λ

= E({E

)}

−1

). If

Ψ is ergodic, then by Proposition A. 4, we know that λ

= {E(T

)}

−1

Recall that B

is the set of failure detector hi stories with output T at time

0. Let B

be the set of failure detector historie s with output S at time 0. Let

def

= {ω ∈ Ω : Ψ

(ω) ∈ B

} and A

def

= {ω ∈ Ω : Ψ

(ω) ∈ B

}. Let X

: Ω → R

the random variable such that X

(ω) = T

(Ψ

(ω)) for all ω ∈ A

, and X

(ω) = 0 for

all ω ∈ A

. Let X

: Ω → R

be the random variable such that X

(ω) = T

(Ψ

(ω))

for all ω ∈ A

, and X

(ω) = 0 for all ω ∈ A

Proposit ion 3.10 E

) = 2E

) a.s., and E

) = 2E

) a.s.

Proof. Deﬁne random variable X

′

: Ω → R

such that X

′

(ω) = 0 for all ω ∈ A

and X

′

(ω) = T

(Ψ

(ω)) for all ω ∈ A

. Then by deﬁnition T

= X

+ X

′

. Thus

to prove E

) = 2E

) a.s., it is enough to show that E

) = E

′

) a.s.

Let f : H

(∞)

→ R

be the measurable mapping such that for all ψ = hk, {t

}i,

f(ψ) = T

(ψ) if k = T , and f(ψ) = 0 if k = S. Simil arly le t f

′

: H

(∞)

→ R

the measurable mapping such that for all ψ = hk, {t

}i, f(ψ) = 0 if k = T , and

f(ψ) = T

(ψ) if k = S. Thus X

= f ◦Ψ

and X

′

= f

′

◦Ψ

. It is important to note

that f

′

◦ Ψ

(j)

= f ◦Ψ

(j+1)

for all j ≥ 0. Using equality (A .9), we have

′

) = E

′

◦ Ψ

) = lim

n→∞

n−1

j=0

′

◦ Ψ

(j)

= lim

n→∞

j=1

f ◦ Ψ

(j)

= lim

n→∞

n + 1





j=0

f ◦ Ψ

(j)

− f ◦Ψ

(0)





= E

(f ◦Ψ

) = E

) a.s.

We thus have E

) = 2E

) a.s. Simi larly we can prove that E

) =

) a.s.

Lem ma 3.11 If 0 < E

) < ∞, a.s., then

= E

)

. (3.9)

If Ψ is ergodic and 0 < E(T

) < ∞, then

E(T

)

E(T

)

. (3.10)

Proof. By deﬁnition, P

def

= Pr(Ψ

∗

∈ B

). Since 0 < E

) < ∞ a.s., by Propo-

sition 3.8, we know that 0 < E

◦ Ψ

) < ∞ a.s. Then by the empirical inversion

formula (A. 19) we have

Pr(Ψ

∗

∈ B

) = E





◦Ψ

◦ Ψ

)





. (3.11)

We claim that X

◦Ψ

◦ Ψ

ds a.s. In fact, from Proposition A.1, we know

that with probability one Ψ

has a transiti on at time 0, i.e. Pr(t

◦ Ψ

= 0) = 1.

Then with probability one, during the entire period (0, T

(Ψ

(ω))), the output of

is the same as the output of Ψ

at the origin. In other words, with probability

one, if Ψ

(ω) ∈ B

, then I

(Ψ

(ω)) = 1 for all s ∈ (0, T

(Ψ

(ω))); if Ψ

(ω) ∈ B

then I

(Ψ

(ω)) = 0 for all s ∈ (0, T

(Ψ

(ω))). There fore, with probability one,

◦Ψ

ds = T

◦Ψ

if Ψ

(ω) ∈ B

, and

◦Ψ

ds = 0 if Ψ

(ω) ∈ B

Thus X

◦Ψ

◦ Ψ

ds a.s.

By Proposition 3.10, we have E

) = 2E

[

◦Ψ

◦ Ψ

ds] a.s.

By Proposition 3.8, we have E

) = 2E

◦Ψ

) a.s. Therefore, from (3.11) we

have

= E

)

If Ψ is ergodic, then By Proposition A.4, we have

E(T

)

E(T

)

By deﬁnition, if P

= 0, then T

≡ 0. We now study the case P

> 0.

Lem ma 3.12 If P

> 0 and 0 < E

) < ∞ a.s., then for all x ∈ R

Pr(T

≤ x) =

(min(T

, x))

)

. (3.12)

Proof. Let f : H

(∞)

→ R

be the measurable mapping such that for all ψ ∈ H

(∞)

f(ψ) = f

(ψ) i f ψ ∈ B

, and f(ψ) = 0 if ψ ∈ B

. Let Y

def

= f ◦Ψ

∗

. Thus under the

condition {Ψ

∗

∈ B

}, Y = T

, and under the condition {Ψ

∗

∈ B

}, Y = 0. For all

x ∈ R

, we have

Pr(Y ≤ x) = Pr(Y ≤ x |{Ψ

∗

∈ B

})Pr(Ψ

∗

∈ B

) +

Pr(Y ≤ x |{Ψ

∗

∈ B

})Pr(Ψ

∗

∈ B

)

= Pr(T

≤ x) P

+ (1 − P

Thus if P

> 0, we have for all x ∈ R

Pr(T

≤ x) =

Pr(Y ≤ x) − (1 − P

)

. (3.13)

Let E = {ψ = hk, {t

}i ∈ H

(∞)

: k = S, or k = T and t

≤ x}. Then {Y ≤ x} =

{f ◦ Ψ

∗

≤ x} = {Ψ

∗

∈ E}. Since 0 < E

) < ∞ a.s., by Proposition 3.8, we

know that 0 < E

◦Ψ

) < ∞ a.s. Then by the empirical inversion formula (A.19)

we have

Pr(Y ≤ x) = Pr (Ψ

∗

∈ E) = E





◦Ψ

◦ Ψ

)





. (3.14)

Let X

′

: Ω → R

be the random variable such t hat for all ω ∈ A

, X

′

(ω) =

min(x, T

(Ψ

(ω))), and for all ω ∈ A

, X

′

(ω) = 0. We claim that X

+ X

′

◦Ψ

◦ Ψ

ds a.s. In fact, from Proposition A. 1, we know that with probabili ty

one Ψ

has a transition at time 0, i.e. Pr(t

◦ Ψ

= 0) = 1. Thus, with probabil ity

one, if ω ∈ A

then I

(Ψ

(ω)) = 1 for all s ∈ (0, T

(Ψ

(ω))). So we have that with

probability one, if ω ∈ A

then

(Ψ

(ω))

(Ψ

(ω)) ds = T

(Ψ

(ω)).

If ω ∈ A

, then I

(Ψ

(ω)) = 1 iﬀ Ψ

(ω) ∈ E, which means that starting from

time s at which the output is T , the time to the next transition in Ψ

(ω) is at most x.

So, with probability one, i f ω ∈ A

, then I

(Ψ

(ω)) = 1 iﬀ T

(Ψ

(ω)) −s ≤ x. There

are two possible cases: (a) T

(Ψ

(ω)) ≤ x, in which case I

(Ψ

(ω)) = 1 for all s ∈

(0, T

(Ψ

(ω))), and so

(Ψ

(ω))

(Ψ

(ω)) ds = T

(Ψ

(ω)); or (b) T

(Ψ

(ω)) > x, in

which case for all s ∈ (0, T

(Ψ

(ω)) −x), I

(Ψ

(ω)) = 0, and for all s ∈ [T

(Ψ

(ω)) −

x, T

(Ψ

(ω))), I

(Ψ

(ω)) = 1, and so

(Ψ

(ω))

(Ψ

(ω)) ds =

(Ψ

(ω))

(Ψ

(ω))−x

1 ds = x.

Combining the cases (a) and (b), we have with probability one, if ω ∈ A

, then

(Ψ

(ω))

(Ψ

(ω)) ds = min(x, T

(Ψ

(ω))).

From the above separate cases for ω ∈ A

and ω ∈ A

, we thus have X

+ X

′

◦Ψ

◦ Ψ

ds a.s.

We now show that E

(min(T

, x)) = 2E

′

) a.s. The proof is similar to t he

proofs of Propositions 3.8 and 3.10. Deﬁne random variable X

′′

: Ω → R

such

that X

′′

(ω) = 0 for all ω ∈ A

, and X

′′

(ω) = min(x, T

(Ψ

(ω))) for all ω ∈ A

Then by deﬁnition min(T

, x) = X

′

+ X

′′

. Thus it is enough to show that E

′′

) =

′

) a.s. Let f

′

and f

′′

be the corresponding measurable mappings such that

′

= f

′

◦Ψ

and X

′′

= f

′′

◦Ψ

. Note that f

′′

◦Ψ

(j)

= f

′

◦Ψ

(j+1)

for all j ≥ 0. Using

equality (A.9), we have

′′

) = E

′′

◦ Ψ

) = lim

n→∞

n−1

j=0

′′

◦ Ψ

(j)

= lim

n→∞

j=1

′

◦ Ψ

(j)

= lim

n→∞

n + 1





j=0

′

◦ Ψ

(j)

− f

′

◦ Ψ

(0)





= E

′

◦ Ψ

) = E

′

) a.s.

We thus have E

(min(T

, x)) = 2E

′

) a.s.

Therefore, from (3.14) and Propositions 3.8 and 3.10, we have

Pr(Y ≤ x) = E

) + E

′

)

◦ Ψ

)

= E

) + E

(min(T

, x))

)

= 1 − P

+ E

(min(T

, x))

)

The last equality is due to Lemmata 3.5 and 3.11. Plugging the above result into

(3.13), we then have (3.12).

Corollary 3.13 If Ψ is ergodic and if 0 < E(T

) < ∞ and E(T

) 6= 0, then

for all x ∈ R

, Pr(T

≤ x) =

E(T

)

Pr(T

> y)dy, (3.15)

E(T

) =

E(T

k+1

)

(k + 1)E(T

)

. (3.16)

In particular,

E(T

) =

E(T

)

2E(T

)

E(T

)

1 +

V(T

)

E(T

)

. (3.17)

Proof. By Lemma 3.11, if Ψ is e rgodic and 0 < E(T

) < ∞, then P

E(T

)/E(T

). If E(T

) 6= 0, then P

> 0. Then by Proposition A.4 and

Lemma 3.12, we have

Pr(T

≤ x) =

E(min(T

, x))

E(T

)

We now use the fact (see e.g. [Bi l95] p.275) that for any nonnegative random variable

E(X) =

∞

Pr(X > t) dt. (3.18)

We have E(min(T

, x)) =

∞

Pr(min(T

, x) > y) dy =

Pr(min(T

, x) > y) dy =

Pr(T

> y) dy. Thus

Pr(T

≤ x) =

E(T

)

Pr(T

> y)dy.

To prove (3.16), we substitute X in (3.18) with X

to have

E(X

) =

∞

Pr(X

> t) dt =

∞

Pr(X > t

) dt

∞

Pr(X > t) d(t

) =

∞

k−1

Pr(X > t) dt

Then together with (3.15) we have

E(T

) =

∞

k−1

Pr(T

> t) dt =

∞

k−1

1 −

E(T

)

Pr(T

> y)dy

E(T

)

∞

k−1



E(T

) −

Pr(T

> y)dy



E(T

)

∞

k−1



∞

Pr(T

> y)dy



E(T

)

∞

Pr(T

> y)



k−1



E(T

)

∞

Pr(T

> y)y

dy =

E(T

k+1

)

(k + 1)E(T

)

(3.17) is obtained fr om (3.16) by setting k = 2 and using the fact that E(X

) =

E(X)

+ V (X).

We now summariz e all t he results in the following the orem.

Theorem 3.14 For any failure detector D, the following results hold:

(1) T

= T

+ T

(2) Suppose 0 < E

) < ∞, a.s. Then

= E

)

, (3.19)

= E

)

. (3.20)

If P

= 0 then T

≡ 0; if P

> 0 then

Pr(T

≤ x) =

(min(T

, x))

)

. (3.21)

(3) Suppose Ψ

def

= D(∞) is ergodic and 0 < E(T

) < ∞. Then

E(T

)

, (3.22)

E(T

)

E(T

)

. (3.23)

If E(T

) = 0 then T

≡ 0; if E(T

) 6= 0, then

for all x ∈ R

, Pr(T

≤ x) =

E(T

)

Pr(T

> y)dy, (3.24)

E(T

) =

E(T

k+1

)

(k + 1)E(T

)

. (3.25)

In particular,

E(T

) =

E(T

)

2E(T

)

E(T

)

1 +

V(T

)

E(T

)

. (3.26)

Theorem 2.1 in Chapter 2 is just the parts (1) and (3) of the above theorem.

Chapter 4

The Design and Analysis of a New

Failure Detector A lgorithm

4.1 Introduction

In Chapter 2, we proposed a set of speciﬁcation metrics to measure t he QoS provided

by fail ure detectors. These metrics address the failure detector’s speed (how fast it

detects process crashes) and its accuracy (how well it avoids erroneous detections).

In this chapter, we design a new failure de t ector algorithm for distri buted sy stems

with probabilistic behaviors. We analyz e the QoS of the new algorithm and derive

closed formulas on its QoS met r ics. We show that, among a large class of failure

detector algorithms, the new al gorithm i s optimal with respect to some of these

QoS metrics. Given a set of failure detector QoS requirements, we show how to

compute the parameters of our algorithm so that it satisﬁes these requirements, and

we show how this c an be done even if the probabilistic behavior of the system is

not known. Finally, we simulate both the new algorithm and a simple algorithm

commonly used in practice, compare the simulation results, and show that the new

algorithm provides be tt er QoS than the si mple al gorithm.

We consider a simple system of two processes p and q connected through a com-

munication link. Process p may fail by crashing, and the link between p and q may

delay or drop m essages. Message delays and message losses follow some probabilistic

distributions. Process q has a failure detector that monitors p. As in Chapter 2, the

failure detector at q outputs either “I suspect that p has crashed” or “I trust that p

is up” (“suspe ct p” and “t rust p” in shor t, respec t ively).

4.1 .1 A Common Fa ilure Detection Algorithm a nd its

Drawbacks

We ﬁrst consider the following simple failure detector algorithm commonly used in

practice: at regular time intervals, process p sends heartbeat messages to q; when q

receives a more rece nt heartbeat message, it trusts p and starts a t imer with a ﬁxed

timeout value TO; if the timer expires be fore q receives a more recent he artbeat

message from p, q starts suspecting p.

This algorithm has two undesirable characteristics; one regards its accuracy and

the other its detection time, as we now explain. Consider the i-th heartbeat message

. Intuitively, the probability of a premature timeout on m

should depend solely

on m

, and in particular on m

’s delay. With the simple al gorithm, however, the

probability of a premature timeout on m

also depends on the heartbeat m

i−1

that

precedes m

! In fact, the timer for m

is started upon t he receipt of m

i−1

, and so if

i−1

is “fast”, the t imer for m

starts earl y and this increases the probability of a

premature timeout on m

. This dependency on past heartbeats is undesirable.

To see the second problem, suppose p sends a heartbeat just be fore it crashes, and

let d be the delay of this last heartbeat. In the simple algorithm, q would permanently

suspect p only d + TO time uni ts after p crashes. Thus, the worst-case detection

time for this algorithm is the maximum message delay plus TO. This is impractical

because in many systems the maximum message delay is orders of magnitude larger

than the average message delay (i.e., they have large variations of message delays).

The source of the above proble ms is that even though the heartbeats are sent

at regular intervals, the timers to “catch” them expir e at irregular times, namely

the receipt times of the heartbeats plus a ﬁxed TO. The algorithm that we propose

eliminates this problem. As a result, the probabili ty of a premature tim eout on

heartbeat m

does not depend on the behavior of the heartbeats that precede m

and the de t ection ti me does not depend on the maximum message delay.

4.1 .2 The New Algorithm and its QoS Analysis

In this chapter, we design a new failure de t ector algorithm that has good worst-case

detection time and good accuracy.

In the new failure detect or algorithm, process p sends heartbeat messages to

q pe riodically, as in the simple algorithm. Suppose m

, m

, . . . are heartbeat

messages and η is the intersending time between two consecutive messages. The new

algorithm diﬀers f r om the simple algorithm in the procedure that q uses t o decide

whether to suspec t p or not. In the new algorithm, q has a sequence of time points

, τ

, . . ., called freshness points. Each freshness point τ

is set to σ

+ δ, where

is the time when m

is sent and δ is a ﬁxed parameter of the algorithm. That

is, the freshness poi nts are obtained by shifting the sending times of the heartb eat

messages forward in time by a ﬁxed δ time units. These freshness points are used to

determine the failure de t ector output. Roughly speaking, at any time t ∈ [τ

, τ

i+1

only messages m

, m

i+1

, m

i+2

, . . . can aﬀect the failure detector output, and we say

that only these messages are still fresh (at time t), and messages m

, . . . , m

i−1

are

stale (at time t). At any time t, process q trusts p if and only if some message that

q received is still fresh at time t. A detailed desc ription of the algorithm is given in

Section 4.3.1.

We analyze the algorithm in terms of the QoS metrics proposed in Chapter 2. The

analysis uses the theory of stochastic processes, and provides some closed formulas on

the QoS me t r ics of the new algorithm. We then show the followi ng optimality r esult:

among all failure detector algorithms that send messages at the same rate and satisfy

the same upper bound on the worst-case detection time, our algorithm is optimal

with respect to the query accuracy probability. This shows that the new algorithm

guarantees good worst-case detection time while providing go od accuracy. We then

show that, given a se t of QoS requirements by an appli cation, we can use the closed

formulas we derived to compute the parameters of the new algorithm to meet these

requirements. We ﬁrst do so assuming that one knows the probabilistic behavior of

the system (i.e ., the probability distributions of message delays and message losses).

We then drop this assumption, and show how to conﬁgure the failure detector to

meet the QoS requireme nts of an application even when the probabilistic behavior

of the system is not known.

The ﬁrst version of our al gorithm (describ ed above) assumes that p and q have

synchronized clocks. This assumption is not unrealistic, even in large networks. For

example, GPS clocks are becomi ng cheap, and they can provide clo cks that are very

closely synchronized (see e.g. [VR00]). When synchronized clocks are not available,

we propose a modiﬁcation to this algorithm that performs equally well in practice,

as shown by our simulations. The basic idea is to use past heartbeat messages to

obtain accurate estim ates of the expected arrival times of future heartbeats, and

then use these estimates to ﬁnd the freshness points. This computation uses the

same heartbeat messages used for failure detection, so it does not involve additional

system cost.

The modiﬁed algorithm has another parameter, namely n, which is the number of

messages used to estimate the expected arrival times of the heartbeat messages. As

n varies from 1 to ∞, we obtain a spect r um of algorithms. An important observation

is that one end point of this spectrum (n = 1) corresponds to the simple algorithm,

and the other end point (n = ∞) corresponds to the new algorithm with known

expected arrival times of all heartbeat messages. As n increases, the new algorithm

moves away from the simple algorithm and gets closer to the new algorithm with

known expected arrival times. This demonstrates that the problem of the simple

algorithm is that it does not use enough information available (it only uses the most

recently received heartbeat message), and by using more information available (using

more me ssages received), the ne w algorithm is able to provide better QoS than the

simple algorithm.

Finally, we run simulations of both the new algorithm and the simple algorithm,

and provide detailed analysis on the simulation results. The conclusion we draw

from the simulation results are: (a) the simulation results of the new algorithm are

consistent with our mathemati cal analysis of the QoS metrics; (b) t he modiﬁed new

algorithm for systems with unsynchronized clocks provides essentially the same QoS

as the algorithm with synchronize d clocks; and (c) when the new algorithm and the

simple algorithm send messages at the same rate and satisfy the same upper bound

on the worst-case detection time, the new algorithm provides (in some cases orders

of magnitude) be t t er accuracy than the simple algorithm.

4.1 .3 Related Work

Heartbeat-style failure detectors are commonly used in practice. To keep both good

detection time and good accuracy, many implementat ions rely on special features of

the operating system and communication system to try to deliver heartbeat messages

as regularly as possible (see discussion in Section 12.9 of [Pﬁ98]) . This is not easy even

for closely-c onnec t ed computer clusters, and it is very hard in wide-area networks.

Some other failure detector algorithms and their analyses can be found in

[vRMH98, GM98, RT99]. The gossip-style heartbeat protocol in [vRMH98] focus

on the scalability of the protocol, and it falls into the category of the simple algo-

rithm as gi ven in Section 4.1.1. In this protocol, nodes in the network randomly pick

some other node t o forward a heartbeat message, so heartbeat messages generated

by a source node may in some cases reach a destination node directly, while in some

other cases may be forwarded by many other intermediate nodes bef ore reaching the

same destination node. Thus the protocol has a large variation of the end-to-end

message delays, and therefore it has the problem of the simple algorithm pointed

out in Section 4.1.1. Algorithms presented in [GM98] are diﬀerent from one-way

heartbeat algorithms we discussed in this paper, and they are used in a more limited

setting in which a single suspicion will terminate the protocol. The group mem-

bership failure detection algorithm presented in [RT99] detects membe r failures in

a group: if some process det ects a failure in the group (perhaps a false detection),

then all proce sses re port a group failure and the protocol terminates. The al gorithm

uses heartbeat-style protocol, and its timeout mechanism is the same as the simple

algorithm that we describ ed in Section 4.1.1.

The probabilistic network model used in this chapter is similar to the ones used

in [Cri89, Arv94] for probabilistic clock synchroniz ation. The method of estimating

the exp ected arrival time s of heartbeat messages is close to the method of remote

clock reading of [ Arv94].

The rest of the chapter is organized as follows. In Section 4.2, we deﬁne the proba-

bilistic network model. In Section 4.3, we present the new failure detector algorithm,

analyze it in terms of the QoS metrics, show the optimality result, and show how

to c onﬁgure the new algorithm to satisfy given QoS requirem ents. In Section 4.4 we

show how to conﬁgure the failure detector algorithm when the probabil istic behavior

of the messages is not known, and how to modify the algorithm so that it works

when the local clocks are not synchronized. We present the simulation results in

Section 4.5, and conclude the chapter with some discussions in Section 4.6.

4.2 The Probabilistic Network Model

We assume that proce ss p and q are connected by a link that does not create or

duplicate messages,

but may delay or drop messages. Note that the link here rep-

resents an end-to-end connection and does not necessarily correspond to a physical

link .

We assume that the message loss and message delay behavior of any message sent

through the link is probabilistic, and is characterized by the following two parameters:

(a) message loss probability p

, which is the probability that a message is dropped

by the link; and (b) message delay time D, which is a random variable with range

(0, ∞) representing the delay from the time a message is sent to the time it is received,

under the condition that the message is not dropped by the link. We assume that

the expected value E(D) and the variance V(D) of D is ﬁnite. Note that our model

does not assume that the message delay time D follows any particular distribution,

and thus it is applic able to many practical systems.

Processes p and q have access to their own local clocks. For simplicity, we assume

that there is no clock drift, i.e., l ocal clocks run at the same speed as real time (our

results can be easil y generalized t o the case where loc al clocks have bounded drifts).

In Section 4.3, we f urther assume that clocks are synchronized. We explain how to

remove this assumption in Section 4.4.2.

For simplicity we assume that the probabilistic be havior of the network doe s not

change over t ime. In Section 4.6 we brieﬂy discuss some issues related to the change

Message duplication can be easily taken care of: whenever we refer to a message being received,

we change it to the ﬁrst copy of the message being received. With this modiﬁcation, all deﬁnitions

and analyses in this chapter g o through, and in particular, our results remain correct without any

change.

i+1

(c)(b)

(a)

i+1

FD at q

suspect

trust trust

suspect

i+1

Figure 4.1: Three scenarios of the failure detector output in one interval [τ

, τ

i+1

)

of network behavior, and explain how our algorithm can adapt to such changes.

4.3 The New Failure Detector Algorithm and Its

Analysis

4.3 .1 The Algo rithm

In the new algorithm, the task of process p is the same as in the simple algorithm: p

periodically sends heartbeat messages m

, m

, . . . to q every η time units, where

η is a parameter of the algorithm. Heartbeat message m

is tagged with its seq uence

number i. Let σ

be the sending time of message m

The new algorithm diﬀers from the simple algorithm in the task of process q. In

the new algorithm, q has a sequence of time points τ

< τ

< . . ., such that

is obtained by shifting the sending time σ

forward in time by δ ti me units (i.e.

= σ

+δ), where δ is a ﬁxed parameter of the algorithm. Time points τ

’s, together

with the arrival times of the heartbeat me ssages, are used to determine the output

of the failure detec t or at q, as we now explain. Consider the time period [τ

, τ

i+1

At time τ

, the failure de t ector at q checks whether q has received some message m

with j ≥ i. If so, the f ailure detector trusts p in the period [τ

, τ

i+1

) (Fig. 4.1 ( a)). If

not, the failure detector starts suspecting p at time τ

. During the period [τ

, τ

i+1

), if

q receives some message m

with j ≥ i, then the failure detector at q starts trusting

p when the message is received, and keeps trusting p until time τ

i+1

. (Fig. 4.1 (b)). If

by time τ

i+1

q has not received any message m

with j ≥ i, then the fail ure detector

suspects p in the entire period [τ

, τ

i+1

) (Fig. 4.1 ( c)). This procedure is repeated for

every period.

Note that from time τ

to τ

i+1

, only messages m

with j ≥ i can aﬀect the output

of the failure detector. For this reason, τ

is called a f reshness point: from time τ

i+1

, messages m

with j ≥ i are still fresh (useful), and messages m

with j < i are

stale (not useful). The core property of the algorithm is that q trusts p at time t if

and only if some message that q received is still fresh at time t.

The detail ed algorithm, denoted by NFD-S, is given in Fig. 4.2.

4.3 .2 The Analysis

We now analyze the QoS metrics of the algorithm. For the analysis, we assume that

the link from p to q satisﬁes the following m essage independence property: (a) the

message loss and message delay behavior of any message sent by p is i ndependent of

whether or when p crashes later; and (b) there exists a known constant ∆ such that

This version of the algorithm is convenient for illustrating the mai n idea and fo r performing

the a nalysis. One can easily derive some equivalent version that is more eﬃcient in practice.

Process p:

1 for some constant η, send to q heartb eat messages m

, m

, . . . at regular time points

η, 2η, 3 η, . . . respectively;

Process q:

2 Initialization:

3 for all i ≥ 1, set τ

= σ

+ δ; {σ

= iη is the sending time of m

}

4 output = S; {suspect p initi all y }

5 at every freshness point τ

6 if no message m

with j ≥ i has been received then

7 output ← S; {suspect p if no fresh message is received}

8 upon receive message m

at time t ∈ [τ

, τ

i+1

9 if j ≥ i then output ← T ; {trust p when some fresh message is received}

Figure 4.2: The new failure detector algorithm N FD-S, with synchronized clocks,

and with parameters η and δ

the message loss and message delay behaviors of any two messages sent at least ∆

time units apart are independent. We assume that the intersending t ime η is chosen

such that η ≥ ∆, so that all heartbeat messages have independent delay and loss

behaviors. For simplicity, we assume that the actions in lines 5–7 and lines 8–9 are

executed instantaneously without interruption.

We adopt the f ol lowing convention about transitions of a failure detector’s out-

put:

when an S-transition occurs at ti me t, the output at time t is S, and a symmet-

ric conve ntion is taken for T-transitions. With this convention, the output i s right

continuous: namely, if the output at a time t is X ∈ {T, S}, then there exists ǫ > 0

such that the output is also X in the period (t, t + ǫ).

Henceforth, let τ

def

= 0, and τ

, i ≥ 1, are given as in line 3. The following core

lemma states preci sely our intuitive i deas about freshness points and fresh messages.

All subsequent analyses are based on this lemma.

This convention is already included in the formal model deﬁned in Chapter 3.

Lem ma 4.1 For a ll i ≥ 0 and all time t ∈ [τ

, τ

i+1

), q trusts p at time t if and only

if q has received some message m

with j ≥ i by time t.

Proof. Fix an i ≥ 0 and a tim e t ∈ [τ

, τ

i+1

). Suppose ﬁrst that q has received

some message m

with j ≥ i by time t. Let t

′

≤ t be the time when m

is received.

Choose i

′

such that t

′

∈ [τ

′

, τ

′

). Thus i

′

≤ i ≤ j. According to line 9, q trusts p at

time t

′

. For every τ

ℓ

in the period (t

′

, t], since m

is received at t

′

and ℓ ≤ i ≤ j, the

output of the failure detector does not change to S, according to lines 5–7. Therefore,

q trusts p at time t.

Suppose now that q has not received any message m

with j ≥ i by time t. Then

at time τ

, q suspects p according to lines 5–7. During the period (τ

, t], since no

message m

with j ≥ i is received, the output of the failure detector does not change

to T . So q suspects p at time t.

The following deﬁnitions are used in t he analysis, and they are all with respect

to failure-free runs.

Note that even though i appears in these deﬁnitions, the actual

values of i are irrelevant. This is made clear in Proposition 4.2.

Deﬁnition 4.1

(1) For any i ≥ 1, let k be the smallest integer such that for all j ≥ i + k, m

sent at or after time τ

(2) For any i ≥ 1, let p

(x) be the probability that q does not receive message m

i+j

by time τ

+ x, for every j ≥ 0 and every x ≥ 0; let p

= p

(0).

(3) For any i ≥ 2, let q

be the probability that q receives message m

i−1

before time

Recall that a failure-free run is a run in which p does not crash, as deﬁned in Section 2.2.1.

(4) For any i ≥ 1, let u(x) be the probability that q suspects p at time τ

+ x, for

every x ∈ [0, η).

(5) For any i ≥ 2, let p

be the probability that an S-transition occurs at time τ

Proposit ion 4.2

(1) k = ⌈δ/η⌉.

(2) For all j ≥ 0 and for all x ≥ 0, p

(x) = p

+ (1 − p

)Pr(D > δ + x − jη).

(3) q

= (1 − p

)Pr(D < δ + η).

(4) For all x ∈ [0, η), u(x) =

j=0

(x).

(5) p

= q

· u(0).

Upon the ﬁrst reading, readers can skip t he rest of the analysis and go directl y

to Theorem 4.11.

We now analyze the acc uracy metric s of the algorithm NF D-S, and to do so we

assume that p does not crash.

Proposit ion 4.3 (1) An S-transition can only occur at time τ

for some i ≥ 2, and

it occurs at τ

if and only if message m

i−1

is received by q before time τ

and no

message m

with j ≥ i is received by q by time τ

; (2) Lemma 4.1 remains true if

j ≥ i in the statement is replaced by i ≤ j ≤ i + k; (3) part (1) above remains tr ue

if j ≥ i in the statement is replaced by i ≤ j < i + k.

Proof. From the algorithm, it is clear that an S-transition can only occur at time

with i ≥ 1. An S-transition cannot occur at time τ

, because if so, q susp ects p at

time τ

, which implies f r om Lemma 4.1 that q does not receive m

by time τ

for all

i ≥ 1. Then q must also suspe ct p during the period [0, τ

) — a contradiction.

An S-transition occurs at time τ

if and only if (a) q suspects p at time τ

, and

(b) for some t

′

∈ (τ

i−1

, τ

), q trusts p at time t

′

. Then by Lemma 4.1, (a) is true if and

only if no message m

with j ≥ i is received by q by time τ

, while (b) is true if and

only if some message m

with j ≥ i − 1 is received by q by time t

′

< τ

. Combining

(a) and (b), we know that an S-transition occurs at time τ

if and only if message

i−1

is received by q before t ime τ

and no message m

with j ≥ i is received by q

by time τ

(2) and (3) follow from the deﬁnition of k.

Note that part ( 1) of the above proposition guarantees t hat during any bounded

time period, there is only a ﬁnite number of transitions of failure detector output.

Proof of Proposition 4.2. (1) is immediate f r om the fact that m

is sent at time

− δ + (j − i) η for all i ≥ 1.

(2) directly follows from the fact that p

(x) is the probability that either m

i+j

lost, or m

i+j

is not l ost but is delayed by more than σ

+δ + x−(σ

+jη) = δ + x−jη

time units.

(3) directly follows from the fact that q

is the probabil ity that m

i−1

is not lost

and is delayed less than δ + η time units.

(4) By Proposition 4.3 (2), u(x) is the probability that q does not rec eive any

message m

with i ≤ j ≤ i + k by time τ

+ x. Then by the deﬁnition of p

(x) and

the message independence property, we have u(x) =

j=0

(x).

(5) By Proposition 4.3 (1), p

is the probabil ity that (a) message m

i−1

is re ceived

by q before time τ

, and (b) no me ssage m

with j ≥ i is received by q by time τ

. By

the m essage independence prop erty, (a) and (b) are inde pendent, and by Lemma 4.1,

(b) is also the event that q suspects p at time τ

. Thus by the deﬁnitions of q

and

u(x) we have p

= q

· u(0).

Proposit ion 4.4 u(0) ≥ p

, and for all x ∈ [0, η), u(0) ≥ u( x).

Proof. By Proposition 4.2, p

(0) ≥ p

(0) = p

, p

(0) = 1, and p

(0) ≥ p

(x)

for x ∈ [0, η). So u(0) =

j=0

(0) ≥

k−1

j=0

= p

, and u(0) =

j=0

(0) ≥

j=0

(x) = u(x).

Lem ma 4.5 (1) I f p

= 0, then with probability one q trusts p forever after time

; (2) If q

= 0, then with probability one q suspects p forever; (3) If p

> 0 a nd

> 0, then with proba bility one the failure detector at q h as an inﬁnite number of

transitions.

Proof. (1) By deﬁnition, p

= 0 means that for all i ≥ 1, the probability that q

does not receive m

by time τ

is 0. Thus by Lemma 4.1, the probability that q

keeps tr usting p in the period [τ

, τ

i+1

) is 1. Therefore, with probability one q trusts

p forever after time τ

(2) By deﬁnition, q

= 0 means that for all i ≥ 2, the probabili ty that q receives

i−1

before tim e τ

is 0. For every j ≥ i, message m

is sent after m

i−1

, so the prob-

ability that q receives m

before time τ

is also 0. This implies that the probability

that q receives some m

with j ≥ i − 1 before time τ

is 0. By Lemma 4. 1, we have

that for all i ≥ 0, the probability that q keeps suspecting p in the period [τ

, τ

i+1

) is

1. Thus with probability one q suspects p forever.

(3) Suppose p

> 0 and q

> 0. For all i ≥ 2, let A

be the event that there is

an S-transition at tim e τ

. By Proposition 4.3 (3), A

is also the event that message

i−1

is received before time τ

but no messages m

with i ≤ j < i + k is received

by time τ

. Hence A

only depends on messages m

with i − 1 ≤ j < i + k, which

implies that {A

i(k+1)

, i ≥ 2} are independent. By deﬁnition and Proposition 4.4, we

have Pr(A

) = q

· u(0) ≥ q

· p

> 0. Therefore, with probability one, the f ai lure

detector at q has an inﬁnite number of transitions.

The above lemma factors out the special case in which p

= 0 or q

= 0. We

call this special case the degenerated case. From now on, we only consider the

nondegenera ted case in which p

> 0 and q

> 0, and only consider runs i n which

the output of the failure detector has an inﬁnite number of transitions.

Lem ma 4.6 P

= 1 −

u(x) dx.

Proof. For al l i ≥ 1, let P

be the probability that at any random time T ∈ [τ

, τ

i+1

q is suspect ing p. Note that T is uniformly distributed on [τ

, τ

i+1

) with density

1/(τ

i+1

− τ

) = 1/η. Thus

i+1

u(x − τ

) dx =

u(x) dx.

Note that the value of P

does not depend on i. Let this value be P . Thus we

have that P

, the probability that q trusts p at a random t ime, is 1 −P . This shows

the lemma.

We now analyze the average mistake recurrence ti me E(T

) of the failure detec-

tor. We will show that

Lem ma 4.7 E(T

) = η/p

If, at each time point τ

with i ≥ 2, t he test of whether an S-transition occurs

were an independent Bernoulli trial, then the above result would be very easy to

obtain: p

is the probability of success in one Bernoulli trial, i.e. an S-transition

occurs at τ

, and η is the time between two Bernoulli trials, and so η/p

is the

expected time between two successful Bernoulli trials, which is just the expec t ed

time between two S-transitions. Unfortunately, this is not the case because the tests

of whether S-transitions occ ur at τ

’s are not independe nt. In fact, by Proposition 4.3,

the event that an S-transition occurs at τ

dependents on the behavior of messages

, . . . , m

i+k−1

. Thus two such e vents may depend on the behavior of comm on

messages, and so they are not i ndependent in general.

To deal with this, we use some result s in renewal theory, a branch in the theory

of stochastic processes. Besides proving Lemma 4.7, the analysis also reveals an

important property of the failure detector output: each recurre nce interval between

two consecutive S-transitions is independent of other recurrence intervals.

The analysis proceeds as follows. We ﬁrst introduce the concept of a renewal

process. A more formal account can be found in any standard textbook on stochastic

processes (see for example Chapter 3 of [Ros83]). Let {(T

, R

), n = 1, 2, . . .} be a

sequence of random variable pairs such that (1) a nonnegative T

denotes the time

between the (n − 1)-th and n-th occurrences of some recurrent event A, i.e ., event

A occurs at time t

= T

, t

= T

+ T

, t

= T

+ T

, . . . ; and (2) R

can be

interpreted as the reward assoc iated with the n-th occurrence of event A. A delayed

renewal reward process i s such a sequence satisfying: (1) The pairs (T

, R

), n ≥ 1 are

mutually indepe ndent; and (2) The pairs (T

, R

), n ≥ 2 are identically distributed.

If {R

} is omitted, then the above process {T

, n ≥ 1} is called a delayed renewal

process. Such processes are well studied i n the literature, and are known to have

some nice propert ies.

Now consider S-transitions of the failure detector output as the recurrent events.

Let T

MR,n

be the random variable representing the time that e lapses from the (n−1)-

th S-transition to the n-th S-transition (as a convention c onsider t ime 0 to be the time

at which the 0-th S-transition occurs). Let T

M,n

be the r andom variable representing

the time that elapses from the (n −1)-th S-transition to the n-th T-transition. Thus

M,n

≤ T

MR,n

for all n ≥ 1.

Lem ma 4.8 {(T

MR,n

, T

M,n

), n = 1, 2, . . .} is a delayed renewal reward process.

We need the following technical result before proving thi s lemma.

Proposit ion 4.9 Let {A

, i ≥ 1} be an event partition (i.e. disjoint and covers all

events). Two random variables X and Y are independent if for all A

: (1) X is

independent of A

, that is, if Pr(A

) > 0 then for all x ∈ [−∞, ∞], Pr (X ≤ x) =

Pr(X ≤ x |A

); and (2) X and Y are independent when conditioned on A

, that

is, if Pr(A

) > 0 then for all x, y ∈ [−∞, ∞], Pr(X ≤ x, Y ≤ y |A

) = Pr(X ≤

x |A

)Pr(Y ≤ y |A

Proof. For all x, y ∈ [−∞, ∞],

Pr(X ≤ x, Y ≤ y) =

∞

i=1

Pr(X ≤ x, Y ≤ y |A

)Pr(A

)

Pr(A

)>0

Pr(X ≤ x |A

)Pr(Y ≤ y |A

)Pr(A

)

= Pr(X ≤ x)

Pr (A

)>0

Pr(Y ≤ y |A

)Pr(A

)

= Pr(X ≤ x)Pr(Y ≤ y).

Thus X and Y are independent.

Note that we can replace X and Y i n the above proposition with random vectors

and the result still holds.

Proof of Lemm a 4.8. For all n ≥ 1, by Proposition 4.3 (1), the n-th S-transition

can only occur at time τ

for some i ≥ 2. Let A

be the event that the n-th S-

transition occurs at time τ

. Thus for each n ≥ 1, {A

, i ≥ 2} is an event partition.

Let B

be the event consisting of all the runs in which the messages m

with j < i

behave in the same way as in some run in A

. Let C

be the event that no message

with j ≥ i is received by tim e τ

. Since C

and B

are determined by completely

diﬀerent set of messages, C

is indepe ndent of B

To complete the proof of the lemma, we now show the following ﬁve claims.

Claim 1. For all n ≥ 1 and for all i ≥ 2, A

= B

∩C

Proof of Claim 1. By deﬁnition, A

⊆ B

. By Proposition 4.3 (1), A

implies that

no message m

with j ≥ i arrives by time τ

, and t hus A

⊆ C

. So A

⊆ B

∩ C

For any run r

in B

∩C

, by the deﬁnition of B

, there is a run r

in A

such that

messages m

with j < i behave exactly in the same way in both runs. Since r

∈ C

we know from the deﬁnition of C

that in r

no messages m

with j ≥ i is received

by time τ

. Since r

∈ A

, we know from the deﬁnition of A

and Proposition 4.3

(1) that in r

no messages m

with j ≥ i i s received by time τ

. Thus in both runs

and r

, the failure detector outputs up to time τ

are the same. Theref ore, in r

the n-th S-transition occurs at τ

just as in r

, which means r

∈ A

. Thus Claim 1

holds.

Claim 2. For all n, n

′

≥ 1, for all i, i

′

≥ 2, if Pr(A

) > 0 and Pr(A

′

) > 0, then for

all x, y ∈ [−∞, ∞],

Pr(T

MR,n+1

≤ x, T

M,n+1

≤ y |A

) = Pr (T

MR,n

′

≤ x, T

M,n

′

≤ y |A

′

). (4.1)

Proof of Claim 2. Suppose Pr(A

) > 0 and Pr(A

′

) > 0. Let t

and t

two random variables representing the times at which the ﬁrst T-transition and

S-transition occur after time τ

, resp ectively. Since A

represents the event that

the n-th S-transition occurs at tim e τ

, we have Pr(T

MR,n+1

≤ x, T

M,n+1

≤ y |A

) =

Pr(t

− τ

≤ x, t

− τ

≤ y |A

). Let D

be the event {t

− τ

≤ x, t

− τ

≤ y}.

Equality (4.1) is thus equivalent to Pr(D

) = Pr (D

′

By Lemma 4.1, the output of the failure detector after τ

is completely determined

by messages m

with j ≥ i. Thus we know that D

is completely determined by

messages m

with j ≥ i. Since C

is completely determined by messages m

with

j ≥ i while B

is completely determined by messages m

with j < i, we have that

both C

and C

∩D

are independent of B

. We clai m that Pr (D

) = Pr (D

Indeed,

Pr(D

) = Pr(D

∩ C

) =

Pr(D

∩B

∩C

)

Pr(B

∩C

)

Pr(D

∩B

∩ C

)/Pr(B

)

Pr(B

∩ C

)/Pr(B

)

Pr(D

∩C

)

Pr(C

)

Pr(D

∩C

)

Pr(C

)

= Pr(D

Similarly we have Pr(D

′

) = Pr(D

′

). Thus we only need to show that

Pr(D

) = Pr(D

′

Pr(D

) is the probability that, given that no messages m

with j ≥ i is

received by time τ

, the ﬁrst S-transition after τ

occurs within x time units after τ

and the ﬁrst T-transition after τ

occurs within y time units. Since the occurrences

of the se transitions are all deter mined by me ssages m

with j ≥ i, and messages are

sent at regular intervals, it is e asy to verify that this probability is the same for every

i ≥ 2. Thus Pr(D

) = Pr (D

′

), and Claim 2 holds.

Claim 3. The pairs (T

MR,n

, T

M,n

), n ≥ 2 are identically distributed.

Proof of Claim 3. This is a direct consequence of Claim 2. In fact, for all n ≥ 2,

for all x, y ∈ [−∞, ∞], let p(x, y) = Pr(T

MR,n

≤ x, T

M,n

≤ y |A

n−1

) if Pr(A

n−1

) > 0.

This is wel l-deﬁned by Claim 2. Then

Pr(T

MR,n

≤ x, T

M,n

≤ y) =

∞

i=2

Pr(T

MR,n

≤ x, T

M,n

≤ y |A

n−1

)Pr(A

n−1

)

Pr (A

n−1

)>0

p(x, y)Pr(A

n−1

) = p(x, y).

Claim 4. For all n ≥ 1 and i ≥ 2, (T

MR,n+1

, T

M,n+1

) is indepe ndent of A

Proof of Claim 4. This is another direct consequence of Claim 2. Suppose t hat we

ﬁx i and n and Pr(A

) > 0. Then we have f or all x, y ∈ [−∞, ∞],

Pr(T

MR,n+1

≤ x, T

M,n+1

≤ y) =

∞

j=2

Pr(T

MR,n+1

≤ x, T

M,n+1

≤ y |A

)Pr(A

)

Pr(A

)>0

Pr(T

MR,n+1

≤ x, T

M,n+1

≤ y |A

)Pr(A

)

= Pr(T

MR,n+1

≤ x, T

M,n+1

≤ y |A

This shows that (T

MR,n+1

, T

M,n+1

) is indepe ndent of A

Claim 5. For all n ≥ 1 and for all i ≥ 2, when conditioned on A

, (T

MR,n+1

, T

M,n+1

)

is indepe ndent of {(T

MR,j

, T

M,j

), j ≤ n}.

Proof of Claim 5. We already know that when conditioned on A

, (T

MR,n+1

, T

M,n+1

) is

completely determined by the distribution of me ssages m

with j ≥ i. On the other

hand, when conditioned on A

, the occurrence of any S-transition or T-transition

before τ

is only determined by messages m

with j < i, because A

implies that

all messages m

with j ≥ i do not arrive at q by time τ

. Since the occurrences

of transitions before and after τ

are de t ermined by disjoint set of messages, and

messages are i ndependent of each other, Claim 5 holds.

From Claims 4 and 5 and Proposition 4.9, we know that (T

MR,n+1

, T

M,n+1

) is inde-

pendent of (T

MR,j

, T

M,j

), j ≤ n. Thus pairs (T

MR,n

, T

M,n

), n ≥ 1 are mutually indepen-

dent. From Claim 3, we know that (T

MR,n

, T

M,n

), n ≥ 2 are identically distributed.

Therefore, {(T

MR,n

, T

M,n

), n = 1, 2, . . .} is a delayed renewal reward process.

It is immediate from the above lemm a that for all n ≥ 2, T

= T

MR,n

, T

= T

M,n

and T

= T

MR,n

− T

M,n

. This provides more direct ways to analyz e the distributions

of these variables. Moreover, any delayed renewal reward process is ergodic (see for

example Section 2.6 of [Sig95]), so Theorem 2.1 of Chapter 2 is applicable to our

failure detector.

Proof of Lemma 4.7. For all i ≥ 2, let A

be the event that an S-transition occurs

at time τ

. By deﬁnition and Proposition 4.4, we have that Pr(A

) = p

= q

·u(0) ≥

· p

. Since in the nondegenerated case q

> 0 and p

> 0, we have Pr(A

) > 0.

By Proposition 4.3 (3), A

is also the event that m

i−1

is received before time τ

but

no message m

with i ≤ j < i + k is received by time τ

. This implies that A

only

depends on messages m

with i −1 ≤ j < i + k, which in turn impli es that for every

j ∈ {2, . . . , k + 2}, events A

i(k+1)+j

, i ≥ 0 are independent.

For j ∈ {2, . . . , k + 2}, let B

be the se t of time points {τ

i(k+1)+j

: i = 0, 1, . . .}.

Obvious B

, j ≥ 0 is a partition of all time points τ

, i ≥ 2. Let N

(t) be the random

variable representing the number of S-transitions that occur at times in B

by time

t. Let N(t) be the random variable representing the number of S-transitions by time

t. Thus N(t) =

k+2

j=2

(t).

Consider N

(t) for some j ∈ {2, . . . , k + 2}. For t ≥ τ

, the number of time points

in B

that are no greater than t i s ⌊(t − τ

)/((k + 1)η)⌋ + 1. From the above, we

know that at each of these time points, there is an independent probability of p

that

an S-transition occurs. Therefore, the average number of S-transitions at these time

points by time t ≥ τ

is given by

E(N

(t)) = p

t − τ

(k + 1)η

+ 1

Hence, we have for t ≥ τ

k+2

E(N(t)) =

k+2

j=2

t − τ

(k + 1)η

+ 1

By Lemma 4.8, {T

MR,n

, n ≥ 1} is a delayed renewal process. Then by the Ele-

mentary Renewal Theorem (see for example [Ros83] p.61),

E(T

) = lim

t→∞

E(N(t))

= lim

t→∞

k+2

j=2

j

t−τ

(k+1)η

+ 1



k+2

j=2

lim

t→∞

j

t−τ

(k+1)η

+ 1



k+2

j=2

(k+1)η

By Lemma 4.7, we know that 0 < E(T

) < ∞. Then we can apply Theorem 2.1

of Chapter 2 to obtain results on other metric s by our results on P

and E(T

The above is the analysis on the accuracy metrics of the new failure dete ctor. We

now give the bound on the worst-case detection time T

Lem ma 4.10 T

≤ δ + η. Moreover, the inequality is tight when q

> 0, and T

always 0 when q

= 0.

Proof. Suppose that process p crashes at time t. Let m

be the last heartbeat

message sent by p before p crashes. By deﬁnition, m

is sent at time σ

, and σ

≤ t.

Since no m essages with sequence number greater than i are sent by p, q does not

receive these messages. Thus by Lemma 4.1, for all t

′

∈ [τ

i+1

, ∞), q suspects p at

time t

′

. So the detection time is at most τ

i+1

− t = σ

+ δ + η − t ≤ δ + η.

When q

> 0, with nonzero probability m

is received before τ

i+1

and thus q

trusts p just before τ

i+1

In these cases, the detection time is σ

+ δ + η − t. Si nce

the time t when p crashes can be arbitraril y close to σ

, we have that the bound δ + η

is tight. When q

= 0, similar t o Lemma 4.5 (2), we can see that in runs in which p

crashes q also suspects p forever. Therefore T

is always 0.

All the above analytical results are summarized in Theorem 4.11.

Theorem 4.11 The failure d etector NFD-S given in Fig. 4.2 has the following prop-

erties:

(1) T

≤ δ + η. Moreover, if q

> 0, then the inequality is tight, and if q

= 0, then

is always 0.

(2) If p

> 0 and q

> 0 (the nondegenerated cas e), then we have

E(T

) =

, (4.2)

E(T

) = (1 − P

) · E(T

) =

u(x) dx

, (4.3)

= 1 −

u(x) dx, (4.4)

E(T

) = P

· E(T

) =

η −

u(x) dx

, (4.5)

E(T

) ≥

E(T

) =

η −

u(x) dx

. (4.6)

Even though q

is deﬁned with respect to runs in which p does not crash, it also applies to runs

in which p crashes by part (a) of the message i ndependence property.

If p

= 0 or q

= 0 (the degenerated case), then we have: in failure-free runs, (a) if

= 0, then with probability one q trusts p forever after time τ

; (b) if q

= 0, then

with probability one q suspects p forever.

From these closed formulas, we can derive many useful properties of the QoS

of the new failure detector. For example, we can derive bounds on the accuracy

metrics E(T

), E(T

), P

, E(T

), and E(T

), as we later do in Theorem 4.17.

From these bounds, it is easy to check that when δ inc reases or η decreases, P

increases exponentially fast t owards 1, and E(T

), E(T

) and E(T

) increases

exponentially fast towards ∞, while E(T

) is bounded by a relative small value.

The tradeoﬀ is that: (a) when δ increases, the detection time increases linearly ;

(b) when η decreases, the network bandwidth used by the failure detector increases

linearly. Therefore, with a small (linear) increase in the detection time or in the

network cost, we can get a large (exponential) i ncrease in t he accuracy of the new

failure detector.

In Section 4.3.4, 4.4.1 and 4.4.3, we will show how these close formulas are used

to compute the failure detector parameters to satisfy given QoS requirements.

4.3 .3 An Optimality Result

Besides the properties given in Theorem 4.11, the new algorithm has the following

important optimality property: among all failure detectors that send messages at

the same rate and satisfy the same upp er bound on the worst-case detection time,

the new algorithm provides the best query accuracy probability.

More precisely, let C b e the class of fai lure detector algorithms A such that in

every run of A process p sends messages to q every η time units and A satisﬁes

≤ T

for some constant T

. Let A

∗

be the instance of t he new failure detector

algorithm NFD-S with parameters η and δ = T

− η (δ can be negative). By part

(1) of Theorem 4.11, we know that A

∗

∈ C. We show that

Theorem 4.12 For any A ∈ C, let P

be the query accuracy probability of A. Let

∗

be the query accuracy proba bility of A

∗

. Then P

∗

≥ P

The core idea behind the theorem is the following important property of algorithm

∗

: roughly sp eaking, i f in some failure-free r un r of A

∗

process q suspects p at time

t, then for any A ∈ C, in any fail ure-free run r

′

of A in which the message delay and

loss behaviors are exactly the same as in run r, q al so suspects p at time t. With

this property, it is easy to see that the probability that q trusts p at a random time

in A

∗

must be at least as high as the probabili ty that q trusts p at a random time in

any A ∈ C. We now give the more detailed proof.

A message delay pattern P

is a sequence {d

, d

, . . .} with d

∈ (0, ∞] rep-

resenting the delay time of the i-th message sent by p; d

= ∞ means that the i-th

message is lost. The distribution of message delay patterns are governed by the mes-

sage loss probability p

and the message del ay time D, and thus it is the same for

all algorithms in C.

We ﬁrst consider a subclass C

′

of C such that for any algorithm A ∈ C

′

, in any

run of A proc ess p sends messages to q at times η, 2η, 3η, . . ., just as in A

∗

. For any

algorithm in C

′

, a message delay pattern completely determines the time and the

order at which q receives messages in failure-free runs. For A

∗

, this means that a

message delay pattern uniquely determines a failure-free run of A

∗

. For some other

algorithm A ∈ C

′

, if A is nondeterministic, then A may have diﬀerent failure-free

runs with the same message delay pattern.

Lem ma 4.13 Gi ven any message delay pattern P

, let r

∗

be the failure-free run of

∗

with P

, and let r be a failure-free run of som e algorithm A ∈ C

′

with P

. Then

for every time t ≥ T

, if q suspects p at time t in run r

∗

, then q suspects p at time

t in run r.

Proof. Suppose that in run r

∗

of A

∗

, q suspects p at time t ≥ T

. Note that

= η + δ = τ

, so t ≥ τ

. Suppose t ∈ [τ

, τ

i+1

) for some i ≥ 1. By Lemma 4.1,

in run r

∗

q does not receive any message m

with j ≥ i by time t. Since in run r p

sends messages at the same times as in run r

∗

, and both runs have the same message

delay pattern P

, then in run r, by time t q does not receive any message sent by p

at time jη with j ≥ i.

Consider ﬁrst that t ∈ (τ

, τ

i+1

). Suppose for a contradiction that q trusts p

at tim e t in run r. Let ǫ = t − τ

, and let t

′

= (i − 1)η + ǫ/2. Thus ǫ > 0.

Consider another run r

′

of A in which p crashes at time t

′

, and messages sent before

′

(those sent at times jη with j < i) have the same loss and delay behaviors as

in run r (this is possible by part (a) of the message independence property). In

both runs r and r

′

up to tim e t, q receives the same me ssages at the same times.

If A is nondeterministic, we let A make the same nondeterministic choices up to

time t in both runs. Thus q cannot distinguish run r

′

from r at time t, and so

q trusts p at time t in run r

′

. The detection time in run r

′

, however, is at least

t − t

′

= (τ

+ ǫ) −((i −1)η + ǫ/2) = η + δ + ǫ/2 = T

+ ǫ/2 > T

, contradicting the

assumption that A satisﬁes T

≤ T

Now suppose t = τ

. Since the failure detector output is right continuous, there

exists ǫ > 0 such that q suspects p i n the peri od (t, t + ǫ) in run r

∗

. Then by the

above argument, q suspects p in the period (t, t + ǫ) in run r. By the right continuity

again, we have that q suspects p at time t in run r.

Corollary 4.14 For any A ∈ C

′

, let P

be the query accuracy probability of A. Let

∗

be the query accuracy proba bility of A

∗

. Then P

∗

≥ P

Proof (Sketch). We ﬁrst ﬁx a message de lay pattern P

. For the run r

∗

of A

∗

and

any run r of A with message delay pattern P

, Lemma 4. 13 shows that for any time

t ≥ T

, if q suspects p in r

∗

at time t, then q suspects p in r at time t. Thus, given a

ﬁxed message delay pattern P

, at any random t ime t, the probability that q trusts

p at time t in algorithm A

∗

is at least as high as the probability that q trusts p at

time t in algorithm A. So P

∗

≥ P

given a ﬁxed message delay pattern P

. When

summing (or integrating) both sides of the inequality over all message delay patterns

according to their distribution, we have P

∗

≥ P

The above corollary shows that the new algorithm A

∗

has the best query accuracy

probability in C

′

, the class of failure detector algorithms in which p sends messages

at exactly the same times as in A

∗

. We now generalize this result to class C, where

p still sends messages every η time unit s, but it may do so at times diﬀerent from

those in A

∗

A message sending pattern P

is a se quence of time points {σ

, σ

, . . .}at which

p se nds messages. The message sending pattern is determined by the algorithm. For

any algorithm A ∈ C, its message se nding pattern is in the f orm {s, s + η, s + 2η, . . .}

for some s ∈ [0, ∞). Diﬀerent runs of algorithm A may have diﬀerent message sending

patterns due to the possible nondeterminism of A. Let A

∗

be the al gorithm in which

p sends heartbeat messages according to the sending pattern {s, s + η, s + 2η, . . .},

and q behaves the same way as in A

∗

. Thus A

∗

is a shifted version of A

∗

, and so

the behavior of the failure detector output in A

∗

is also a shifted version of that of

∗

. Since the behaviors of the two failure dete ctors only di ﬀer in some initial period,

their steady state behaviors are the same. Therefore the QoS metrics of A

∗

and A

∗

are the same. In particul ar, they have the same q uery accuracy probability.

Proof of Theorem 4.12 (Sketch). We ﬁrst ﬁx a message sending pattern P

{s, s + η, s + 2η, . . .}. For any algorithm A ∈ C, consider the runs of A with the

sending pattern P

. In these runs p sends messages at exactly the same times as

in algorithm A

∗

. By the similar argument as in Lemma 4.13 and Corollary 4.14,

we can show that the query accuracy probability of A

∗

is at least as high as the

query accuracy probability of A given the message sending pattern P

. Since A

∗

and

∗

have the same query accuracy probability, we have P

∗

≥ P

given the message

sending pattern P

. Since P

is arbitrary, we thus have P

∗

≥ P

4.3 .4 Conﬁguring the Failure Dete ctor to Satisfy QoS

Requirements

Suppose we are given a set of failure detector QoS requirements (these QoS re-

quirements could be given by an application). We now show how to compute the

parameters η and δ of the failure det ector algorithm, so that these req uirements are

satisﬁed. We ﬁrst assume that (a) the lo cal clocks of processes are synchronized,

and (b) one knows the probabilistic behavior of the messages, i.e., the message loss

probability p

and the distribution of message delays Pr(D ≤ x). In Section 4.4, we

show how to remove these assumptions.

We assume that the QoS requirements are expressed using the primary metrics.

More precisely, a set of QoS requirements is a tuple (T

, T

), where T

is an

upper bound on the worst-case detec t ion time , T

is a lower bound on the average

mistake recurrence time, and T

is an upper bound on the average mistake duration.

In other words, the requirements are that:

≤ T

, E(T

) ≥ T

, E(T

) ≤ T

. (4.7)

In addition, we would like to have η as large as possible, to save network band-

width. Using Theorem 4.11, this can be stated as a mathematical programming

problem:

maximize η

subject to δ + η ≤ T

(4.8)

≥ T

(4.9)

u(x) dx

≤ T

(4.10)

η ≥ ∆ (4.11)

where the values of u(x) and p

are given by Proposition 4.2. Constraint (4.11)

ensures that the heartbeat messages are independent, so that Theorem 4.11 can

be applied. Computing the optimal solution for this problem, which means ﬁnding

the largest η and some δ that satisfy constraints (4.8)–(4.11), seems to be hard.

Instead, we give a simple procedure that computes η and δ such that they satisfy

Note that the bounds on the primary metrics E(T

) and E(T

) also impose bounds on the

derived metrics, according to Theorem 2.1 of Chapter 2. More precisely, we have λ

≤ 1/T

≥ (T

− T

)/T

, E(T

) ≥ T

−T

, and E(T

) ≥ (T

− T

)/2.

constraint s (4.8)–(4.11), but η may not be the largest possible value. This is done

by replacing c onstraint (4.10) with a simpler and stronger constraint to obtain a

modiﬁed problem, and computing the optimal solution of this modiﬁed problem.

The conﬁguration procedure is as follows:

• Step 1 : Compute q

′

= (1 − p

)Pr(D < T

), and let η

max

= q

′

• Step 2 : Let

f(η) =

′

⌈T

/η⌉−1

j=1

+ (1 − p

)Pr(D > T

− jη)]

. (4.12)

Find the largest η ≤ η

max

such that f(η) ≥ T

It is easy to check that when η decreases, f(η) increases exponentially fast

towards inﬁnity, so some si mple numerical method ( such as binary search) can

be used to calculate η.

• Step 3 : If η ≥ ∆, then set δ = T

− η; otherwi se, the procedure does not ﬁnd

appropriate η and δ.

We now show that the parameters computed by the proc edure are appropriate.

Proposit ion 4.15 If p

> 0 and q

> 0 (the nondegenerated case), then E(T

) ≤

η/q

Proof. By Proposition 4.4, we have for all x ∈ [ 0, η), u(0) ≥ u(x). Thus by equality

(4.3), we have

E(T

) =

u(x) dx

≤

u(0) dx

u(0)

Theorem 4.16 Consider a system in which clocks are synchronized, and the proba-

bilistic behavior of messages is known. With the parameters η and δ computed by the

above procedure, the failure detector algorithm NFD-S of Fig. 4.2 satisﬁes the QoS

requirements given in (4. 7).

Proof. Suppose that the procedure ﬁnds parameters η and δ. Then by step 3 we

have T

= η + δ. By part (1) of Theorem 4.11, T

≤ T

is satisﬁed. By step 1

and Proposition 4.2, q

′

= (1 − p

)Pr(D < η + δ) = q

(note that q

> 0: otherwise

max

= 0 and the procedure cannot ﬁnd η ≥ ∆). Consider ﬁrst that p

> 0. Then

by Proposition 4.15, E(T

) ≤ η/q

≤ η

max

= q

= T

. So E(T

) ≤ T

satisﬁed. Note that

⌈T

/η⌉−1

j=1

+ (1 − p

)Pr(D > T

− jη)]

⌈(η+δ)/η⌉−1

j=1

+ (1 − p

)Pr(D > η + δ − jη)]

⌈δ/η⌉−1

j=0

+ (1 − p

)Pr(D > δ − jη)]

⌈δ/η⌉

j=0

+ (1 − p

)Pr(D > δ − jη)] = u(0).

Thus f(η) = η/(q

u(0)) = η/p

= E(T

), by equality (4.2). By step 2, f(η) ≥ T

and so E(T

) ≥ T

is sat isﬁed.

Consider now that p

= 0. By Theorem 4.11, in failure-free runs, the fail ure

detector keeps trusting p after time τ

, and so E(T

) = ∞ and E(T

) = 0. Thus

the requirements in (4.7) are also satisﬁed.

4.4 De aling with Unknown Sys tem Behavior and

Unsynchronized Clocks

So far, we assumed that (a) the local clocks of processes are synchronized, and

(b) t he probabilistic behavior of the messages (i.e., probability of me ssage loss and

distribution of message delays) is known. These assumptions are not unrealistic, but

in some systems assumption (a) or (b) may not hold. To widen the applicability of

our algorithm, we now show how to remove these assumptions.

4.4 .1 Conﬁguring the Failure De tector NFD-S When the

Probabilistic Behavio r of the Messages is Not Known

In Section 4.3.4, our procedure of computing η and δ to meet some QoS requirements

assumed that one knows the probabilistic behavior of the messages (i.e., probability

of message loss and the probability distribution Pr(D ≤ x) of the message delay).

If this probabilistic be havior is not known, we can still compute η and δ as follows:

We ﬁrst assume that message loss probability p

, the expected value E(D) and the

variance V(D) of message delay D are known, and show how to compute η and δ

with only p

, E(D) and V(D). We then show how to estimate p

, E(D) and V(D)

using heartbeat messages. Note that in this section we still assume that lo cal clocks

are synchronized.

With E(D) and V(D), we have an upper bound on Pr (D > x), as gi ven by the

following One-Sided Inequality of probability theory (e .g., see [Al l90] p. 79): For any

random variable D with a ﬁnite expected value and a ﬁnite variance,

Pr(D > x) ≤

V(D)

V(D) + (x − E(D))

, for all x > E(D). (4.13)

With the One-Sided Inequality, we derive the following bounds on the QoS metrics

of algorithm NFD-S.

Theorem 4.17 Assume δ > E(D). For algo rithm NFD-S, in the nondegenerated

cases of Theorem 4.11, we have

E(T

) ≥

, (4.14)

E(T

) ≤

, (4.15)

≥ 1 − β, (4.16)

E(T

) ≥

1 − β

η, (4.17)

E(T

) ≥

1 − β

2β

η, (4.18)

where

β =

j=0

V(D) + p

(δ − E(D) − jη)

V(D) + (δ − E(D) − jη)

, k

= ⌈(δ − E(D))/η⌉ − 1,

and

γ =

(1 − p

)(δ − E(D) + η)

V(D) + (δ − E(D) + η)

Proof. Note that for all j such that 0 ≤ j ≤ k

, δ − jη > E(D). Then by the

One-Sided Inequality (4.13), we have for all j such that 0 ≤ j ≤ k

(0) = p

+ (1 − p

)Pr(D > δ − jη)

≤ p

+ (1 − p

)

V(D)

V(D) + (δ − E(D) −jη)

V(D) + p

(δ − E(D) − jη)

V(D) + (δ − E(D) − jη)

Thus,

j=0

(0) ≤ β.

By Proposition 4.2 (4) and (5) and Proposition 4.4, and the fact that k

≤ k −1,

we have u(x) ≤ u(0) ≤

j=0

(0) ≤ β, and p

≤ u(0) ≤ β. Therefore, from equality

(4.2), E(T

) = η/p

≥ η/β. Similarly, when applying u(x) ≤ β and p

≤ β

to equali t ies (4.4), (4.5) and (4.6), we obtain inequaliti es (4.16), (4.17) and (4.18),

respectivel y.

To show inequality (4.15), ﬁrst note that we can replace Pr(D > x) in the One-

Sided Inequality (4.13) with Pr(D ≥ x), and the inequality rem ai ns true. In fact,

for all ǫ ∈ (0, x − E(D)),

Pr(D ≥ x) ≤ Pr(D > x −ǫ) ≤

V(D)

V(D) + (x − ǫ − E(D))

Let ǫ → 0, and we obtain the result.

Then from Proposition 4.2 (3) we have

= (1 − p

)(1 − Pr (D ≥ δ + η)) ≥ (1 − p

)

1 −

V(D)

V(D) + (δ − E(D) + η)

= γ.

Therefore, by Proposition 4.15, E(T

) ≤ η/q

≤ η/γ.

Note that in Theorem 4.17 we assume that δ > E(D). This assumption is

reasonable because if the parameter δ of NFD-S is set to be less than E(D), then

there will be a false suspicion every time the heartbeat message takes more than the

average message delay, and so a failure detector with such a δ makes very f r equent

mistakes and is not useful.

Computing Fai lure Detector Parame ters η and δ

The bounds given in Theorem 4.17 can be used to compute the parameters η and

δ of the failure detector NFD-S, so that it satisﬁes the QoS requirements given

in (4.7). The conﬁguration procedure is given below. This procedure assumes that

> E(D), i.e., the required detection ti me is greater than the average message

delay, which is a reasonable assumption.

• Step 1 : Compute γ

′

= (1 −p

)(T

−E(D))

/( V(D) + (T

−E(D))

) and let

max

= min(γ

′

, T

− E(D)).

• Step 2 : Let

f(η) = η ·

⌈

−E(D))/η

⌉

−1

j=1

V(D) + (T

− E(D) − jη)

V(D) + p

−E(D) − jη)

. (4.19)

Find the largest η ≤ η

max

such that f(η) ≥ T

• Step 3 : If η ≥ ∆, then set δ = T

− η; otherwi se, the procedure does not ﬁnd

appropriate η and δ.

Theorem 4.18 Consider a system in which clocks are synchronized, and the proba-

bilistic behavior of messages is not known. With parameters η and δ computed by the

above procedure, the failure detector algorithm NFD-S of Fig. 4.2 satisﬁes the QoS

requirements given in (4. 7), provided that T

> E(D).

The proof of the theorem is straitforward.

Estim ating p

, E(D) and V(D)

The conﬁguration procedure given above assumes that p

, E(D) and V(D) are

known. In practice, we can use heartbeat messages to compute close estimates of

these quantities, as we now explain.

Estimating p

is easy. For example, one can use the se quence numbers of the

heartbeat messages to count the number of “missing” heartbeats, and then divide

this count by the highest sequence number received so far.

Since lo cal clocks are synchronized, E(D) and V(D) can also be easily estimated

using the heartb eat messages of the algorithm. Suppose that when p sends a heart-

beat m, p timestamps m with the sending time S, and when q receives m, q records

the receipt t ime A. Thus the delay of message m is A −S. Therefore, by taking the

average and the variance of A −S of heartbeat messages, we obtain the estim ates of

E(D) and V(D).

4.4 .2 Dea ling with Unsynchronized Clocks

The algorithm NFD-S in Fig. 4.2 assumes that the lo cal clocks are synchronized, so

that q can set the freshness points τ

’s by shifting t he sending times of the heartbeats

by a constant. If the local clocks are not synchronized, q cannot set the τ

’s in this

way. To circumvent this problem, we modify the algorithm so that q obtains the τ

’s

by shifting the expected arrival times of the heartbeats, as we now explain.

We assume that local clocks do not drift with respect to real time, i.e., they

accurately measure t ime intervals. Let σ

denote the sending time of m

with respect

to q’s local clock time. Then, the expected arrival tim e EA

of m

at q is EA

+ E(D), where E(D) is the expected message delay. We will show shortly how q

can accurately estimate the EA

’s by using past heartbeat messages.

Suppose for the moment that q knows the EA

’s. Then q can set τ

’s by shifting

the EA

’s forward in time by α time units, i.e., τ

= EA

+α (where α is a new failure

detector paramete r replacing δ). We denote t he algorithm with thi s modiﬁcation as

Process p: {using p’s local clock time}

1 for some constant η, send to q heartb eat messages m

, m

, . . . at regular time points

η, 2η, 3 η, . . . respectively;

Process q: {using q’s local clock time}

2 Initialization:

3 for all i ≥ 1, set τ

= EA

+ α; {EA

is the expected arrival time of m

}

4 output = S; {suspect p initi all y }

5 at every freshness point τ

6 if no message m

with j ≥ i has been received then

7 output ← S; {suspect p if no fresh message is received}

8 upon receive message m

at time t ∈ [τ

, τ

i+1

9 if j ≥ i then output ← T ; {trust p when some fresh message is received}

Figure 4.3: The new failure detector algorithm NFD-U, with unsynchronized clocks

and known expected arrival times, and with parameters η and α

NFD-U, and it is given in Fig. 4.3. Intuitively, EA

is the time when m

is expected to

be received, and α is a slack added to EA

to accommodate the possible extra delay

of message m

. Thus an appropriately se t α gives a high probability that q receives

before the freshness point τ

, so that there is no failure detector mistake in the

period [τ

, τ

i+1

) (see Fig. 4.1 (a)). If α is large enough, it also allows subsequent

messages m

i+1

, m

i+2

, . . . to be received be fore time τ

, so that ther e is no fail ure

detector mistake in [τ

, τ

i+1

) even i f m

is lost. Of course α cannot be too large

because it adds to the detection t ime.

Note that in algorithm NFD-U, τ

= σ

+ E(D) + α. Ther efore, it is easy to

see that if we let δ = E(D) + α, and consider all times r eferred in the analysis of

the algorithm NFD-S to be with r espect to q’s local clock time, the n the analysis of

NFD-S also applies to the algorithm NFD-U. In particular, the only changes of the

Process p: {using p’s local clock time}

1 for some constant η, send to q heartb eat messages m

, m

, . . . at regular time points

η, 2η, 3 η, . . . respectively;

Process q: {using q’s local clock time}

2 Initialization:

3 τ

= 0;

4 ℓ = −1; {ℓ keeps the la rgest sequence number in all messages q received so far}

5 upon τ

ℓ+1

= the current time:

{if the current time reaches τ

ℓ+1

, then all messages received are stale}

6 output ← S; {suspect p since all messages received are stale at this time}

7 upon receive message m

at time t:

8 if j > ℓ then {received a message with a higher sequence number}

9 ℓ ← j;

10 compute

ℓ+1

; {estimate the expected arrival time of m

ℓ+1

using formula (4.24)}

11 τ

ℓ+1

←

ℓ+1

+ α;

12 if t < τ

ℓ+1

then output ← T ; {trust p since m

ℓ

is still fresh at time t}

Figure 4.4: The new failure detector algorithm NFD-E, wit h unsynchronized clocks

and estimated expe cted arrival times, and wit h parameters η and α

results are: (a) In Proposition 4.2, we now have

k = ⌈(E(D) + α)/η⌉, (4.20)

(x) = p

+ (1 − p

)Pr(D > E(D) + α + x − jη), (4.21)

= (1 − p

)Pr(D < E(D) + α + η). (4.22)

(b) In part (1) of Theorem 4.11, the inequality is now

≤ E(D) + α + η. (4.23)

Estim ating the Expected Arrival Times

The expecte d arrival times can be estimated using heartbeat messages. The idea

is to use the n most recently received heartbeat messages to estimate the expe cted

arrival time of the next heartbeat message. To do so, we ﬁrst modify t he structure

of the failure detect or algorithm NFD-U in Fig. 4.3 to show exactly when q needs to

estimate the expected arrival time of the next heartbeat.

The new version of the algorithm with estimated expected arrival times is given

in Fig 4.4 and is denoted by NFD-E. In NFD-E, process q uses a variable ℓ to keep

the largest heartb eat sequence number received so far, and τ

ℓ+1

refers to the “next”

freshness point. Note that when q updates ℓ, it also changes τ

ℓ+1

. If the local clock

of q ever reaches time τ

ℓ+1

(an event which might never happen), then at this time

all the heartbeats received are stale, and so q starts suspecting p (lines 5–6). When

q receives m

, it checks whether this is a new heartbeat (j > ℓ) and in this case, (1)

q updates ℓ, (2) q computes the estimate

ℓ+1

of the expected arrival time of m

ℓ+1

(the next heartbe at), (3) q sets the next freshness point τ

ℓ+1

+ α, and (4)

q trusts p if the current time is less than τ

ℓ+1

(lines 9–12).

Note that the algorithm NFD-E satisﬁes the same core property that q trusts p at

time t if and only if some message that q recei ved is still fresh at time t. Therefore,

except the fact that it needs to estimate t he exp ected arrival times, algorithm NFD-E

is equivalent to algorithm NFD-U.

We now show how to estimate the expected arrival time of m

ℓ+1

from the most

recent n heartbeat messages that q rec eived. Let m

′

, m

′

, . . . , m

′

be these n messages.

Note that m

′

is not m

in general, and so the sequence number of m

′

is not necessarily

i. Moreover, the sequence numbers of m

′

, m

′

, . . . , m

′

may not be consecutive or

monotonically i ncreasing, because the heartbeat messages may be lost or received out

of order. Let s

, s

, . . . , s

be the sequence numbers of m

′

, m

′

, . . . , m

′

respectivel y.

Let A

′

be the actual arrival time of m

′

with respect to q’s local clock time.

100

Let the expected arrival time of m

′

at q be EA

′

. Let ǫ

= A

′

−EA

′

. Thus ǫ

is the

deviation of the actual arrival time of m

′

from its expected arrival time at q. Let D

be the actual delay time of message m

′

. Then we have ǫ

= A

′

−EA

′

= D

− E(D).

For the ex pected arrival time EA

ℓ+1

of m

ℓ+1

, we have t hat for every i = 1, 2, . . . , n,

ℓ+1

= EA

′

+ (ℓ + 1 − s

)η = A

′

− ǫ

+ (ℓ + 1 − s

)η. By summing over all i’s on

both side of the equality and then dividing both sides by n, we have

ℓ+1

i=1

′

−

i=1

(ℓ + 1 −s

)η.

By choosing η ≥ ∆, we know that D

’s are independent and identical to D. Thus

i=1

− E(D), which is close to zero when n is large. There fore, we

obtain the following estimator of EA

ℓ+1

i=1

′

i=1

(ℓ + 1 − s

)η. (4.24)

This is the formula used in line 10 of the algorithm NFD-E in Fig. 4.4 to compute

the estimate of the expected arrival time of m

ℓ+1

How large the value of n should be to obtain a reasonably good estimate? Note

that

ℓ+1

−EA

ℓ+1

i=1

−E(D), where D

’s are independent and

identical to D. Thus the q uality of the esti mator

ℓ+1

is the same as the quality of

the e stimator

i=1

for estim ating E(D). By the Sampling Theorem in statistics

(see, e.g., [All90] p.432), we know that

i=1

is an unbiased estimator of E(D),

and when n is large,

i=1

has approxim ately the normal distributi on with mean

E(D) and standard deviati on σ(D)/

√

n. When it is close to a normal distribution,

about 95% of the estimated values are within 2σ(D)/

√

n away from the true value.

The actual n that makes the estimator close to a normal distribution depends on the

distribution of D. A widely used rule of t humb is that n be at least 30 ([All90] p.434).

101

For ex ample, we simulate algorithm NFD-E with 32 messages for the estimation and

D having an e xponential distribution (Section 4.5.2). The simulation results show

that NFD-E provides essentially the same QoS as NFD-S (the new algorithm with

synchronized clocks), so the estimation do es not compromise the QoS of the new

failure detector.

With n as a parameter varying from 1 towards ∞, NFD-E is actually a spectrum

of algorithms. The larger the value of n is, the better the estimates are. The

algorithm NFD-U c orresponds to one end point of the spectrum when n = ∞.

The other end point of this spe ctrum, namely the algorithm with n = 1, is

worth some furthe r discussion. When n = 1, formula (4.24) becomes

ℓ+1

′

+ (ℓ + 1 −s

)η. According to the algorithm in Fig. 4.4, when

ℓ+1

is computed

at line 10, the most recent message q received is m

ℓ

. Thus s

= ℓ,

ℓ+1

= A

′

+ η

and τ

ℓ+1

= A

′

+ η + α. This means that whenever a new heartbeat me ssage with

a higher sequence number is received, q sets a new freshness point τ

ℓ+1

, which is a

ﬁxed η + α time units away from the current time, such that if no new heartb eat

message is received by ti me τ

ℓ+1

, then q starts suspecting p. This is just the simple

algorithm!

Therefore, when n varies from 1 towards ∞, the algorithm NFD-E spans a spec-

trum that includes the simple algorithm at one end (n = 1), and the new algorithm

NFD-U in which q knows the expected arrival times of the heartbeat messages at

the other end (n = ∞). When the number of the heartbeat me ssages used in the

estimation increases, the new algorithm moves away from the simple algorithm and

gets closer to the algorithm NFD-U. This demonstrates that the problem of the sim-

ple algorithm is that it does not use enough information available (it only uses the

102

most recently received heartbeat message), and by using more information available

(using m ore messages received), the new algorithm is able to provi de a better QoS

than the si mple al gorithm.

4.4 .3 Conﬁguring the Failure D etector When Local Clocks

are Not Synchronized and the Probabilistic Behavior

of the Messages is Not Known

We now consider systems in which local clocks are not synchronized and the prob-

abilistic behavior of the messages is not known. Since local clocks are not synchro-

nized, we c annot use algorithm NFD-S. In this section, we show how to compute

the parameters η and α of algorithm NFD-U to m eet the QoS requirements in such

systems. For algorithm NFD-E, when the number n of messages used to estimate

the exp ected arrival time s are large, the estimates are very accurate, and thus for

practical purposes the computation of η and α for NFD-U also applies to NFD-E.

The method used here is based on the one given in Section 4.4.1.

We ﬁrst need to point out t hat in such systems, one cannot estimate E(D) us-

ing only one-way heartbeat messages. This is because in such settings one cannot

distinguish a system with small message delays but a large clock skew from another

system with large message delays but a small cl ock skew, as we now further explain.

Suppose in a syste m S with message delay time D, one obtains an estimate µ of

E(D) using only one way messages from p to q. Suppose in system S q’s local clock

is s tim e uni ts ahead of p’s local clock. Now consider another system S

′

with message

delay t ime D

′

= D + c, where c is a constant. That is, each message in S

′

is delayed

103

by c time units longer than in S. Suppose in S

′

q’s local clock t ime is s−c time units

ahead of p’s local clock. Thus, in both systems S and S

′

, a message sent by p at the

same p’s local clock time is rec eived at the same q’s local clock time. Therefore, with

unknown c lock skews and only one way messages from p to q, one cannot di stinguish

the two systems S and S

′

, and so in S

′

the estimate of E(D

′

) obtained i s also µ. But

µ c annot be an estimate of both E(D) and E(D

′

), since E(D

′

) = E(D) + c for an

arbitrary constant c. This shows that E(D) cannot be estimated when local clocks

are not synchronized and only one way messages from p to q are used.

Fortunately, we do not need to esti mate E(D). Since the analysis of NFD-S

applies to NFD-U if δ is replaced with α + E(D), with this replacement we obtain

the following theorem from Theorem 4.17. From this theorem, it is clear that we

only need p

and V(D) t o bound the QoS metrics of NFD-U.

Theorem 4.19 Assume α > 0. For algorithm NFD-U, in the nondegenerated cases

of Theorem 4.11, we have

E(T

) ≥

, (4.25)

E(T

) ≤

, (4.26)

≥ 1 − β, (4.27)

E(T

) ≥

1 − β

η, (4.28)

E(T

) ≥

1 − β

2β

η. (4.29)

where

β =

j=0

V(D) + p

(α − jη)

V(D) + (α −jη)

, k

= ⌈α/η⌉ − 1,

This is one reason why it is convenient to set the freshness points with respect to the expected

arrival times as opposed to some other reference points (e.g. the media n arrival times).

104

and

γ =

(1 − p

)(α + η)

V(D) + (α + η)

We ﬁrst assume that p

and V(D) are known, and show how to compute param-

eters η and α of NFD-U using Theorem 4.17 to satisfy Q oS requirements. We then

show how to estimates p

and V(D).

We consider a set of QoS requirements of the form:

≤ T

+ E(D), E(T

) ≥ T

, E(T

) ≤ T

. (4.30)

These requirements are identical to the ones in (4.7), except that the upper bound

requirement on the detec t ion time is not just T

, but rather T

plus the unknown

average message delay E(D). This i s justiﬁed as follows. First, it is not surprising

that the detection time includes E(D): it is not reasonable to require a failure

detector to detect a crash faster than the average delay of a heartbeat. Second,

when loc al clocks are not synchronized and only one-way messages are used, an

absolute bound T

≤ T

cannot be enforced by any failure detector. The reason is

the same as the reason why E(D) cannot be estimated in such settings: one cannot

distinguish a system with small message delays but a large clock skew from another

system with large message delays but a small clock skew.

The foll owing is the conﬁguration procedure for algorithm NFD-U, modiﬁed from

the one in Section 4.4.1.

• Step 1 : Compute γ

′

= (1 − p

)(T

)

/( V(D) + (T

)

) and let η

max

min(γ

′

, T

105

• Step 2 : Let

f(η) = η ·

⌈

/η

⌉

−1

j=1

V(D) + (T

− jη)

V(D) + p

− jη)

. (4.31)

Find the largest η ≤ η

max

such that f(η) ≥ T

• Step 3 : If η ≥ ∆, then set α = T

−η; otherwise, the procedure does not ﬁnd

appropriate η and α.

Theorem 4.20 Consider a system with unsynchronized, drift-free clocks, where the

proba bilistic behavior of messages is not known. With parameters η and α computed

by the above p rocedure, the failure detector algorithm NFD-U of Fig. 4.3 satisﬁes the

QoS requirements given in (4.30).

Estim ating p

and V(D)

When local clocks are not synchronized, The message loss probability p

and the

variance V(D) of message delay can still b e estimated using the heartbeat messages,

in exactly the same way as the one given in Section 4.4.1. For p

, this is because

we only use sequence numbers of the heartbeat messages to estimate p

, and so it is

not aﬀected by whether the clocks are synchronized or not. For V( D), we still use

the variance of A −S of heartbeat messages as the estimate of V(D), where A is t he

time (with respect to q’s local clock time) when q receives a message m, and S is the

time (with respe ct to p’s local cl ock time) when p sends m. This estimate me thod

still works because here A − S is the actual delay of m plus a constant, namely the

skew between the clocks of p and q. Thus the variance of A − S is the same as the

variance V(D) of message delay.

106

4.5 Simulation Results

We simulate both the new failure detector algorithm that we developed and the

simple algorithm commonly used in practice. In particular, (a) we simulate the

algorithm NFD-S (the one that se t s the freshness points using the sending times of the

heartbeat m essages and synchronized clocks), and show that the simulation results

are consistent with our QoS analysis of NFD-S in Section 4.3.2; (b) we simulate the

algorithm NFD-E (the one that sets fr eshness points with respect to the expected

arrival times of the heartbeat messages), show how the QoS of the algorithm changes

as the number n of messages used for estimati ng the expected arrival ti mes increases,

and show that, with appropriately chosen n, NFD-E provides essentially the same

QoS as NFD-S; and (c) we simulate the simple algorithm and compare it to the

diﬀerent versions of the new algorithms, and show that when all algorithms send

messages at the same rate and satisfy the same upper bound on the worst-case

detection tim e, the new algorithms provide much better accuracy than the simple

algorithm.

The settings of the si mulations are as follows. For the purpose of comparison, we

ﬁx the intersending time η of heartbeat messages in both the new algorithm and the

simple algorithm to be 1. The message loss probability p

is set to 0.01. The me ssage

delay time D foll ows the exponential distribution (i.e., Pr(D ≤ x) = 1 − e

−x/E(D)

for all x ≥ 0). We choose the exponential di stribution because of the following two

reasons: ﬁrst, it has the characteri stic that a large portion of messages have fairly

short delays while a small portion of messages have large delays, which is also the

characteristic of message del ays in many practical systems [Cri89]; second, it has

107

simple analytical representation which al lows us to compare the si mulation results

with the analytical results given in Theorem 4.11. The average message delay time

E(D) is set to be 0.02, which is a small value compared to the intersending time

η. Thi s corresponds to the practical situation in which message delays are in the

order of tens of milliseconds (typical for messages transmitted over the Internet),

while heartbeat messages are sent every few seconds. Note that since D follows an

exponential distribution, we have that the standard deviation σ(D) = E(D) = 0.02,

and the variance V(D) = σ(D)

= 4 ×10

−4

We compare the accuracy of diﬀerent algorithms when they all satisfy the same

bound T

on the worst-case detection time. To do so, we run simulations for each

algorithm as foll ows: (a) We ﬁrst conﬁgure the algorithm using the given bound

. (b) We the n run simulations to verify that the conﬁguration is indeed correct,

i.e., the given bound T

is sat isﬁed. More speciﬁcally, we simulate the algorithm in

10,000 runs in which process p crashes at some nondeterministic times, and obtain

the maximum detection time observed among all these runs, and see if this observed

maximum detection time is close to but not exceeding the given bound T

. (c)

Finally we obtain the average mistake recurrence time by simulating the algorithm

in runs in which p does not crash, and then taking the average of the lengths of 500

mistake recurrence intervals. We f ound that the average mistake recurrence time

is representative for the purpose of comparing the accuracy of the algorithms we

simulate, and thus we omit t he simulation results on other accuracy metrics here.

We vary the bound T

from 1 to 3.5, i.e., from exactly one intersending time of

heartbeat messages to three and a half times of the intersending time, and show how

the simulation results vary accordingly.

108

1 1.5 2 2.5 3 3.5

1.5

2.5

3.5

required bound T

on the worst−case detection time

maximum detection time observed in the simulations

reference line

NFD−S

Figure 4.5: The maximum detection times observed in the simulations of

NFD-S(shown by +)

4.5 .1 Simulation Results of NFD-S

It is easy to conﬁgure the parameters of NFD-S to meet the given upper bound T

on the worst-case detection time. In fact, since the intersending time η is ﬁxed (to

1), only parameter δ is conﬁgurable, and by Theorem 4.11 (1), we set δ = T

−η =

− 1.

Figure 4.5 shows the simulation results that checks the correctness of our con-

ﬁgurations of NFD-S. The reference line represents the situation in whi ch a f ai lure

109

1 1.5 2 2.5 3 3.5

required bound T

on the worst−case detection time

average mistake recurrence time obtained from the simulations

analytical

NFD−S

Figure 4.6: The average mistake recurrence times obtained from the simulations of

NFD-S (shown by +), with the plot of the analytical formula for E(T

) of NFD-S

(shown by —).

detector is perfectly conﬁgured: the maximum detec t ion time is equal to the de-

sired bound T

. Figure 4.5 shows that all the maximum detect ion times observed

in the simulations of NFD-S are very c lose to the referenc e line. Therefore NFD-S is

correctly conﬁgured.

Figure 4.6 shows the simulation results on the average mistake recurrence time s

of algorithm NFD-S, together with the plot of the analytic al formula for E(T

)

that we derived i n Section 4.3.2 (formula (4.2) of Theorem 4.11). The immediate

110

conclusions from Fig. 4.6 is: the simulation results of algorithm NFD-S matches the

analytical formula for E(T

), i.e., f ormula (4.2) of Theorem 4.11.

Furthermore, note that the y-axis is in log scale, which means that when T

increases linearly, t he overall tendency of the average m istake recurrence time i s

to increase exponentially fast. This increase, however, is not continuous: as T

increases, the average mistake recurrenc e time alternates between rapid increasing

periods and ﬂat (nonincreasing) periods — just as a step function. We now explain

why the curve resembles the curve of a step function.

We separate the curve into the following periods according to the value of T

and explain them one by one.

1. When T

= 1, the parameter δ is set to T

−η = 0. Recall that the freshness

point τ

is set to be σ

+ δ where σ

is the sending ti me of m

. So, in this case

the freshness point τ

is the same as the sending time σ

. But it is impossible

for message m

to arrive before time τ

, so q suspects p at every freshness point

. During the i nterval (τ

, τ

i+1

), q is likely to receive the message m

(recall

that the average message delay is only 0.02 and the message loss probability is

only 0.01), and t hus becomes trusting p again. Therefore, when T

= 1, the

average mistake recurrence time is close to 1.

2. A s T

increases f r om 1 to around 1.16, δ = T

−η increases from 0 to 0.16 and

the freshness points τ

’s are shifted forward in time accordingly. In this period,

the probability that message m

arrives after time τ

decreases very fast, from

1 to e

−8

= 0.0003. Thus a small increase in δ reduces the probability that m

arrives late signiﬁcantly, and therefore incr ease signiﬁcant ly the time between

111

consecutive mistakes.

3. When T

= 1.16, δ = T

− η = 0.16, i.e., τ

’s are shifted forward in time

0.16 time uni t s from σ

’s. This shift distance is 8 times of the average message

delay time, and thus if m

is not lost, there is a very high probability that m

is indeed rec eived before τ

(in fact, the probabili ty is 1 −e

−8

= 0.9997). Since

the message loss probability i s 0.01, we know that at this poi nt the m ai n cause

of a failure detector mi stake is that a message is lost. Since on average one

out of every 100 messages is lost, the average mistake recurrence time is close

to 100, as shown in Fig. 4.6.

4. From T

= 1.16 to T

= 2.0, δ increases from 0.16 to 1. In this period,

a message is very unlikely to be delayed by more than δ time units, while a

single message loss is enough to cause a failure detector mistake. Therefore,

an increase in δ does not help much to gain a better mistake recurrence time,

and the curve is almost ﬂat.

The case is similar when T

increases from 2 to 3: (a) From 2 to around 2.16,

a failure detector mistake is mainly caused by the loss of a message m

followed

by the delay of message m

i+1

. Thus an increase in δ increases the probability that

message m

i+1

is received bef ore time τ

, so that the failure detector does not make

a m istake e ven if m

is lost. Therefore in this period the average mistake recurrence

time increases sharply. (b) From 2.16 to 3, a failure detector mistake is mainly due

to the loss of two consecutive messages, and thus an increase in δ does not help much

to gain a better accuracy, and the average mistake re currence time stays at about

. Other periods can be explained similarly.

112

As a summary, the ﬂat portions of the curve correspond to the failure detector

conﬁgurations in which the failure det ector mi stakes are mainly due to consecutive

message losses, while the ascending porti ons of the curve correspond to the con-

ﬁgurations in which the failure detector mistakes are mainly due to a sequenc e of

consecutive message losses followed by the delay of the last message b efore the sus-

picion.

In Fig. 4.6, we only show the average mistake re currence times obtained from the

simulations. To further show that these simulation results are reliable , i.e., t hey are

not just by chance very close to the theoretical analysis, we show their corresponding

conﬁdence intervals in Fig. 4.7. In this ﬁgure, we show the 99% conﬁdence intervals

for the expected values of mistake recurrence times of NFD-S, together with the

plot of the analytical formula for E(T

) of NFD-S. The conﬁdence intervals are

computed using standard techniques (see e.g. [All90] p.445). The ﬁgure illustrates

that all the conﬁdence i ntervals are very small and surrounding the theoretical results.

This shows that the simulation results are acc urate and are not obtained by chance.

The conﬁdence intervals of the simulation results of other algorithms show the simil ar

properties, and thus we do not include them in the thesis.

4.5 .2 Simulation Results of NFD-E

Algorithm NFD-E has a parameter n — the number of messages used for esti mating

the exp ected arrival times of the heartbeats — that also aﬀects the QoS. To show

this, we ﬁrst run simulations in which the parameter α is ﬁxed and n takes the

values 1, 4, 8, 12, 16, 20, 24, 28, 32 respectivel y, and see how the maximum detection

times and the average mistake recurrence times change accordingly.

113

1 1.5 2 2.5 3 3.5

required bound T

on the worst−case detection time

average mistake recurrence time obtained from the simulations

Figure 4.7: The 99% c onﬁdence intervals for the expected values of mistake recur-

rence times of NFD-S (shown by ⊤⊥), with the plot of the analytical formula for

E(T

) of NFD-S (shown by —).

114

Figure 4.8 shows the simulation re sults for α = 1.90. From the ﬁgure, we see

that when n i ncreases, the average mistake recurrence times have no obvious change,

while the maximum detection ti me observed decreases from above 3.00 (when n = 1)

to about 2.93 (when n = 32). Note that according to the analytical results on

algorithm NF D-U (the one that knows all the expected arrival times), we have T

≤

E(D) + α + η. Thus with α = 1.90, we have T

≤ 2.92 for algorithm NFD-U. So

from n = 1 to n = 32, the maximum det ection time observed changes from more

than 0.08 (4 times of E(D)) above the bound 2.92, to within 0.01 (half of E(D))

above the bound 2.92. This suggests that when n = 32, the algorithm NFD-E is

very close to the algorithm NFD-U. Simulations on other α values show the similar

results, and so we choose n = 32 for the algorithm NFD-E.

Since when n = 32 NFD-E is very close to NFD-U, we use the bound T

≤

E(D) + α+η of the algorithm NFD-U to compute the parameter α for the algorithm

NFD-E. In particular, we set α = T

−E(D) − η = T

− 1.02.

Figure 4.9 shows the simulation results that checks the correctness of our conﬁg-

urations of NFD-E. Since all simulation results are very close to the reference line,

the algorithm NFD-E is correctly conﬁgured.

Figure 4.10 shows the simulation results on the average mistake recurrence times

of algorithms NFD-E, together with the plot of the analytical formula for E(T

)

that we de rived for algorithm NFD-S in Section 4.3.2 (formula (4.2) of Theorem 4.11).

From this ﬁgure, we see that with appropriately chosen n, the accuracy of algorithms

NFD-S and NFD-E are essentially the same, when both algorithms send heartbeat

messages at the same rate and satisfy the same upper bound on the worst-case

detection time.

115

0 5 10 15 20 25 30 35

2.92

2.93

2.94

2.95

2.96

2.97

2.98

2.99

3.01

number of messages used for estimating expected arrival times

maximum detection time observed in the simulations

(a) The maximum detection time observed decreases when n increases.

0 5 10 15 20 25 30

number of messages used for estimating expected arrival times

average mistake recurrence time obtained from the simulations

(b) The average mistake recurrence tim e stays about the same when n increases.

Figure 4.8: The change of the QoS of NFD-E when n increases. Parameter α = 1.90.

116

1 1.5 2 2.5 3 3.5

1.5

2.5

3.5

required bound T

on the worst−case detection time

maximum detection time observed in the simulations

reference line

NFD−E

Figure 4.9: The maximum detection times observed in the simulations of NFD-E

(shown by ×)

117

1 1.5 2 2.5 3 3.5

required bound T

on the worst−case detection time

average mistake recurrence time obtained from the simulations

analytical

NFD−E

Figure 4.10: The average mistake recurrence times obtained from the simulations of

NFD-E (shown by ×), with the plot of the analytical formula for E(T

) of NFD-S

(shown by —).

118

4.5 .3 Simulation Results of the Simple Algorithm

As discussed in the I ntroduction of this chapter, the worst-case detect ion time of

the simple algorithm is the maximum message delay time plus the timeout value

TO. This means that for many practical systems that have no upper bound on the

message delay time, as we ll as for our simulation setti ng, the worst-case detection

time of the sim ple algorithm is unbounded. Thus in these situations the simple

algorithm as it stands i s not suitable to satisfy QoS requir e ments that require an

upper bound on the worst-case detection time.

In this sect ion, we apply a straitforward modi ﬁcation to the simple algorithm so

that it is able to provide an upper bound on the worst-case detection time. Since

the unbounded worst-case detection time of the simple algorithm is caused by the

messages with very large delays, we modif y the algorithm such that these messages

are discarded. More precisely, the modiﬁed algorithm has another parameter, the

cutoﬀ time c, such that all messages delayed by more than c ti me units are discarded.

We call messages delayed by at most c time units fast messages, and messages delayed

by more than c time units slow messages. We assume t hat the simple al gorithm is

able to distinguish slow messages from fast messages.

With this modiﬁcation, it is easy to see that the simple algorithm now has a

worst-case detection time c + TO. Given a bound T

on the worst-case detection

time, there is a tradeoﬀ in se tt ing the cutoﬀ time c and the timeout value TO: the

larger the cutoﬀ time c, the smaller the number of messages being discarded, but

the shorter the timeout value TO, and vice versa. In the simulations, we choose two

This is not easy when local clocks are not synchronized. A fail-aware datagram service [FC97]

may be used for this purpose.

119

1 1.5 2 2.5 3 3.5

1.5

2.5

3.5

required bound T

on the worst−case detection time

maximum detection time observed in the simulations

reference line

SFD−L

SFD−S

Figure 4.11: The maximum detection times observed in the simulations of SFD-L

and SFD-S (shown by ⋄ and ◦)

cutoﬀ times c = 0.16 and c = 0.08, i.e., cutoﬀ times of 8 and 4 times of the mean

message delay tim e respectively. The timeout value TO is then set to be T

−c. The

algorithm with c = 0.16 is denoted by SFD-L, and the one wit h c = 0.08 is denoted

by SFD-S.

Figure 4.11 shows the simulation re sults of the observed maxi mum detection times

of SFD-L and SFD-S. Since all simulation results are very close to the refe rence line

at which the maximum detection time observed eq uals T

, algorithms SFD-L and

SFD-S are correctly conﬁgured.

120

1 1.5 2 2.5 3 3.5

required bound T

on the worst−case detection time

average mistake recurrence time obtained from the simulations

NFD−S

SFD−L

SFD−S

Figure 4.12: The average mistake recurrence times obtained from the simulations of

SFD-L and SFD-S (shown by -⋄- and -◦-), with the plot of the analytic al formula for

E(T

) of NFD-S (shown by —).

121

Figure 4.12 shows the simulation results on the average mistake recurrence times

of SFD-L and SFD-S, together with the plot of the analytical formula for E(T

) of

the new algorithm NFD-S (formula (4.2) of Theorem 4.11), which is the same plot as

given in Fig. 4.6 and 4.10. From Fig. 4.6 and 4.10 we k now that this plot also closely

represents the simulation results of the two versions of the new algorithm NF D-S

and NFD-E.

We have the foll owing observations from Fig. 4.12.

1. The curves of SFD-L and SFD-S resemble the curves of some step functions.

The reason is similar to the one that we give for algorithm NFD-S.

2. The ﬂat portions of SFD-L are very close to those of N FD-S, but the ﬂat

portions of SFD-S are much lower than those of the other two curves, and the

gap is orders of magnitude large and it is i ncreasing.

The reason is as follows. We already explained that the ﬂat portions of a curve

correspond to the cases in which failure dete ctor mistakes are mainly due to

message losses. More precisely, the ﬁrst ﬂat portion of each curve corresponds

to the cases in which a mistake is mainly due to a single message loss; t he

second ﬂat portion corresponds to the cases in which a mistake is mainly due

to two consecutive message losses, and so on.

For the modiﬁed simple algorithm, slow me ssages are equivalent to lost mes-

sages since they are discarded by the algorithm. In SFD-L with cutoﬀ time

c = 0.16, the probability that a message is slow is very small comparing to

the probability that a message is really lost (in fact, it is e

c/E(D)

= 3.4 × 10

−4

compared to the message loss probability 0.01). In SFD-S, however, the cutoﬀ

122

time is c = 0.08 and the probability that a message is slow is 0.018, which

is signiﬁcant. For this algorithm, the combined message loss probability is

+ 0.018 = 0.028. Under this message loss probability, a single message loss

occurs about every 35 messages, and the event of two consecutive message losses

occurs about every 1200 messages. This explains why the vertical position of

the ﬁrst ﬂat portion of SFD-S is between 30 and 40 and the vertical positi on

of the second ﬂat portion is between 1000 and 2000. Since the diﬀerence in the

probability of consecutive message losses between the algorithm SFD-S and the

other two algorithms is incre asing, the gap between SFD-S and the other two

algorithms is increasing accordingly.

3. Re garding the ascending portions of the curves, as T

increases, an ascending

portion of NFD-S always starts ﬁrst, then followed by an ascending portion

of SFD-S, and ﬁnally followed by an ascending portion of SFD-L. In these

ascending portions, under the same value T

, the average mistake recurrence

time of the new algorithm could be orders of magnitude better than those of

the simple algorithms.

This can be explai ned by the following example . Consider the point when

= 1.08. For the new algorithm NFD-S, δ = T

− η = 0.08, which me ans

that the freshness points are shifted forward in time by 4 times the mean

message delay. Under this setting, a message (if not lost) is very likely to

be received be fore the corresponding freshness point and thus avoid a failure

detector mistake (the exact probability is 0.982). For the simple algorithm

SFD-S with c = 0.08, we have TO = T

− c = 1. This means that after a

123

message is received, the tim er will expire one time unit later. Since on average

the next message will arrive one time unit later t han the receipt of the previous

message, TO = 1 means that about half of the messages will arrive after the

timer expi r es and thus cause fai lure detector mistakes. Thus the accuracy of

SFD-S is not go od compared with NFD-S. For the simple algorithm SFD-L

with c = 0.16, we have TO = T

− c = 0.92. Under this setting, the timeout

is too short, such that almost no message c an arrive before the timer expires.

Thus the accuracy of SFD-L is even worse at this point. Similar explanations

can be applied to other points of T

in the period from 1 to around 1.2, from

2 to around 2.2, etc.

Therefore, in general, under the same requirement of T

, the conﬁguration of

the new algorithm always gives a lower probability of a failure detector mistake

caused either by m essage del ay or by message loss, than the conﬁguration of the

simple algorithm. For the simple algorithm, the larger the cutoﬀ time is, the smaller

the timeout value, and thus the higher the probability of a failure detector mistake

caused by message delay. On t he other hand, if the cutoﬀ time is getting smaller,

more me ssages are discarded (it eﬀectively increases the probability that a message

is lost), and this increases the probability of a failure dete ctor mistake caused by

message losses.

From the above observations, it is not hard to see that when the cutoﬀ time

of the modiﬁed simple al gorithm increases, its curve is shifted further to the right;

when the cutoﬀ time decreases, its curve is pre ssed further down towards t he x-axis.

In all cases, the curve of the simple algorithm is always under the curve of the new

algorithm.

124

To summ arize, the simulation results show that, when b oth algorithms send heart-

beat messages at t he same rate and satisfy the same upper bound on the worst-case

detection time, the accuracy of the new algorithm (with or without synchronized

clocks) always dominates the accuracy of the simple algorithm, and in some cases it

is orders of magnitude better.

4.6 Concluding Remarks

On the Adaptiveness of the New Failure D etector

In this chapter, our network mode l assumes that the probabilistic behavior of the

network does not change over time. In practice, t he network behavior m ay change

over time gradually. For example, during working days, a corporate network typically

experiences heavier traﬃc, which means longer message delays and more message

losses, while during nights and weekends, the network traﬃc is usually much lighter.

However, for a short period of time, e.g., one hour, the change of network behavior

is rel atively small, and our mode l is a good approximation for such relatively short

periods.

For the gradual changes of the network behavior in a long time pe r iod, our new

failure detector algorithm has the ability to adapt to the changes and behave accord-

ingly. This is because we can conﬁgure the failure detector so that it only uses recent

heartbeat messages to esti mate the relevant system parameters such as p

, E(D)

and V(D), and the expected arrival times of the heartbeats if necessary. Therefore,

the algorithm can automatically adapt to the recent behavior of the network, and

thus the QoS of the fail ure detector can be guaranteed even if t he network behavior

125

changes gradually over tim e.

On the QoS Requirements

In Sections 4.3.4, 4.4.1 and 4.4.3, we consider some simple QoS requirements that

take the form of the bounds on some QoS metrics. Applications may also have re-

quirements i n other forms. For example, an application may specify some objective

function in terms of t he QoS metrics, and require that the failure detector b e conﬁg-

ured such that the objective function is maximized. To de al with such more general

QoS requirements, a decision-theoretic approach may be used in the conﬁguration of

the failure detector. Dec ision theory [Res87] provides mathematic al tools for making

decisions, and there have bee n some research works t hat apply dec ision theory in

certain areas of computer science such as networking, distributed computing, and

database systems (e.g., [MHW96, BBS98, BS98, CH98, CHS99]). A study on the

QoS of failure detectors using decision theory is an interesting research dire ction, but

it is beyond the scope of this thesis.

On n-Process Systems

In this thesis, we focus on two-process systems: a failure detector at a process q

monitors a process p. Many practical systems consist of more than two processes,

and failure detection is re quired be tween every pair of processes. Our work on two-

process systems can be used as a basis for the study of n-process systems. For

example, in an n-process system , one may be interested in the tim e elapsed from the

time when a process p crashes to the time when all other processes detect the crash of

126

p. For this purpose, we can use our QoS metric — the detection time — of the failure

detector on every process q that monitors p, and then take the maximum of all these

detection times to obtain the value we want. Of course, n-process systems present

more complicated cases than two process systems, and more careful and cre ative

study is necessary. We hope that this thesis can provide some helpful directions to

the study of fail ure detection in more complicated distri buted systems.

Appendix A

Theory of Marked Point Processes

Most of the notations, terminologies, and results conce r ning the theory of marked

point proce sses are taken from [Sig95].

Marked Point Pro cesses

Let R

and Z

denote the sets of nonnegative real numbers and integers, respectively.

Let K denote a complete separable metric space called mark space.

A si mple mar ked point process (mpp) on the nonnegative time line R

is a se-

quence

ψ = {(t

, k

) : n ∈ Z

, t

∈ R

, k

∈ K}, (A.1)

such that 0 ≤ t

< t

< ···, lim

n→+∞

= +∞. We call t

an event time and

a mark associate wi th event time t

. By simple we mean that the event times

are all diﬀerent. Let M = M

denote the collection of all simpl e mpp’ s wi th mark

space K. The set of all Borel measurable subsets of M is denoted as B(M).

127

128

We sometimes use the following interevent time sequence representation that is

equivalent to (A.1):

, {(T

, k

) : n ∈ Z

, T

def

= t

n+1

− t

}}, (A.2)

where T

denotes the n-th interevent ti me.

Note that t

, T

, and k

are actually measurable mappings f rom M to R

or K.

Shift Mappings

A shift mapping θ

: M → M is a mapping that shifts a mpp ψ to the left by s

time units. More precisely, if s = 0, then θ

is the i dentity mapping; if s > 0, then

for ψ = {(t

, k

)}, suppose t

i−1

< s ≤ t

for some i ∈ Z

(denote t

−1

= 0 for

convenience). We then have

def

= {(t

i+n

− s, k

i+n

) : n ≥ 0}. (A.3)

That is, θ

ψ is t he mpp obtained from ψ by shifting the origin to s, re-labeling event

times at and after s as t

, t

, . . ., and ignoring the events before time s. Let θ

(j)

def

= θ

denote shifti ng by the event time t

, j ≥ 0. We then let

def

= θ

ψ and ψ

(j)

def

= θ

(j)

ψ. (A.4)

Let θ

−1

def

= {ψ ∈ M : θ

ψ ∈ E}, and θ

−1

(j)

def

= {ψ ∈ M : θ

(j)

ψ ∈ E}.

Random Marked Point Processes

Let (Ω, F, P ) be a probability space. A random marked point process (rmpp) Ψ is a

measurable mapping Ψ : Ω → M. Ψ has the di stribution P (Ψ ∈ E)

def

= P ({ω ∈ Ω :

129

Ψ(ω) ∈ E}) deﬁned for all E ∈ B(M). Ψ

is the rmpp obtained from Ψ be shifting

the origin to time s, that is, for all ω ∈ Ω, Ψ

(ω) = Ψ(ω)

. Similarly, Ψ

(j)

is the

rmpp obtained from Ψ by shi fting t he origin to the ti me of the j-th event, that is,

for all ω ∈ Ω, Ψ

(j)

(ω) = Ψ(ω)

(j)

Stationary Ve rsions

A rmpp Ψ is event stationary if P (Ψ

(j)

∈ E) = P (Ψ ∈ E) for all j ∈ Z

and all

E ∈ B(M). Ψ is time stationary if P (Ψ

∈ E) = P (Ψ ∈ E) for all s ∈ R

and all

E ∈ B(M).

The event stationary version Ψ

and the time stationary version Ψ

∗

of rmpp Ψ

are two rmpp’s deﬁned by t he following distributions (assuming they exist):

Pr(Ψ

∈ E)

def

= lim

n→∞

n−1

j=0

P (Ψ

(j)

∈ E), for all E ∈ B(M), (A.5)

and

Pr(Ψ

∗

∈ E)

def

= li m

t→∞

P (Ψ

∈ E) ds, for all E ∈ B(M). (A.6)

As shown in [Sig95], Ψ

is event stationary and Ψ

∗

is time stationary.

Proposit ion A.1 Any event stationary Ψ has, with probability one, the event time

at the origin, i.e. Pr (t

◦Ψ = 0) = 1. Any time stationary Ψ has, with probability

one, no event a t the origin, i.e. Pr(t

◦ Ψ = 0) = 0.

Invariant σ-Field

The invariant σ-ﬁeld of M with respect to the ﬂ ow of sh i f ts, {θ

: t ≥ 0}, is denoted

by I and deﬁned by I

def

= {E ∈ B(M) : θ

−1

E = E, t ≥ 0}. The invariant σ-ﬁeld of

130

M with respect to the event shifts, {θ

(j)

: j ≥ 0}, is denoted by I

and de ﬁned by

def

= {E ∈ B(M) : θ

−1

(j)

E = E, j ≥ 0}.

Proposit ion A.2 I

= I.

Because of the above proposition, we use I to denote the one and only invariant

σ-ﬁeld of M.

If Ψ is deﬁned on a probability space (Ω, F, P ), the n the invariant σ-ﬁeld on M

can be lifted onto F by taking the inverse image: I

def

= Ψ

−1

I = {Ψ

−1

E : E ∈ I},

where Ψ

−1

def

= {ω ∈ Ω : Ψ(ω) ∈ E}. We omit t he superscri pt Ψ whenever the context

is clear. For example, for the notation of conditional expected value, we use E

(f ◦Ψ)

instead of E

Ψ (f ◦Ψ) (f is a measurable mapping from M to R

Ergodicity

An event stationary Ψ

is called ergodic if for any two events E

, E

∈ B(M),

lim

n→∞

n−1

j=0

Pr(Ψ

∈ E

, Ψ

(j)

∈ E

) = Pr (Ψ

∈ E

)Pr(Ψ

∈ E

). (A.7)

Similarly, a time stationary Ψ

∗

is called ergodi c if for any two events E

, E

∈ B(M),

lim

t→∞

Pr(Ψ

∗

∈ E

, Ψ

∗

∈ E

) ds = Pr(Ψ

∗

∈ E

)Pr(Ψ

∗

∈ E

). (A.8)

As suggested by Sigman [Sig95], the ergodicity should be regarded as a condition

describing a kind of loss of memory as the event (or time) parameter tends to ∞.

“For Ψ

this means that if you start with Ψ

and then randomly observe it way out at

an event, then what you observe is an independent copy of Ψ

itself” ([Sig95] p.38).

The same holds for Ψ

∗

when you randomly observe it way out in time. The follow

131

proposition shows that ergodicity can be equivalently deﬁned by using invariant σ-

ﬁeld I.

Proposit ion A.3 Ψ

is ergodic if and only if the invariant σ-ﬁeld I is 0-1 with

respect to Ψ

, i. e., iﬀ for all E ∈ I, Pr(Ψ

∈ E) ∈ {0, 1}. Ψ

∗

is ergodic if and

only if the invariant σ-ﬁeld I is 0-1 with respect to Ψ

∗

, i.e., iﬀ for all E ∈ I,

Pr(Ψ

∗

∈ E) ∈ {0, 1}.

Proposit ion A.4 For any measurable f : M → R

∞

, if Ψ

is ergodic, then E

(f ◦

) = E(f ◦ Ψ

) a.s., and if Ψ

∗

is ergodic, then E

(f ◦ Ψ

∗

) = E(f ◦Ψ

∗

) a.s.

The following is the version of the important Birkhoﬀ’s Ergodic Theorem for

random marked point processes. Henceforth, we assume that Ψ, Ψ

and Ψ

∗

use the

same underlying probability space (Ω, F, P ) (one can always construct some common

space supporting all of them).

Theorem A.5

(1) If Ψ has the event stationary version Ψ

, then fo r any measura ble mapping

f : M → R

lim

n→∞

n−1

j=0

f ◦ Ψ

(j)

= E

(f ◦ Ψ

) a.s. (A.9)

In particular, if Ψ

is ergodic, then

lim

n→∞

n−1

j=0

f ◦Ψ

(j)

= E(f ◦Ψ

) a.s. (A.10)

The notation a.s. stands for “almost surely”, which means that the equation is true with

probability one.

132

(2) If Ψ has the time stationary version Ψ

∗

, then for any measurable mapping

f : M → R

such that

f ◦ Ψ

ds < ∞, t ≥ 0, a.s.,

lim

t→∞

f ◦ Ψ

ds = E

(f ◦Ψ

∗

) a.s. (A.11)

In particular, if Ψ

∗

is ergodic, then

lim

t→∞

f ◦ Ψ

ds = E(f ◦Ψ

∗

) a.s. (A.12)

Note that f ◦Ψ

and f ◦Ψ

∗

are measurable mappings from the underlying sample

space Ω t o R

, and so they are random variables. So are lim

n→∞

n−1

j=0

f ◦ Ψ

(j)

and lim

t→∞

f ◦Ψ

ds. Similar mathematical expressions are used in the following

theorems.

From the above theorem, we can have the following characterization of the er-

godicities of Ψ

and Ψ

∗

. Let I

be the indicator function for some event E ∈ B(M),

i.e. for all ψ ∈ M, I

(ψ) = 1 if ψ ∈ E, and I

(ψ) = 0 if ψ 6∈ E.

Proposit ion A.6 Suppose that Ψ has event stationary version Ψ

and time station-

ary version Ψ

∗

. Ψ

is ergodic if and only if for all E ∈ B(M),

Pr(Ψ

∈ E) = li m

n→∞

n−1

j=0

◦ Ψ

(j)

a.s. (A.13)

∗

is ergodic if and only if for all E ∈ B(M),

Pr(Ψ

∗

∈ E) = lim

t→∞

◦ Ψ

ds a.s. (A.14)

Proof. Suppose Ψ

is ergodic. Then (A.13) is obtained by substituting f in (A.10)

with I

. Now suppose (A.13) holds. Then for any E ∈ I, we claim that I

◦Ψ

(j)

= I

◦

Ψ. In fact, for all ω ∈ Ω where Ω is the underlying sample space for Ψ, I

◦Ψ

(j)

(ω) = 1

133

iﬀ Ψ(ω)

(j)

∈ E iﬀ Ψ(ω) ∈ θ

−1

(j)

E iﬀ Ψ(ω) ∈ E iﬀ I

◦ Ψ(ω) = 1. Thus from ( A.13)

we have Pr(Ψ

∈ E) = I

◦ Ψ a.s., whi ch implies that Pr (Ψ

∈ E) ∈ {0, 1}. By

Proposition A.3, we know that Ψ

is ergodic. The proof for Ψ

∗

is similar.

Proposition A.6 suggests that the event stationary version Ψ

of some rmpp Ψ

is ergodic if and only if the distribution of Ψ

, i.e. Pr(Ψ

∈ E), can be obtained

(with probability one) from any single run Ψ (ω) of Ψ as follows: observe Ψ(ω) at

every event time t

(to obtain Ψ (ω)

(j)

), check whether the event E is true when

observed at t

(i.e., whether I

(Ψ(ω)

(j)

) = 1 or not), and then use the ratio of the

number of event times t

’s at which E is t r ue over the total number of event tim es

as Pr(Ψ

∈ E). The distribution of Ψ

∗

can be obtained in a similar way.

The following lemma shows that the ergodicities of Ψ

and Ψ

∗

are equivalent.

Lem ma A.7 Suppose that Ψ has event stationary version Ψ

and time stationary

version Ψ

∗

. Then Ψ

is ergodic if and only if Ψ

∗

is ergodic.

Arrival Rates

Let N

: M → R

be the measurable mapping such that for all ψ ∈ M, N

(ψ)

is the number of event times of ψ in the pe riod (0, t]. Suppose rmpp Ψ has the

event stationary version Ψ

and the tim e stationary version Ψ

∗

. Let λ

def

= E(N

◦Ψ

∗

and λ is called arrival rate or intensity of Ψ. Intuitively, λ is the average number

of event times or arrivals in a unit period in the time stationary version Ψ

∗

. Let

def

= E

◦ Ψ

∗

), and λ

is called conditional arrival rate or conditional intensity

of Ψ.

Recall that T

: M → R

is the measurable mapping such that T

(ψ)

def

= t

n+1

(ψ)−

134

(ψ) represents the n-th interevent time of mpp ψ.

Lem ma A.8 For the conditional ar rival rate λ

, we have

= lim

t→∞

◦ Ψ







lim

n→∞

n−1

j=0

◦ Ψ







−1

◦ Ψ

)

−1

a.s. (A.15)

For the arri val rate λ, we have

λ = E(λ

). (A.16)

Moreover, if Ψ

is ergodic (and so is Ψ

∗

), the we have

= λ =

E(T

◦ Ψ

)

−1

a.s. (A.17)

Equalities (A.15) mean that the conditional arrival rate λ

is a random vari-

able, and it can be obtained either from the l ong run number of events per unit

time ( lim

t→∞

◦ Ψ/t), or from the reciprocal of the long run average interevent

time ({li m

n→∞

n−1

j=0

◦ Ψ}

−1

). Equality (A.16) shows that the arrival rate is

the exp ected value of the random variable λ

. Equalities (A.17) m ean that, if the

stationary versions of Ψ are ergodic, then the conditional arrival rate λ

is alm ost

surely the constant λ, which is also the reciprocal of the ex pected value of the very

ﬁrst interevent time of Ψ

Empirical Inversion Formulas

The empirical invers ion formulas are used to connect the e vent stationary version

with the time stationary version Ψ

∗

. Roughly speaking, the results show that (a)

a random marked point process Ψ has the event stationary version Ψ

if any only

if it has the time stationary version Ψ

∗

; (b) Ψ

is the event stationary version of

135

both Ψ and Ψ

∗

, and Ψ

∗

is the tim e stationary version of both Ψ and Ψ

; (c) the

distributions of Ψ

and Ψ

∗

are related by some inversion formulas. We now state

these results formally.

Theorem A.9 Ψ has the event stationary version Ψ

and 0 < E

◦Ψ

) < ∞ a.s.,

if and only if Ψ has the time stationary version Ψ

∗

and 0 < E

◦ Ψ

∗

) < ∞ a.s..

In this case, we have E

◦ Ψ

∗

) = {E

◦ Ψ

)}

−1

= λ

, Ψ

is also the event

stationary version of Ψ

∗

, and Ψ

∗

is also the time stationary version of Ψ

Theorem A.10 If Ψ has the event stationary version Ψ

and 0 < E

◦ Ψ

) <

∞ a.s. (or equivalently Ψ has the time stationary version Ψ

∗

and 0 < E

◦Ψ

∗

) <

∞ a.s.), then for all E ∈ B(M), we have the following empir i cal inversion formulas:

Pr(Ψ

∈ E) = E





◦Ψ

∗

−1

j=0

◦ Ψ

∗

(j)

◦ Ψ

∗

)





, (A.18)

Pr(Ψ

∗

∈ E) = E





◦Ψ

◦ Ψ

)





. (A.19)

For all measurable f : M → R

, we have the follo wing conditional empirical inver-

sion fo rmulas:

(f ◦Ψ

) =

◦Ψ

∗

−1

j=0

f ◦ Ψ

∗

(j)

◦ Ψ

∗

)

a.s., (A.20)

and if in addition,

f ◦ Ψ

ds, t ≥ 0, a.s., then

(f ◦ Ψ

∗

) =

◦Ψ

f ◦ Ψ

◦ Ψ

)

a.s. (A.21)

The following corollary states the empirical inversion formulas under the ergod-

icity condition.

136

Corollary A.11 If Ψ has the event stationary version Ψ

, and Ψ

is ergodic, and

0 < E(T

◦ Ψ

) < ∞ (or equiva lently Ψ has the time stationary version Ψ

∗

, and Ψ

∗

is ergodic, and 0 < E(N

◦ Ψ

∗

) < ∞), then for all E ∈ M, we have the fo llowing

ergodic empirical inversion formulas:

Pr(Ψ

∈ E) =

◦Ψ

∗

−1

j=0

◦ Ψ

∗

(j)

E(N

◦ Ψ

∗

)

, (A.22)

Pr(Ψ

∗

∈ E) =

◦Ψ

◦ Ψ

E(T

◦ Ψ

)

. (A.23)

For al l measurable f : M → R

, we have

E(f ◦ Ψ

) =

◦Ψ

∗

−1

j=0

f ◦ Ψ

∗

(j)

E(N

◦ Ψ

∗

)

, (A.24)

and if in addition,

f ◦ Ψ

ds, t ≥ 0, a.s., then

E(f ◦ Ψ

∗

) =

◦Ψ

f ◦ Ψ

E(T

◦ Ψ

)

. (A.25)

Bibliography

[ACT] Marcos Kawazoe Aguilera, Wei Chen, and Sam Toueg. On qui escent

reliable communication. SIAM Journal on Computing. To appear.

Part of the paper appeared in Proceedi ngs of the 11th International

Workshop on Distributed Algorithms, September 1997, 126–140.

[ACT99] Marcos Kawazo e Aguilera, Wei Chen, and Sam Toueg. Using the heart-

beat failure detector for quiescent reliable communication and consen-

sus in partiti onable networks. Theoretical Computer Science, 220(1):3–

30, June 1999.

[ACT00] Marcos Kawazoe Aguilera, Wei Chen, and Sam Toueg. Failure detec-

tion and consensus in the crash-recovery model. Distributed Computing,

2000. To appear. A n extended abstract appe ared in Proceedings of the

12th International Symposium on Distributed Computing, September

1998, 231-245.

[ADKM92] Yair Amir, Danny Dolev, Shlomo Krame r , and Dalia Malki. Transis: a

communication sub-system for high availability. In Proceedings of the

22nd Annual International Symposium on Fault-Tolerant Comp uting,

pages 76–84, Boston, July 1992.

[All90] Arnold O. Allen. Probability, Statistics, and Queueing Theory with

Computer Science Applications. Academic Press, 2nd edition, 1990.

[Arv94] K. Arvind. Probabilistic clock synchronization in distributed systems.

IEEE Transactions on Parallel and Distributed Sys tems, 5(5):475–487,

May 1994.

[BBS98] Sandeep Bajaj, Lee Breslau, and Scott Shenker. Uniform versus priority

dropping for layered video. In Proceedi ngs of the SIGCOMM’98ACM

137

138

Conference on Applications, Technologies, Architectures, and Protocols

for Com p uter Communication, pages 131–143, August 1998.

[BDGB94]

Ozalp Babao˘glu, Renzo Davoli, Luigi-Alberto Giachini, and Mary Gray

Baker. Relacs: a communications infrastructure for constructing reli-

able applications in large-scale distributed systems, 1994. BROAD-

CAST Project deliverable report, Department of Computing Science,

University of Newcastle upon Tyne, UK.

[Bil95] Patrick Billingsley. Probability and Measure. John Wiley & Sons, 3rd

edition, 1995.

[Bra89] R. Braden, editor. Requirements for Internet Hosts-Communication

Layers. RFC 1122, O ctobe r 1989.

[BS98] Lee Bre slau and Scott Shenker. Best-eﬀort ve r sus reservations: A sim-

ple comparative analysis. In Proceedings o f the SIGCOMM ’98ACM

Conference on Applications, Technologies, Architectures, and Protocols

for Com p uter Communication, pages 3–16, August 1998.

[BvR93] Kenneth P. Bir man and Robbert van Renesse, editors. Reliable Dis-

tributed Computing with the Isis Toolkit. IEEE Computer Society Press,

1993.

[CH98] Francis C. Chu and Joseph Y. Halpern. A decision-theoretic approach

to reliable message delivery. In Proceedings of the 12th I nternational

Symposium on Distributed Computing, Lecture Notes on Computer Sci-

ence, pages 89–103. Springer-Verlag, September 1998.

[CHS99] Francis C. Chu, Joseph Y. Halpern, and Praveen Seshadri. Least ex-

pected cost query optimization: A n exercise in utili ty. In Proceedings

of the 18th ACM Symposium on Principles of Database Systems, pages

138–147, May 1999.

[CHT96] Tushar Deepak Chandra, Vassos H adzilacos, and Sam Toueg. The

weakest failure detector for solving consensus. Journal of the ACM,

43(4):685–722, July 1996. An extended abstract appeared in Proceed-

ings of the 11th ACM Sympo sium on Principles of Distributed Comput-

ing, August, 1992, 147–158.

[Cri89] Flaviu Cristian. Probabilistic clock synchronization. Distributed Com-

puting, 3:146–158, 1989.

139

[CT96] Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors

for reliable distributed systems. Journal of the ACM, 43(2):225–267,

March 1996. A preliminary version app eared in Proceedings of the

10th ACM Symposium on Pri nciples o f Distributed Co mputing, August,

1991, 325–340.

[DFKM96] Danny Dolev, Roy Friedman, Idi t Keidar, and Dahlia Malkhi. Fail-

ure detect ors in omission failure environments. Technical Report 96-

1608, Department of Computer Science, Cornell University, Ithaca,

New York, September 1996.

[FC96] Christof Fetzer and Flaviu Cristian. Fail-aware failure detectors. In

Proceedings of the 15th Symposium on Reliable Distributed Systems,

pages 200–209, Oct 1996.

[FC97] Christof Fetzer and Flaviu Cristian. A fail-aware datagram service.

In 2nd Annual Workshop on Fault-Tolerant Parallel and Distributed

Systems, April 1997.

[GLS95] Rachid Guerraoui, Michael Larrea, and Andr´e Schiper. Non blocking

atomic commitment with an unreliable failure detector. In Proceedings

of the 14th IEEE Sympo s i um on Reliable Distributed Systems, pages

13–15, 1995.

[GM98] Mohamed G. Gouda and Tommy M. McGuire. Accel erated heartbeat

protocols. I n Proceedings of the 18th International Conference on Dis-

tributed Com puting Systems, May 1998.

[Hay98] Mark Garland Hayden. The Ensemble System. Ph.D. dissertation,

Department of Computer Science, Cornell University, January 1998.

[MHW96] Armin R. Mikler, Vasant Honavar, and Johnny S. K. Wong. Analysis of

utility-theoretic heuristics for intelligent adaptive network routing. In

Proceedings of the 13th National Conference o n Artiﬁcial Intelligence,

volume 1, pages 96–101, 1996.

[MM SA

96] Louise E. Moser, P. M. Melliar-Smith, Deborah A. Argarwal, Ravi K.

Budhia, and Colleen A. Lingley-Papadopoulos. Totem: A fault-tolerant

multicast group communication system. Communications of the ACM,

39(4):54–63, April 1996.

[Pﬁ98] Gregory F. Pﬁster. In Search of Clusters. Pre ntice -Hall, Inc., 2nd

edition, 1998.

140

[Res87] Michael D. Resnik. Choices: An Introduction to Decision Th eory. Uni-

versity of Minnesota Press, 1987.

[Ros83] Sheldon M. Ross. Stochastic Processes. John Wiley & Sons, 1983.

[RT99] Michel Raynal and Fr´ed´eric Tronel. Group membership failure detec-

tion: a simple protocol and its probabilistic analysis. D i s tributed Sys-

tems Engineering Journal, 6(3):95–102, 1999.

[Sig95] Karl Sigman. Stationary Marked Poi nt Processes, an Intuitive Ap-

proach. Chapman & Hall, 1995.

[VR00] Paulo Ver´ıssimo and Michel Raynal. Time, clocks and temporal or-

der. In Sacha Krakowiak and Santosh K. Shrivastava, edit ors, Recent

Advances in Distributed Systems, chapter 1. Springer-Verlag, 2000. to

appear.

[vRBM96] Robbert van Renesse, Kenneth P. Birman, and Silvano Maﬀeis. Horus:

a ﬂexible group communication system . Communications of the ACM,

39(4):76–83, April 1996.

[vRMH98] Robber t van Renesse, Yaron Minsky, and Mark Hayden. A gossip-style

failure detection service. In Proceedings of Middleware’98, September

1998.

The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors

Article

Full-text available

Aug 2023
COMPUTING

Reliable systems require effective monitoring techniques for fault identification. System-level diagnosis was originally proposed in the 1960s as a test-based approach to monitor and identify faulty components of a general system. Over the last decades, several diagnosis models and strategies have been proposed, based on different fault models, and applied to the most diverse types of computer systems. In the 1990s, unreliable failure detectors emerged as an abstraction to enable consensus in asynchronous systems subject to crash faults. Since then, failure detectors have become the de facto standard for monitoring distributed systems. The purpose of the present work is to fill a conceptual gap by presenting a distributed diagnosis model that is consistent with unreliable failure detectors. Properties are proven for the number of tests/monitoring messages required, latency for event detection, as well as completeness and accuracy. Three different failure detectors compliant with the proposed model are presented, including vRing and vCube, which provide scalable alternatives to the traditional all-monitor-all strategy adopted by most existing failure detectors.

Um Serviço Distribuído de Detecção de Falhas Baseado em Disseminação Epidêmica

Conference Paper

May 2010

Detectores de falhas são abstrações que, dependendo das propriedades que oferecem, permitem a solução do consenso em sistemas distribuídos assíncronos. Este trabalho apresenta um serviço de detecção de falhas baseado em disseminação epidêmica. O serviço foi implementado para a plataforma JXTA. Para permitir a avaliação com um número maior de processos, foi também implementado um simulador. Resultados experimentais são apresentados para o uso de processador e memória, tempo de detecção, taxa de enganos do detector, além do seu uso na execução de eleição de líder. Os resultados experimentais e de simulação indicam que o serviço é escalável com o número de processos e mostram que a estratégia de disseminação epidêmica possui vantagens significativas em grupos com grande número de processos.

A Production Suite for Failure Detectors

Conference Paper

Nov 2023

Preliminary Exploration on Node-To-Node Fault Tolerance Coordination in Distributed System

Conference Paper

Oct 2023

Consistency Before Availability: Network Reference Point based Failure Detection for Controller Redundancy

Conference Paper

Sep 2023

Machine learning applied to failure detection

Conference Paper

Oct 2023

Potência do Sinal de Recepção como Suporte à Detecção de Mobilidade em Detectores de Defeitos

Conference Paper

May 2014

Detector de defeitos é um componente essencial na construção de sistemas distribuídos confiáveis e seu projeto depende fortemente do modelo de sistema distribuído, o que tem demandado soluções para tratar a movimentação de nós em redes móveis ad hoc (MANETs). Este trabalho apresenta um detector de defeitos assíncrono não-confiável baseado em fofoca que diferencia nós defeituosos e móveis através da manutenção de informações sobre a potência do sinal de recepção nos nós do sistema mapeadas em um pequeno histórico de regiões. As avaliações demonstram melhoras na qualidade de serviço do detector quando comparadas com o algoritmo de fofoca tradicional.

Distributed Oracle for Estimating Global Network Delay with Known Error Bounds

Chapter

Sep 2022

Partially synchronous models are often assumed for designing distributed protocols because they capture realistic timing assumptions, such as the asynchronous and synchronous periods that the system can experience. In some of these models, protocols need to estimate network delays. Some protocols fix the global message delay bound for all executions, which leads to sub-optimal solutions in terms of latency, because this bound must be chosen conservatively. And other protocols employ delay estimation mechanisms that only give an upper bound on the delay without quantifying the estimation error. The performance of these protocols depends on how close their estimations are in relation to the actual network delay. For instance, some Byzantine consensus protocols use timeouts based on this estimation. We formalize this problem as the Global Delay Bound Estimation (\(\textsf{GDBE}\)) and address it by introducing a distributed oracle that enriches partial synchronous models. This oracle produces estimates of the channel delays that allow processes to derive an efficient global bounded estimate. Oracles and global bounded estimates, provide a framework that facilitates the design of protocols for partially synchronous models and the analysis of their time complexity. We formalize the properties of the oracle and the proposed framework and show that it can be implemented in the presence of crash failures. In contrast, we prove that \(\textsf{GDBE}\) cannot be solved in the Byzantine failure model, and show how to circumvent this impossibility using an extra assumption. Finally, we show how to use our framework to implement a view synchronizer thus obtaining an efficient solution for Byzantine consensus.KeywordsOracleGlobal delayTimeoutConsensusCrash failureByzantine failureChannel delaySynchronizerFixed delayPartial synchronyOne-way delay

A failure detection method of remote disaster recovery and backup system

Article

Full-text available

Jun 2022

Failure detection is one of the basic functions of building a reliable disaster recovery backup system. Aiming at the application-level disaster recovery backup failure detection problem, this paper analyzes the remote disaster recovery center architecture and failure detection hierarchy, and predicts the arrival time of cross-domain heartbeat information through the back propagation neural network based on particle swarm optimization (PSO-BP). When the actual timeout is reached, the active Auxiliary Detection (AD) is used to improve the correctness of failure detection, and finally the effectiveness of method PSO-BP-AD is verified through simulation.

Uma Nova Abordagem Para Otimizar a Comunicação Entre Detectores de Defeitos

Conference Paper

Oct 2006

Detectores de defeitos (FDs) não confíaveis são utilizados como bloco básico na especificação e implementação de tolerância a falhas em sistemas distribuídos assíncronos. Um exemplo típico de sistemas distribuídos assíncronos e de larga escala é a Internet. Neste contexto, FDs tradicionais apresentam problemas, uma vez que seu projeto destina-se a redes controladas (LAN). Um problema a ser tratado é a explosão de mensagens, pois em sistemas de larga escala, onde o número de processos e os atrasos são imprevisíveis o problema da explosão de mensagens pode comprometer o desempenho do serviço de detecção de defeitos e a escalabilidade da aplicação. Neste sentido, este artigo trata do problema da explosão de mensagens propondo uma abordagem genérica e prática que utiliza o reaproveitamento de mensagens para suprir mensagens de controle nos FDs.

RELACS: A communications infrastructure for constructing reliable applications in large-scale distributed systems (PDF)

Article

Full-text available

Nov 1994

Distributed systems that span large geographic distances or manage large numbers of objects are already common place. In such systems, programming applications with even modest reliability requirements to run correctly and efficiently is a difficult task due to asynchrony and the possibility of complex failure scenarios. We describe the architecture of the RELACS communication subsystem that constitutes the microkernel of a layered approach to reliable computing in large-scale distributed systems. RELACS is designed to be highly portable and implements a very small number of abstractions and primitives that should be sufficient for building a variety of interesting higher-level paradigms.

Stochastic processes

Article

Full-text available

A decision-theoretic approach to reliable message delivery

Article

Full-text available

Jan 2001

Summary. We argue that the tools of decision theory should be taken more seriously in the specification and analysis of systems. We illustrate this by considering a simple problem involving reliable communication, showing how considerations of utility and probability can be used to decide when it is worth sending heartbeat messages and, if they are sent, how often they should be sent.

Failure detection and consensus in the crash-recovery model

Article

Full-text available

Apr 2000

Summary. We study the problems of failure detection and consensus in asynchronous systems in which processes may crash and recover, and links may lose messages. We first propose new failure detectors that are particularly suitable to the crash-recovery model. We next determine under what conditions stable storage is necessary to solve consensus in this model. Using the new failure detectors, we give two consensus algorithms that match these conditions: one requires stable storage and the other does not. Both algorithms tolerate link failures and are particularly efficient in the runs that are most likely in practice – those with no failures or failure detector mistakes. In such runs, consensus is achieved within time and with 4 n messages, where is the maximum message delay and n is the number of processes in the system.

Probability and Measure

Book

Jan 1986
J AM STAT ASSOC

P. Billingsley

Probability. Measure. Integration. Random Variables and Expected Values. Convergence of Distributions. Derivatives and Conditional Probability. Stochastic Processes. Appendix. Notes on the Problems. Bibliography. List of Symbols. Index.

Stochastic Processes

Book