ArticlePDF Available

On the quality of service of failure detectors

Authors:

Abstract and Figures

We study the quality of service (QoS) of failure detectors. By QoS, we mean a specification that quantifies: (1) how fast the failure detector detects actual failures and (2) how well it avoids false detections. We first propose a set of QoS metrics to specify failure detectors for systems with probabilistic behaviors, i.e., for systems where message delays and message losses follow some probability distributions. We then give a new failure detector algorithm and analyze its QoS in terms of the proposed metrics. We show that, among a large class of failure detectors, the new algorithm is optimal with respect to some of these QoS metrics. Given a set of failure detector QoS requirements, we show how to compute the parameters of our algorithm so that it satisfies these requirements and we show how this can be done even if the probabilistic behavior of the system is not known. We then present some simulation results that show that the new failure detector algorithm provides a better QoS than an algorithm that is commonly used in practice. Finally, we suggest some ways to make our failure detector adaptive to changes in the probabilistic behavior of the network
Content may be subject to copyright.
ON THE QUALITY OF SERVICE OF
FAILURE DETECTORS
A Dissertation
Presented to the Faculty of the Graduate School
of Cornell University
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
by
Wei Chen
May 2000
c
We i Chen 2000
ALL RIGHTS RESERVED
ON THE QUALITY OF SERVICE OF FAILURE DETECTORS
Wei Chen, Ph.D.
Cornell University 2000
Failure detectors are basic building blo cks of fault-tolerant distributed systems
and are used in a wide variety of settings. They are also the basi s of a paradigm for
solving several fundamental problems in fault-tolerant distributed computing such
as consensus, atomic broadcast, l eader electi on, etc.
In this thesis, we study the quality of service (QoS) of failure detectors. By QoS,
we mean a spe cification that quantifies (a) how fast the failure detector detects actual
failures, and (b) how well it avoids false detections. To the best of our knowledge,
this is the first comprehensive and systematic study of the QoS of failure detectors
that provides both a rigorous mathematical foundation and practic al solutions.
We first study t he QoS specification of failure detectors. In particular, we propose
a set of QoS metrics that are especially suited for specifying failure detectors with
probabilistic be haviors. We then provide a rigorous mathematical foundation based
on stochastic modeling to support our QoS speci fication.
Next , we develop a new f ai lure detector algorithm for systems with probabilistic
behaviors (i.e., the behaviors of message de lays and message losses follow some prob-
ability distributions). We perform quantitative analysis and derive closed formulas
on the QoS metrics of the new algor ithm. We show t hat among a large class of fail-
ure detectors, the new algorithm is optimal with respect to some of the QoS metrics.
We then show how to configure the new failure detector algorithm to satisfy QoS
requirements given by an application. In order to put the algorithm into practice, we
further explain how to modify the al gorithm so that it works when the local clocks
of processes are not synchronized, and how to configure the failure dete ctor even if
the probabilistic behaviors of the system is not known. Finally, we run simulations
of both the new algorithm and a sim ple failure detec t or algorithm commonly used in
practice. The simulation results demonstrate that the new failure detector algorithm
provides better QoS t han the simple algorithm.
Biographical Sketch
Wei Chen was born on May 2, 1968 in Beijing, China. During most of his first
twenty years, he lived with his parents in a lovely neighborhood north to the Long
Tan Lake and t hree bus stops away from the famous Temple of He aven, by which his
wife Jian Han was brought up. In his early age, his mother fostered his interest in
mathematics, while his father sent him to a nearby amateur sports school to receive
regular soccer training. Since then, mathematics and socc er have been two of his
long lasting interests, giv ing him many joy and excitement.
After six years at No.26 Middle School (l ater renamed to Hui Wen Middle School
during the years when Jian was studying there), where he wrote his first program
on an APPLE II computer, he entered Tsinghua University in 1986 and selected
Computer Sci ence as his major. He recei ved his Bachelor of Enginee r ing degree in
July 1991 with the honor of “Excell ent Graduate”, and then continued in Tsinghua
for graduate study and received his Master of Engineering degree in March 1993.
Only at around this time he finally met Jian, even though they had been brought up
in nearby neighborhoods, and had attended the same elementary and middle schools.
After graduation, he worked in the Department of Computer Science and Tech-
nology, Tsinghua University as a Teaching and Research Associate. In August 1994,
iii
he came to the States and pursue his Do ctoral degree at the Department of Com-
puter Science, Cornell University. One year later, he married Jian, who since then
has accompanied and supported him through out his study at Cornell , and in the
mean time pursues her own graduate degree in management science.
iv
To my parents, Chen Chengda, Wang Zhengli
and my wife , Jian
v
Acknowledgements
More than any other person, I am indebted to my wife, Jian. Her love, understanding,
and support have made my five years at Cornell much more joy ful and much less
frustrating than it could have bee n.
I am extremely grateful to my advisor, Sam Toueg, who has guided me through
my re search work. Sam has taught me everything important to a high quality aca-
demic research, from exploring new ideas, formulating the results, to writing every
single sentence of a paper. I cannot imagine how I could have reached t his point
without his help and guidance.
I have also benefited a lot from the collab oration with Marcos Kawazo e Aguilera.
Working with Marcos is always a pleasant and informative experi ence.
I extend my gratitude to my other committee members Robbert van Renesse,
Joseph Halpe r n, David Shmoys, and Jon Kleinberg (as the proxy of Professor Shmoys
at my de fense), who carefully reviewed my thesis work and provided helpful input. I
am also greatly b enefited from the interactions through out the years wi t h many peo-
ple in or outside the department. Among others, they include Ken Birman, Tushar
Chandra, Francis Chu, Vassos Hadzilacos, Narahari U. Prabhu, Michel Raynal, and
Jean-Marie Sulmont.
vi
I would like to offer my special thanks to my friend and soccer teammate Fang
Xue, who has supplied me with many needed background knowledge in probability
theory and stochastic proce sses. I would also like to thank Thomas Wan and Brian
James, who read part of the thesis and helped me to improve my thesis presentation.
I would like to thank my teachers and advisors in Tsinghua, in particular, Pro-
fessor Dai Yiq i, Lin Xingliang, Lu Kaicheng, Huang Liansheng, and Lu Zhongwan,
who introduce d me to computer science research.
This research work is partiall y supported by NSF grants CCR-9402896 and CCR-
9711403, and by ARPA/ONR grant N00014-96-1-1014. Any opinions, findi ngs, or
recommendations presented in this thesis, however, are my own and do not neces-
sarily reflect the views of any of the organizations mentioned i n this paragraph.
Last but not the least, I thank my family, and my wife’s family, for their constant
support to my graduate study at Cornell.
vii
Table of Contents
1 Introduction 1
1.1 On the QoS Specification of Failure Detectors . . . . . . . . . . . . . 4
1.2 The Design and Analysis of a New Failure Detector Algorithm . . . . 6
1.3 Summary of Other Research Works . . . . . . . . . . . . . . . . . . . 8
1.3.1 Failure Detect ion and Consensus in the Crash-Recovery Model 8
1.3.2 Achieving Quiescence wit h the Heartbeat Failure Detector . . 9
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 On the Quality-of-Service Speci fication of Failure Detectors 12
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . 13
2.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Failure Detect or Spec ification . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 The Failure Detector Model . . . . . . . . . . . . . . . . . . . 18
2.2.2 Primary Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Derived Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Relations betwe en Accuracy Metrics . . . . . . . . . . . . . . . . . . 25
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Stochastic Modeling of Failure Detec tors and Their Quality-of-
Service Specifications 31
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Failure Detect or M odel . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Failure Detect or Definition . . . . . . . . . . . . . . . . . . . . 33
3.2.2 Failure Detect or Histories as Marked Point Proc esses . . . . . 36
3.2.3 The Steady State Behaviors of Failure Detectors . . . . . . . . 38
3.3 Failure Detect or Spec ification Metrics . . . . . . . . . . . . . . . . . . 46
3.3.1 Definitions of Metrics . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.2 Relations betwe en Accuracy Metrics . . . . . . . . . . . . . . 50
viii
4 The Design and Analysis of a New Failure Detec tor Algorit hm 60
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1.1 A Common Failure Detection Algorithm and its Drawbacks . 61
4.1.2 The New Algorithm and its QoS Analysis . . . . . . . . . . . 62
4.1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 The Probabil istic Network Model . . . . . . . . . . . . . . . . . . . . 67
4.3 The New Failure Detec tor Algorithm and Its Analysis . . . . . . . . . 68
4.3.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.2 The Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.3 An Optimality Result . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.4 Configuring the Failure Detector to Satisfy QoS Requirements 88
4.4 Dealing with Unknown System Behavior and Unsynchronized Clocks 92
4.4.1 Configuring t he Failure Detector NFD-S When the Probabilis-
tic Behavior of the Messages i s Not Known . . . . . . . . . . . 92
4.4.2 Dealing with Unsynchronized Clocks . . . . . . . . . . . . . . 96
4.4.3 Configuring the Failure Detector When Loc al Clocks are Not
Synchronized a nd the Probabilistic Behavior of the Messages
is Not Known . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.5.1 Simulation Results of NFD-S . . . . . . . . . . . . . . . . . . 108
4.5.2 Simulation Results of NFD-E . . . . . . . . . . . . . . . . . . 112
4.5.3 Simulation Results of the Simple Algorithm . . . . . . . . . . 118
4.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
A Theory of Marked Point Processes 127
Bibliography 137
ix
List of Figures
2.1 Detection time T
D
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 FD
1
and FD
2
have the same quer y accuracy probability of .75, but
the mistake rate of FD
2
is four times that of FD
1
. . . . . . . . . . . . 14
2.3 FD
1
and FD
2
have the same mistake rate 1/16, but the query accuracy
probabilities of FD
1
and FD
2
are .75 and .50, respectively. . . . . . . 15
2.4 Mistake duration T
M
, Good period duration T
G
, and Mistake recur-
rence time T
MR
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1 Three scenarios of the failure detector output in one interval [τ
i
, τ
i+1
) 68
4.2 The new failure detector algorithm NFD-S, with synchronized clocks,
and with parameters η and δ . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 The new failure detector algorithm NFD-U, with unsynchronized
clocks and known expected arrival times, and with parameters η and α 97
4.4 The new failure detector algorithm NFD-E, with unsynchronized
clocks and estimated expected arrival times, and with parameters η
and α . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.5 The maximum detec t ion times observed in the simulations of
NFD-S(shown by +) . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.6 The average mistake recurrence times obtained from the simulations
of NFD-S (shown by +), with the plot of the analyti cal formula for
E(T
MR
) of NFD-S (shown by —). . . . . . . . . . . . . . . . . . . . . 109
4.7 The 99% confidence intervals for the expected values of mistake recur-
rence times of NFD-S (shown by ), with the plot of the analytical
formula for E(T
MR
) of NFD-S (shown by —). . . . . . . . . . . . . . . 113
4.8 The change of the QoS of NFD-E when n increases. Parameter α = 1.90.115
4.9 The maximum detection times observed in the simulations of NF D-E
(shown by ×) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.10 The average mistake recurrence times obtained from the simulations
of NFD-E (shown by ×), with the plot of the analyti cal formula for
E(T
MR
) of NFD-S (shown by —). . . . . . . . . . . . . . . . . . . . . 117
x
4.11 The maximum dete ction times observed in the simulations of SFD-L
and SFD-S (shown by and ) . . . . . . . . . . . . . . . . . . . . . 119
4.12 The average mistake recurrence times obtained from the simulations
of SFD-L and SFD-S (shown by -- and --), with the plot of the
analytical formula for E(T
MR
) of NFD-S (shown by —). . . . . . . . . 120
xi
Chapter 1
Introduction
Fault-tolerant distributed systems are designed to provide reliable and cont inuous
service despite the f ai lures of some of the ir components. A basic building block
in these systems is the failure detector. Failure detect ors are used in a wide vari-
ety of settings, such as network communication protocols [Bra89], computer clus-
ter management [Pfi98], group membership protocols [ADKM92, BvR93, BDGB94,
vRBM96, MMSA
+
96, Hay98], etc.
Roughly speaking, a failure detector consists of distributed modules such that
each process has access to a local failure detector modul e that provides (possibly
erroneous) information about which processes have crashed. This information is
typically given in the form of a list of suspects. In general, due to the nondeter-
minism present i n distributed systems, such as message delays and losses caused by
network congestion, failure detectors are not reliable: a process that has crashed is
not necessarily susp ected and a process may be erroneously susp ected even though
it has not crashed.
1
2
Chandra and Toueg [CT96] provide the first formal specification of unreliable
failure detectors and show how they c an be used to solve some fundamental problems
in distributed computing, such as consensus and atomic broadcast. This approach
was later used and/or generalized in other works, e.g., [GLS95, DFKM96, FC96,
ACT, ACT00, ACT99].
In all of the above works, the failure detector specifications are defined in terms of
the eventual behaviors of failure detectors (e.g. a process that crashes is eventually
suspected). These specifications are appropriate for purely asynchronous systems
in which there is no timing assumption whatsoever. Practical distributed systems,
however, usually do have certain timing constraints. I n these systems, applications
require more than just proper ties on the eventual behaviors of failure de t ectors. For
example, a failure detector that starts suspecting a process one hour after the process
crashes may still satisfy the propert ies necessary for solvi ng asynchronous consensus,
but it can hardly satisfy the requirement of any application in practice. Therefore,
in practice, one needs to know the quality of service (QoS) of failure detectors. By
QoS, we mean a specification that quantifies the behavior of a failure detector. More
precisely, it specifies (a) how fast the fai lure detector detects actual failures, and (b)
how well it avoids false detections.
In this thesi s, we focus on the QoS of failure detectors. More specifically:
1. We study how to specify the QoS of fail ure detectors. In particular:
(a) We propose a set of QoS metrics that are especially sui t ed for specifying
failure detectors with probabilistic behaviors.
(b) We provide a rigorous mathematical foundation based on stochastic
3
modeling to support our QoS specification.
2. We develop a new fai lure detector algorithm, and study the QoS it provides.
In particular:
(a) We p erform a quantitative analysis and derive closed formulas on the QoS
metrics of the new algorithm.
(b) We show that among a large class of failure detect ors the new algorithm
is optimal with respect to some of the QoS metrics.
(c) We show how to configure the algorithm so that it meets the QoS required
by an application. More precisely, given the QoS requirements of an ap-
plication, we show how to use the closed formulas we derived to compute
the parameters of the new algorithm to satisfy the requirements.
(d) To widen the applicability of the new algorithm, we further explain how
to configure the fail ure detector even if the probabilistic behavior of the
system is not known, and how to modify the algorithm so that it works
when the local clocks of processes are not synchronized.
(e) We run simulations of both the ne w algorithm and a simple algorithm
commonly used in practice, and from the simulation results we de mon-
strate that the new algorithm is better than the simple algorithm with
respect to some QoS metrics.
To the best of our knowledge, this is the first comprehensive and systematic study
of the QoS of failure detectors that provides both a rigorous mathematical foundation
and practical solutions.
4
1.1 On the QoS Specification of Failure Detectors
How should one specify the QoS of a failure detector? As pointed out above, a
failure detector may be slow in detecting a crash, and it may make mistakes, i.e., it
may suspect some processes that are actually up. Thus the QoS specific ation should
be given by a set of metrics that describes the fail ure detector’s speed (how fast it
detects crashes) and its accuracy (how well it avoids mistakes). Note that, when
specifying the QoS of a failure detector, we should consider the failure detector as a
“black box”: the QoS metric s should refer only to the external behavior of the failure
detector, and not to various aspe cts of its internal implementation.
A failure det ector’s speed is easy to measure: this is the time elapsed fr om the mo-
ment when a process crashes to the time when the failure detector starts suspecting
the process permanently. We call this QoS metric the detection time.
The accuracy met rics should measure how well a failure detector avoids erroneous
suspicions of processes that are actually up. Therefore, when measuring the accuracy
of failure detectors, we assume that the processes being monitored do not crash. It
turns out that determ ining a good set of accuracy metrics is a subtle task. The
subtleties are due to the variety of the accuracy aspects that applications might be
interested in. For example, consider an application that at random times queries
a failure detector about a process being monitored. For such an application, a
natural measure of accuracy is the probability that, when queried at a random time,
the failure detect or does not suspect the process, i.e., the failure detec t or output
is correct. We cal l this QoS metric the query accuracy probability. This metric,
however, is not sufficient to fully describe the accuracy of a failure detector. In fact,
5
it is easy to find two f ailure detectors that have the same que ry accuracy probability,
but one makes mistakes more f requently than the other. In some applications, every
mistake of the failure detector causes a costly interrupt, and for such applications the
mistake rate is an important accuracy metr ic. Mi stake rate alone, however, cannot
fully characterize the accuracy either: one can find two failure detectors that have
the same mistake rate but different query accuracy probability. Moreover, even when
used together, these two m etric s are still not sufficient. It is easy to find two f ai lure
detectors such that one is better in both mistake rate and q uer y accuracy probability,
but the other is better in some other aspect of the accuracy.
These subtleties show that there are several different aspects of accuracy that may
be important to applications, and each aspect has a corresponding accuracy metric.
We identify six accuracy metrics, and then use the theory of stochastic proc esses to
determine their relations. B ased on these relations, we select two accuracy metrics
as the primary ones in the sense that (a) they are not redundant (one cannot be
derived from the other), and (b) together, they can be used to derive the other four
accuracy metrics. These two accuracy metrics, together with the detection time,
provide the QoS specification of failure detectors.
The QoS metrics we proposed are especially suited for specifying failure detec t ors
with probabilistic behaviors (such probabilistic behaviors may be due to the fact that
(a) message losses and delays follow a certain probability distribution, or (b) the
failure detector algorithm itself uses randomization, as in [vRMH98]). We provide a
solid mathematical foundation based on stochastic modeling to formally model the
probabilistic behaviors of failure detectors and their QoS. More pre cisely, we use the
theory of marked point processes to formally define the fai lure detector model and
6
the QoS metrics proposed, and then we perform a rigorous analysis on the relations
between the accuracy metrics under this formal model .
1.2 The Design and Analysis of a New Failure
Detector Algorithm
When designing a failure detector algorithm, one should strive to achieve both good
speed and good accuracy. However, these are two conflicting objectives. To se e this,
note that in practice a failure detector typically works as follows: the failure detector
waits for messages from the process being monitored, and if it does not receive any
message from the process for a while, it starts suspecting the process. This suspicion
could be a mistake since the messages from the process may be delayed or lost. If
the failure detector waits for a longer period of time bef ore suspecting the process,
it reduces the chance of making a mistake, but it increases the detection time if
the process actually crashes. Conversely, if the failure det ector waits for a shorter
period of time before suspecting the process, it reduc es the detection ti me if the
process actuall y crashes, but increases the chance of making a mistake. Thus to
design a good algorithm design, one should find the right balance between these two
conflicting obj ectives.
We first examine a simple failure detector algorithm commonly used in practice,
and notice that when the variation of the message delays i s large, this algorithm
cannot achieve both good spe ed and good accuracy. We then design a new failure
detector algorithm that overcomes the problem of the simple algorithm.
We analyze the QoS of the new algorithm in distributed systems with probabilistic
7
behaviors (i.e., the behaviors of message de lays and message losses follow some prob-
ability distributions). We use the theory of stochastic processes in the analysis, and
derive closed formulas on the QoS metrics of the new algorithm. We then show the
following optimality result: Roughly speaking, among all fail ure detectors that send
messages at the same rate and satisfy the same upper bound on the worst-case de-
tection time, the new failure detector algorithm is optimal with respect to the query
accuracy probability. This shows that the new f ai lure dete ctor algorithm provides
both good speed and good acc uracy. We then show that, given a set of QoS require-
ments by an application, we can use the closed formulas we derived to compute the
parameters of the new algorithm to meet these requirements.
Next , we explain how to make the new algorithm applicable to more gene ral
settings. This involves the following two mo difications: (a) When configuring the new
failure detector algorithm to meet an application’s QoS requirements, the original
configuration proc edure requires the knowledge of the probabilistic behaviors of the
system (i.e., the probability distributions of message delays and message losses). We
show how to configure the new failure detector even if the probabilisti c behavior of
the system is not known. (b) The new failure detector algorithm is first given wit h
the assumption that the local clocks of processes are synchronized. We show how to
modify the new algorithm so that this assumption is no longer necessary.
Finally, we run simulations of both the new algorithm and the simple algorithm,
and prov ide a detailed analysis on the simulation results. The conclusion we draw
from these simulations are: (a) the simulation results of the new algorithm are con-
sistent with our m athematical analysis of the Q oS metrics; (b) the new algorithm
that does not assume synchronized clocks provides similar QoS as the algorithm that
8
assumes synchronized clocks; and (c) when comparing the new algorithm with the
simple algorithm under the condition that both algorithms send messages at the
same rate and satisfy the same bound on the worst-case detecti on time, the new
algorithm provides (in some cases orders of magnitude) better accuracy than the
simple algorithm.
1.3 Su mmary of Other Research Works
Our research on the QoS of failure det ectors aims to provide both a solid f oundation
and useful solutions to practical systems. In the same spirit, our other research works
emphasize ext ending previous theoretical works to more practical computing model s.
These works have appeared or will app ear as the following journal pap ers [ACT00,
ACT, ACT99]. We only briefly summarize the main results of these research works
here.
1.3 .1 Failure Detection and Consens us in the Crash-
Recovery Model
The problem of solving consensus in asynchronous systems with unreliable failure
detectors was first investigated in [CT96, CHT96]. These works established the
paradigm of usi ng fail ure detec tion to solve some fundamental problem s in fault-
tolerant computing. However, these works only considered systems where process
crashes are permanent and links are reliable (i.e ., they do not lose messages). In
practical distributed systems, processes may recover after crashing and links may
lose messages.
9
In [ACT00], we study the problems of failure detection and consensus in asyn-
chronous systems in which processes may crash and recover, and links may lose
messages. We first propose new failure dete ctors that are particularl y suited for the
crash-recovery model. We nex t determine the conditions under which stable storage
is necessary to solve consensus in this model. Using the new failure detectors, we give
two consensus algorithms that match these conditions: one requires stable storage
and the other does not. Both algorithms tol erate link failures and are particularly
efficient in the runs that are most likely in practice those with no failures or fail-
ure detector mistakes. In such runs, consensus is achieved within 3δ time units and
with 4n messages, wher e δ is the maximum message delay and n is the number of
processes in the system.
1.3 .2 Achieving Quiescence wi th the Heartbeat Failure
Detec tor
An algorithm is quiescent if it eventually stops sending messages. Quiescence is an
important property of an algorithm, but in asynchronous systems subject t o both
process crashes and message losses, quiescence is not easy to achieve.
In [ACT], we study the problem of achievi ng reliable communication with quies-
cent algorithms in asynchronous system s with process crashes and lossy links. We
first show that it is impossible to solve this problem in purely asynchronous sys-
tems (with no failure detectors). We then show that, among failure detectors that
output lists of suspects, the weakest one that can be used to solve this problem is
P, a failure detec tor that cannot be implemented. To overcome this difficulty, we
10
introduce an implementable fai lure detector called Heartbeat and show that it can be
used to achieve quiescent reliable communication. Heartbeat is novel: in contrast to
typical failure detectors, it does not output li sts of suspects and it is implem entable
without timeouts. With Heartbeat, many exi sting algorithms that t ol erate only pro-
cess crashes can be transformed into quiescent algorithms that tolerate both proce ss
crashes and message losses. This can be applie d to consensus, atomic broadcast,
k-set agreement, atomic commitment, etc.
In [ACT99], we show how to achieve quiescent reliable communication and quies-
cent consensus in partitionable networks, i n which not only processes may crash and
messages may be lost, but also the network may be partitioned into disconnected
components. We first tackle the problem of reliable communication for partitionable
networks by extending the results in [ACT]. In particular, we generalize the speci-
fication of the heartbeat failure detector, show how to implement it, and show how
to use it to achieve qui escent reliable communication. We then turn our attention
to the problem of consensus for partitionable networks. We first show that, even
though this problem can be solved using a natural extension of failure detector
S
(the one used in [ CT96] to solve consensus), such solutions are not quiesc ent
in other words,
S alone is not sufficient to achieve quiescent consensus in parti-
tionable networks. We t hen solve this problem using
S and the quiescent reliable
communication primitives that we develop ed.
11
1.4 Thesis Organization
In Chapter 2, we propose a set of metrics for the QoS specification of failure detectors.
In Chapter 3, we present the f ormal ization of the failure detector model and t he QoS
specification. In Chapter 4, we develop a new failure detector algorithm, analyze its
QoS, show its optimality result, show how to configure the algorithm to satisfy QoS
requirements given by an application, show how to make the algorithm applicable
to more general settings, and show the simulation results that provide an empirical
comparison between the new algorithm and t he simple algorithm. In Appendix A,
we summarize relevant definitions and results in the t heory of marked point processes
that are used in Chapter 3.
Chapter 2
On the Quality-of-Service
Spec ification of Failure Detectors
2.1 Introduction
In this chapter, we study how to specify the quality of service (QoS) of failure de-
tectors. In particular, we propose a set of QoS metrics that are especiall y suited for
specifying failure detectors with probabilistic behaviors (such probabilistic behaviors
may be due to the fact that (a) message losses and delays follow a certain probabil-
ity distributi on, or (b) the fail ure detector algorithm itself uses randomization, as in
[vRMH98]).
12
13
2.1 .1 Background and Motivation
We consider message-passing distributed systems in which processes may fail by
crashing, and messages may be delayed or dropped by communication links.
1
In
such systems, fai lure detec tors typical ly provide a list of processes that are sus-
pected to have crashed so far. A failure detector can be slow, i.e ., it may take a long
time to suspect a process that has crashed, and it can make mistakes, i.e., it may
erroneously suspect some processes that are actually up (such a mistake i s not nec-
essarily permanent: the failure detector may later remove this process fr om its list
of suspects). To be useful, a failure detector has to be reasonably fast and accurate.
In this chapter, we propose a set of metrics for the QoS specification of fail-
ure detectors. In general, these QoS metric s should be able to describe the failure
detector’s speed (how fast it detects crashes) and its accuracy (how well it avoids
mistakes). Note that speed is with respect to processes that crash, while accuracy i s
with respect to processes that do not crash.
A failure detector’s spee d is easy to measure: this is simply the time that elapses
from the mome nt when a process p crashes t o the time when the failure dete ctor starts
suspecting p permanently. This QoS metric, called detection time, is illustrated in
Fig. 2.1.
How do we measure a failure detector’s accuracy? It turns out that deter mining
a good set of accuracy metri cs is a delicate task. To illustrate some of the subtleties
invol ved, consider a system of two processes p and q connected by a lossy commu-
nication li nk, and suppose that the failure detector at q monitors process p. The
1
We assume that process crashes are permanent, or, equivalently, that a process that recovers
from a crash assumes a new identity.
14
up
trust
T
D
p
suspect
down
trust
suspect
FD at q
Figure 2.1: Detection time T
D
output of the f ai lure detector at q is either “I suspect that p has crashed” or “I trust
that p is up”, and it may alternate between these two outputs from time to time.
For the purpose of measuring the accuracy of the failure detec t or at q, suppose that
p does not crash.
Consider an application that queries qs failure detector at random times. For
such an application, a natural measure of accuracy is the probability that, when
queried at a random time, the output of the fail ure detector at q is “I trust that p
is up” which is correct. This QoS metric is the query accuracy probability. For
example, in Fig. 2.2, the que r y accuracy probability of FD
1
at q is 12/(12+4) = .75.
3
1
3
1
12 12 12
3 ...
1 ...
p
up
FD
1
FD
2
4 4 4
Figure 2.2: FD
1
and FD
2
have the same query accuracy probability of .75, but the
mistake rate of FD
2
is four tim es that of FD
1
.
The query accuracy probability, however, is not sufficient to fully describe the
15
p
up
FD
2
12 12 12
FD
1
8
8 8
8 8
8
4 4 4
Figure 2.3: FD
1
and FD
2
have the same mistake rate 1/16, but the query accuracy
probabilities of FD
1
and FD
2
are .75 and .50, respec t ively.
accuracy of a failure detector. To see this, we show i n Fig. 2.2 two failure detectors
FD
1
and FD
2
such that (a) they have the same query accuracy probability, but
(b) FD
2
makes mistakes more frequently than FD
1
.
2
In some applicati ons, every
mistake causes a costly interrupt, and for such applications the mi stake rate is an
important accuracy metric.
Note, however, that the mistake rate alone is not sufficient to characterize acc u-
racy: as shown in Fig. 2.3, two fail ure detectors can have the same mistake rate, but
different query accuracy probabilities.
Even when used together, the above two accuracy metri cs are still not suffi cient.
In fact, it is easy to find two failure detectors FD
1
and FD
2
, such that (a) FD
1
is
better than FD
2
in both measures (i.e., it has a higher query accuracy probability and
a lower m istake rate), but (b) FD
2
is better than FD
1
in another respect: specificall y,
whenever FD
2
makes a mistake, it corrects this mistake faster than FD
1
; in other
words, the mistake durations in FD
2
are smaller than in FD
1
. Having small mistake
durations may be im portant to some applications.
2
The failure detector at q makes a mistake every time its output changes from “trust” to “sus-
pect” while p is actually up.
16
As it can be seen from the above, there are several different aspec ts of accuracy
that may be important t o applications, and each aspect has a corresponding accuracy
metric.
In this chapter, we first identify six accuracy m etrics (sinc e the behavior of a
failure detector is probabilistic , most of these metrics are random variables). We
then use the theory of stochastic processes to determine their precise relation. This
analysis allows us to select two accuracy metrics as the primary ones in the sense
that: (a) they are not redundant (one cannot be derived from the other), and (b)
together, they can be used to derive the other four accuracy metrics.
In summary, we show that the QoS specification of failure detectors can be given
in terms of three basic m etrics, namely, the detec t ion time and the two primary
accuracy metrics that we identified. Taken together, these m etrics can be used to
characterize and compare the QoS of failure det ectors.
2.1 .2 Related Work
There is not much previous work on the QoS specification of failure detectors.
In [CT96], unreliable failure detectors were introduced as an abstraction that c an
be used to solve some fundamental problems of fault-tolerant distributed computing,
such as consensus, in asynchronous systems. This approach was later used and/or
generalized in other works, e.g., [GLS95, DFKM96, FC96, ACT, ACT00, ACT99].
In all of these works, the failure dete ctor specifications are defined in terms of the
eventual behaviors of failure detec tors (e.g., a process that crashes is eventually
suspected). The eventual be havior, however, does not describe the QoS of failure
detectors (e.g., how fast a process that crashes becomes suspected).
17
In [GM98], Gouda and McGuire measure the performance of some failure detector
protocols under the assumption that the proto col stops as soon as some process is
suspected to have crashed (even if this suspicion is a mistake). This class of failure
detectors is less general than the one that we studied here : in our work, a failure
detector can alternate between suspicion and trust many times.
In [vRMH98], van Renesse et. al. propose a gossip-style randomized failure
detector protocol. They measure the accuracy of this protocol in terms of the prob-
ability of p remature timeouts.
3
The probability of premature tim eouts, however, is
not an appropriate metric for the specification of failure detectors in general: i t is
implementation-speci fic and it cannot be used to compare failure detectors that use
timeouts in different ways. This point is further explained in Section 2.4.
In [V R00], Ve r´ıssimo and Raynal study timing failure detectors, which detect
timing failures (such as the delay of a message or the exe cution time of a task being
longer than a given time bound). The class of timing failure detec t ors is more general
than the class of failure detectors that dete ct crash failure s. They also study QoS,
but their work differs significantly from ours in that: What they study is Q oS-FD
failure detectors that detect quality-of-service failures. More precisely, they study
failure detectors that output some index (and other derived information) to indicate
the quality of service of some system services (e.g. network connectivity). What we
study here is the quality of service of (crash) f ailure detectors, i.e. how good a failure
detector is in terms of detecting process crashes, and how to configure the fai lure
detector to satisfy QoS requireme nts given by an application.
3
This is called “the probability of mistakes” in [vRMH98].
18
The rest of the chapter is organized as follows. In Section 2.2, we propose a set of
QoS metric s for failure detectors. We quantify the relation between these metrics in
Section 2.3, and conclude the chapter with a brief discussion in Section 2.4.
In this chapter, we keep our presentation at an intuitive level. The formal def-
initions of our mode l and of our QoS metrics are developed using the theory of
stochastic processes, and are given in Chapter 3.
2.2 Failure Detector Spec ification
We consider a system of two processes p and q. We assume that the failure detector
at q monitors p, and that q does not crash. Henceforth, real time is continuous and
ranges from 0 to .
2.2 .1 The Failure Dete ctor Model
The output of the failure detector at q at time t is either S or T , which means that
q suspects or trusts p at time t, respectively. A transition o ccurs when the output
of the failure detector at q changes: An S-transition occurs when the output at q
changes from T to S; a T-transition occurs when the output at q changes from S to
T . We assume that there are only a finite numb er of transitions during any finite
time interval. A failure detector history describes the output of the failure dete ctor
in an entire run.
A failure pattern of process p is just a number F [0, ], denoting the time t at
which p crashes; F = me ans that p does not crash. A run in which p does not
crash is called a failure-free run. For each failure pattern, there is a corresponding
19
set of possible fail ure detector histories (as in [CT96]), and this set has a probability
distribution. To understand this, consider the following example. Let F be the
failure pattern in which p crashes at time 5. The failure detector at q may detec t ”
this crash at time 6, or 6.74, or 9, etc. This is the set H of failure detector histories
corresponding to the failure pattern F . In a probabilistic system, some of the failure
detector histories in H are more likely than others, and this is given by the probability
distribution on H. With this probability distribution, quantities like “the probability
that the crash of p is detec t ed b efore time 8”, or “the expec t ed detection time” are
now meaningful.
We consider only fail ure detectors whose behavior eventually reaches steady state,
as we now expl ai n.
4
When a failure detector starts r unning, and for a while after,
its behavior depends on the initi al conditi on (such as whether initially q suspects p
or not) and on how long i t has bee n running. Typically, as time passes the effect
of the initial condition gradually diminishes and its behavior no longer depends on
how long i t has been running i.e., eventually the failure detector behavior reaches
equilibri um, or steady state. In steady state, the probability law governing the
behavior of the failure detector does not change over time. The QoS metrics that we
propose refer to the behavior of a failure detector after it reaches steady state.
2.2 .2 Primary Metrics
We propose three primary metrics for the QoS specification of failure detec t ors. The
first one measures the speed of a failure detector. It is defined with respect to the
4
We omit the formal definition of steady state here; this definition is based on the theory of
stochastic processes.
20
trust
suspect suspect
up
T
M
T
G
T
MR
p
FD at q
Figure 2.4: Mistake duration T
M
, Good period duration T
G
, and Mistake recurrence
time T
MR
runs in which p crashes.
Detection time (T
D
): Informally, T
D
is the time that elapses from p’s crash to the
time when q starts suspecting p perm anently. More precisely, T
D
is a random variable
representing the time that elapses from the time that p crashes t o the time when the
final S-transition (of the failure detector at q) occurs and there are no transitions
afterwards (Fig. 2.1). If there is no such final S-transition, then T
D
= ; if such an
S-transition occurs before p crashes, then T
D
= 0.
The next two metrics can be used to specify the accuracy of a failure detector.
They are defined with respect to failure-free runs.
5
Mistake recurrence time (T
MR
): this measures the time between two consecutive
mistakes. More precisely, T
MR
is a random variable representing the time that elapses
from an S-transition to the next one (Fig. 2.4). If no new S-transition occurs, then
T
MR
= .
Mistake duration (T
M
): this measures the time it takes the f ai lure detector to
correct a mi stake. More precisely, T
M
is a random variable representing the time that
elapses from an S-transiti on to the next T-transition (Fig. 2.4). If no S-transition
5
In Section 2.4, we explain why these metrics also measure the failure detector accuracy in runs
in which p crashes.
21
occurs, then T
M
= 0; if no T-transition occurs after an S-transition, then T
M
= .
As we discussed in the introduction, there are many aspe cts of failure detector
accuracy that may be important to applic ations. Thus, in addition to T
MR
and T
M
,
we propose four other accuracy metrics in the next section. We selected T
MR
and
T
M
as the primary metrics because given these two, one can compute the other four
(this will be shown in Section 2.3).
2.2 .3 Derived Metrics
We propose the following four additional accuracy metric s (they are defined wi th
respect to failure-free runs).
Average m istake rate (λ
M
): this measures the rate at which a failure de tector
make mistakes, i.e., it is the average number of S-transitions per time unit. This
metric is important to long-lived applicati ons such as group membership and c luster
management, where each mistake (each S-transition) results in a costly interrupt.
Query accuracy probability (P
A
): this is the probability t hat the failure detector’s
output is correct at a random time. This metr ic is im portant to applications that
interact with the failure detector by querying it at random times.
Many applications are slowed down by failure detector mistakes. Such applica-
tions prefer a failure detector with long good periods periods in which the failure
detector makes no mistakes. This observation leads to the following two metric s.
Good period duration (T
G
): this measures the length of a good period. More
precisely, T
G
is a random variable representing the time that elapses from a T-
transition to the next S-transition (Fig. 2.4). If no T-transition occurs, then T
G
= 0;
if no S-transition occurs after a T-transition, then T
G
= .
22
For short-lived applications, however, a closely related metric may be more
relevant. Suppose that an application i s started at a random time, and that this
happens to occur somewhere inside a goo d period. In this case, we are interested
in measuring the remaining portion of this good perio d: if i t is long enough, the
short-lived application will be able to complete its task within this period. The
corresponding metric is as follows.
Forward good period duration (T
FG
): this is a random variable representing the
time that elapses from a random time at which q trusts p, to the time of the next
S-transition. If no such S-transition occurs, then T
FG
= . If the probability that q
trusts p at a random time is 0 (i.e. P
A
= 0), then T
FG
is always 0.
At first sight, it may seem that, on the average, T
FG
is just half of T
G
(the length
of a goo d period). But this is incorrect, and in Section 2.3 we give the actual relation
between T
FG
and T
G
.
We now give a simple example to illustrate how these definitions work.
Example 1. Consider the following simple failure detector algorithm A: process p
sends a heartbeat message to q every one tim e unit ; process q suspects p initially;
every time q receives a heartbeat message from p, q trusts p for one tim e unit, and
by the end of the unit if q has not received any new heartbeat message, then q starts
suspecting p.
Suppose that algorithm A runs i n the following simplified ne twork environment:
every heartbeat message is either lost, or is delivered instantaneously; each heartbeat
message has an independent probability p
L
(0, 1) to be lost.
We now analyze all seven metrics of this failure detector. In this system, if p does
not crash, then q keeps trusting p if and only if the heartbeat messages are not lost.
23
Once a message is lost, q starts suspecting p immediately and the suspicion is kept
until a new heartbeat message is received.
For the detection time T
D
, let T be the time elapsed between the time t when p
sends the last message and the time t
when p crashes. T has a uniform distribution
between 0 and 1. If the last heartbeat message is lost, then q starts suspecting p
permanently at time t before p crashes, and so T
D
= 0 in this case. If the last message
is not lost, then q starts suspecting p permanently at time t+1, and so T
D
= 1T in
this case. Therefore, the distribution of T
D
is such that with probability p
L
, T
D
= 0,
and with probability 1 p
L
, T
D
= 1T , where T has a uniform distribution between
0 and 1. Thus we have
Pr(T
D
x) =
0 x < 0
p
L
x = 0
p
L
+ x(1 p
L
) 0 < x 1
1 x > 1
For the accuracy metrics, suppose that p does not crash.
For the mistake duration T
M
, suppose that a message m
j
is lost, which c auses an
S-transition of the failure detector. Then after m
j
, the first message that q receives
causes the next T-transition. Since message losses are indep endent, we have that
the probability that m
j+i
is the first message that q receives after m
j
is p
i1
L
(1 p
L
),
for all i 1. Since m essages are sent e very one unit of time and are delivered
instantaneously if not lost, we know that the distribution of T
M
is such that with
probability p
i1
L
(1 p
L
), T
M
= i, i 1. This is a geometric distribution with
parameter 1 p
L
.
The good period duration T
G
has a distribution symmetric t o the distribution of
24
T
M
. Suppose that message m
j
is received and it causes a T-transition. Then after
m
j
, the first message that is lost causes the next S-transition. Thus the distribution
of T
G
is such that with probability (1 p
L
)
i1
p
L
, T
G
= i, i 1. This is a geometric
distribution with parameter p
L
.
For the mi stake recurrence time T
MR
, starting at an S-transition, the first message
that is received causes the next T-transition, and then the first message that is lost
causes the next S-transition. The length of the suspicion period is independent of
the length of the tr ust period that follows, due to the indep endence of message loss.
Thus T
MR
is the sum of two independent random variable X and Y , where X and Y
have geometric distributions wi t h parameters 1 p
L
and p
L
, respectively.
For the average m istake rate λ
M
, in any unit time interval in steady state, there
is either no S-transition or exactly one S-transition. Thus the ave r age number of
S-transitions in the unit i nterval is just the probability that one S-transition occurs
in the interval. An S-transition occurs in the interval if and only if the message sent
in the interval is lost and the previous message is not lost. Thus the probability that
an S-transition occurs in the interval is p
L
(1 p
L
). Therefore, λ
M
= p
L
(1 p
L
).
For the query accuracy P
A
, q trusts p at a random time t if and only if the message
sent before t is not lost. Therefore, P
A
= 1 p
L
.
For the forward good period duration T
FG
, suppose that q trusts p at a random
time t. Let T
be the time elapsed from t to the time when next heartbeat message is
sent. Then T
has a uniform distribution from 0 to 1. From t ime t on, an S-transition
occurs when a heartbeat message is lost. Therefore, T
FG
has the distribution such
that with probability (1 p
L
)
i
p
L
, T
FG
= T
+ i, i 0.
25
2.3 Relations between Accuracy Metrics
In Theorem 2.1 below we state the re lation between the six accuracy metrics that
we defined in the previous sections. We then use this theorem to justify our choice
of the primary accuracy metrics.
Henceforth, Pr(A) denotes the probability of event A, and E(X), V(X), and
σ(X) denote the e xpected value (or mean), the variance, and the standard deviation
of random variable X, respecti vel y.
Parts (2) and (3) of Theorem 2.1 assume that in fai lure-free runs, the probabilistic
distribution of failure detector histories is ergodic. Roughly speaking, this means that
in failure-free runs, the failure detector sl owly “forgets” its past history: from any
given time on, its future behavior may depend only on i t s recent behavior. We
call failure detectors satisfying this ergodicity c ondition ergodic failure detectors. In
Chapter 3, we formally define the ergodicity condition, prove the following theorem,
and also show the relations between our accuracy metrics in the case that ergodicity
does not hold.
Theorem 2.1 For any ergodic failure detector, the following results hold:
(1) T
G
= T
MR
T
M
.
(2) If 0 < E(T
MR
) < , then:
λ
M
=
1
E(T
MR
)
, (2.1)
P
A
=
E(T
G
)
E(T
MR
)
=
E(T
MR
) E(T
M
)
E(T
MR
)
. (2.2)
(3) If 0 < E(T
MR
) < and E(T
G
) = 0, then T
FG
is always 0. If 0 < E(T
MR
) <
26
and E(T
G
) 6= 0, then:
for all x [0, ), Pr(T
FG
x) =
1
E(T
G
)
Z
x
0
Pr(T
G
> y)dy, (2.3)
E(T
k
FG
) =
E(T
k+1
G
)
(k + 1)E(T
G
)
. (2.4)
In particular,
E(T
FG
) =
E(T
2
G
)
2E(T
G
)
=
E(T
G
)
2
1 +
V(T
G
)
E(T
G
)
2
!
. (2.5)
The fact that T
G
= T
MR
T
M
holds is immediate by definition. Equalities (2.1)
and (2.2) are intuitive, but (2.3), (2.4) and (2.5), which describe the relation be-
tween T
G
and T
FG
, are more complex . Moreover, (2.5) is counter-intuitive: one may
think that E(T
FG
) = E(T
G
)/2, but ( 2. 5) says that E(T
FG
) is in general larger than
E(T
G
)/2 (this is a version of the “waiting tim e paradox” i n the theory of stochastic
processes [Al l90]).
We now explain how Theorem 2.1 guided our selection of the primary accuracy
metrics. Equalities (2.1), (2.2) and (2.3) show that λ
M
, P
A
and T
FG
can be derived
from T
MR
, T
M
and T
G
. This suggests that the primary metrics should be selected
among T
MR
, T
M
and T
G
. Moreove r , since T
G
= T
MR
T
M
, it is clear that given
the joint distribution of any two of them, one c an derive the remaini ng one. Thus,
two of T
MR
, T
M
and T
G
should be selected as the primary metric s, but which two?
By choosing T
MR
and T
M
as our pri mary metrics, we get the following convenient
property that helps to compare failure detectors: if FD
1
is better than FD
2
in terms
of both E(T
MR
) and E(T
M
) (the expected values of the pri mary metrics) then we can
be sure that FD
1
is also better than FD
2
in terms of E(T
G
) (the ex pected values of
27
the other metric). We would not get this useful property if T
G
were selected as one
of the primary metrics.
6
We now demonstrate parts (2) and (3) of Theorem 2.1 with the example in the
previous section.
Example 2. Consider again the algorithm in Example 1. Since message losses are
independent, the behavior of the failure dete ctor does not depend on what happened
in the past history, and so the distribution of f ai lure detector histories in fail ure-free
runs is ergodic.
T
G
has a geometric distribution with parameter p
L
, so we have E(T
G
) = 1/p
L
and V(T
G
) = (1 p
L
)/p
2
L
. Similarly, we have E(T
M
) = 1/(1 p
L
), and E(T
MR
) =
1/p
L
+ 1/(1 p
L
) = 1/[p
L
(1 p
L
)]. With p
L
(0, 1), we have 0 < E(T
MR
) < and
E(T
G
) 6= 0. Thus the conditions for Equalities (2.1)–(2.5) to hold are true.
From Ex ample 1, we know that λ
M
= p
L
(1 p
L
) and P
A
= 1 p
L
. Since
E(T
MR
) = 1/[p
L
(1 p
L
)] and E(T
G
) = 1/p
L
, Equalities (2.1) and (2. 2) hold for this
failure detector.
We now check Equality (2.3). Given any x [0, ), let n = x and r = x n.
From Example 1, we know that T
FG
has the distribution such that with probability
(1 p
L
)
i
p
L
, T
FG
= T
+ i, i 0, where T
has a uniform distribution from 0 to 1.
Then
Pr(T
FG
x) = Pr(T
FG
< n) + Pr(n T
FG
x)
=
n1
X
i=0
(1 p
L
)
i
p
L
+ (1 p
L
)
n
p
L
Pr(0 T
r)
= 1 ( 1 p
L
)
n
+ rp
L
(1 p
L
)
n
.
6
For example, FD
1
may be better than FD
2
in terms of both E(T
G
) and E(T
M
), but worse than
FD
2
in terms of E(T
MR
).
28
On the other side, for any y [i 1, i), Pr(T
G
> y) =
P
j=i
Pr(T
G
= j) =
P
j=i
(1
p
L
)
j1
p
L
= (1 p
L
)
i1
. Thus
1
E(T
G
)
Z
x
0
Pr(T
G
> y) dy = p
L
"
n
X
i=1
Z
i
i1
Pr(T
G
> y) dy +
Z
n+r
n
Pr(T
G
> y) dy
#
= p
L
"
n
X
i=1
(1 p
L
)
i1
+ r(1 p
L
)
n
#
= 1 (1 p
L
)
n
+ rp
L
(1 p
L
)
n
.
Therefore, Equality (2.3) holds for this failure detector.
Equalities (2.4) and (2.5) are direct consequences of Equality (2.3), and we
only che ck Equality (2.5) here. From the distribution of T
FG
, we have E(T
FG
) =
P
i=0
E(T
+ i)(1 p
L
)
i
p
L
=
P
i=0
(i + 1/2)(1p
L
)
i
p
L
= (2 p
L
)/(2p
L
). On the other
side,
E(T
G
)
2
1 +
V(T
G
)
E(T
G
)
2
!
=
1
2p
L
1 +
(1 p
L
)/p
2
L
1/p
2
L
!
=
2 p
L
2p
L
.
Therefore, Equality (2.5) holds for this failure detector.
Note that for this failure detector E(T
FG
) = (2 p
L
)/(2p
L
) while E( T
G
) = 1/p
L
,
so E(T
FG
) > E(T
G
)/2.
2.4 Discussion
On the Probability of Premature Timeouts
For timeout-based failure detectors, the probabili ty of premature timeouts is some-
times used as the accuracy measure: thi s is the probability that when the timer is
set, it will prematurely timeout on a process that is actually up. The problem with
this measure, however, is that (a) it is implementation-spe cific, and (b) it is not
29
useful to applications unless it is given together with other implementation-specific
measures, e.g., how often timers are started, whether the timers are started at regular
or variable intervals, whether the timeout periods are fixed or variable, etc. ( many
such variations exist in practice [Bra89, GM98, vRMH98]). Thus, the probability
of pre mature timeouts is not a good metric for the specification of failure detectors,
e.g., it cannot be used to compare the QoS of failure detectors that use timeouts in
different ways. The six accuracy metrics that we i dentified in this paper do not refer
to implementation-specific f eatures, in particular, they do not refer to timeouts at
all.
Accuracy Metrics and Runs with Crashes
To measure the accuracy of a failure detector that monitors a process p, we considered
runs in which p does not crash. In real systems, however, such runs rarely occur: p
is likely to crash eventually. Are our accuracy metrics applicable to such systems?
The answer is yes, as we now explain.
Note that the output of any failure detector implementation at a ti me t should
not depend on what happens after time t, i.e., the implementation doe s not predi ct
the future.
7
Therefore, the steady state b ehavior of a failure detector before a process
p crashes is the same as its steady state behavior in runs in which p does not crash.
Thus, our accuracy metrics also measure the accuracy of a fail ure detector in runs in
which p eventually crashes (provided that this crash occurs after the failure detector
has reached its steady state behavior).
7
Our model can enforce this assumption by imposing some restriction on the sets of failure
detector histories and their associated distributions (see Chapter 3).
30
Good Periods versus Stable Periods
Recall that a good period of a failure detector is defined in terms of runs in which p
does not crash. It starts when the failure detector trusts p (makes a T-transition) and
terminates when the failure detector erroneously suspects p (makes an S-transition).
In contrast, a stable period of a system starts when the fai lure detector trusts p
and p is up, and terminates when either: (a) the fai lure detector susp ects p, or (b) p
crashes. The length of stable perio ds is an important measure for many applications.
This measure, however, cannot be part of the QoS specification of failure detectors:
since a fai lure detector has no control over process crashes, i t cannot by itself ensure
“long” stable periods, even if it is very accurate.
To measure the length of a stable period in a system, one can use measures on
the accuracy of the failure detector and on the likelihood of crashes. For example,
let T
G
be the r andom vari able representing the length of a good period of the failure
detector, and C b e the random variable re presenting the lifetime of process p. Assume
that C has an exponential distribution (so that at any given time at which p is still
up, the remaining lifetime of p has the same distribution as C). Let S be the
random variable representing the length of a stable peri od after the failure detector
has reached steady state. Then the distribution of S is given by Pr(S x) =
1 Pr (C > x)Pr(T
G
> x). Intuitively, this is because a stable p eriod terminates as
soon as the failure detector makes a mistake or p crashes.
Chapter 3
Stochastic Modeling of Failure
Detectors and Their
Quality-of-Service Specifi cations
3.1 Introduction
In Chapter 2, we proposed a set of metrics for the QoS specification of failure de-
tectors. The definitions of failure dete ctors and the QoS metrics were kept at an
intuitive level to emphasize the main i deas of the QoS speci fication of failure detec-
tors. In thi s chapter, we give a rigorous formalization of failure detectors and their
QoS metrics based on stochastic modeling, in particular the theory of marked point
processes (c.f. [Sig95]). Upon the first reading, readers can skip this chapter and
read Chapter 4 without any difficulty.
In the formalization, we first define random failure detector histories that model
31
32
the probabilistic behaviors of failure detectors. We show that a random failure
detector history is an extension of a (particular type of) random mar ked point process.
We nex t define failure detectors as mappings from failure patterns to random failure
detector histories. This is an extension to the failure detector model of [CT96].
We then define some particular random failure detector histories as the steady state
behaviors of a failure detector and use them to define the Q oS metrics. Some of
these random failure detector histories match with the stationary versions of random
marked point processes. Finally, we analyze the relation between the QoS metrics we
defined. The analysi s is based on the results in the theory of marked point processes,
such as Birkhoff’s Ergodic Theorem for marked point processes, and the empirical
inversion formulas. The relations we present in this chapter are more general than
the results in Theorem 2.1 of Chapter 2.
The rest of the chapter is organized as follows. In Se ction 3.2, we define the failure
detector model, which includes the definitions of random failure detector histories,
failure detectors, and the steady state behaviors of failure detectors. In Se ction 3.3,
we define the QoS metrics and analyze the relation between these metrics.
In Appendix A, we summarize relevant definiti ons and r esults in the theory of
marked point processes.
3.2 Failure Detector Model
As in Chapter 2, we consider a system of two processes p and q, and a fai lure detector
at q that monitors p. We assume that q does not crash. Real time is continuous and
ranges from 0 to .
33
3.2 .1 Failure Detector Definition
As described in Section 2.2.1, the output of the fail ure detector is denoted as either
S or T , and it has two types of transiti ons: S-transitions and T-transitions. Roughly
speaking, a failure detector history describes the output of the failure detector in an
entire run, and it can be represented by the initial output and the ti mes at which
transitions occur.
More precisely, we define a fail ure detector history as f ol lows. Let R, R
+
, and Z
+
denote the set of real numbers, nonnegative real numbers, and nonnegative integers,
respectivel y. Let K
def
= {S, T }. For x K, let
x denote the element other than x in
K.
A failure detector history is given as ψ = hk, {t
n
: n I}i such that
(1) k K ;
(2) I = Z
+
or I = {0, 1, . . . , m 1} for some m Z
+
(if m = 0, then I = );
(3) t
n
R
+
for all n I;
(4) if I = Z
+
, then 0 t
0
< t
1
< t
2
< ···, and lim
n→∞
t
n
= ;
(5) if |I| = m < , then 0 t
0
< t
1
< t
2
< ··· < t
m1
.
In the representation of failure detector history ψ, k represents the output of the
failure detector at time 0, and the increasing sequence {t
n
} represents the times at
which failure detector transitions occur. We call t
n
the n-th transition time (so ψ
starts with the zeroth transition). When t
0
> 0, ψ = hk, {t
n
}i represents a run in
which the failure detector outputs k in the period [0, t
0
), makes a transition at time
t
0
, then outputs
k in the period [t
0
, t
1
), and then makes another transition at time
t
1
, and so on. When t
0
= 0, ψ = hk, {t
n
}i represents a run in which failure detector
has a transition at time 0, then outputs k in the period [t
0
, t
1
), makes a transition at
34
time t
1
, and then outputs k in the perio d [t
1
, t
2
), and so on. All owing a transition at
time 0 is to conform with the representation of marked point processes (as in [Sig95]),
which is the basic tool we use to model the failure detectors. Intuitively it makes
sense when there is output before time 0, and this can happ en if the time line is
shifted. The requirement lim
n→∞
t
n
= in (4) enforces that there are only a finite
number of transitions in any bounded time interval.
For a fail ure detec t or history ψ = hk, {t
n
: n I}i, we define the n-th inter-
transition time T
n
of ψ as follows. If |I| = , then T
n
def
= t
n+1
t
n
for all n 0;
if |I| = m < , then (a) if m = 0, then T
0
def
= and T
n
def
= 0 for n 1; and (b) if
m 1, then T
n
def
= t
n+1
t
n
for 0 n m 2, T
m1
def
= , and T
n
def
= 0 for n m.
To model the probabilistic behavior of a failure detector, we need to define ra ndom
failure detector histories with probability distributions over the set of f ailure detector
histories. To do so, we first need to define what are t he subsets of failure detector
histories that we can assign probability to. Formally, we need to define a σ-field
which contains all measurable subsets of failure detector histories. This is done as
follows.
Let H be the set of all failure detector histories. Let Z
+
def
= Z
+
∪{∞}. for m Z
+
,
let H
(m)
be the set of all failure detector histories with exactly m transitions, i.e.
H
(m)
def
= {ψ = hk, {t
n
: n I}i : |I| = m}. Thus {H
(m)
: m Z
+
} forms a partition
of H. We next define the Borel σ-field B(H
(m)
) of H
(m)
for each m Z
+
.
When m < , H
(m)
is a subset of K × R
m
, where R
m
is the m-dim ensional
Euclidean space with the Borel σ-field B(R
m
). Let B(K × R
m
) be the product
σ-field generated by {K × B : K K, B B(R
m
)}. It is easy to check that
H
(m)
B(K × R
m
). Then, we get the Borel σ-field B(H
(m)
)
def
= {E : E B(K ×
35
R
m
), E H
(m)
}.
When m = , H
()
is a subset of K × R
Z
+
, where R
Z
+
is the set of all count-
ably infinite sequences of real numbe rs. It is known ( se e e.g. [Sig95]) that R
Z
+
is a
complete separable metric space, and the Borel σ-field B(R
Z
+
) is well defined. Let
B(K × R
Z
+
) be the product σ-field generated by {K × B : K K, B B(R
Z
+
)}.
It is easy to check that H
()
B(K × R
Z
+
). Then, we get the Borel σ-field
B(H
()
)
def
= {E : E B(K × R
Z
+
), E H
()
}.
With the above definitions of B(H
(m)
) for all m Z
+
, we then define the Borel
σ-field B(H ) of H to be {
S
mZ
+
E
m
: E
m
B(H
(m)
)}. Hence we have a measurable
space ( H, B(H)) on the set of all failure detector histories.
Some simple examples of Borel sets in H are: (1) {ψ H : k = S}, the set of
failure detector histories in which the output at time 0 is S; (2) {ψ H : t
0
x},
the set of failure detector histories in which the zeroth transition occurs within x
time units; and (3) {ψ H : T
0
x}, the set of failure detector histories i n which
the zeroth intertransition time is at most x time units.
It is easy to verify that B(H) can also be generated fr om the following collection
of the sets:
{ψ = hk, {t
n
: n I}i H : |I | = m, k K, t
n
0
x
0
, . . . , t
n
l
x
l
}, (3.1)
where m Z
+
, K K, l I, 0 n
0
< ··· < n
l
< m, x
i
R
+
.
We define a random failure detector history to be a measurable mapping Ψ :
H, with (Ω, F, P ) as the underlying probability space. With this definition, the ran-
dom f ai lure detector history Ψ has the probability distribution P E)
def
= P ({ω
: Ψ(ω) E}) defined for all E B(H). For convenience, we use {Ψ E} as a
36
short hand for {ω : Ψ (ω) E}. Let Ψ be the set of all random failure detector
histories.
A failure pattern F of process p is just a number F [0, ], denoting the time
F at which p crashes; F = means that p does not crash, and we call this pattern
failure-free pattern. Let F denote the set of all failure patterns. Thus F = [0, ].
A fai lure detector D is a mapping D : F Ψ. Intuitively, the random failure
detector history D(F ) characterizes the probabilistic behavior of the failure detector
output when process p crashes at time F . This is an extension of the f ai lure detector
definition in [CT96] to model the probabilistic behavior of the failure de t ector output.
3.2 .2 Failure Detector Histories as Marked Point Processes
We now build the relation between failure detector histories and marked point pro-
cesses.
Given a failure detector history ψ = hk, {t
n
: n I}i where I 6= , let k
n
be
the output of the failure detector at time t
n
for all n I. Thus, we know that
the transition occurred at time t
n
is a k
n
-transition, and in period [t
n
, t
n+1
), the
failure detector output is k
n
. The relation between k and k
n
’s is: (1) if t
0
= 0, then
k
0
= k
2
= ··· = k and k
1
= k
3
= ··· =
k; and (2) i f t
0
> 0, then k
0
= k
2
= ··· = k
and k
1
= k
3
= ··· = k. For notational convenience, l et t
1
def
= 0, and k
1
def
= k.
With k
n
’s, the failure de t ector history ψ can be equivalently represented as
ψ = {(t
n
, k
n
) : n I}. When I = Z
+
, this repre se ntation coincides with the
representation of a simple marked point process as given in [Sig95]. In fact, A failure
detector history wit h an infinite number of transitions can be directly modeled as a
simple marked point process, with transitions as events and K as the mark space.
37
Therefore, definitions and results for marked point processes can be directly applied
to failure detector histories with an infinite numb er of transitions. For consistency
and convenienc e, we extend some of the defini t ions to include failure de t ector histo-
ries with only a finite number of transitions.
One important extension is the shift mappings on failure detector histories with a
finite number of transitions, as given below. Suppose θ
s
: H H is a shift mapping
defined on all failure det ector histories. Intuitively, for a failure detector history ψ,
θ
s
ψ is t he failure detector history obtained from ψ by shifti ng the ori gi n to s, using
the output at time s as the initial output, re-labeling transitions at and after s as
t
0
, t
1
, . . ., and ignoring the portion of the failure detector history before s. More
precisely, if s = 0, then θ
s
is the identity mapping; if ψ has an infinite number of
transitions, then ψ is also a simple marked point process, and thus θ
s
ψ is de fined
as in A ppendix A. N ow suppose s > 0 and ψ = hk, {t
n
: n I}i has only a finite
number of transitions, i.e., |I| = m < . If t
i1
< s t
i
for some i I, then
θ
s
ψ
def
= hk
, {t
i+n
s : 0 n m 1 i}i, where k
is the output at time s, and
k
= k
i
if s = t
i
; k
=
k
i
if s < t
i
. If s > t
m1
, then θ
s
ψ
def
= hk
m1
, ∅i.
We now define shift mapping by event t ime θ
(j)
for j 0. Intuitively, for a failure
detector history ψ, θ
(j)
ψ is a failure detector history obtained f r om ψ by shift ing the
origin to the time of j-th transition in ψ, and if ψ does not have enough transitions,
then the origin is shifte d to the last transition of ψ. More precisely, if ψ has at
least j + 1 transitions, then θ
(j)
ψ
def
= θ
t
j
; if ψ has less than j + 1 transitions, then
θ
(j)
ψ
def
= θ
t
m1
, where m is the number of transitions in ψ. We then let
ψ
s
def
= θ
s
ψ and ψ
(j)
def
= θ
(j)
ψ. (3.2)
38
Note that ψ
(j)
always has a transition at the origin, exce pt the case when ψ itself
has no transition at all.
For a random fai lure detector history Ψ : H, let Ψ
s
: H be a random
failure detector history obtained from Ψ by shifting the origin to time s, that is,
Ψ
s
(ω) = Ψ(ω)
s
for all ω Ω. Similarly, le t Ψ
(j)
: H be a random failure detector
history obtained from Ψ by shifting the origin to the time of the j-th transition, that
is, Ψ
(j)
(ω) = Ψ(ω)
(j)
for all ω . Intuitively
s
represents what you see if you
always start observing Ψ at time s, and Ψ
(j)
represents what you see if you alway s
start observing Ψ at the j-th transition.
Shift mapping is an important tool to the study of the steady state behaviors of
failure detectors, as we discuss in the next section.
3.2 .3 The Steady State Behaviors of Failure Detec tors
We consider failure detectors whose behaviors e ventually reach the steady state.
Roughly speaking, when a failure detector starts running, and for a while after,
its behavior depends on the initi al conditi on (such as whether initially q suspects p
or not) and on how long i t has bee n running. Typically, as time passes the effect
of the initial condition gradually diminishes and its behavior no longer depends on
how long i t has been running i.e., eventually the failure detector behavior reaches
equilibri um, or steady state.
Suppose that while p is still up, the behavior of the failure detector reaches a
steady state. We consider two kinds of behaviors in this case: First, i f p remains up,
what would be the behavior of the failure detector? Second, if p crashes, what would
be the behavior of the failure detector in response of the crash of p?
39
We now formally define several random failure detector histories that capture
such steady state behaviors. Let D be the failure detector in consideration.
The Steady State Behavior If p Does Not Crash
Let F = be the failure-free pattern of p. Then Ψ
def
= D(F ) defines the behavior of
the failure detector output under this failure-free pattern. Suppose the unde r lying
probability space defining Ψ is (Ω, F, P ). The steady state behavior of Ψ is given
by its event stationary version Ψ
0
and time stationary version Ψ
, if they exist.
Formally, they are defined by the following distributions ( assuming they exist), just
as the definitions in Section 2.3 of [Sig95]:
1
Pr
0
E)
def
= lim
n→∞
1
n
n1
X
j=0
P
(j)
E), for all E B(H), (3.3)
Pr
E)
def
= li m
t→∞
1
t
Z
t
0
P
s
E) ds, for all E B(H). (3.4)
The event stationary version Ψ
0
is obtained by averaging the distribution of Ψ
(j)
over all t ransition times, and the time stationary version Ψ
is obtained by ave r aging
the di stribution of Ψ
s
over all times. Such average distributions are referred to as
empirical distributions in [Sig95]. The i ntuitive me anings of (3.3) and (3.4) are: if we
randomly pick a transition and start observing Ψ after this transition, the random
failure detector history we observed is given by the event stationary version Ψ
0
; if
we randomly pick a real time s and then observe the behavior of Ψ after s, then the
1
Note that in (3.3) and (3.4) we use the no tation Pr (·) to avoid the complication of specifying
the underlying probability spaces for Ψ
0
and Ψ
. These stationary versions can be defined in
probability spaces different from Ψ, but there is no need to specify them here since we are only
interested i n the probability di stributions of Ψ
0
and Ψ
. We will use the notation Pr (·) whenever
it is convenient for us.
40
random failure detector history we observed is given by the time stationary version
Ψ
. Using the expressions in [Sig95], Ψ
0
is the version of Ψ when we randomly observe
Ψ way out at a tra nsition, and Ψ
is the version of Ψ when we randomly observe Ψ
way out in time.
Intuitively, Ψ
0
is event stationary, i.e., the distribution does not change if Ψ
0
is
shifted by transition t imes (see Appendi x A for its definition), because after already
randomly observing Ψ way out at a transition to obtain Ψ
0
, observing Ψ several
transitions later make s no difference to the di stribution of Ψ
0
. Similarly, Ψ
is time
stationary, i.e, the distribution does not change if Ψ
is shifted by time, because after
already randomly observing Ψ way out in time to obtain Ψ
, observing Ψ some time
units later makes no difference to the distribution of Ψ
.
Lem ma 3.1 Ψ
0
is event stationary and Ψ
is time stationary.
Proof. The proof is the same as the proof in [Sig95] p.26, except that the definit ions
of shift mappings ar e extended to include the case where the number of transitions
are nite.
We say that the behavior of the failure detector D reaches steady state in failure-
free runs if the distributions defined in (3.3) and (3.4) exist. The accuracy metrics
of the failure detector are defined with respect to the steady state behavior of t he
failure detector in failure-free runs, i.e., with respect to stationary versions Ψ
0
and
Ψ
.
To further understand the stationary versions Ψ
0
and Ψ
, we break down events
in B(H) into different c ategories and study them separately. From Section 3.2.1,
we know t hat {H
(m)
: m Z
+
} is a partition of H, and B(H) = {
S
mZ
+
E
m
:
E
m
B(H
(m)
)}. Therefore, for any event E =
S
mZ
+
E
m
, if we know Pr
0
E
m
)
41
for all m Z
+
, then by the additivity of the probability measure we know that
Pr
0
E) =
P
mZ
+
Pr
0
E
m
). The case for Pr
E) is similar. Thus, we
now focus on Pr
0
E) and Pr
E) wi t h E B(H
(m)
), for each m Z
+
.
Let E
S
and E
T
be the sets of all failure detector histories in which eventually
the output is always S or T , respectively. Thus E
S
E
T
contains all failure detector
histories with a finite number of transitions. Let p
Ψ
S
def
= P E
S
) and p
Ψ
T
def
= P
E
T
). Thus p
Ψ
S
and p
Ψ
T
are the probabili t ies that eventually the output of the random
failure detector history Ψ is always S or always T , respectively. Let p
Ψ
def
= P
H
()
), the probability that Ψ has an infinite number of transitions.
Proposit ion 3.2 p
Ψ
S
+ p
Ψ
T
+ p
Ψ
= 1.
Proof. It is direct from the fact that E
S
, E
T
, and H
()
are disjoint and E
S
E
T
H
()
= H.
Probabilities p
Ψ
S
and p
Ψ
T
are used to characterize the steady state behavior of
failure detector histories with only a finite number of transitions. Intuitively, if a run
of the failure detec t or only has a finite number of transitions, then in ste ady state
the failure detect or should keep its final output value . In other words, when you
randomly observe a failure detector history with a finite number of transitions way
out in time or way out at a transition, with probability one what you observe is the
portion in which the failure detector keeps its final output value. The probability
that the output you observe is S or T is given by p
Ψ
S
or p
Ψ
T
, respectively. This is
formalized in the fol lowing lemma.
Lem ma 3.3
(1) Pr
{hS, i}) = p
Ψ
S
, and Pr
{hT, i}) = p
Ψ
T
;
42
(2) For all m Z
+
\ {0}, for all E B(H
(m)
), Pr
E) = 0;
(3) Pr
0
{hS, i, hS, {0}i}) = p
Ψ
S
, and Pr
0
{hT, ∅i, hT, {0}i}) = p
Ψ
T
;
(4) For all m Z
+
\ {0}, for all E B(H
(m)
), if E does not contain hS, {0}i or
hT, {0}i, then Pr
0
E) = 0.
Proof. (1) Let E = {hS, ∅i}. We have that if s t, then {Ψ
s
E} {Ψ
t
E}, i.e.,
if a failure detector history remains the output S fr om time s on, then it of course
remains the output S from a later time t on. Thus P
s
E) P
t
E). It is clear
that {Ψ
t
E} {Ψ E
S
}, i .e., as t , {Ψ
t
E} monotonically increasing and
tends to {Ψ E
S
} from below. Then for integer valued n, {Ψ
n
E} {Ψ E
S
}.
Since the probability measure is cont inuous from below ( see e.g. [Bil95], p.25), we
have P
n
E) P E
S
), then it is also true that P
t
E) P E
S
).
From (3.4), we have
Pr
E) = lim
t→∞
1
t
Z
t
0
P
s
E) ds
lim
t→∞
1
t
Z
t
0
P
t
E) ds = lim
t→∞
P
t
E) = P E
S
) = p
Ψ
S
.
On the other hand, from P
t
E) P E
S
), we have that for all ǫ > 0, there
exists K such that for all s K, P
s
E) p
Ψ
S
ǫ. Then
Pr
E) = lim
t→∞
1
t
Z
t
0
P
s
E) ds lim
t→∞
1
t
Z
t
K
p
Ψ
S
ǫ
ds = p
Ψ
S
ǫ.
Let ǫ 0, we have Pr
E) p
Ψ
S
. Therefore, Pr
E) = p
Ψ
S
. Similarly, we
can prove that Pr
{hT, i}) = p
Ψ
T
.
(2) For all E B(H
(m)
) with m Z
+
\{0}, since E(H
()
{hS, ∅i, hT, ∅i}) = ,
we have Pr
E) 1Pr
H
()
)p
Ψ
S
p
Ψ
T
. Thus, to prove Pr
E) = 0,
it is enough to show that Pr
H
()
) = p
Ψ
, since we know that p
Ψ
S
+p
Ψ
T
+p
Ψ
= 1.
43
To prove Pr
H
()
) = p
Ψ
, note that {Ψ
s
H
()
} = {Ψ H
()
}, i.e ., a
failure detector history has an infinite number of transitions from time s on if and
only if itself has an infinite number of transitions. Then we have
Pr
H
()
) = lim
t→∞
1
t
Z
t
0
P
s
H
()
) ds
= lim
t→∞
1
t
Z
t
0
P H
()
) ds = P H
()
) = p
Ψ
.
(3) and (4) have similar proofs as those of (1) and (2).
We now look at failure detector histories with an infinite number of transitions.
We know that with probability p
Ψ
, Ψ has an infinite number of transitions. If
p
Ψ
= 0, then p
Ψ
S
+ p
Ψ
T
= 1, and in the stationary versions of Ψ, only trivial histories
that never change the failure detector outputs have a nonzero probability.
We now consider the case when p
Ψ
> 0. In this case, we restrict Ψ onto H
()
.
More precisely, we first define the restricted probability space (Ω
Ψ
, F
Ψ
, P
Ψ
) such that
(1)
Ψ
= Ψ
1
(H
()
), (2) F
Ψ
= {B : B F, B
Ψ
}, and (3) P
Ψ
(B) = P (B)/p
Ψ
for all B F
Ψ
. We then define the restricted random f ai lure detector history Ψ
as
the measurable mapping from
Ψ
to H
()
such that Ψ
(ω) = Ψ(ω) for all ω
Ψ
.
Since a failure de t ector history in H
()
is also a simple marked point process, Ψ
is
also a ra ndom marked point process as defined in [Sig95].
Ψ
, as a random marked point process, has its own event stationary version
Ψ
0
and time stationary version Ψ
(see definitions in Appendix A). The following
lemma gives the relation between the distributions of Ψ
0
, Ψ
and Ψ
0
, Ψ
.
Lem ma 3.4 If p
Ψ
> 0, then for all E B(H
()
), Pr
0
E) = p
Ψ
Pr
0
E),
and Pr
E) = p
Ψ
Pr
E).
44
Proof. Direct from (3.3), (3.4), (A.5), (A.6) and the definition of the probability
measure P
Ψ
.
The Steady State Behavior after p Crashes
We now define a random failure detector history that represents the steady state
behavior of the failure detector after p crashes. Formally, a post-crash version Ψ
c
of failure detector D is a random failure detector history defined by the following
distribution (assuming it e xists):
Pr
c
E)
def
= li m
t→∞
1
t
Z
t
0
Pr(D(s)
s
E) ds, for all E B(H). (3.5)
Intuitively, (3.5) means that if we randomly pick a time s at which p crashes and
then observe the behavior of the failure detec t or after time s, the random failure
detector history we observed is given by the post-crash version Ψ
c
. So, simil arly
to Ψ
0
and Ψ
, we say that Ψ
c
is the version of the failure detector D when we
randomly observe D way out at a time when p crashes. Ψ
c
is obtained by averaging
the distribution of D( s)
s
, the post-crash behavior of the f ai lure detector D, over all
crash times. Thus it is also an empirical distribution. We say that the failure detector
D has s teady state behavior after p crashes if the distribution defined i n (3.5) exists.
One primary metric, the detec t ion time, is defined with re spec t to the steady state
behavior after p crashes, i.e., with respect to the post-crash version Ψ
c
.
Non-Futuristic Property
Before p crashes, no failure detector implementation can tell whether p will crash later
or not, i.e. , the failure detector cannot predict the future. Therefore, the behavior
45
of the failure detector up to any time t at which p is stil l up should be the same as
the behavior of the failure detector in the same period in failure -free runs. We now
formalize this idea.
For all m Z
+
and for all t R
+
, let H
(m)
t
be the subset of H
(m)
such that
the time of the last transition of any failure detector history in H
(m)
t
is at most t,
i.e., H
(m)
t
def
= {ψ H
(m)
: t
m1
t}. So H
(m)
t
gives the set of failure detector history
prefixes up to time t t hat contains exactly m transitions. Clearly H
(m)
t
B(H
(m)
),
and so we can define the Borel σ-field B(H
(m)
t
) = {E : E B(H
(m)
), E H
(m)
t
}. Let
H
t
=
S
mZ
+
H
(m)
t
. H
t
contains all failure detector history prefixes up to time t.
The Borel σ-field of H
t
is defined as B(H
t
) = {
S
mZ
+
E
m
: E
m
B(H
(m)
t
)}. We
define a prefix mapping f
t
: H H
t
, such that for any failure detector history
ψ H, f
t
(ψ) is the f ai lure detector history prefix that only contains the transitions
of ψ up to time t. It is easy to verify that f
t
is a measurable mapping.
We now formally de fine the probabilistic behavior of a failure detector history up
to some time t. Given a random failure detector history Ψ : H, the random
failure detector history prefix up to time t is the measurable mapping Ψ
t
: H
t
such that Ψ
t
= f
t
Ψ.
We say that a failure detector D is non-futuristic (or not predicting the future)
if for all t R
+
, for al l t
1
, t
2
R
+
and t
1
, t
2
> t, D(t
1
)
t
and D(t
2
)
t
have
the same distri bution, i.e., for all E B(H
t
), Pr(D(t
1
)
t
E) = Pr(D(t
2
)
t
E).
Intuitively, this means that as long as p has not crashed yet by t ime t, the probabilistic
behavior of the f ai lure detector up to time t is the same no matter whether or when
p may crash later. In other words, the fail ure detec tor do es not provide hints on
whether or when process p will crash in the future.
46
3.3 Failure Detector Spec ification Metrics
With the formal model of the failure detector given in the previous section, we are
now ready to formally define the QoS metrics of the failure detector introduced in
Chapter 2.
Let D be a failure detector. Let R
+
def
= R
+
{∞}.
3.3 .1 Definitions of Metrics
Detection time (T
D
): T
D
is defined from the post-crash version Ψ
c
of D. Suppose
Ψ
c
:
c
H, with (Ω
c
, F
c
, P
c
) as the underlying probability space. We first define a
measurable m apping f
D
: H R
+
such that for any failure detector hi story ψ H,
f
D
(ψ) i s: (a) 0, if ψ has no transition and the output is always S; or (b) the time of
the last transition, if ψ has a finite number of transitions and the output after the
last transition is always S; or (c) otherwise. Then T
D
:
c
R
+
is the random
variable such that T
D
= f
D
Ψ
c
. That is, given any particular post-crash history ψ,
T
D
= f
D
(ψ) is the time elapsed from the time of crash to the time when the failure
detector starts suspecting p permanently, and the distribution of T
D
is determined
by the distribution of Ψ
c
, which is defined i n (3.5).
All accuracy metrics are defined with respect to the steady state behavior in
failure-fre e runs, i.e., with respect to the stationary versions Ψ
0
and Ψ
of the random
failure detector history Ψ
def
= D(). For the convenience of studying the relations
between the accuracy metrics in the next section, we assume that Ψ, Ψ
0
and Ψ
use the same underlying probability space (Ω, F, P ) (one can always construct some
common space supporting all of them).
47
The following three accuracy metrics are defined in terms of the e vent stationary
version Ψ
0
. Recall from Section 3.2.1 that T
0
and T
1
are defined as the zeroth and
the first intertransition time of a given failure detector history ψ. Thus T
0
and T
1
are actually measurable mappings fr om H to R
+
.
Mistake recurrence time (T
MR
): We define a me asurable mapping f
MR
: H
R
+
such that f
MR
= T
0
+ T
1
. Then T
MR
: R
+
is the random variable such that
T
MR
= f
MR
Ψ
0
. Intuitively, T
MR
is the length of the first two consecutive periods, one
trust period and one suspicion period, of Ψ
0
. We call any two consecutive per iods a
recurrence interval. Since Ψ
0
is event stationary, the distribution of the length of any
recurrence interval is the same, and thus we only take the first recurrence interval of
Ψ
0
. T
MR
represents the length of the recurrence interval when we randomly observe
the failure detector output way out at a transition in some failure-free run, and its
distribution is det ermi ned by the distribution of Ψ
0
, which is defined in ( 3. 3). Note
that when defining T
MR
we do not restrict the recurrence interval to be started and
ended with S-transitions. This is because in steady state whether you observe at an
S-transition or a T-transition does not change the distribution of the length of the
recurrence interval, and so we choose not to make this restriction f or convenience.
Mistake duration (T
M
): we define a measurable mapping f
M
: H R
+
such
that for any failure detector history ψ = hk, {t
n
}i H, f
M
(ψ) = T
0
(ψ) if k = S,
and f
M
(ψ) = T
1
(ψ) i f k = T . Then T
M
: R
+
is the random variable such
that T
M
= f
M
Ψ
0
. Recall that after being shifted by a transition time, any failure
detector history has a transition at origin (i.e., t
0
= 0) ex cept the historie s with
no transitions at all. Thus the definition of f
M
guarantees that it always takes the
length of the first suspic ion (mistake) period from the event stationary version Ψ
0
.
48
Therefore, T
M
represents the length of the mistake pe r iod when we randomly observe
the failure detector output way out at an S-transition in some failure-free run, and
its distribution is determined by the distribution of Ψ
0
.
Good period duration (T
G
): The definition of T
G
is symmetric to that of T
M
.
we define a measurable mapping f
G
: H R
+
such that for any failure detector
history ψ = hk, {t
n
}i H, f
G
(ψ) = T
0
(ψ) if k = T , and f
G
(ψ) = T
1
(ψ) if k = S.
Then T
G
: R
+
is the random variable such that T
G
= f
G
Ψ
0
. Intuitively,
T
G
represents the length of the trust (good) p eriod when we randomly observe the
failure detector output way out at a T-transition in some failure-free run, and its
distribution is det ermi ned by the distribution of Ψ
0
.
The following three accuracy metrics are defined in ter ms of the time stationary
version Ψ
.
Query accuracy probability (P
A
): Let B
T
be the set of failure dete ctor histo-
ries with output T at ti me 0. Then P
A
def
= P
B
T
). Intuitively, when we randomly
observe the failure detector output way out at a time t in some failure-free run, the
probability that the output at time t is T is just the probabili ty that the output is
T at t ime 0 of the time stationary version Ψ
. Therefore, P
A
is the probability that,
when querie d at a random time in some failure-free run, the output of the failure
detector is T (and thus is correct).
Average mistake rate (λ
M
): We define a measurable mapping N
S
: H R
+
such that for any failure detector history ψ, N
S
(ψ) i s the number of S-transitions
in the period (0, 1]. Thus N
S
Ψ
is a random variable representing the number of
S-transitions of Ψ
in the unit interval (0, 1]. Since Ψ
is time stationary, N
S
Ψ
is the numb er of S-transitions in any unit interval when we randomly observe the
49
failure detector output way out in time in some fail ure-free run. Then λ
M
is defined
as E(N
S
Ψ
), the expected value of N
S
Ψ
.
Forward good period duration (T
FG
): Roughly speaking, we define T
FG
as
the time from the origin to the first transition of Ψ
, conditioned on the fact that the
output of Ψ
at the origin is T . Since Ψ
is obtained when we r andomly observe the
failure detector output way out in time in failure-free runs, T
FG
represents the time
elapsed from a random time at whi ch q tr usts p to the time of the next S-transition.
We now formal ly define T
FG
. If P
A
= 0, then let T
FG
0, i.e., if the probabili ty
that q trusts p at a random time is 0, then T
FG
is always 0. If P
A
> 0, then we
define a random failure detector history Ψ
T
obtained by restricting Ψ
onto B
T
.
More precisely, we first define a restricte d probability space (Ω
T
, F
T
, P
T
) such that
(1)
T
= {Ψ
B
T
}, (2) F
T
= {B : B F, B
T
}, and (3) P
T
(B) = P (B) /P
A
for all B F
T
. Then Ψ
T
:
T
B
T
is the random failure detector history such that
Ψ
T
(ω) = Ψ
(ω) for all ω
T
. I ntuitively, Ψ
T
is the version of Ψ
conditioned on
the fact that the output at the origin is T . Let f
FG
: B
T
R
+
be the measurable
mapping such that for all ψ = hk, {t
n
}i B
T
, if ψ has at least one transiti on
then f
FG
(ψ) = t
0
, else f
FG
(ψ) = , i.e., f
FG
(ψ) is the time from the origin to the
zeroth transition of ψ. Then T
FG
:
T
R
+
is the random variable such that
T
FG
= f
FG
Ψ
T
.
We now give an example that is helpful for understanding the above definitions.
It shows how these definitions are linked with the steady state distributions defined
in Section 3.2.3, and how they match with the intuition.
Example 1. Gi ven a failure detector D, suppose we want to know the probabili ty
that its mistake recurrence time is at least x, i.e. Pr (T
MR
x), for some x R
+
. Let
50
E
def
= {ψ H : T
0
(ψ) + T
1
(ψ) x}, i.e., the set of failure detector histories in which
the length of the very first recurrence interval is at least x. Let Ψ
def
= D() be the
random failure det ector history in failure-free runs, and let Ψ
0
be the event stationary
version of Ψ. By the definition of T
MR
, we have Pr(T
MR
x) = Pr(f
MR
Ψ
0
x) =
Pr
0
E). From (3.3), we have
Pr(T
MR
x) = lim
n→∞
1
n
n1
X
j=0
Pr
(j)
E). (3.6)
Note that Pr
(j)
E) is the probability that the length of the j-th recurrence
interval is at least x. Thus Pr(T
MR
x) is obtained by averaging t hese probabiliti es
over the first n recurrence intervals, and then taking the li mit as n goes to infinity.
Equality (3.6) corresp onds to what we would do if we want to obtain an estimate
on Pr(T
MR
x) by experiments. We would r un the failure detector a number of
times such t hat each run contains a large number of recurrence intervals. We then
compute the ratio of the number of recurrence intervals that are at least x time units
long over the total number of recurrence intervals, and use this ratio as the estimate
of Pr(T
MR
x). This ratio can be equivalently obtained by computing such ratios
for the zeroth, first, second ... recurrence intervals, and then averaging these ratios.
This matches the intuitive idea behind equality (3.6).
3.3 .2 Relations between Accuracy Me trics
We now analyze the relations between the accuracy metrics defined in the previous
section. The analysi s is based on the results in the theory of marked point processes,
such as Birkhoff’s Ergodic Theorem for marked point processes, and the empirical
inversion formulas.
51
Let Ψ
def
= D() be the random failure detector history of some failure detector D
under the fail ure-free pattern, and supp ose that Ψ has the event stationary version
Ψ
0
and the time stationary version Ψ
. Suppose that the underlying probability
space for Ψ, Ψ
0
and Ψ
are (Ω, F, P ).
Lem ma 3.5 T
MR
= T
M
+ T
G
.
Proof. This is immediate from the fact that f
MR
= f
M
+ f
G
, where f
MR
, f
M
and f
G
are measurable mappings used to define T
MR
, T
M
and T
G
respectivel y.
Henceforth, we only consider the nondegenerated case in which 0 < E(T
MR
) < .
Intuitively, thi s means that the average time f or a failure detec t or to make the next
mistake is finite and nonzero.
Proposit ion 3.6 If E(T
MR
) < , then p
Ψ
S
= p
Ψ
T
= 0 and p
Ψ
= 1.
Proof. Let E
def
= {hS, ∅i, hS, {0}i}. By Lemma 3.3 (3), Pr
0
E) = p
Ψ
S
. For all
ω such that Ψ
0
(ω) E, T
0
0
(ω)) = , and thus {ω : Ψ
0
(ω) E} {ω
: T
MR
(ω) = ∞}. Therefore Pr(T
MR
= ) Pr
0
E) = p
Ψ
S
. If p
Ψ
S
> 0, then
E(T
MR
) = , which contradict s to the assumption that E(T
MR
) < . So p
Ψ
S
= 0.
Similarly we have p
Ψ
T
= 0. By Proposition 3.2, we have p
Ψ
= 1.
From this proposition and Lemma 3.3, we know that the probability that the
stationary version Ψ
0
or Ψ
has a nite number of transitions is zero. Formally,
Corollary 3.7 Let E
def
= H \ H
()
. If E(T
MR
) < , then Pr
0
E) = Pr
E) = 0.
Henceforth, we treat Ψ
0
and Ψ
as mappings from to H
()
, since {Ψ
0
H \H
()
} and {Ψ
H \H
()
} have measure zer o. In this case, Ψ
0
and Ψ
are just
52
simple random marked point processes, and so results from the theory of random
marked point processes can be applied to Ψ
0
and Ψ
direct ly.
Let I be the invariant σ-field of H
()
(see Appendix A for the definition). Let
E
I
(X) denote the conditional expected value of X given the σ-field I (see [Bi l95]
p.445 for a definition of the conditional expected value given a σ-field).
Proposit ion 3.8 E
I
(T
MR
) = 2E
I
(T
0
Ψ
0
) a.s.
Proof. by definition, E
I
(T
MR
) = E
I
(f
MR
Ψ
0
) = E
I
((T
0
+ T
1
) Ψ
0
). Since E
I
((T
0
+
T
1
) Ψ
0
) = E
I
(T
0
Ψ
0
) + E
I
(T
1
Ψ
0
) a.s., it is enough to show that E
I
(T
1
Ψ
0
) =
E
I
(T
0
Ψ
0
) a.s. By (A.9) of Theorem A.5, we have
E
I
(T
1
Ψ
0
) = lim
n→∞
1
n
n1
X
j=0
T
1
Ψ
(j)
= lim
n→∞
1
n
n
X
j=1
T
0
Ψ
(j)
= lim
n→∞
n + 1
n
·
1
n + 1
n
X
j=0
T
0
Ψ
(j)
T
0
Ψ
(0)
= lim
n→∞
1
n + 1
n
X
j=0
T
0
Ψ
(j)
= E
I
(T
0
Ψ
0
) a.s.
We say that Ψ is ergodic if Ψ
0
(or e quivalently Ψ
) is ergodi c (see Appendix A
for the definition). Informally, in this case we also say that the distribution of failure
detector histories in failure-free runs is ergodic, or simply the failure detector is
ergodic.
Lem ma 3.9
λ
M
= E
"
1
E
I
(T
MR
)
#
. (3.7)
If Ψ is ergodic, then
λ
M
=
1
E(T
MR
)
. (3.8)
53
Proof. Let λ
def
= E(N
1
Ψ
) be the arrival rate of Ψ. From (A.16) and (A.15), we
have λ = E(E
I
(N
1
Ψ
)) and E
I
(N
1
Ψ
) = lim
t→∞
N
t
Ψ
t
a.s. Similarly, we can have
λ
M
= E(E
I
(N
S
Ψ
)) and E
I
(N
S
Ψ
) = lim
t→∞
N
S
t
Ψ
t
a.s., where N
S
t
: H R
+
is
a measurable mapping representing the number of S-transitions in the period (0, t].
Since in any period (0, t], the numbers of S-transitions and T-transitions differ at
most by one, we have for any ψ H, 2N
S
t
(ψ) 1 N
t
(ψ) 2N
S
t
(ψ) + 1. Thus
lim
t→∞
N
t
Ψ
t
= 2 lim
t→∞
N
S
t
Ψ
t
and so λ = 2λ
M
. By (A.16) and (A.15), we know that
λ
M
=
1
2
E({E
I
(T
0
Ψ
0
)}
1
). By Propositi on 3.8, we have λ
M
= E({E
I
(T
MR
)}
1
). If
Ψ is ergodic, then by Proposition A. 4, we know that λ
M
= {E(T
MR
)}
1
.
Recall that B
T
is the set of failure detector hi stories with output T at time
0. Let B
S
be the set of failure detector historie s with output S at time 0. Let
A
T
def
= {ω : Ψ
0
(ω) B
T
} and A
S
def
= {ω : Ψ
0
(ω) B
S
}. Let X
T
: R
+
be
the random variable such that X
T
(ω) = T
0
0
(ω)) for all ω A
T
, and X
T
(ω) = 0 for
all ω A
S
. Let X
S
: R
+
be the random variable such that X
S
(ω) = T
0
0
(ω))
for all ω A
S
, and X
S
(ω) = 0 for all ω A
T
.
Proposit ion 3.10 E
I
(T
G
) = 2E
I
(X
T
) a.s., and E
I
(T
M
) = 2E
I
(X
S
) a.s.
Proof. Define random variable X
T
: R
+
such that X
T
(ω) = 0 for all ω A
T
,
and X
T
(ω) = T
1
0
(ω)) for all ω A
S
. Then by definition T
G
= X
T
+ X
T
. Thus
to prove E
I
(T
G
) = 2E
I
(X
T
) a.s., it is enough to show that E
I
(X
T
) = E
I
(X
T
) a.s.
Let f : H
()
R
+
be the measurable mapping such that for all ψ = hk, {t
n
}i,
f(ψ) = T
0
(ψ) if k = T , and f(ψ) = 0 if k = S. Simil arly le t f
: H
()
R
+
be
the measurable mapping such that for all ψ = hk, {t
n
}i, f(ψ) = 0 if k = T , and
f(ψ) = T
1
(ψ) if k = S. Thus X
T
= f Ψ
0
and X
T
= f
Ψ
0
. It is important to note
54
that f
Ψ
(j)
= f Ψ
(j+1)
for all j 0. Using equality (A .9), we have
E
I
(X
T
) = E
I
(f
Ψ
0
) = lim
n→∞
1
n
n1
X
j=0
f
Ψ
(j)
= lim
n→∞
1
n
n
X
j=1
f Ψ
(j)
= lim
n→∞
n + 1
n
·
1
n + 1
n
X
j=0
f Ψ
(j)
f Ψ
(0)
= E
I
(f Ψ
0
) = E
I
(X
T
) a.s.
We thus have E
I
(T
G
) = 2E
I
(X
T
) a.s. Simi larly we can prove that E
I
(T
M
) =
2E
I
(X
S
) a.s.
Lem ma 3.11 If 0 < E
I
(T
MR
) < , a.s., then
P
A
= E
"
E
I
(T
G
)
E
I
(T
MR
)
#
. (3.9)
If Ψ is ergodic and 0 < E(T
MR
) < , then
P
A
=
E(T
G
)
E(T
MR
)
. (3.10)
Proof. By definition, P
A
def
= Pr
B
T
). Since 0 < E
I
(T
MR
) < a.s., by Propo-
sition 3.8, we know that 0 < E
I
(T
0
Ψ
0
) < a.s. Then by the empirical inversion
formula (A. 19) we have
Pr
B
T
) = E
E
I
h
R
T
0
Ψ
0
0
I
B
T
Ψ
0
s
ds
i
E
I
(T
0
Ψ
0
)
. (3.11)
We claim that X
T
=
R
T
0
Ψ
0
0
I
B
T
Ψ
0
s
ds a.s. In fact, from Proposition A.1, we know
that with probability one Ψ
0
has a transiti on at time 0, i.e. Pr(t
0
Ψ
0
= 0) = 1.
Then with probability one, during the entire period (0, T
0
0
(ω))), the output of
Ψ
0
is the same as the output of Ψ
0
at the origin. In other words, with probability
one, if Ψ
0
(ω) B
T
, then I
B
T
0
s
(ω)) = 1 for all s (0, T
0
0
(ω))); if Ψ
0
(ω) B
S
,
55
then I
B
T
0
s
(ω)) = 0 for all s (0, T
0
0
(ω))). There fore, with probability one,
R
T
0
Ψ
0
0
I
B
T
Ψ
0
s
ds = T
0
Ψ
0
if Ψ
0
(ω) B
T
, and
R
T
0
Ψ
0
0
I
B
T
Ψ
0
s
ds = 0 if Ψ
0
(ω) B
S
.
Thus X
T
=
R
T
0
Ψ
0
0
I
B
T
Ψ
0
s
ds a.s.
By Proposition 3.10, we have E
I
(T
G
) = 2E
I
(X
T
) = 2E
I
[
R
T
0
Ψ
0
0
I
B
T
Ψ
0
s
ds] a.s.
By Proposition 3.8, we have E
I
(T
MR
) = 2E
I
(T
0
Ψ
0
) a.s. Therefore, from (3.11) we
have
P
A
= E
"
E
I
(T
G
)
E
I
(T
MR
)
#
.
If Ψ is ergodic, then By Proposition A.4, we have
P
A
=
E(T
G
)
E(T
MR
)
.
By definition, if P
A
= 0, then T
FG
0. We now study the case P
A
> 0.
Lem ma 3.12 If P
A
> 0 and 0 < E
I
(T
MR
) < a.s., then for all x R
+
,
Pr(T
FG
x) =
1
P
A
E
"
E
I
(min(T
G
, x))
E
I
(T
MR
)
#
. (3.12)
Proof. Let f : H
()
R
+
be the measurable mapping such that for all ψ H
()
,
f(ψ) = f
FG
(ψ) i f ψ B
T
, and f(ψ) = 0 if ψ B
S
. Let Y
def
= f Ψ
. Thus under the
condition {Ψ
B
T
}, Y = T
FG
, and under the condition {Ψ
B
S
}, Y = 0. For all
x R
+
, we have
Pr(Y x) = Pr(Y x |{Ψ
B
T
})Pr
B
T
) +
Pr(Y x |{Ψ
B
S
})Pr
B
S
)
= Pr(T
FG
x) P
A
+ (1 P
A
).
Thus if P
A
> 0, we have for all x R
+
,
Pr(T
FG
x) =
Pr(Y x) (1 P
A
)
P
A
. (3.13)
56
Let E = {ψ = hk, {t
n
}i H
()
: k = S, or k = T and t
0
x}. Then {Y x} =
{f Ψ
x} = {Ψ
E}. Since 0 < E
I
(T
MR
) < a.s., by Proposition 3.8, we
know that 0 < E
I
(T
0
Ψ
0
) < a.s. Then by the empirical inversion formula (A.19)
we have
Pr(Y x) = Pr
E) = E
E
I
h
R
T
0
Ψ
0
0
I
E
Ψ
0
s
ds
i
E
I
(T
0
Ψ
0
)
. (3.14)
Let X
: R
+
be the random variable such t hat for all ω A
T
, X
(ω) =
min(x, T
0
0
(ω))), and for all ω A
S
, X
(ω) = 0. We claim that X
S
+ X
=
R
T
0
Ψ
0
0
I
E
Ψ
0
s
ds a.s. In fact, from Proposition A. 1, we know that with probabili ty
one Ψ
0
has a transition at time 0, i.e. Pr(t
0
Ψ
0
= 0) = 1. Thus, with probabil ity
one, if ω A
S
then I
E
0
s
(ω)) = 1 for all s (0, T
0
0
(ω))). So we have that with
probability one, if ω A
S
then
R
T
0
0
(ω))
0
I
E
0
s
(ω)) ds = T
0
0
(ω)).
If ω A
T
, then I
E
0
s
(ω)) = 1 i Ψ
0
s
(ω) E, which means that starting from
time s at which the output is T , the time to the next transition in Ψ
0
(ω) is at most x.
So, with probability one, i f ω A
T
, then I
E
0
s
(ω)) = 1 iff T
0
0
(ω)) s x. There
are two possible cases: (a) T
0
0
(ω)) x, in which case I
E
0
s
(ω)) = 1 for all s
(0, T
0
0
(ω))), and so
R
T
0
0
(ω))
0
I
E
0
s
(ω)) ds = T
0
0
(ω)); or (b) T
0
0
(ω)) > x, in
which case for all s (0, T
0
0
(ω)) x), I
E
0
s
(ω)) = 0, and for all s [T
0
0
(ω))
x, T
0
0
(ω))), I
E
0
s
(ω)) = 1, and so
R
T
0
0
(ω))
0
I
E
0
s
(ω)) ds =
R
T
0
0
(ω))
T
0
0
(ω))x
1 ds = x.
Combining the cases (a) and (b), we have with probability one, if ω A
T
, then
R
T
0
0
(ω))
0
I
E
0
s
(ω)) ds = min(x, T
0
0
(ω))).
From the above separate cases for ω A
S
and ω A
T
, we thus have X
S
+ X
=
R
T
0
Ψ
0
0
I
E
Ψ
0
s
ds a.s.
We now show that E
I
(min(T
G
, x)) = 2E
I
(X
) a.s. The proof is similar to t he
proofs of Propositions 3.8 and 3.10. Define random variable X
′′
: R
+
such
57
that X
′′
(ω) = 0 for all ω A
T
, and X
′′
(ω) = min(x, T
1
0
(ω))) for all ω A
S
.
Then by definition min(T
G
, x) = X
+ X
′′
. Thus it is enough to show that E
I
(X
′′
) =
E
I
(X
) a.s. Let f
and f
′′
be the corresponding measurable mappings such that
X
= f
Ψ
0
and X
′′
= f
′′
Ψ
0
. Note that f
′′
Ψ
(j)
= f
Ψ
(j+1)
for all j 0. Using
equality (A.9), we have
E
I
(X
′′
) = E
I
(f
′′
Ψ
0
) = lim
n→∞
1
n
n1
X
j=0
f
′′
Ψ
(j)
= lim
n→∞
1
n
n
X
j=1
f
Ψ
(j)
= lim
n→∞
n + 1
n
·
1
n + 1
n
X
j=0
f
Ψ
(j)
f
Ψ
(0)
= E
I
(f
Ψ
0
) = E
I
(X
) a.s.
We thus have E
I
(min(T
G
, x)) = 2E
I
(X
) a.s.
Therefore, from (3.14) and Propositions 3.8 and 3.10, we have
Pr(Y x) = E
"
E
I
(X
S
) + E
I
(X
)
E
I
(T
0
Ψ
0
)
#
= E
"
E
I
(T
M
) + E
I
(min(T
G
, x))
E
I
(T
MR
)
#
= 1 P
A
+ E
"
E
I
(min(T
G
, x))
E
I
(T
MR
)
#
.
The last equality is due to Lemmata 3.5 and 3.11. Plugging the above result into
(3.13), we then have (3.12).
Corollary 3.13 If Ψ is ergodic and if 0 < E(T
MR
) < and E(T
G
) 6= 0, then
for all x R
+
, Pr(T
FG
x) =
1
E(T
G
)
Z
x
0
Pr(T
G
> y)dy, (3.15)
E(T
k
FG
) =
E(T
k+1
G
)
(k + 1)E(T
G
)
. (3.16)
In particular,
E(T
FG
) =
E(T
2
G
)
2E(T
G
)
=
E(T
G
)
2
1 +
V(T
G
)
E(T
G
)
2
!
. (3.17)
58
Proof. By Lemma 3.11, if Ψ is e rgodic and 0 < E(T
MR
) < , then P
A
=
E(T
G
)/E(T
MR
). If E(T
G
) 6= 0, then P
A
> 0. Then by Proposition A.4 and
Lemma 3.12, we have
Pr(T
FG
x) =
E(min(T
G
, x))
E(T
G
)
.
We now use the fact (see e.g. [Bi l95] p.275) that for any nonnegative random variable
X,
E(X) =
Z
0
Pr(X > t) dt. (3.18)
We have E(min(T
G
, x)) =
R
0
Pr(min(T
G
, x) > y) dy =
R
x
0
Pr(min(T
G
, x) > y) dy =
R
x
0
Pr(T
G
> y) dy. Thus
Pr(T
FG
x) =
1
E(T
G
)
Z
x
0
Pr(T
G
> y)dy.
To prove (3.16), we substitute X in (3.18) with X
k
to have
E(X
k
) =
Z
0
Pr(X
k
> t) dt =
Z
0
Pr(X > t
1
k
) dt
=
Z
0
Pr(X > t) d(t
k
) =
Z
0
kt
k1
Pr(X > t) dt
Then together with (3.15) we have
E(T
k
FG
) =
Z
0
kt
k1
Pr(T
FG
> t) dt =
Z
0
kt
k1
1
1
E(T
G
)
Z
t
0
Pr(T
G
> y)dy
!
dt
=
1
E(T
G
)
Z
0
kt
k1
E(T
G
)
Z
t
0
Pr(T
G
> y)dy
dt
=
1
E(T
G
)
Z
0
kt
k1
Z
t
Pr(T
G
> y)dy
dt
=
1
E(T
G
)
Z
0
Pr(T
G
> y)
Z
y
0
kt
k1
dt
dy
=
1
E(T
G
)
Z
0
Pr(T
G
> y)y
k
dy =
E(T
k+1
G
)
(k + 1)E(T
G
)
.
59
(3.17) is obtained fr om (3.16) by setting k = 2 and using the fact that E(X
2
) =
E(X)
2
+ V (X).
We now summariz e all t he results in the following the orem.
Theorem 3.14 For any failure detector D, the following results hold:
(1) T
MR
= T
M
+ T
G
.
(2) Suppose 0 < E
I
(T
MR
) < , a.s. Then
λ
M
= E
"
1
E
I
(T
MR
)
#
, (3.19)
P
A
= E
"
E
I
(T
G
)
E
I
(T
MR
)
#
. (3.20)
If P
A
= 0 then T
FG
0; if P
A
> 0 then
Pr(T
FG
x) =
1
P
A
E
"
E
I
(min(T
G
, x))
E
I
(T
MR
)
#
. (3.21)
(3) Suppose Ψ
def
= D() is ergodic and 0 < E(T
MR
) < . Then
λ
M
=
1
E(T
MR
)
, (3.22)
P
A
=
E(T
G
)
E(T
MR
)
. (3.23)
If E(T
G
) = 0 then T
FG
0; if E(T
G
) 6= 0, then
for all x R
+
, Pr(T
FG
x) =
1
E(T
G
)
Z
x
0
Pr(T
G
> y)dy, (3.24)
E(T
k
FG
) =
E(T
k+1
G
)
(k + 1)E(T
G
)
. (3.25)
In particular,
E(T
FG
) =
E(T
2
G
)
2E(T
G
)
=
E(T
G
)
2
1 +
V(T
G
)
E(T
G
)
2
!
. (3.26)
Theorem 2.1 in Chapter 2 is just the parts (1) and (3) of the above theorem.
Chapter 4
The Design and Analysis of a New
Failure Detector A lgorithm
4.1 Introduction
In Chapter 2, we proposed a set of specification metrics to measure t he QoS provided
by fail ure detectors. These metrics address the failure detector’s speed (how fast it
detects process crashes) and its accuracy (how well it avoids erroneous detections).
In this chapter, we design a new failure de t ector algorithm for distri buted sy stems
with probabilistic behaviors. We analyz e the QoS of the new algorithm and derive
closed formulas on its QoS met r ics. We show that, among a large class of failure
detector algorithms, the new al gorithm i s optimal with respect to some of these
QoS metrics. Given a set of failure detector QoS requirements, we show how to
compute the parameters of our algorithm so that it satisfies these requirements, and
we show how this c an be done even if the probabilistic behavior of the system is
60
61
not known. Finally, we simulate both the new algorithm and a simple algorithm
commonly used in practice, compare the simulation results, and show that the new
algorithm provides be tt er QoS than the si mple al gorithm.
We consider a simple system of two processes p and q connected through a com-
munication link. Process p may fail by crashing, and the link between p and q may
delay or drop m essages. Message delays and message losses follow some probabilistic
distributions. Process q has a failure detector that monitors p. As in Chapter 2, the
failure detector at q outputs either “I suspect that p has crashed” or I trust that p
is up” (“suspe ct p and t rust p in shor t, respec t ively).
4.1 .1 A Common Fa ilure Detection Algorithm a nd its
Drawbacks
We first consider the following simple failure detector algorithm commonly used in
practice: at regular time intervals, process p sends heartbeat messages to q; when q
receives a more rece nt heartbeat message, it trusts p and starts a t imer with a fixed
timeout value TO; if the timer expires be fore q receives a more recent he artbeat
message from p, q starts suspecting p.
This algorithm has two undesirable characteristics; one regards its accuracy and
the other its detection time, as we now explain. Consider the i-th heartbeat message
m
i
. Intuitively, the probability of a premature timeout on m
i
should depend solely
on m
i
, and in particular on m
i
’s delay. With the simple al gorithm, however, the
probability of a premature timeout on m
i
also depends on the heartbeat m
i1
that
precedes m
i
! In fact, the timer for m
i
is started upon t he receipt of m
i1
, and so if
62
m
i1
is “fast”, the t imer for m
i
starts earl y and this increases the probability of a
premature timeout on m
i
. This dependency on past heartbeats is undesirable.
To see the second problem, suppose p sends a heartbeat just be fore it crashes, and
let d be the delay of this last heartbeat. In the simple algorithm, q would permanently
suspect p only d + TO time uni ts after p crashes. Thus, the worst-case detection
time for this algorithm is the maximum message delay plus TO. This is impractical
because in many systems the maximum message delay is orders of magnitude larger
than the average message delay (i.e., they have large variations of message delays).
The source of the above proble ms is that even though the heartbeats are sent
at regular intervals, the timers to catch” them expir e at irregular times, namely
the receipt times of the heartbeats plus a fixed TO. The algorithm that we propose
eliminates this problem. As a result, the probabili ty of a premature tim eout on
heartbeat m
i
does not depend on the behavior of the heartbeats that precede m
i
,
and the de t ection ti me does not depend on the maximum message delay.
4.1 .2 The New Algorithm and its QoS Analysis
In this chapter, we design a new failure de t ector algorithm that has good worst-case
detection time and good accuracy.
In the new failure detect or algorithm, process p sends heartbeat messages to
q pe riodically, as in the simple algorithm. Suppose m
1
, m
2
, m
3
, . . . are heartbeat
messages and η is the intersending time between two consecutive messages. The new
algorithm differs f r om the simple algorithm in the procedure that q uses t o decide
whether to suspec t p or not. In the new algorithm, q has a sequence of time points
τ
1
, τ
2
, τ
3
, . . ., called freshness points. Each freshness point τ
i
is set to σ
i
+ δ, where
63
σ
i
is the time when m
i
is sent and δ is a fixed parameter of the algorithm. That
is, the freshness poi nts are obtained by shifting the sending times of the heartb eat
messages forward in time by a fixed δ time units. These freshness points are used to
determine the failure de t ector output. Roughly speaking, at any time t [τ
i
, τ
i+1
),
only messages m
i
, m
i+1
, m
i+2
, . . . can affect the failure detector output, and we say
that only these messages are still fresh (at time t), and messages m
1
, . . . , m
i1
are
stale (at time t). At any time t, process q trusts p if and only if some message that
q received is still fresh at time t. A detailed desc ription of the algorithm is given in
Section 4.3.1.
We analyze the algorithm in terms of the QoS metrics proposed in Chapter 2. The
analysis uses the theory of stochastic processes, and provides some closed formulas on
the QoS me t r ics of the new algorithm. We then show the followi ng optimality r esult:
among all failure detector algorithms that send messages at the same rate and satisfy
the same upper bound on the worst-case detection time, our algorithm is optimal
with respect to the query accuracy probability. This shows that the new algorithm
guarantees good worst-case detection time while providing go od accuracy. We then
show that, given a se t of QoS requirements by an appli cation, we can use the closed
formulas we derived to compute the parameters of the new algorithm to meet these
requirements. We first do so assuming that one knows the probabilistic behavior of
the system (i.e ., the probability distributions of message delays and message losses).
We then drop this assumption, and show how to configure the failure detector to
meet the QoS requireme nts of an application even when the probabilistic behavior
of the system is not known.
The first version of our al gorithm (describ ed above) assumes that p and q have
64
synchronized clocks. This assumption is not unrealistic, even in large networks. For
example, GPS clocks are becomi ng cheap, and they can provide clo cks that are very
closely synchronized (see e.g. [VR00]). When synchronized clocks are not available,
we propose a modification to this algorithm that performs equally well in practice,
as shown by our simulations. The basic idea is to use past heartbeat messages to
obtain accurate estim ates of the expected arrival times of future heartbeats, and
then use these estimates to find the freshness points. This computation uses the
same heartbeat messages used for failure detection, so it does not involve additional
system cost.
The modified algorithm has another parameter, namely n, which is the number of
messages used to estimate the expected arrival times of the heartbeat messages. As
n varies from 1 to , we obtain a spect r um of algorithms. An important observation
is that one end point of this spectrum (n = 1) corresponds to the simple algorithm,
and the other end point (n = ) corresponds to the new algorithm with known
expected arrival times of all heartbeat messages. As n increases, the new algorithm
moves away from the simple algorithm and gets closer to the new algorithm with
known expected arrival times. This demonstrates that the problem of the simple
algorithm is that it does not use enough information available (it only uses the most
recently received heartbeat message), and by using more information available (using
more me ssages received), the ne w algorithm is able to provide better QoS than the
simple algorithm.
Finally, we run simulations of both the new algorithm and the simple algorithm,
and provide detailed analysis on the simulation results. The conclusion we draw
from the simulation results are: (a) the simulation results of the new algorithm are
65
consistent with our mathemati cal analysis of the QoS metrics; (b) t he modified new
algorithm for systems with unsynchronized clocks provides essentially the same QoS
as the algorithm with synchronize d clocks; and (c) when the new algorithm and the
simple algorithm send messages at the same rate and satisfy the same upper bound
on the worst-case detection time, the new algorithm provides (in some cases orders
of magnitude) be t t er accuracy than the simple algorithm.
4.1 .3 Related Work
Heartbeat-style failure detectors are commonly used in practice. To keep both good
detection time and good accuracy, many implementat ions rely on special features of
the operating system and communication system to try to deliver heartbeat messages
as regularly as possible (see discussion in Section 12.9 of [Pfi98]) . This is not easy even
for closely-c onnec t ed computer clusters, and it is very hard in wide-area networks.
Some other failure detector algorithms and their analyses can be found in
[vRMH98, GM98, RT99]. The gossip-style heartbeat protocol in [vRMH98] focus
on the scalability of the protocol, and it falls into the category of the simple algo-
rithm as gi ven in Section 4.1.1. In this protocol, nodes in the network randomly pick
some other node t o forward a heartbeat message, so heartbeat messages generated
by a source node may in some cases reach a destination node directly, while in some
other cases may be forwarded by many other intermediate nodes bef ore reaching the
same destination node. Thus the protocol has a large variation of the end-to-end
message delays, and therefore it has the problem of the simple algorithm pointed
out in Section 4.1.1. Algorithms presented in [GM98] are different from one-way
heartbeat algorithms we discussed in this paper, and they are used in a more limited
66
setting in which a single suspicion will terminate the protocol. The group mem-
bership failure detection algorithm presented in [RT99] detects membe r failures in
a group: if some process det ects a failure in the group (perhaps a false detection),
then all proce sses re port a group failure and the protocol terminates. The al gorithm
uses heartbeat-style protocol, and its timeout mechanism is the same as the simple
algorithm that we describ ed in Section 4.1.1.
The probabilistic network model used in this chapter is similar to the ones used
in [Cri89, Arv94] for probabilistic clock synchroniz ation. The method of estimating
the exp ected arrival time s of heartbeat messages is close to the method of remote
clock reading of [ Arv94].
The rest of the chapter is organized as follows. In Section 4.2, we define the proba-
bilistic network model. In Section 4.3, we present the new failure detector algorithm,
analyze it in terms of the QoS metrics, show the optimality result, and show how
to c onfigure the new algorithm to satisfy given QoS requirem ents. In Section 4.4 we
show how to configure the failure detector algorithm when the probabil istic behavior
of the messages is not known, and how to modify the algorithm so that it works
when the local clocks are not synchronized. We present the simulation results in
Section 4.5, and conclude the chapter with some discussions in Section 4.6.
67
4.2 The Probabilistic Network Model
We assume that proce ss p and q are connected by a link that does not create or
duplicate messages,
1
but may delay or drop messages. Note that the link here rep-
resents an end-to-end connection and does not necessarily correspond to a physical
link .
We assume that the message loss and message delay behavior of any message sent
through the link is probabilistic, and is characterized by the following two parameters:
(a) message loss probability p
L
, which is the probability that a message is dropped
by the link; and (b) message delay time D, which is a random variable with range
(0, ) representing the delay from the time a message is sent to the time it is received,
under the condition that the message is not dropped by the link. We assume that
the expected value E(D) and the variance V(D) of D is finite. Note that our model
does not assume that the message delay time D follows any particular distribution,
and thus it is applic able to many practical systems.
Processes p and q have access to their own local clocks. For simplicity, we assume
that there is no clock drift, i.e., l ocal clocks run at the same speed as real time (our
results can be easil y generalized t o the case where loc al clocks have bounded drifts).
In Section 4.3, we f urther assume that clocks are synchronized. We explain how to
remove this assumption in Section 4.4.2.
For simplicity we assume that the probabilistic be havior of the network doe s not
change over t ime. In Section 4.6 we briefly discuss some issues related to the change
1
Message duplication can be easily taken care of: whenever we refer to a message being received,
we change it to the first copy of the message being received. With this modification, all definitions
and analyses in this chapter g o through, and in particular, our results remain correct without any
change.
68
m
i+1
m
i+1
m
i+1
(c)(b)
(a)
p
q
m
i
m
i
m
i
τ
i+1
τ
i+1
τ
i
τ
i
τ
i
τ
i+1
FD at q
suspect
trust trust
suspect
σ
i
σ
i+1
σ
i
σ
i+1
σ
i
σ
i+1
Figure 4.1: Three scenarios of the failure detector output in one interval [τ
i
, τ
i+1
)
of network behavior, and explain how our algorithm can adapt to such changes.
4.3 The New Failure Detector Algorithm and Its
Analysis
4.3 .1 The Algo rithm
In the new algorithm, the task of process p is the same as in the simple algorithm: p
periodically sends heartbeat messages m
1
, m
2
, m
3
, . . . to q every η time units, where
η is a parameter of the algorithm. Heartbeat message m
i
is tagged with its seq uence
number i. Let σ
i
be the sending time of message m
i
.
The new algorithm differs from the simple algorithm in the task of process q. In
the new algorithm, q has a sequence of time points τ
1
< τ
2
< τ
3
< . . ., such that
τ
i
is obtained by shifting the sending time σ
i
forward in time by δ ti me units (i.e.
τ
i
= σ
i
+δ), where δ is a fixed parameter of the algorithm. Time points τ
i
’s, together
69
with the arrival times of the heartbeat me ssages, are used to determine the output
of the failure detec t or at q, as we now explain. Consider the time period [τ
i
, τ
i+1
).
At time τ
i
, the failure de t ector at q checks whether q has received some message m
j
with j i. If so, the f ailure detector trusts p in the period [τ
i
, τ
i+1
) (Fig. 4.1 ( a)). If
not, the failure detector starts suspecting p at time τ
i
. During the period [τ
i
, τ
i+1
), if
q receives some message m
j
with j i, then the failure detector at q starts trusting
p when the message is received, and keeps trusting p until time τ
i+1
. (Fig. 4.1 (b)). If
by time τ
i+1
q has not received any message m
j
with j i, then the fail ure detector
suspects p in the entire period [τ
i
, τ
i+1
) (Fig. 4.1 ( c)). This procedure is repeated for
every period.
Note that from time τ
i
to τ
i+1
, only messages m
j
with j i can affect the output
of the failure detector. For this reason, τ
i
is called a f reshness point: from time τ
i
to
τ
i+1
, messages m
j
with j i are still fresh (useful), and messages m
j
with j < i are
stale (not useful). The core property of the algorithm is that q trusts p at time t if
and only if some message that q received is still fresh at time t.
The detail ed algorithm, denoted by NFD-S, is given in Fig. 4.2.
2
4.3 .2 The Analysis
We now analyze the QoS metrics of the algorithm. For the analysis, we assume that
the link from p to q satisfies the following m essage independence property: (a) the
message loss and message delay behavior of any message sent by p is i ndependent of
whether or when p crashes later; and (b) there exists a known constant such that
2
This version of the algorithm is convenient for illustrating the mai n idea and fo r performing
the a nalysis. One can easily derive some equivalent version that is more efficient in practice.
70
Process p:
1 for some constant η, send to q heartb eat messages m
1
, m
2
, m
3
, . . . at regular time points
η, 2η, 3 η, . . . respectively;
Process q:
2 Initialization:
3 for all i 1, set τ
i
= σ
i
+ δ; {σ
i
= is the sending time of m
i
}
4 output = S; {suspect p initi all y }
5 at every freshness point τ
i
:
6 if no message m
j
with j i has been received then
7 output S; {suspect p if no fresh message is received}
8 upon receive message m
j
at time t [τ
i
, τ
i+1
):
9 if j i then output T ; {trust p when some fresh message is received}
Figure 4.2: The new failure detector algorithm N FD-S, with synchronized clocks,
and with parameters η and δ
the message loss and message delay behaviors of any two messages sent at least
time units apart are independent. We assume that the intersending t ime η is chosen
such that η ∆, so that all heartbeat messages have independent delay and loss
behaviors. For simplicity, we assume that the actions in lines 5–7 and lines 8–9 are
executed instantaneously without interruption.
We adopt the f ol lowing convention about transitions of a failure detector’s out-
put:
3
when an S-transition occurs at ti me t, the output at time t is S, and a symmet-
ric conve ntion is taken for T-transitions. With this convention, the output i s right
continuous: namely, if the output at a time t is X {T, S}, then there exists ǫ > 0
such that the output is also X in the period (t, t + ǫ).
Henceforth, let τ
0
def
= 0, and τ
i
, i 1, are given as in line 3. The following core
lemma states preci sely our intuitive i deas about freshness points and fresh messages.
All subsequent analyses are based on this lemma.
3
This convention is already included in the formal model defined in Chapter 3.
71
Lem ma 4.1 For a ll i 0 and all time t [τ
i
, τ
i+1
), q trusts p at time t if and only
if q has received some message m
j
with j i by time t.
Proof. Fix an i 0 and a tim e t [τ
i
, τ
i+1
). Suppose first that q has received
some message m
j
with j i by time t. Let t
t be the time when m
j
is received.
Choose i
such that t
[τ
i
, τ
i
+1
). Thus i
i j. According to line 9, q trusts p at
time t
. For every τ
in the period (t
, t], since m
j
is received at t
and i j, the
output of the failure detector does not change to S, according to lines 5–7. Therefore,
q trusts p at time t.
Suppose now that q has not received any message m
j
with j i by time t. Then
at time τ
i
, q suspects p according to lines 5–7. During the period (τ
i
, t], since no
message m
j
with j i is received, the output of the failure detector does not change
to T . So q suspects p at time t.
The following definitions are used in t he analysis, and they are all with respect
to failure-free runs.
4
Note that even though i appears in these definitions, the actual
values of i are irrelevant. This is made clear in Proposition 4.2.
Definition 4.1
(1) For any i 1, let k be the smallest integer such that for all j i + k, m
j
is
sent at or after time τ
i
.
(2) For any i 1, let p
j
(x) be the probability that q does not receive message m
i+j
by time τ
i
+ x, for every j 0 and every x 0; let p
0
= p
0
(0).
(3) For any i 2, let q
0
be the probability that q receives message m
i1
before time
τ
i
.
4
Recall that a failure-free run is a run in which p does not crash, as defined in Section 2.2.1.
72
(4) For any i 1, let u(x) be the probability that q suspects p at time τ
i
+ x, for
every x [0, η).
(5) For any i 2, let p
S
be the probability that an S-transition occurs at time τ
i
.
Proposit ion 4.2
(1) k = δ.
(2) For all j 0 and for all x 0, p
j
(x) = p
L
+ (1 p
L
)Pr(D > δ + x jη).
(3) q
0
= (1 p
L
)Pr(D < δ + η).
(4) For all x [0, η), u(x) =
Q
k
j=0
p
j
(x).
(5) p
S
= q
0
· u(0).
Upon the first reading, readers can skip t he rest of the analysis and go directl y
to Theorem 4.11.
We now analyze the acc uracy metric s of the algorithm NF D-S, and to do so we
assume that p does not crash.
Proposit ion 4.3 (1) An S-transition can only occur at time τ
i
for some i 2, and
it occurs at τ
i
if and only if message m
i1
is received by q before time τ
i
and no
message m
j
with j i is received by q by time τ
i
; (2) Lemma 4.1 remains true if
j i in the statement is replaced by i j i + k; (3) part (1) above remains tr ue
if j i in the statement is replaced by i j < i + k.
Proof. From the algorithm, it is clear that an S-transition can only occur at time
τ
i
with i 1. An S-transition cannot occur at time τ
1
, because if so, q susp ects p at
time τ
1
, which implies f r om Lemma 4.1 that q does not receive m
i
by time τ
1
for all
i 1. Then q must also suspe ct p during the period [0, τ
1
) a contradiction.
73
An S-transition occurs at time τ
i
if and only if (a) q suspects p at time τ
i
, and
(b) for some t
(τ
i1
, τ
i
), q trusts p at time t
. Then by Lemma 4.1, (a) is true if and
only if no message m
j
with j i is received by q by time τ
i
, while (b) is true if and
only if some message m
j
with j i 1 is received by q by time t
< τ
i
. Combining
(a) and (b), we know that an S-transition occurs at time τ
i
if and only if message
m
i1
is received by q before t ime τ
i
and no message m
j
with j i is received by q
by time τ
i
.
(2) and (3) follow from the definition of k.
Note that part ( 1) of the above proposition guarantees t hat during any bounded
time period, there is only a finite number of transitions of failure detector output.
Proof of Proposition 4.2. (1) is immediate f r om the fact that m
j
is sent at time
τ
i
δ + (j i) η for all i 1.
(2) directly follows from the fact that p
j
(x) is the probability that either m
i+j
is
lost, or m
i+j
is not l ost but is delayed by more than σ
i
+δ + x(σ
i
+jη) = δ + xjη
time units.
(3) directly follows from the fact that q
0
is the probabil ity that m
i1
is not lost
and is delayed less than δ + η time units.
(4) By Proposition 4.3 (2), u(x) is the probability that q does not rec eive any
message m
j
with i j i + k by time τ
i
+ x. Then by the definition of p
j
(x) and
the message independence property, we have u(x) =
Q
k
j=0
p
j
(x).
(5) By Proposition 4.3 (1), p
S
is the probabil ity that (a) message m
i1
is re ceived
by q before time τ
i
, and (b) no me ssage m
j
with j i is received by q by time τ
i
. By
the m essage independence prop erty, (a) and (b) are inde pendent, and by Lemma 4.1,
(b) is also the event that q suspects p at time τ
i
. Thus by the definitions of q
0
and
74
u(x) we have p
S
= q
0
· u(0).
Proposit ion 4.4 u(0) p
k
0
, and for all x [0, η), u(0) u( x).
Proof. By Proposition 4.2, p
j
(0) p
0
(0) = p
0
, p
k
(0) = 1, and p
j
(0) p
j
(x)
for x [0, η). So u(0) =
Q
k
j=0
p
j
(0)
Q
k1
j=0
p
0
= p
k
0
, and u(0) =
Q
k
j=0
p
j
(0)
Q
k
j=0
p
j
(x) = u(x).
Lem ma 4.5 (1) I f p
0
= 0, then with probability one q trusts p forever after time
τ
1
; (2) If q
0
= 0, then with probability one q suspects p forever; (3) If p
0
> 0 a nd
q
0
> 0, then with proba bility one the failure detector at q h as an infinite number of
transitions.
Proof. (1) By definition, p
0
= 0 means that for all i 1, the probability that q
does not receive m
i
by time τ
i
is 0. Thus by Lemma 4.1, the probability that q
keeps tr usting p in the period [τ
i
, τ
i+1
) is 1. Therefore, with probability one q trusts
p forever after time τ
1
.
(2) By definition, q
0
= 0 means that for all i 2, the probabili ty that q receives
m
i1
before tim e τ
i
is 0. For every j i, message m
j
is sent after m
i1
, so the prob-
ability that q receives m
j
before time τ
i
is also 0. This implies that the probability
that q receives some m
j
with j i 1 before time τ
i
is 0. By Lemma 4. 1, we have
that for all i 0, the probability that q keeps suspecting p in the period [τ
i
, τ
i+1
) is
1. Thus with probability one q suspects p forever.
(3) Suppose p
0
> 0 and q
0
> 0. For all i 2, let A
i
be the event that there is
an S-transition at tim e τ
i
. By Proposition 4.3 (3), A
i
is also the event that message
m
i1
is received before time τ
i
but no messages m
j
with i j < i + k is received
by time τ
i
. Hence A
i
only depends on messages m
j
with i 1 j < i + k, which
75
implies that {A
i(k+1)
, i 2} are independent. By definition and Proposition 4.4, we
have Pr(A
i
) = q
0
· u(0) q
0
· p
k
0
> 0. Therefore, with probability one, the f ai lure
detector at q has an infinite number of transitions.
The above lemma factors out the special case in which p
0
= 0 or q
0
= 0. We
call this special case the degenerated case. From now on, we only consider the
nondegenera ted case in which p
0
> 0 and q
0
> 0, and only consider runs i n which
the output of the failure detector has an infinite number of transitions.
Lem ma 4.6 P
A
= 1
1
η
R
η
0
u(x) dx.
Proof. For al l i 1, let P
i
be the probability that at any random time T [τ
i
, τ
i+1
),
q is suspect ing p. Note that T is uniformly distributed on [τ
i
, τ
i+1
) with density
1/(τ
i+1
τ
i
) = 1. Thus
P
i
=
1
η
Z
τ
i+1
τ
i
u(x τ
i
) dx =
1
η
Z
η
0
u(x) dx.
Note that the value of P
i
does not depend on i. Let this value be P . Thus we
have that P
A
, the probability that q trusts p at a random t ime, is 1 P . This shows
the lemma.
We now analyze the average mistake recurrence ti me E(T
MR
) of the failure detec-
tor. We will show that
Lem ma 4.7 E(T
MR
) = η/p
S
.
If, at each time point τ
i
with i 2, t he test of whether an S-transition occurs
were an independent Bernoulli trial, then the above result would be very easy to
obtain: p
S
is the probability of success in one Bernoulli trial, i.e. an S-transition
occurs at τ
i
, and η is the time between two Bernoulli trials, and so η/p
S
is the
76
expected time between two successful Bernoulli trials, which is just the expec t ed
time between two S-transitions. Unfortunately, this is not the case because the tests
of whether S-transitions occ ur at τ
i
’s are not independe nt. In fact, by Proposition 4.3,
the event that an S-transition occurs at τ
i
dependents on the behavior of messages
m
i
, . . . , m
i+k1
. Thus two such e vents may depend on the behavior of comm on
messages, and so they are not i ndependent in general.
To deal with this, we use some result s in renewal theory, a branch in the theory
of stochastic processes. Besides proving Lemma 4.7, the analysis also reveals an
important property of the failure detector output: each recurre nce interval between
two consecutive S-transitions is independent of other recurrence intervals.
The analysis proceeds as follows. We first introduce the concept of a renewal
process. A more formal account can be found in any standard textbook on stochastic
processes (see for example Chapter 3 of [Ros83]). Let {(T
n
, R
n
), n = 1, 2, . . .} be a
sequence of random variable pairs such that (1) a nonnegative T
n
denotes the time
between the (n 1)-th and n-th occurrences of some recurrent event A, i.e ., event
A occurs at time t
1
= T
1
, t
2
= T
1
+ T
2
, t
3
= T
1
+ T
2
+ T
3
, . . . ; and (2) R
n
can be
interpreted as the reward assoc iated with the n-th occurrence of event A. A delayed
renewal reward process i s such a sequence satisfying: (1) The pairs (T
n
, R
n
), n 1 are
mutually indepe ndent; and (2) The pairs (T
n
, R
n
), n 2 are identically distributed.
If {R
n
} is omitted, then the above process {T
n
, n 1} is called a delayed renewal
process. Such processes are well studied i n the literature, and are known to have
some nice propert ies.
Now consider S-transitions of the failure detector output as the recurrent events.
Let T
MR,n
be the random variable representing the time that e lapses from the (n1)-
77
th S-transition to the n-th S-transition (as a convention c onsider t ime 0 to be the time
at which the 0-th S-transition occurs). Let T
M,n
be the r andom variable representing
the time that elapses from the (n 1)-th S-transition to the n-th T-transition. Thus
T
M,n
T
MR,n
for all n 1.
Lem ma 4.8 {(T
MR,n
, T
M,n
), n = 1, 2, . . .} is a delayed renewal reward process.
We need the following technical result before proving thi s lemma.
Proposit ion 4.9 Let {A
i
, i 1} be an event partition (i.e. disjoint and covers all
events). Two random variables X and Y are independent if for all A
i
: (1) X is
independent of A
i
, that is, if Pr(A
i
) > 0 then for all x [−∞, ], Pr (X x) =
Pr(X x |A
i
); and (2) X and Y are independent when conditioned on A
i
, that
is, if Pr(A
i
) > 0 then for all x, y [−∞, ], Pr(X x, Y y |A
i
) = Pr(X
x |A
i
)Pr(Y y |A
i
).
Proof. For all x, y [, ],
Pr(X x, Y y) =
X
i=1
Pr(X x, Y y |A
i
)Pr(A
i
)
=
X
Pr(A
i
)>0
Pr(X x |A
i
)Pr(Y y |A
i
)Pr(A
i
)
= Pr(X x)
X
Pr (A
i
)>0
Pr(Y y |A
i
)Pr(A
i
)
= Pr(X x)Pr(Y y).
Thus X and Y are independent.
Note that we can replace X and Y i n the above proposition with random vectors
and the result still holds.
78
Proof of Lemm a 4.8. For all n 1, by Proposition 4.3 (1), the n-th S-transition
can only occur at time τ
i
for some i 2. Let A
n
i
be the event that the n-th S-
transition occurs at time τ
i
. Thus for each n 1, {A
n
i
, i 2} is an event partition.
Let B
n
i
be the event consisting of all the runs in which the messages m
j
with j < i
behave in the same way as in some run in A
n
i
. Let C
i
be the event that no message
m
j
with j i is received by tim e τ
i
. Since C
i
and B
n
i
are determined by completely
different set of messages, C
i
is indepe ndent of B
n
i
.
To complete the proof of the lemma, we now show the following five claims.
Claim 1. For all n 1 and for all i 2, A
n
i
= B
n
i
C
i
.
Proof of Claim 1. By definition, A
n
i
B
n
i
. By Proposition 4.3 (1), A
n
i
implies that
no message m
j
with j i arrives by time τ
i
, and t hus A
n
i
C
i
. So A
n
i
B
n
i
C
i
.
For any run r
1
in B
n
i
C
i
, by the definition of B
n
i
, there is a run r
2
in A
n
i
such that
messages m
j
with j < i behave exactly in the same way in both runs. Since r
1
C
i
,
we know from the definition of C
i
that in r
1
no messages m
j
with j i is received
by time τ
i
. Since r
2
A
n
i
, we know from the definition of A
n
i
and Proposition 4.3
(1) that in r
2
no messages m
j
with j i i s received by time τ
i
. Thus in both runs
r
1
and r
2
, the failure detector outputs up to time τ
i
are the same. Theref ore, in r
1
the n-th S-transition occurs at τ
i
just as in r
2
, which means r
1
A
n
i
. Thus Claim 1
holds.
Claim 2. For all n, n
1, for all i, i
2, if Pr(A
n
i
) > 0 and Pr(A
n
i
) > 0, then for
all x, y [, ],
Pr(T
MR,n+1
x, T
M,n+1
y |A
n
i
) = Pr (T
MR,n
+1
x, T
M,n
+1
y |A
n
i
). (4.1)
Proof of Claim 2. Suppose Pr(A
n
i
) > 0 and Pr(A
n
i
) > 0. Let t
T
and t
S
be
79
two random variables representing the times at which the first T-transition and
S-transition occur after time τ
i
, resp ectively. Since A
n
i
represents the event that
the n-th S-transition occurs at tim e τ
i
, we have Pr(T
MR,n+1
x, T
M,n+1
y |A
n
i
) =
Pr(t
S
τ
i
x, t
T
τ
i
y |A
n
i
). Let D
i
be the event {t
S
τ
i
x, t
T
τ
i
y}.
Equality (4.1) is thus equivalent to Pr(D
i
|A
n
i
) = Pr (D
i
|A
n
i
).
By Lemma 4.1, the output of the failure detector after τ
i
is completely determined
by messages m
j
with j i. Thus we know that D
i
is completely determined by
messages m
j
with j i. Since C
i
is completely determined by messages m
j
with
j i while B
n
i
is completely determined by messages m
j
with j < i, we have that
both C
i
and C
i
D
i
are independent of B
n
i
. We clai m that Pr (D
i
|A
n
i
) = Pr (D
i
|C
i
).
Indeed,
Pr(D
i
|A
n
i
) = Pr(D
i
|B
n
i
C
i
) =
Pr(D
i
B
n
i
C
i
)
Pr(B
n
i
C
i
)
=
Pr(D
i
B
n
i
C
i
)/Pr(B
n
i
)
Pr(B
n
i
C
i
)/Pr(B
n
i
)
=
Pr(D
i
C
i
|B
n
i
)
Pr(C
i
|B
n
i
)
=
Pr(D
i
C
i
)
Pr(C
i
)
= Pr(D
i
|C
i
).
Similarly we have Pr(D
i
|A
n
i
) = Pr(D
i
|C
i
). Thus we only need to show that
Pr(D
i
|C
i
) = Pr(D
i
|C
i
).
Pr(D
i
|C
i
) is the probability that, given that no messages m
j
with j i is
received by time τ
i
, the first S-transition after τ
i
occurs within x time units after τ
i
and the first T-transition after τ
i
occurs within y time units. Since the occurrences
of the se transitions are all deter mined by me ssages m
j
with j i, and messages are
sent at regular intervals, it is e asy to verify that this probability is the same for every
i 2. Thus Pr(D
i
|C
i
) = Pr (D
i
|C
i
), and Claim 2 holds.
80
Claim 3. The pairs (T
MR,n
, T
M,n
), n 2 are identically distributed.
Proof of Claim 3. This is a direct consequence of Claim 2. In fact, for all n 2,
for all x, y [−∞, ], let p(x, y) = Pr(T
MR,n
x, T
M,n
y |A
n1
i
) if Pr(A
n1
i
) > 0.
This is wel l-defined by Claim 2. Then
Pr(T
MR,n
x, T
M,n
y) =
X
i=2
Pr(T
MR,n
x, T
M,n
y |A
n1
i
)Pr(A
n1
i
)
=
X
Pr (A
n1
i
)>0
p(x, y)Pr(A
n1
i
) = p(x, y).
Claim 4. For all n 1 and i 2, (T
MR,n+1
, T
M,n+1
) is indepe ndent of A
n
i
.
Proof of Claim 4. This is another direct consequence of Claim 2. Suppose t hat we
fix i and n and Pr(A
n
i
) > 0. Then we have f or all x, y [−∞, ],
Pr(T
MR,n+1
x, T
M,n+1
y) =
X
j=2
Pr(T
MR,n+1
x, T
M,n+1
y |A
n
j
)Pr(A
n
j
)
=
X
Pr(A
n
j
)>0
Pr(T
MR,n+1
x, T
M,n+1
y |A
n
i
)Pr(A
n
j
)
= Pr(T
MR,n+1
x, T
M,n+1
y |A
n
i
).
This shows that (T
MR,n+1
, T
M,n+1
) is indepe ndent of A
n
i
.
Claim 5. For all n 1 and for all i 2, when conditioned on A
n
i
, (T
MR,n+1
, T
M,n+1
)
is indepe ndent of {(T
MR,j
, T
M,j
), j n}.
Proof of Claim 5. We already know that when conditioned on A
n
i
, (T
MR,n+1
, T
M,n+1
) is
completely determined by the distribution of me ssages m
j
with j i. On the other
hand, when conditioned on A
n
i
, the occurrence of any S-transition or T-transition
before τ
i
is only determined by messages m
j
with j < i, because A
n
i
implies that
all messages m
j
with j i do not arrive at q by time τ
i
. Since the occurrences
of transitions before and after τ
i
are de t ermined by disjoint set of messages, and
messages are i ndependent of each other, Claim 5 holds.
81
From Claims 4 and 5 and Proposition 4.9, we know that (T
MR,n+1
, T
M,n+1
) is inde-
pendent of (T
MR,j
, T
M,j
), j n. Thus pairs (T
MR,n
, T
M,n
), n 1 are mutually indepen-
dent. From Claim 3, we know that (T
MR,n
, T
M,n
), n 2 are identically distributed.
Therefore, {(T
MR,n
, T
M,n
), n = 1, 2, . . .} is a delayed renewal reward process.
It is immediate from the above lemm a that for all n 2, T
MR
= T
MR,n
, T
M
= T
M,n
and T
G
= T
MR,n
T
M,n
. This provides more direct ways to analyz e the distributions
of these variables. Moreover, any delayed renewal reward process is ergodic (see for
example Section 2.6 of [Sig95]), so Theorem 2.1 of Chapter 2 is applicable to our
failure detector.
Proof of Lemma 4.7. For all i 2, let A
i
be the event that an S-transition occurs
at time τ
i
. By definition and Proposition 4.4, we have that Pr(A
i
) = p
S
= q
0
·u(0)
q
0
· p
k
0
. Since in the nondegenerated case q
0
> 0 and p
0
> 0, we have Pr(A
i
) > 0.
By Proposition 4.3 (3), A
i
is also the event that m
i1
is received before time τ
i
but
no message m
j
with i j < i + k is received by time τ
i
. This implies that A
i
only
depends on messages m
j
with i 1 j < i + k, which in turn impli es that for every
j {2, . . . , k + 2}, events A
i(k+1)+j
, i 0 are independent.
For j {2, . . . , k + 2}, let B
j
be the se t of time points {τ
i(k+1)+j
: i = 0, 1, . . .}.
Obvious B
j
, j 0 is a partition of all time points τ
i
, i 2. Let N
j
(t) be the random
variable representing the number of S-transitions that occur at times in B
j
by time
t. Let N(t) be the random variable representing the number of S-transitions by time
t. Thus N(t) =
P
k+2
j=2
N
j
(t).
Consider N
j
(t) for some j {2, . . . , k + 2}. For t τ
j
, the number of time points
in B
j
that are no greater than t i s (t τ
j
)/((k + 1)η) + 1. From the above, we
know that at each of these time points, there is an independent probability of p
S
that
82
an S-transition occurs. Therefore, the average number of S-transitions at these time
points by time t τ
j
is given by
E(N
j
(t)) = p
S
$
t τ
j
(k + 1)η
%
+ 1
!
.
Hence, we have for t τ
k+2
,
E(N(t)) =
k+2
X
j=2
p
S
$
t τ
j
(k + 1)η
%
+ 1
!
.
By Lemma 4.8, {T
MR,n
, n 1} is a delayed renewal process. Then by the Ele-
mentary Renewal Theorem (see for example [Ros83] p.61),
E(T
MR
) = lim
t→∞
t
E(N(t))
= lim
t→∞
t
P
k+2
j=2
p
S
j
tτ
j
(k+1)η
k
+ 1
=
1
P
k+2
j=2
p
S
lim
t→∞
j
tτ
j
(k+1)η
k
+ 1
/t
=
1
P
k+2
j=2
p
S
(k+1)η
=
η
p
S
.
By Lemma 4.7, we know that 0 < E(T
MR
) < . Then we can apply Theorem 2.1
of Chapter 2 to obtain results on other metric s by our results on P
A
and E(T
MR
).
The above is the analysis on the accuracy metrics of the new failure dete ctor. We
now give the bound on the worst-case detection time T
D
.
Lem ma 4.10 T
D
δ + η. Moreover, the inequality is tight when q
0
> 0, and T
D
is
always 0 when q
0
= 0.
Proof. Suppose that process p crashes at time t. Let m
i
be the last heartbeat
message sent by p before p crashes. By definition, m
i
is sent at time σ
i
, and σ
i
t.
83
Since no m essages with sequence number greater than i are sent by p, q does not
receive these messages. Thus by Lemma 4.1, for all t
[τ
i+1
, ), q suspects p at
time t
. So the detection time is at most τ
i+1
t = σ
i
+ δ + η t δ + η.
When q
0
> 0, with nonzero probability m
i
is received before τ
i+1
and thus q
trusts p just before τ
i+1
.
5
In these cases, the detection time is σ
i
+ δ + η t. Si nce
the time t when p crashes can be arbitraril y close to σ
i
, we have that the bound δ + η
is tight. When q
0
= 0, similar t o Lemma 4.5 (2), we can see that in runs in which p
crashes q also suspects p forever. Therefore T
D
is always 0.
All the above analytical results are summarized in Theorem 4.11.
Theorem 4.11 The failure d etector NFD-S given in Fig. 4.2 has the following prop-
erties:
(1) T
D
δ + η. Moreover, if q
0
> 0, then the inequality is tight, and if q
0
= 0, then
T
D
is always 0.
(2) If p
0
> 0 and q
0
> 0 (the nondegenerated cas e), then we have
E(T
MR
) =
1
λ
M
=
η
p
S
, (4.2)
E(T
M
) = (1 P
A
) · E(T
MR
) =
R
η
0
u(x) dx
p
S
, (4.3)
P
A
= 1
1
η
Z
η
0
u(x) dx, (4.4)
E(T
G
) = P
A
· E(T
MR
) =
η
R
η
0
u(x) dx
p
S
, (4.5)
E(T
FG
)
1
2
E(T
MR
) =
η
R
η
0
u(x) dx
2p
S
. (4.6)
5
Even though q
0
is defined with respect to runs in which p does not crash, it also applies to runs
in which p crashes by part (a) of the message i ndependence property.
84
If p
0
= 0 or q
0
= 0 (the degenerated case), then we have: in failure-free runs, (a) if
p
0
= 0, then with probability one q trusts p forever after time τ
1
; (b) if q
0
= 0, then
with probability one q suspects p forever.
From these closed formulas, we can derive many useful properties of the QoS
of the new failure detector. For example, we can derive bounds on the accuracy
metrics E(T
MR
), E(T
M
), P
A
, E(T
G
), and E(T
FG
), as we later do in Theorem 4.17.
From these bounds, it is easy to check that when δ inc reases or η decreases, P
A
increases exponentially fast t owards 1, and E(T
MR
), E(T
G
) and E(T
FG
) increases
exponentially fast towards , while E(T
M
) is bounded by a relative small value.
The tradeoff is that: (a) when δ increases, the detection time increases linearly ;
(b) when η decreases, the network bandwidth used by the failure detector increases
linearly. Therefore, with a small (linear) increase in the detection time or in the
network cost, we can get a large (exponential) i ncrease in t he accuracy of the new
failure detector.
In Section 4.3.4, 4.4.1 and 4.4.3, we will show how these close formulas are used
to compute the failure detector parameters to satisfy given QoS requirements.
4.3 .3 An Optimality Result
Besides the properties given in Theorem 4.11, the new algorithm has the following
important optimality property: among all failure detectors that send messages at
the same rate and satisfy the same upp er bound on the worst-case detection time,
the new algorithm provides the best query accuracy probability.
More precisely, let C b e the class of fai lure detector algorithms A such that in
85
every run of A process p sends messages to q every η time units and A satisfies
T
D
T
U
D
for some constant T
U
D
. Let A
be the instance of t he new failure detector
algorithm NFD-S with parameters η and δ = T
U
D
η (δ can be negative). By part
(1) of Theorem 4.11, we know that A
C. We show that
Theorem 4.12 For any A C, let P
A
be the query accuracy probability of A. Let
P
A
be the query accuracy proba bility of A
. Then P
A
P
A
.
The core idea behind the theorem is the following important property of algorithm
A
: roughly sp eaking, i f in some failure-free r un r of A
process q suspects p at time
t, then for any A C, in any fail ure-free run r
of A in which the message delay and
loss behaviors are exactly the same as in run r, q al so suspects p at time t. With
this property, it is easy to see that the probability that q trusts p at a random time
in A
must be at least as high as the probabili ty that q trusts p at a random time in
any A C. We now give the more detailed proof.
A message delay pattern P
D
is a sequence {d
1
, d
2
, d
3
, . . .} with d
i
(0, ] rep-
resenting the delay time of the i-th message sent by p; d
i
= means that the i-th
message is lost. The distribution of message delay patterns are governed by the mes-
sage loss probability p
L
and the message del ay time D, and thus it is the same for
all algorithms in C.
We first consider a subclass C
of C such that for any algorithm A C
, in any
run of A proc ess p sends messages to q at times η, 2η, 3η, . . ., just as in A
. For any
algorithm in C
, a message delay pattern completely determines the time and the
order at which q receives messages in failure-free runs. For A
, this means that a
message delay pattern uniquely determines a failure-free run of A
. For some other
86
algorithm A C
, if A is nondeterministic, then A may have different failure-free
runs with the same message delay pattern.
Lem ma 4.13 Gi ven any message delay pattern P
D
, let r
be the failure-free run of
A
with P
D
, and let r be a failure-free run of som e algorithm A C
with P
D
. Then
for every time t T
U
D
, if q suspects p at time t in run r
, then q suspects p at time
t in run r.
Proof. Suppose that in run r
of A
, q suspects p at time t T
U
D
. Note that
T
U
D
= η + δ = τ
1
, so t τ
1
. Suppose t [τ
i
, τ
i+1
) for some i 1. By Lemma 4.1,
in run r
q does not receive any message m
j
with j i by time t. Since in run r p
sends messages at the same times as in run r
, and both runs have the same message
delay pattern P
D
, then in run r, by time t q does not receive any message sent by p
at time jη with j i.
Consider first that t (τ
i
, τ
i+1
). Suppose for a contradiction that q trusts p
at tim e t in run r. Let ǫ = t τ
i
, and let t
= (i 1)η + ǫ/2. Thus ǫ > 0.
Consider another run r
of A in which p crashes at time t
, and messages sent before
t
(those sent at times jη with j < i) have the same loss and delay behaviors as
in run r (this is possible by part (a) of the message independence property). In
both runs r and r
up to tim e t, q receives the same me ssages at the same times.
If A is nondeterministic, we let A make the same nondeterministic choices up to
time t in both runs. Thus q cannot distinguish run r
from r at time t, and so
q trusts p at time t in run r
. The detection time in run r
, however, is at least
t t
= (τ
i
+ ǫ) ((i 1)η + ǫ/2) = η + δ + ǫ/2 = T
U
D
+ ǫ/2 > T
U
D
, contradicting the
assumption that A satisfies T
D
T
U
D
.
87
Now suppose t = τ
i
. Since the failure detector output is right continuous, there
exists ǫ > 0 such that q suspects p i n the peri od (t, t + ǫ) in run r
. Then by the
above argument, q suspects p in the period (t, t + ǫ) in run r. By the right continuity
again, we have that q suspects p at time t in run r.
Corollary 4.14 For any A C
, let P
A
be the query accuracy probability of A. Let
P
A
be the query accuracy proba bility of A
. Then P
A
P
A
.
Proof (Sketch). We first fix a message de lay pattern P
D
. For the run r
of A
and
any run r of A with message delay pattern P
D
, Lemma 4. 13 shows that for any time
t T
U
D
, if q suspects p in r
at time t, then q suspects p in r at time t. Thus, given a
fixed message delay pattern P
D
, at any random t ime t, the probability that q trusts
p at time t in algorithm A
is at least as high as the probability that q trusts p at
time t in algorithm A. So P
A
P
A
given a fixed message delay pattern P
D
. When
summing (or integrating) both sides of the inequality over all message delay patterns
according to their distribution, we have P
A
P
A
.
The above corollary shows that the new algorithm A
has the best query accuracy
probability in C
, the class of failure detector algorithms in which p sends messages
at exactly the same times as in A
. We now generalize this result to class C, where
p still sends messages every η time unit s, but it may do so at times different from
those in A
.
A message sending pattern P
S
is a se quence of time points {σ
1
, σ
2
, σ
3
, . . .}at which
p se nds messages. The message sending pattern is determined by the algorithm. For
any algorithm A C, its message se nding pattern is in the f orm {s, s + η, s + 2η, . . .}
for some s [0, ). Different runs of algorithm A may have different message sending
patterns due to the possible nondeterminism of A. Let A
s
be the al gorithm in which
88
p sends heartbeat messages according to the sending pattern {s, s + η, s + 2η, . . .},
and q behaves the same way as in A
. Thus A
s
is a shifted version of A
, and so
the behavior of the failure detector output in A
s
is also a shifted version of that of
A
. Since the behaviors of the two failure dete ctors only di ffer in some initial period,
their steady state behaviors are the same. Therefore the QoS metrics of A
s
and A
are the same. In particul ar, they have the same q uery accuracy probability.
Proof of Theorem 4.12 (Sketch). We first fix a message sending pattern P
S
=
{s, s + η, s + 2η, . . .}. For any algorithm A C, consider the runs of A with the
sending pattern P
S
. In these runs p sends messages at exactly the same times as
in algorithm A
s
. By the similar argument as in Lemma 4.13 and Corollary 4.14,
we can show that the query accuracy probability of A
s
is at least as high as the
query accuracy probability of A given the message sending pattern P
S
. Since A
s
and
A
have the same query accuracy probability, we have P
A
P
A
given the message
sending pattern P
S
. Since P
S
is arbitrary, we thus have P
A
P
A
.
4.3 .4 Configuring the Failure Dete ctor to Satisfy QoS
Requirements
Suppose we are given a set of failure detector QoS requirements (these QoS re-
quirements could be given by an application). We now show how to compute the
parameters η and δ of the failure det ector algorithm, so that these req uirements are
satisfied. We first assume that (a) the lo cal clocks of processes are synchronized,
and (b) one knows the probabilistic behavior of the messages, i.e., the message loss
probability p
L
and the distribution of message delays Pr(D x). In Section 4.4, we
89
show how to remove these assumptions.
We assume that the QoS requirements are expressed using the primary metrics.
More precisely, a set of QoS requirements is a tuple (T
U
D
, T
L
MR
, T
U
M
), where T
U
D
is an
upper bound on the worst-case detec t ion time , T
L
MR
is a lower bound on the average
mistake recurrence time, and T
U
M
is an upper bound on the average mistake duration.
In other words, the requirements are that:
6
T
D
T
U
D
, E(T
MR
) T
L
MR
, E(T
M
) T
U
M
. (4.7)
In addition, we would like to have η as large as possible, to save network band-
width. Using Theorem 4.11, this can be stated as a mathematical programming
problem:
maximize η
subject to δ + η T
U
D
(4.8)
η
p
S
T
L
MR
(4.9)
R
η
0
u(x) dx
p
S
T
U
M
(4.10)
η (4.11)
where the values of u(x) and p
S
are given by Proposition 4.2. Constraint (4.11)
ensures that the heartbeat messages are independent, so that Theorem 4.11 can
be applied. Computing the optimal solution for this problem, which means finding
the largest η and some δ that satisfy constraints (4.8)–(4.11), seems to be hard.
Instead, we give a simple procedure that computes η and δ such that they satisfy
6
Note that the bounds on the primary metrics E(T
MR
) and E(T
M
) also impose bounds on the
derived metrics, according to Theorem 2.1 of Chapter 2. More precisely, we have λ
M
1/T
L
MR
,
P
A
(T
L
MR
T
U
M
)/T
L
MR
, E(T
G
) T
L
MR
T
U
M
, and E(T
FG
) (T
L
MR
T
U
M
)/2.
90
constraint s (4.8)–(4.11), but η may not be the largest possible value. This is done
by replacing c onstraint (4.10) with a simpler and stronger constraint to obtain a
modified problem, and computing the optimal solution of this modified problem.
The configuration procedure is as follows:
Step 1 : Compute q
0
= (1 p
L
)Pr(D < T
U
D
), and let η
max
= q
0
T
U
M
.
Step 2 : Let
f(η) =
η
q
0
Q
T
U
D
⌉−1
j=1
[p
L
+ (1 p
L
)Pr(D > T
U
D
jη)]
. (4.12)
Find the largest η η
max
such that f(η) T
L
MR
.
It is easy to check that when η decreases, f(η) increases exponentially fast
towards infinity, so some si mple numerical method ( such as binary search) can
be used to calculate η.
Step 3 : If η ∆, then set δ = T
U
D
η; otherwi se, the procedure does not find
appropriate η and δ.
We now show that the parameters computed by the proc edure are appropriate.
Proposit ion 4.15 If p
0
> 0 and q
0
> 0 (the nondegenerated case), then E(T
M
)
η/q
0
.
Proof. By Proposition 4.4, we have for all x [ 0, η), u(0) u(x). Thus by equality
(4.3), we have
E(T
M
) =
R
η
0
u(x) dx
p
S
R
η
0
u(0) dx
q
0
u(0)
=
η
q
0
.
91
Theorem 4.16 Consider a system in which clocks are synchronized, and the proba-
bilistic behavior of messages is known. With the parameters η and δ computed by the
above procedure, the failure detector algorithm NFD-S of Fig. 4.2 satisfies the QoS
requirements given in (4. 7).
Proof. Suppose that the procedure finds parameters η and δ. Then by step 3 we
have T
U
D
= η + δ. By part (1) of Theorem 4.11, T
D
T
U
D
is satisfied. By step 1
and Proposition 4.2, q
0
= (1 p
L
)Pr(D < η + δ) = q
0
(note that q
0
> 0: otherwise
η
max
= 0 and the procedure cannot find η ∆). Consider first that p
0
> 0. Then
by Proposition 4.15, E(T
M
) η/q
0
η
max
/q
0
= q
0
T
U
M
/q
0
= T
U
M
. So E(T
M
) T
U
M
is
satisfied. Note that
T
U
D
⌉−1
Y
j=1
[p
L
+ (1 p
L
)Pr(D > T
U
D
jη)]
=
(η+δ)/η⌉−1
Y
j=1
[p
L
+ (1 p
L
)Pr(D > η + δ jη)]
=
δ⌉−1
Y
j=0
[p
L
+ (1 p
L
)Pr(D > δ jη)]
=
δ
Y
j=0
[p
L
+ (1 p
L
)Pr(D > δ jη)] = u(0).
Thus f(η) = η/(q
0
u(0)) = η/p
S
= E(T
MR
), by equality (4.2). By step 2, f(η) T
L
MR
,
and so E(T
MR
) T
L
MR
is sat isfied.
Consider now that p
0
= 0. By Theorem 4.11, in failure-free runs, the fail ure
detector keeps trusting p after time τ
1
, and so E(T
MR
) = and E(T
M
) = 0. Thus
the requirements in (4.7) are also satisfied.
92
4.4 De aling with Unknown Sys tem Behavior and
Unsynchronized Clocks
So far, we assumed that (a) the local clocks of processes are synchronized, and
(b) t he probabilistic behavior of the messages (i.e., probability of me ssage loss and
distribution of message delays) is known. These assumptions are not unrealistic, but
in some systems assumption (a) or (b) may not hold. To widen the applicability of
our algorithm, we now show how to remove these assumptions.
4.4 .1 Configuring the Failure De tector NFD-S When the
Probabilistic Behavio r of the Messages is Not Known
In Section 4.3.4, our procedure of computing η and δ to meet some QoS requirements
assumed that one knows the probabilistic behavior of the messages (i.e., probability
p
L
of message loss and the probability distribution Pr(D x) of the message delay).
If this probabilistic be havior is not known, we can still compute η and δ as follows:
We first assume that message loss probability p
L
, the expected value E(D) and the
variance V(D) of message delay D are known, and show how to compute η and δ
with only p
L
, E(D) and V(D). We then show how to estimate p
L
, E(D) and V(D)
using heartbeat messages. Note that in this section we still assume that lo cal clocks
are synchronized.
With E(D) and V(D), we have an upper bound on Pr (D > x), as gi ven by the
following One-Sided Inequality of probability theory (e .g., see [Al l90] p. 79): For any
93
random variable D with a finite expected value and a finite variance,
Pr(D > x)
V(D)
V(D) + (x E(D))
2
, for all x > E(D). (4.13)
With the One-Sided Inequality, we derive the following bounds on the QoS metrics
of algorithm NFD-S.
Theorem 4.17 Assume δ > E(D). For algo rithm NFD-S, in the nondegenerated
cases of Theorem 4.11, we have
E(T
MR
)
η
β
, (4.14)
E(T
M
)
η
γ
, (4.15)
P
A
1 β, (4.16)
E(T
G
)
1 β
β
η, (4.17)
E(T
FG
)
1 β
2β
η, (4.18)
where
β =
k
0
Y
j=0
V(D) + p
L
(δ E(D) jη)
2
V(D) + (δ E(D) jη)
2
, k
0
= (δ E(D)) 1,
and
γ =
(1 p
L
)(δ E(D) + η)
2
V(D) + (δ E(D) + η)
2
.
Proof. Note that for all j such that 0 j k
0
, δ jη > E(D). Then by the
One-Sided Inequality (4.13), we have for all j such that 0 j k
0
,
p
j
(0) = p
L
+ (1 p
L
)Pr(D > δ jη)
p
L
+ (1 p
L
)
V(D)
V(D) + (δ E(D) jη)
2
=
V(D) + p
L
(δ E(D) jη)
2
V(D) + (δ E(D) jη)
2
.
94
Thus,
Q
k
0
j=0
p
j
(0) β.
By Proposition 4.2 (4) and (5) and Proposition 4.4, and the fact that k
0
k 1,
we have u(x) u(0)
Q
k
0
j=0
p
j
(0) β, and p
S
u(0) β. Therefore, from equality
(4.2), E(T
MR
) = η/p
S
η/β. Similarly, when applying u(x) β and p
S
β
to equali t ies (4.4), (4.5) and (4.6), we obtain inequaliti es (4.16), (4.17) and (4.18),
respectivel y.
To show inequality (4.15), first note that we can replace Pr(D > x) in the One-
Sided Inequality (4.13) with Pr(D x), and the inequality rem ai ns true. In fact,
for all ǫ (0, x E(D)),
Pr(D x) Pr(D > x ǫ)
V(D)
V(D) + (x ǫ E(D))
2
.
Let ǫ 0, and we obtain the result.
Then from Proposition 4.2 (3) we have
q
0
= (1 p
L
)(1 Pr (D δ + η)) (1 p
L
)
1
V(D)
V(D) + (δ E(D) + η)
2
!
= γ.
Therefore, by Proposition 4.15, E(T
M
) η/q
0
η.
Note that in Theorem 4.17 we assume that δ > E(D). This assumption is
reasonable because if the parameter δ of NFD-S is set to be less than E(D), then
there will be a false suspicion every time the heartbeat message takes more than the
average message delay, and so a failure detector with such a δ makes very f r equent
mistakes and is not useful.
Computing Fai lure Detector Parame ters η and δ
The bounds given in Theorem 4.17 can be used to compute the parameters η and
δ of the failure detector NFD-S, so that it satisfies the QoS requirements given
95
in (4.7). The configuration procedure is given below. This procedure assumes that
T
U
D
> E(D), i.e., the required detection ti me is greater than the average message
delay, which is a reasonable assumption.
Step 1 : Compute γ
= (1 p
L
)(T
U
D
E(D))
2
/( V(D) + (T
U
D
E(D))
2
) and let
η
max
= min(γ
T
U
M
, T
U
D
E(D)).
Step 2 : Let
f(η) = η ·
(T
U
D
E(D))
1
Y
j=1
V(D) + (T
U
D
E(D) jη)
2
V(D) + p
L
(T
U
D
E(D) jη)
2
. (4.19)
Find the largest η η
max
such that f(η) T
L
MR
.
Step 3 : If η ∆, then set δ = T
U
D
η; otherwi se, the procedure does not find
appropriate η and δ.
Theorem 4.18 Consider a system in which clocks are synchronized, and the proba-
bilistic behavior of messages is not known. With parameters η and δ computed by the
above procedure, the failure detector algorithm NFD-S of Fig. 4.2 satisfies the QoS
requirements given in (4. 7), provided that T
U
D
> E(D).
The proof of the theorem is straitforward.
Estim ating p
L
, E(D) and V(D)
The configuration procedure given above assumes that p
L
, E(D) and V(D) are
known. In practice, we can use heartbeat messages to compute close estimates of
these quantities, as we now explain.
96
Estimating p
L
is easy. For example, one can use the se quence numbers of the
heartbeat messages to count the number of “missing” heartbeats, and then divide
this count by the highest sequence number received so far.
Since lo cal clocks are synchronized, E(D) and V(D) can also be easily estimated
using the heartb eat messages of the algorithm. Suppose that when p sends a heart-
beat m, p timestamps m with the sending time S, and when q receives m, q records
the receipt t ime A. Thus the delay of message m is A S. Therefore, by taking the
average and the variance of A S of heartbeat messages, we obtain the estim ates of
E(D) and V(D).
4.4 .2 Dea ling with Unsynchronized Clocks
The algorithm NFD-S in Fig. 4.2 assumes that the lo cal clocks are synchronized, so
that q can set the freshness points τ
i
’s by shifting t he sending times of the heartbeats
by a constant. If the local clocks are not synchronized, q cannot set the τ
i
’s in this
way. To circumvent this problem, we modify the algorithm so that q obtains the τ
i
’s
by shifting the expected arrival times of the heartbeats, as we now explain.
We assume that local clocks do not drift with respect to real time, i.e., they
accurately measure t ime intervals. Let σ
i
denote the sending time of m
i
with respect
to q’s local clock time. Then, the expected arrival tim e EA
i
of m
i
at q is EA
i
=
σ
i
+ E(D), where E(D) is the expected message delay. We will show shortly how q
can accurately estimate the EA
i
’s by using past heartbeat messages.
Suppose for the moment that q knows the EA
i
’s. Then q can set τ
i
’s by shifting
the EA
i
’s forward in time by α time units, i.e., τ
i
= EA
i
+α (where α is a new failure
detector paramete r replacing δ). We denote t he algorithm with thi s modification as
97
Process p: {using p’s local clock time}
1 for some constant η, send to q heartb eat messages m
1
, m
2
, m
3
, . . . at regular time points
η, 2η, 3 η, . . . respectively;
Process q: {using qs local clock time}
2 Initialization:
3 for all i 1, set τ
i
= EA
i
+ α; {EA
i
is the expected arrival time of m
i
}
4 output = S; {suspect p initi all y }
5 at every freshness point τ
i
:
6 if no message m
j
with j i has been received then
7 output S; {suspect p if no fresh message is received}
8 upon receive message m
j
at time t [τ
i
, τ
i+1
):
9 if j i then output T ; {trust p when some fresh message is received}
Figure 4.3: The new failure detector algorithm NFD-U, with unsynchronized clocks
and known expected arrival times, and with parameters η and α
NFD-U, and it is given in Fig. 4.3. Intuitively, EA
i
is the time when m
i
is expected to
be received, and α is a slack added to EA
i
to accommodate the possible extra delay
of message m
i
. Thus an appropriately se t α gives a high probability that q receives
m
i
before the freshness point τ
i
, so that there is no failure detector mistake in the
period [τ
i
, τ
i+1
) (see Fig. 4.1 (a)). If α is large enough, it also allows subsequent
messages m
i+1
, m
i+2
, . . . to be received be fore time τ
i
, so that ther e is no fail ure
detector mistake in [τ
i
, τ
i+1
) even i f m
i
is lost. Of course α cannot be too large
because it adds to the detection t ime.
Note that in algorithm NFD-U, τ
i
= σ
i
+ E(D) + α. Ther efore, it is easy to
see that if we let δ = E(D) + α, and consider all times r eferred in the analysis of
the algorithm NFD-S to be with r espect to q’s local clock time, the n the analysis of
NFD-S also applies to the algorithm NFD-U. In particular, the only changes of the
98
Process p: {using p’s local clock time}
1 for some constant η, send to q heartb eat messages m
1
, m
2
, m
3
, . . . at regular time points
η, 2η, 3 η, . . . respectively;
Process q: {using qs local clock time}
2 Initialization:
3 τ
0
= 0;
4 = 1; { keeps the la rgest sequence number in all messages q received so far}
5 upon τ
+1
= the current time:
{if the current time reaches τ
+1
, then all messages received are stale}
6 output S; {suspect p since all messages received are stale at this time}
7 upon receive message m
j
at time t:
8 if j > then {received a message with a higher sequence number}
9 j;
10 compute
c
EA
+1
; {estimate the expected arrival time of m
+1
using formula (4.24)}
11 τ
+1
c
EA
+1
+ α;
12 if t < τ
+1
then output T ; {trust p since m
is still fresh at time t}
Figure 4.4: The new failure detector algorithm NFD-E, wit h unsynchronized clocks
and estimated expe cted arrival times, and wit h parameters η and α
results are: (a) In Proposition 4.2, we now have
k = (E(D) + α), (4.20)
p
j
(x) = p
L
+ (1 p
L
)Pr(D > E(D) + α + x jη), (4.21)
q
0
= (1 p
L
)Pr(D < E(D) + α + η). (4.22)
(b) In part (1) of Theorem 4.11, the inequality is now
T
D
E(D) + α + η. (4.23)
Estim ating the Expected Arrival Times
The expecte d arrival times can be estimated using heartbeat messages. The idea
is to use the n most recently received heartbeat messages to estimate the expe cted
99
arrival time of the next heartbeat message. To do so, we first modify t he structure
of the failure detect or algorithm NFD-U in Fig. 4.3 to show exactly when q needs to
estimate the expected arrival time of the next heartbeat.
The new version of the algorithm with estimated expected arrival times is given
in Fig 4.4 and is denoted by NFD-E. In NFD-E, process q uses a variable to keep
the largest heartb eat sequence number received so far, and τ
+1
refers to the “next”
freshness point. Note that when q updates , it also changes τ
+1
. If the local clock
of q ever reaches time τ
+1
(an event which might never happen), then at this time
all the heartbeats received are stale, and so q starts suspecting p (lines 5–6). When
q receives m
j
, it checks whether this is a new heartbeat (j > ) and in this case, (1)
q updates , (2) q computes the estimate
d
EA
+1
of the expected arrival time of m
+1
(the next heartbe at), (3) q sets the next freshness point τ
+1
to
d
EA
+1
+ α, and (4)
q trusts p if the current time is less than τ
+1
(lines 9–12).
Note that the algorithm NFD-E satisfies the same core property that q trusts p at
time t if and only if some message that q recei ved is still fresh at time t. Therefore,
except the fact that it needs to estimate t he exp ected arrival times, algorithm NFD-E
is equivalent to algorithm NFD-U.
We now show how to estimate the expected arrival time of m
+1
from the most
recent n heartbeat messages that q rec eived. Let m
1
, m
2
, . . . , m
n
be these n messages.
Note that m
i
is not m
i
in general, and so the sequence number of m
i
is not necessarily
i. Moreover, the sequence numbers of m
1
, m
2
, . . . , m
n
may not be consecutive or
monotonically i ncreasing, because the heartbeat messages may be lost or received out
of order. Let s
1
, s
2
, . . . , s
n
be the sequence numbers of m
1
, m
2
, . . . , m
n
respectivel y.
Let A
i
be the actual arrival time of m
i
with respect to q’s local clock time.
100
Let the expected arrival time of m
i
at q be EA
i
. Let ǫ
i
= A
i
EA
i
. Thus ǫ
i
is the
deviation of the actual arrival time of m
i
from its expected arrival time at q. Let D
i
be the actual delay time of message m
i
. Then we have ǫ
i
= A
i
EA
i
= D
i
E(D).
For the ex pected arrival time EA
+1
of m
+1
, we have t hat for every i = 1, 2, . . . , n,
EA
+1
= EA
i
+ ( + 1 s
i
)η = A
i
ǫ
i
+ ( + 1 s
i
)η. By summing over all i’s on
both side of the equality and then dividing both sides by n, we have
EA
+1
=
1
n
n
X
i=1
A
i
1
n
n
X
i=1
ǫ
i
+
1
n
n
X
i=1
( + 1 s
i
)η.
By choosing η , we know that D
i
’s are independent and identical to D. Thus
1
n
P
n
i=1
ǫ
i
=
1
n
P
n
i=1
D
i
E(D), which is close to zero when n is large. There fore, we
obtain the following estimator of EA
+1
:
d
EA
+1
=
1
n
n
X
i=1
A
i
+
1
n
n
X
i=1
( + 1 s
i
)η. (4.24)
This is the formula used in line 10 of the algorithm NFD-E in Fig. 4.4 to compute
the estimate of the expected arrival time of m
+1
.
How large the value of n should be to obtain a reasonably good estimate? Note
that
d
EA
+1
EA
+1
=
1
n
P
n
i=1
ǫ
i
=
1
n
P
n
i=1
D
i
E(D), where D
i
’s are independent and
identical to D. Thus the q uality of the esti mator
d
EA
+1
is the same as the quality of
the e stimator
1
n
P
n
i=1
D
i
for estim ating E(D). By the Sampling Theorem in statistics
(see, e.g., [All90] p.432), we know that
1
n
P
n
i=1
D
i
is an unbiased estimator of E(D),
and when n is large,
1
n
P
n
i=1
D
i
has approxim ately the normal distributi on with mean
E(D) and standard deviati on σ(D)/
n. When it is close to a normal distribution,
about 95% of the estimated values are within 2σ(D)/
n away from the true value.
The actual n that makes the estimator close to a normal distribution depends on the
distribution of D. A widely used rule of t humb is that n be at least 30 ([All90] p.434).
101
For ex ample, we simulate algorithm NFD-E with 32 messages for the estimation and
D having an e xponential distribution (Section 4.5.2). The simulation results show
that NFD-E provides essentially the same QoS as NFD-S (the new algorithm with
synchronized clocks), so the estimation do es not compromise the QoS of the new
failure detector.
With n as a parameter varying from 1 towards , NFD-E is actually a spectrum
of algorithms. The larger the value of n is, the better the estimates are. The
algorithm NFD-U c orresponds to one end point of the spectrum when n = .
The other end point of this spe ctrum, namely the algorithm with n = 1, is
worth some furthe r discussion. When n = 1, formula (4.24) becomes
d
EA
+1
=
A
1
+ ( + 1 s
1
)η. According to the algorithm in Fig. 4.4, when
d
EA
+1
is computed
at line 10, the most recent message q received is m
. Thus s
1
= ,
d
EA
+1
= A
1
+ η
and τ
+1
= A
1
+ η + α. This means that whenever a new heartbeat me ssage with
a higher sequence number is received, q sets a new freshness point τ
+1
, which is a
fixed η + α time units away from the current time, such that if no new heartb eat
message is received by ti me τ
+1
, then q starts suspecting p. This is just the simple
algorithm!
Therefore, when n varies from 1 towards , the algorithm NFD-E spans a spec-
trum that includes the simple algorithm at one end (n = 1), and the new algorithm
NFD-U in which q knows the expected arrival times of the heartbeat messages at
the other end (n = ). When the number of the heartbeat me ssages used in the
estimation increases, the new algorithm moves away from the simple algorithm and
gets closer to the algorithm NFD-U. This demonstrates that the problem of the sim-
ple algorithm is that it does not use enough information available (it only uses the
102
most recently received heartbeat message), and by using more information available
(using m ore messages received), the new algorithm is able to provi de a better QoS
than the si mple al gorithm.
4.4 .3 Configuring the Failure D etector When Local Clocks
are Not Synchronized and the Probabilistic Behavior
of the Messages is Not Known
We now consider systems in which local clocks are not synchronized and the prob-
abilistic behavior of the messages is not known. Since local clocks are not synchro-
nized, we c annot use algorithm NFD-S. In this section, we show how to compute
the parameters η and α of algorithm NFD-U to m eet the QoS requirements in such
systems. For algorithm NFD-E, when the number n of messages used to estimate
the exp ected arrival time s are large, the estimates are very accurate, and thus for
practical purposes the computation of η and α for NFD-U also applies to NFD-E.
The method used here is based on the one given in Section 4.4.1.
We first need to point out t hat in such systems, one cannot estimate E(D) us-
ing only one-way heartbeat messages. This is because in such settings one cannot
distinguish a system with small message delays but a large clock skew from another
system with large message delays but a small cl ock skew, as we now further explain.
Suppose in a syste m S with message delay time D, one obtains an estimate µ of
E(D) using only one way messages from p to q. Suppose in system S q’s local clock
is s tim e uni ts ahead of p’s local clock. Now consider another system S
with message
delay t ime D
= D + c, where c is a constant. That is, each message in S
is delayed
103
by c time units longer than in S. Suppose in S
q’s local clock t ime is sc time units
ahead of p’s local clock. Thus, in both systems S and S
, a message sent by p at the
same p’s local clock time is rec eived at the same q’s local clock time. Therefore, with
unknown c lock skews and only one way messages from p to q, one cannot di stinguish
the two systems S and S
, and so in S
the estimate of E(D
) obtained i s also µ. But
µ c annot be an estimate of both E(D) and E(D
), since E(D
) = E(D) + c for an
arbitrary constant c. This shows that E(D) cannot be estimated when local clocks
are not synchronized and only one way messages from p to q are used.
Fortunately, we do not need to esti mate E(D). Since the analysis of NFD-S
applies to NFD-U if δ is replaced with α + E(D), with this replacement we obtain
the following theorem from Theorem 4.17. From this theorem, it is clear that we
only need p
L
and V(D) t o bound the QoS metrics of NFD-U.
7
Theorem 4.19 Assume α > 0. For algorithm NFD-U, in the nondegenerated cases
of Theorem 4.11, we have
E(T
MR
)
η
β
, (4.25)
E(T
M
)
η
γ
, (4.26)
P
A
1 β, (4.27)
E(T
G
)
1 β
β
η, (4.28)
E(T
FG
)
1 β
2β
η. (4.29)
where
β =
k
0
Y
j=0
V(D) + p
L
(α jη)
2
V(D) + (α jη)
2
, k
0
= α/η 1,
7
This is one reason why it is convenient to set the freshness points with respect to the expected
arrival times as opposed to some other reference points (e.g. the media n arrival times).
104
and
γ =
(1 p
L
)(α + η)
2
V(D) + (α + η)
2
.
We first assume that p
L
and V(D) are known, and show how to compute param-
eters η and α of NFD-U using Theorem 4.17 to satisfy Q oS requirements. We then
show how to estimates p
L
and V(D).
We consider a set of QoS requirements of the form:
T
D
T
U
D
+ E(D), E(T
MR
) T
L
MR
, E(T
M
) T
U
M
. (4.30)
These requirements are identical to the ones in (4.7), except that the upper bound
requirement on the detec t ion time is not just T
U
D
, but rather T
U
D
plus the unknown
average message delay E(D). This i s justified as follows. First, it is not surprising
that the detection time includes E(D): it is not reasonable to require a failure
detector to detect a crash faster than the average delay of a heartbeat. Second,
when loc al clocks are not synchronized and only one-way messages are used, an
absolute bound T
D
T
U
D
cannot be enforced by any failure detector. The reason is
the same as the reason why E(D) cannot be estimated in such settings: one cannot
distinguish a system with small message delays but a large clock skew from another
system with large message delays but a small clock skew.
The foll owing is the configuration procedure for algorithm NFD-U, modified from
the one in Section 4.4.1.
Step 1 : Compute γ
= (1 p
L
)(T
U
D
)
2
/( V(D) + (T
U
D
)
2
) and let η
max
=
min(γ
T
U
M
, T
U
D
).
105
Step 2 : Let
f(η) = η ·
T
U
D
1
Y
j=1
V(D) + (T
U
D
jη)
2
V(D) + p
L
(T
U
D
jη)
2
. (4.31)
Find the largest η η
max
such that f(η) T
L
MR
.
Step 3 : If η ∆, then set α = T
U
D
η; otherwise, the procedure does not find
appropriate η and α.
Theorem 4.20 Consider a system with unsynchronized, drift-free clocks, where the
proba bilistic behavior of messages is not known. With parameters η and α computed
by the above p rocedure, the failure detector algorithm NFD-U of Fig. 4.3 satisfies the
QoS requirements given in (4.30).
Estim ating p
L
and V(D)
When local clocks are not synchronized, The message loss probability p
L
and the
variance V(D) of message delay can still b e estimated using the heartbeat messages,
in exactly the same way as the one given in Section 4.4.1. For p
L
, this is because
we only use sequence numbers of the heartbeat messages to estimate p
L
, and so it is
not affected by whether the clocks are synchronized or not. For V( D), we still use
the variance of A S of heartbeat messages as the estimate of V(D), where A is t he
time (with respect to q’s local clock time) when q receives a message m, and S is the
time (with respe ct to p’s local cl ock time) when p sends m. This estimate me thod
still works because here A S is the actual delay of m plus a constant, namely the
skew between the clocks of p and q. Thus the variance of A S is the same as the
variance V(D) of message delay.
106
4.5 Simulation Results
We simulate both the new failure detector algorithm that we developed and the
simple algorithm commonly used in practice. In particular, (a) we simulate the
algorithm NFD-S (the one that se t s the freshness points using the sending times of the
heartbeat m essages and synchronized clocks), and show that the simulation results
are consistent with our QoS analysis of NFD-S in Section 4.3.2; (b) we simulate the
algorithm NFD-E (the one that sets fr eshness points with respect to the expected
arrival times of the heartbeat messages), show how the QoS of the algorithm changes
as the number n of messages used for estimati ng the expected arrival ti mes increases,
and show that, with appropriately chosen n, NFD-E provides essentially the same
QoS as NFD-S; and (c) we simulate the simple algorithm and compare it to the
different versions of the new algorithms, and show that when all algorithms send
messages at the same rate and satisfy the same upper bound on the worst-case
detection tim e, the new algorithms provide much better accuracy than the simple
algorithm.
The settings of the si mulations are as follows. For the purpose of comparison, we
fix the intersending time η of heartbeat messages in both the new algorithm and the
simple algorithm to be 1. The message loss probability p
L
is set to 0.01. The me ssage
delay time D foll ows the exponential distribution (i.e., Pr(D x) = 1 e
x/E(D)
for all x 0). We choose the exponential di stribution because of the following two
reasons: first, it has the characteri stic that a large portion of messages have fairly
short delays while a small portion of messages have large delays, which is also the
characteristic of message del ays in many practical systems [Cri89]; second, it has
107
simple analytical representation which al lows us to compare the si mulation results
with the analytical results given in Theorem 4.11. The average message delay time
E(D) is set to be 0.02, which is a small value compared to the intersending time
η. Thi s corresponds to the practical situation in which message delays are in the
order of tens of milliseconds (typical for messages transmitted over the Internet),
while heartbeat messages are sent every few seconds. Note that since D follows an
exponential distribution, we have that the standard deviation σ(D) = E(D) = 0.02,
and the variance V(D) = σ(D)
2
= 4 ×10
4
.
We compare the accuracy of different algorithms when they all satisfy the same
bound T
U
D
on the worst-case detection time. To do so, we run simulations for each
algorithm as foll ows: (a) We first configure the algorithm using the given bound
T
U
D
. (b) We the n run simulations to verify that the configuration is indeed correct,
i.e., the given bound T
U
D
is sat isfied. More specifically, we simulate the algorithm in
10,000 runs in which process p crashes at some nondeterministic times, and obtain
the maximum detection time observed among all these runs, and see if this observed
maximum detection time is close to but not exceeding the given bound T
U
D
. (c)
Finally we obtain the average mistake recurrence time by simulating the algorithm
in runs in which p does not crash, and then taking the average of the lengths of 500
mistake recurrence intervals. We f ound that the average mistake recurrence time
is representative for the purpose of comparing the accuracy of the algorithms we
simulate, and thus we omit t he simulation results on other accuracy metrics here.
We vary the bound T
U
D
from 1 to 3.5, i.e., from exactly one intersending time of
heartbeat messages to three and a half times of the intersending time, and show how
the simulation results vary accordingly.
108
1 1.5 2 2.5 3 3.5
1
1.5
2
2.5
3
3.5
required bound T
U
D
on the worst−case detection time
maximum detection time observed in the simulations
reference line
NFD−S
Figure 4.5: The maximum detection times observed in the simulations of
NFD-S(shown by +)
4.5 .1 Simulation Results of NFD-S
It is easy to configure the parameters of NFD-S to meet the given upper bound T
U
D
on the worst-case detection time. In fact, since the intersending time η is fixed (to
1), only parameter δ is configurable, and by Theorem 4.11 (1), we set δ = T
U
D
η =
T
U
D
1.
Figure 4.5 shows the simulation results that checks the correctness of our con-
figurations of NFD-S. The reference line represents the situation in whi ch a f ai lure
109
1 1.5 2 2.5 3 3.5
10
1
10
2
10
3
10
4
10
5
10
6
10
7
required bound T
U
D
on the worst−case detection time
average mistake recurrence time obtained from the simulations
analytical
NFD−S
Figure 4.6: The average mistake recurrence times obtained from the simulations of
NFD-S (shown by +), with the plot of the analytical formula for E(T
MR
) of NFD-S
(shown by —).
detector is perfectly configured: the maximum detec t ion time is equal to the de-
sired bound T
U
D
. Figure 4.5 shows that all the maximum detect ion times observed
in the simulations of NFD-S are very c lose to the referenc e line. Therefore NFD-S is
correctly configured.
Figure 4.6 shows the simulation results on the average mistake recurrence time s
of algorithm NFD-S, together with the plot of the analytic al formula for E(T
MR
)
that we derived i n Section 4.3.2 (formula (4.2) of Theorem 4.11). The immediate
110
conclusions from Fig. 4.6 is: the simulation results of algorithm NFD-S matches the
analytical formula for E(T
MR
), i.e., f ormula (4.2) of Theorem 4.11.
Furthermore, note that the y-axis is in log scale, which means that when T
U
D
increases linearly, t he overall tendency of the average m istake recurrence time i s
to increase exponentially fast. This increase, however, is not continuous: as T
U
D
increases, the average mistake recurrenc e time alternates between rapid increasing
periods and flat (nonincreasing) periods just as a step function. We now explain
why the curve resembles the curve of a step function.
We separate the curve into the following periods according to the value of T
U
D
,
and explain them one by one.
1. When T
U
D
= 1, the parameter δ is set to T
U
D
η = 0. Recall that the freshness
point τ
i
is set to be σ
i
+ δ where σ
i
is the sending ti me of m
i
. So, in this case
the freshness point τ
i
is the same as the sending time σ
i
. But it is impossible
for message m
i
to arrive before time τ
i
, so q suspects p at every freshness point
τ
i
. During the i nterval (τ
i
, τ
i+1
), q is likely to receive the message m
i
(recall
that the average message delay is only 0.02 and the message loss probability is
only 0.01), and t hus becomes trusting p again. Therefore, when T
U
D
= 1, the
average mistake recurrence time is close to 1.
2. A s T
U
D
increases f r om 1 to around 1.16, δ = T
U
D
η increases from 0 to 0.16 and
the freshness points τ
i
’s are shifted forward in time accordingly. In this period,
the probability that message m
i
arrives after time τ
i
decreases very fast, from
1 to e
8
= 0.0003. Thus a small increase in δ reduces the probability that m
i
arrives late significantly, and therefore incr ease significant ly the time between
111
consecutive mistakes.
3. When T
U
D
= 1.16, δ = T
U
D
η = 0.16, i.e., τ
i
’s are shifted forward in time
0.16 time uni t s from σ
i
’s. This shift distance is 8 times of the average message
delay time, and thus if m
i
is not lost, there is a very high probability that m
i
is indeed rec eived before τ
i
(in fact, the probabili ty is 1 e
8
= 0.9997). Since
the message loss probability i s 0.01, we know that at this poi nt the m ai n cause
of a failure detector mi stake is that a message is lost. Since on average one
out of every 100 messages is lost, the average mistake recurrence time is close
to 100, as shown in Fig. 4.6.
4. From T
U
D
= 1.16 to T
U
D
= 2.0, δ increases from 0.16 to 1. In this period,
a message is very unlikely to be delayed by more than δ time units, while a
single message loss is enough to cause a failure detector mistake. Therefore,
an increase in δ does not help much to gain a better mistake recurrence time,
and the curve is almost flat.
The case is similar when T
U
D
increases from 2 to 3: (a) From 2 to around 2.16,
a failure detector mistake is mainly caused by the loss of a message m
i
followed
by the delay of message m
i+1
. Thus an increase in δ increases the probability that
message m
i+1
is received bef ore time τ
i
, so that the failure detector does not make
a m istake e ven if m
i
is lost. Therefore in this period the average mistake recurrence
time increases sharply. (b) From 2.16 to 3, a failure detector mistake is mainly due
to the loss of two consecutive messages, and thus an increase in δ does not help much
to gain a better accuracy, and the average mistake re currence time stays at about
10
4
. Other periods can be explained similarly.
112
As a summary, the flat portions of the curve correspond to the failure detector
configurations in which the failure det ector mi stakes are mainly due to consecutive
message losses, while the ascending porti ons of the curve correspond to the con-
figurations in which the failure detector mistakes are mainly due to a sequenc e of
consecutive message losses followed by the delay of the last message b efore the sus-
picion.
In Fig. 4.6, we only show the average mistake re currence times obtained from the
simulations. To further show that these simulation results are reliable , i.e., t hey are
not just by chance very close to the theoretical analysis, we show their corresponding
confidence intervals in Fig. 4.7. In this figure, we show the 99% confidence intervals
for the expected values of mistake recurrence times of NFD-S, together with the
plot of the analytical formula for E(T
MR
) of NFD-S. The confidence intervals are
computed using standard techniques (see e.g. [All90] p.445). The figure illustrates
that all the confidence i ntervals are very small and surrounding the theoretical results.
This shows that the simulation results are acc urate and are not obtained by chance.
The confidence intervals of the simulation results of other algorithms show the simil ar
properties, and thus we do not include them in the thesis.
4.5 .2 Simulation Results of NFD-E
Algorithm NFD-E has a parameter n the number of messages used for esti mating
the exp ected arrival times of the heartbeats that also affects the QoS. To show
this, we first run simulations in which the parameter α is fixed and n takes the
values 1, 4, 8, 12, 16, 20, 24, 28, 32 respectivel y, and see how the maximum detection
times and the average mistake recurrence times change accordingly.
113
1 1.5 2 2.5 3 3.5
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
7
required bound T
U
D
on the worst−case detection time
average mistake recurrence time obtained from the simulations
Figure 4.7: The 99% c onfidence intervals for the expected values of mistake recur-
rence times of NFD-S (shown by ), with the plot of the analytical formula for
E(T
MR
) of NFD-S (shown by ).
114
Figure 4.8 shows the simulation re sults for α = 1.90. From the figure, we see
that when n i ncreases, the average mistake recurrence times have no obvious change,
while the maximum detection ti me observed decreases from above 3.00 (when n = 1)
to about 2.93 (when n = 32). Note that according to the analytical results on
algorithm NF D-U (the one that knows all the expected arrival times), we have T
D
E(D) + α + η. Thus with α = 1.90, we have T
D
2.92 for algorithm NFD-U. So
from n = 1 to n = 32, the maximum det ection time observed changes from more
than 0.08 (4 times of E(D)) above the bound 2.92, to within 0.01 (half of E(D))
above the bound 2.92. This suggests that when n = 32, the algorithm NFD-E is
very close to the algorithm NFD-U. Simulations on other α values show the similar
results, and so we choose n = 32 for the algorithm NFD-E.
Since when n = 32 NFD-E is very close to NFD-U, we use the bound T
D
E(D) + α+η of the algorithm NFD-U to compute the parameter α for the algorithm
NFD-E. In particular, we set α = T
U
D
E(D) η = T
U
D
1.02.
Figure 4.9 shows the simulation results that checks the correctness of our config-
urations of NFD-E. Since all simulation results are very close to the reference line,
the algorithm NFD-E is correctly configured.
Figure 4.10 shows the simulation results on the average mistake recurrence times
of algorithms NFD-E, together with the plot of the analytical formula for E(T
MR
)
that we de rived for algorithm NFD-S in Section 4.3.2 (formula (4.2) of Theorem 4.11).
From this figure, we see that with appropriately chosen n, the accuracy of algorithms
NFD-S and NFD-E are essentially the same, when both algorithms send heartbeat
messages at the same rate and satisfy the same upper bound on the worst-case
detection time.
115
0 5 10 15 20 25 30 35
2.92
2.93
2.94
2.95
2.96
2.97
2.98
2.99
3
3.01
number of messages used for estimating expected arrival times
maximum detection time observed in the simulations
(a) The maximum detection time observed decreases when n increases.
0 5 10 15 20 25 30
10
0
10
1
10
2
10
3
10
4
10
5
number of messages used for estimating expected arrival times
average mistake recurrence time obtained from the simulations
(b) The average mistake recurrence tim e stays about the same when n increases.
Figure 4.8: The change of the QoS of NFD-E when n increases. Parameter α = 1.90.
116
1 1.5 2 2.5 3 3.5
1
1.5
2
2.5
3
3.5
required bound T
U
D
on the worst−case detection time
maximum detection time observed in the simulations
reference line
NFD−E
Figure 4.9: The maximum detection times observed in the simulations of NFD-E
(shown by ×)
117
1 1.5 2 2.5 3 3.5
10
1
10
2
10
3
10
4
10
5
10
6
10
7
required bound T
U
D
on the worst−case detection time
average mistake recurrence time obtained from the simulations
analytical
NFD−E
Figure 4.10: The average mistake recurrence times obtained from the simulations of
NFD-E (shown by ×), with the plot of the analytical formula for E(T
MR
) of NFD-S
(shown by —).
118
4.5 .3 Simulation Results of the Simple Algorithm
As discussed in the I ntroduction of this chapter, the worst-case detect ion time of
the simple algorithm is the maximum message delay time plus the timeout value
TO. This means that for many practical systems that have no upper bound on the
message delay time, as we ll as for our simulation setti ng, the worst-case detection
time of the sim ple algorithm is unbounded. Thus in these situations the simple
algorithm as it stands i s not suitable to satisfy QoS requir e ments that require an
upper bound on the worst-case detection time.
In this sect ion, we apply a straitforward modi fication to the simple algorithm so
that it is able to provide an upper bound on the worst-case detection time. Since
the unbounded worst-case detection time of the simple algorithm is caused by the
messages with very large delays, we modif y the algorithm such that these messages
are discarded. More precisely, the modified algorithm has another parameter, the
cutoff time c, such that all messages delayed by more than c ti me units are discarded.
We call messages delayed by at most c time units fast messages, and messages delayed
by more than c time units slow messages. We assume t hat the simple al gorithm is
able to distinguish slow messages from fast messages.
8
With this modification, it is easy to see that the simple algorithm now has a
worst-case detection time c + TO. Given a bound T
U
D
on the worst-case detection
time, there is a tradeoff in se tt ing the cutoff time c and the timeout value TO: the
larger the cutoff time c, the smaller the number of messages being discarded, but
the shorter the timeout value TO, and vice versa. In the simulations, we choose two
8
This is not easy when local clocks are not synchronized. A fail-aware datagram service [FC97]
may be used for this purpose.
119
1 1.5 2 2.5 3 3.5
1
1.5
2
2.5
3
3.5
required bound T
U
D
on the worst−case detection time
maximum detection time observed in the simulations
reference line
SFD−L
SFD−S
Figure 4.11: The maximum detection times observed in the simulations of SFD-L
and SFD-S (shown by and )
cutoff times c = 0.16 and c = 0.08, i.e., cutoff times of 8 and 4 times of the mean
message delay tim e respectively. The timeout value TO is then set to be T
U
D
c. The
algorithm with c = 0.16 is denoted by SFD-L, and the one wit h c = 0.08 is denoted
by SFD-S.
Figure 4.11 shows the simulation re sults of the observed maxi mum detection times
of SFD-L and SFD-S. Since all simulation results are very close to the refe rence line
at which the maximum detection time observed eq uals T
U
D
, algorithms SFD-L and
SFD-S are correctly configured.
120
1 1.5 2 2.5 3 3.5
10
1
10
2
10
3
10
4
10
5
10
6
10
7
required bound T
U
D
on the worst−case detection time
average mistake recurrence time obtained from the simulations
NFD−S
SFD−L
SFD−S
Figure 4.12: The average mistake recurrence times obtained from the simulations of
SFD-L and SFD-S (shown by -- and --), with the plot of the analytic al formula for
E(T
MR
) of NFD-S (shown by ).
121
Figure 4.12 shows the simulation results on the average mistake recurrence times
of SFD-L and SFD-S, together with the plot of the analytical formula for E(T
MR
) of
the new algorithm NFD-S (formula (4.2) of Theorem 4.11), which is the same plot as
given in Fig. 4.6 and 4.10. From Fig. 4.6 and 4.10 we k now that this plot also closely
represents the simulation results of the two versions of the new algorithm NF D-S
and NFD-E.
We have the foll owing observations from Fig. 4.12.
1. The curves of SFD-L and SFD-S resemble the curves of some step functions.
The reason is similar to the one that we give for algorithm NFD-S.
2. The flat portions of SFD-L are very close to those of N FD-S, but the flat
portions of SFD-S are much lower than those of the other two curves, and the
gap is orders of magnitude large and it is i ncreasing.
The reason is as follows. We already explained that the flat portions of a curve
correspond to the cases in which failure dete ctor mistakes are mainly due to
message losses. More precisely, the first flat portion of each curve corresponds
to the cases in which a mistake is mainly due to a single message loss; t he
second flat portion corresponds to the cases in which a mistake is mainly due
to two consecutive message losses, and so on.
For the modified simple algorithm, slow me ssages are equivalent to lost mes-
sages since they are discarded by the algorithm. In SFD-L with cutoff time
c = 0.16, the probability that a message is slow is very small comparing to
the probability that a message is really lost (in fact, it is e
c/E(D)
= 3.4 × 10
4
compared to the message loss probability 0.01). In SFD-S, however, the cutoff
122
time is c = 0.08 and the probability that a message is slow is 0.018, which
is significant. For this algorithm, the combined message loss probability is
p
L
+ 0.018 = 0.028. Under this message loss probability, a single message loss
occurs about every 35 messages, and the event of two consecutive message losses
occurs about every 1200 messages. This explains why the vertical position of
the first flat portion of SFD-S is between 30 and 40 and the vertical positi on
of the second flat portion is between 1000 and 2000. Since the difference in the
probability of consecutive message losses between the algorithm SFD-S and the
other two algorithms is incre asing, the gap between SFD-S and the other two
algorithms is increasing accordingly.
3. Re garding the ascending portions of the curves, as T
U
D
increases, an ascending
portion of NFD-S always starts first, then followed by an ascending portion
of SFD-S, and finally followed by an ascending portion of SFD-L. In these
ascending portions, under the same value T
U
D
, the average mistake recurrence
time of the new algorithm could be orders of magnitude better than those of
the simple algorithms.
This can be explai ned by the following example . Consider the point when
T
U
D
= 1.08. For the new algorithm NFD-S, δ = T
U
D
η = 0.08, which me ans
that the freshness points are shifted forward in time by 4 times the mean
message delay. Under this setting, a message (if not lost) is very likely to
be received be fore the corresponding freshness point and thus avoid a failure
detector mistake (the exact probability is 0.982). For the simple algorithm
SFD-S with c = 0.08, we have TO = T
U
D
c = 1. This means that after a
123
message is received, the tim er will expire one time unit later. Since on average
the next message will arrive one time unit later t han the receipt of the previous
message, TO = 1 means that about half of the messages will arrive after the
timer expi r es and thus cause fai lure detector mistakes. Thus the accuracy of
SFD-S is not go od compared with NFD-S. For the simple algorithm SFD-L
with c = 0.16, we have TO = T
U
D
c = 0.92. Under this setting, the timeout
is too short, such that almost no message c an arrive before the timer expires.
Thus the accuracy of SFD-L is even worse at this point. Similar explanations
can be applied to other points of T
U
D
in the period from 1 to around 1.2, from
2 to around 2.2, etc.
Therefore, in general, under the same requirement of T
U
D
, the configuration of
the new algorithm always gives a lower probability of a failure detector mistake
caused either by m essage del ay or by message loss, than the configuration of the
simple algorithm. For the simple algorithm, the larger the cutoff time is, the smaller
the timeout value, and thus the higher the probability of a failure detector mistake
caused by message delay. On t he other hand, if the cutoff time is getting smaller,
more me ssages are discarded (it effectively increases the probability that a message
is lost), and this increases the probability of a failure dete ctor mistake caused by
message losses.
From the above observations, it is not hard to see that when the cutoff time
of the modified simple al gorithm increases, its curve is shifted further to the right;
when the cutoff time decreases, its curve is pre ssed further down towards t he x-axis.
In all cases, the curve of the simple algorithm is always under the curve of the new
algorithm.
124
To summ arize, the simulation results show that, when b oth algorithms send heart-
beat messages at t he same rate and satisfy the same upper bound on the worst-case
detection time, the accuracy of the new algorithm (with or without synchronized
clocks) always dominates the accuracy of the simple algorithm, and in some cases it
is orders of magnitude better.
4.6 Concluding Remarks
On the Adaptiveness of the New Failure D etector
In this chapter, our network mode l assumes that the probabilistic behavior of the
network does not change over time. In practice, t he network behavior m ay change
over time gradually. For example, during working days, a corporate network typically
experiences heavier traffic, which means longer message delays and more message
losses, while during nights and weekends, the network traffic is usually much lighter.
However, for a short period of time, e.g., one hour, the change of network behavior
is rel atively small, and our mode l is a good approximation for such relatively short
periods.
For the gradual changes of the network behavior in a long time pe r iod, our new
failure detector algorithm has the ability to adapt to the changes and behave accord-
ingly. This is because we can configure the failure detector so that it only uses recent
heartbeat messages to esti mate the relevant system parameters such as p
L
, E(D)
and V(D), and the expected arrival times of the heartbeats if necessary. Therefore,
the algorithm can automatically adapt to the recent behavior of the network, and
thus the QoS of the fail ure detector can be guaranteed even if t he network behavior
125
changes gradually over tim e.
On the QoS Requirements
In Sections 4.3.4, 4.4.1 and 4.4.3, we consider some simple QoS requirements that
take the form of the bounds on some QoS metrics. Applications may also have re-
quirements i n other forms. For example, an application may specify some objective
function in terms of t he QoS metrics, and require that the failure detector b e config-
ured such that the objective function is maximized. To de al with such more general
QoS requirements, a decision-theoretic approach may be used in the configuration of
the failure detector. Dec ision theory [Res87] provides mathematic al tools for making
decisions, and there have bee n some research works t hat apply dec ision theory in
certain areas of computer science such as networking, distributed computing, and
database systems (e.g., [MHW96, BBS98, BS98, CH98, CHS99]). A study on the
QoS of failure detectors using decision theory is an interesting research dire ction, but
it is beyond the scope of this thesis.
On n-Process Systems
In this thesis, we focus on two-process systems: a failure detector at a process q
monitors a process p. Many practical systems consist of more than two processes,
and failure detection is re quired be tween every pair of processes. Our work on two-
process systems can be used as a basis for the study of n-process systems. For
example, in an n-process system , one may be interested in the tim e elapsed from the
time when a process p crashes to the time when all other processes detect the crash of
126
p. For this purpose, we can use our QoS metric the detection time of the failure
detector on every process q that monitors p, and then take the maximum of all these
detection times to obtain the value we want. Of course, n-process systems present
more complicated cases than two process systems, and more careful and cre ative
study is necessary. We hope that this thesis can provide some helpful directions to
the study of fail ure detection in more complicated distri buted systems.
Appendix A
Theory of Marked Point Processes
Most of the notations, terminologies, and results conce r ning the theory of marked
point proce sses are taken from [Sig95].
Marked Point Pro cesses
Let R
+
and Z
+
denote the sets of nonnegative real numbers and integers, respectively.
Let K denote a complete separable metric space called mark space.
A si mple mar ked point process (mpp) on the nonnegative time line R
+
is a se-
quence
ψ = {(t
n
, k
n
) : n Z
+
, t
n
R
+
, k
n
K}, (A.1)
such that 0 t
0
< t
1
< t
2
< ···, lim
n+
t
n
= +. We call t
n
an event time and
k
n
a mark associate wi th event time t
n
. By simple we mean that the event times
are all different. Let M = M
K
denote the collection of all simpl e mpp’ s wi th mark
space K. The set of all Borel measurable subsets of M is denoted as B(M).
127
128
We sometimes use the following interevent time sequence representation that is
equivalent to (A.1):
{t
0
, {(T
n
, k
n
) : n Z
+
, T
n
def
= t
n+1
t
n
}}, (A.2)
where T
n
denotes the n-th interevent ti me.
Note that t
n
, T
n
, and k
n
are actually measurable mappings f rom M to R
+
or K.
Shift Mappings
A shift mapping θ
s
: M M is a mapping that shifts a mpp ψ to the left by s
time units. More precisely, if s = 0, then θ
s
is the i dentity mapping; if s > 0, then
for ψ = {(t
n
, k
n
)}, suppose t
i1
< s t
i
for some i Z
+
(denote t
1
= 0 for
convenience). We then have
θ
s
ψ
def
= {(t
i+n
s, k
i+n
) : n 0}. (A.3)
That is, θ
s
ψ is t he mpp obtained from ψ by shifting the origin to s, re-labeling event
times at and after s as t
0
, t
1
, . . ., and ignoring the events before time s. Let θ
(j)
def
= θ
t
j
denote shifti ng by the event time t
j
, j 0. We then let
ψ
s
def
= θ
s
ψ and ψ
(j)
def
= θ
(j)
ψ. (A.4)
Let θ
1
t
E
def
= {ψ M : θ
t
ψ E}, and θ
1
(j)
E
def
= {ψ M : θ
(j)
ψ E}.
Random Marked Point Processes
Let (Ω, F, P ) be a probability space. A random marked point process (rmpp) Ψ is a
measurable mapping Ψ : M. Ψ has the di stribution P E)
def
= P ({ω :
129
Ψ(ω) E}) defined for all E B(M). Ψ
s
is the rmpp obtained from Ψ be shifting
the origin to time s, that is, for all ω Ω, Ψ
s
(ω) = Ψ(ω)
s
. Similarly, Ψ
(j)
is the
rmpp obtained from Ψ by shi fting t he origin to the ti me of the j-th event, that is,
for all ω Ω, Ψ
(j)
(ω) = Ψ(ω)
(j)
.
Stationary Ve rsions
A rmpp Ψ is event stationary if P
(j)
E) = P E) for all j Z
+
and all
E B(M). Ψ is time stationary if P
s
E) = P E) for all s R
+
and all
E B(M).
The event stationary version Ψ
0
and the time stationary version Ψ
of rmpp Ψ
are two rmpp’s defined by t he following distributions (assuming they exist):
Pr
0
E)
def
= lim
n→∞
1
n
n1
X
j=0
P
(j)
E), for all E B(M), (A.5)
and
Pr
E)
def
= li m
t→∞
1
t
Z
t
0
P
s
E) ds, for all E B(M). (A.6)
As shown in [Sig95], Ψ
0
is event stationary and Ψ
is time stationary.
Proposit ion A.1 Any event stationary Ψ has, with probability one, the event time
t
0
at the origin, i.e. Pr (t
0
Ψ = 0) = 1. Any time stationary Ψ has, with probability
one, no event a t the origin, i.e. Pr(t
0
Ψ = 0) = 0.
Invariant σ-Field
The invariant σ-field of M with respect to the fl ow of sh i f ts, {θ
t
: t 0}, is denoted
by I and defined by I
def
= {E B(M) : θ
1
t
E = E, t 0}. The invariant σ-field of
130
M with respect to the event shifts, {θ
(j)
: j 0}, is denoted by I
e
and de fined by
I
e
def
= {E B(M) : θ
1
(j)
E = E, j 0}.
Proposit ion A.2 I
e
= I.
Because of the above proposition, we use I to denote the one and only invariant
σ-field of M.
If Ψ is defined on a probability space (Ω, F, P ), the n the invariant σ-field on M
can be lifted onto F by taking the inverse image: I
Ψ
def
= Ψ
1
I = {Ψ
1
E : E I},
where Ψ
1
E
def
= {ω : Ψ(ω) E}. We omit t he superscri pt Ψ whenever the context
is clear. For example, for the notation of conditional expected value, we use E
I
(f Ψ)
instead of E
I
Ψ (f Ψ) (f is a measurable mapping from M to R
+
).
Ergodicity
An event stationary Ψ
0
is called ergodic if for any two events E
1
, E
2
B(M),
lim
n→∞
1
n
n1
X
j=0
Pr
0
E
1
, Ψ
0
(j)
E
2
) = Pr
0
E
1
)Pr
0
E
2
). (A.7)
Similarly, a time stationary Ψ
is called ergodi c if for any two events E
1
, E
2
B(M),
lim
t→∞
1
t
Z
t
0
Pr
E
1
, Ψ
s
E
2
) ds = Pr
E
1
)Pr
E
2
). (A.8)
As suggested by Sigman [Sig95], the ergodicity should be regarded as a condition
describing a kind of loss of memory as the event (or time) parameter tends to .
“For Ψ
0
this means that if you start with Ψ
0
and then randomly observe it way out at
an event, then what you observe is an independent copy of Ψ
0
itself ([Sig95] p.38).
The same holds for Ψ
when you randomly observe it way out in time. The follow
131
proposition shows that ergodicity can be equivalently defined by using invariant σ-
field I.
Proposit ion A.3 Ψ
0
is ergodic if and only if the invariant σ-field I is 0-1 with
respect to Ψ
0
, i. e., iff for all E I, Pr
0
E) {0, 1}. Ψ
is ergodic if and
only if the invariant σ-field I is 0-1 with respect to Ψ
, i.e., iff for all E I,
Pr
E) {0, 1}.
Proposit ion A.4 For any measurable f : M R
+
, if Ψ
0
is ergodic, then E
I
(f
Ψ
0
) = E(f Ψ
0
) a.s., and if Ψ
is ergodic, then E
I
(f Ψ
) = E(f Ψ
) a.s.
1
The following is the version of the important Birkhoff’s Ergodic Theorem for
random marked point processes. Henceforth, we assume that Ψ, Ψ
0
and Ψ
use the
same underlying probability space (Ω, F, P ) (one can always construct some common
space supporting all of them).
Theorem A.5
(1) If Ψ has the event stationary version Ψ
0
, then fo r any measura ble mapping
f : M R
+
,
lim
n→∞
1
n
n1
X
j=0
f Ψ
(j)
= E
I
(f Ψ
0
) a.s. (A.9)
In particular, if Ψ
0
is ergodic, then
lim
n→∞
1
n
n1
X
j=0
f Ψ
(j)
= E(f Ψ
0
) a.s. (A.10)
1
The notation a.s. stands for “almost surely”, which means that the equation is true with
probability one.
132
(2) If Ψ has the time stationary version Ψ
, then for any measurable mapping
f : M R
+
such that
R
t
0
f Ψ
s
ds < , t 0, a.s.,
lim
t→∞
1
t
Z
t
0
f Ψ
s
ds = E
I
(f Ψ
) a.s. (A.11)
In particular, if Ψ
is ergodic, then
lim
t→∞
1
t
Z
t
0
f Ψ
s
ds = E(f Ψ
) a.s. (A.12)
Note that f Ψ
0
and f Ψ
are measurable mappings from the underlying sample
space t o R
+
, and so they are random variables. So are lim
n→∞
1
n
P
n1
j=0
f Ψ
(j)
and lim
t→∞
1
t
R
t
0
f Ψ
s
ds. Similar mathematical expressions are used in the following
theorems.
From the above theorem, we can have the following characterization of the er-
godicities of Ψ
0
and Ψ
. Let I
E
be the indicator function for some event E B(M),
i.e. for all ψ M, I
E
(ψ) = 1 if ψ E, and I
E
(ψ) = 0 if ψ 6∈ E.
Proposit ion A.6 Suppose that Ψ has event stationary version Ψ
0
and time station-
ary version Ψ
. Ψ
0
is ergodic if and only if for all E B(M),
Pr
0
E) = li m
n→∞
1
n
n1
X
j=0
I
E
Ψ
(j)
a.s. (A.13)
Ψ
is ergodic if and only if for all E B(M),
Pr
E) = lim
t→∞
1
t
Z
t
0
I
E
Ψ
s
ds a.s. (A.14)
Proof. Suppose Ψ
0
is ergodic. Then (A.13) is obtained by substituting f in (A.10)
with I
E
. Now suppose (A.13) holds. Then for any E I, we claim that I
E
Ψ
(j)
= I
E
Ψ. In fact, for all ω where is the underlying sample space for Ψ, I
E
Ψ
(j)
(ω) = 1
133
iff Ψ(ω)
(j)
E iff Ψ(ω) θ
1
(j)
E iff Ψ(ω) E iff I
E
Ψ(ω) = 1. Thus from ( A.13)
we have Pr
0
E) = I
E
Ψ a.s., whi ch implies that Pr
0
E) {0, 1}. By
Proposition A.3, we know that Ψ
0
is ergodic. The proof for Ψ
is similar.
Proposition A.6 suggests that the event stationary version Ψ
0
of some rmpp Ψ
is ergodic if and only if the distribution of Ψ
0
, i.e. Pr
0
E), can be obtained
(with probability one) from any single run Ψ (ω) of Ψ as follows: observe Ψ(ω) at
every event time t
j
(to obtain Ψ (ω)
(j)
), check whether the event E is true when
observed at t
j
(i.e., whether I
E
(Ψ(ω)
(j)
) = 1 or not), and then use the ratio of the
number of event times t
j
’s at which E is t r ue over the total number of event tim es
as Pr
0
E). The distribution of Ψ
can be obtained in a similar way.
The following lemma shows that the ergodicities of Ψ
0
and Ψ
are equivalent.
Lem ma A.7 Suppose that Ψ has event stationary version Ψ
0
and time stationary
version Ψ
. Then Ψ
0
is ergodic if and only if Ψ
is ergodic.
Arrival Rates
Let N
t
: M R
+
be the measurable mapping such that for all ψ M, N
t
(ψ)
is the number of event times of ψ in the pe riod (0, t]. Suppose rmpp Ψ has the
event stationary version Ψ
0
and the tim e stationary version Ψ
. Let λ
def
= E(N
1
Ψ
),
and λ is called arrival rate or intensity of Ψ. Intuitively, λ is the average number
of event times or arrivals in a unit period in the time stationary version Ψ
. Let
λ
I
def
= E
I
(N
1
Ψ
), and λ
I
is called conditional arrival rate or conditional intensity
of Ψ.
Recall that T
n
: M R
+
is the measurable mapping such that T
n
(ψ)
def
= t
n+1
(ψ)
134
t
n
(ψ) represents the n-th interevent time of mpp ψ.
Lem ma A.8 For the conditional ar rival rate λ
I
, we have
λ
I
= lim
t→∞
N
t
Ψ
t
=
lim
n→∞
1
n
n1
X
j=0
T
j
Ψ
1
=
n
E
I
(T
0
Ψ
0
)
o
1
a.s. (A.15)
For the arri val rate λ, we have
λ = E(λ
I
). (A.16)
Moreover, if Ψ
0
is ergodic (and so is Ψ
), the we have
λ
I
= λ =
n
E(T
0
Ψ
0
)
o
1
a.s. (A.17)
Equalities (A.15) mean that the conditional arrival rate λ
I
is a random vari-
able, and it can be obtained either from the l ong run number of events per unit
time ( lim
t→∞
N
t
Ψ/t), or from the reciprocal of the long run average interevent
time ({li m
n→∞
1
n
P
n1
j=0
T
j
Ψ}
1
). Equality (A.16) shows that the arrival rate is
the exp ected value of the random variable λ
I
. Equalities (A.17) m ean that, if the
stationary versions of Ψ are ergodic, then the conditional arrival rate λ
I
is alm ost
surely the constant λ, which is also the reciprocal of the ex pected value of the very
first interevent time of Ψ
0
.
Empirical Inversion Formulas
The empirical invers ion formulas are used to connect the e vent stationary version
Ψ
0
with the time stationary version Ψ
. Roughly speaking, the results show that (a)
a random marked point process Ψ has the event stationary version Ψ
0
if any only
if it has the time stationary version Ψ
; (b) Ψ
0
is the event stationary version of
135
both Ψ and Ψ
, and Ψ
is the tim e stationary version of both Ψ and Ψ
0
; (c) the
distributions of Ψ
0
and Ψ
are related by some inversion formulas. We now state
these results formally.
Theorem A.9 Ψ has the event stationary version Ψ
0
and 0 < E
I
(T
0
Ψ
0
) < a.s.,
if and only if Ψ has the time stationary version Ψ
and 0 < E
I
(N
1
Ψ
) < a.s..
In this case, we have E
I
(N
1
Ψ
) = {E
I
(T
0
Ψ
0
)}
1
= λ
I
, Ψ
0
is also the event
stationary version of Ψ
, and Ψ
is also the time stationary version of Ψ
0
.
Theorem A.10 If Ψ has the event stationary version Ψ
0
and 0 < E
I
(T
0
Ψ
0
) <
a.s. (or equivalently Ψ has the time stationary version Ψ
and 0 < E
I
(N
1
Ψ
) <
a.s.), then for all E B(M), we have the following empir i cal inversion formulas:
Pr
0
E) = E
E
I
h
P
N
1
Ψ
1
j=0
I
E
Ψ
(j)
i
E
I
(N
1
Ψ
)
, (A.18)
Pr
E) = E
E
I
h
R
T
0
Ψ
0
0
I
E
Ψ
0
s
ds
i
E
I
(T
0
Ψ
0
)
. (A.19)
For all measurable f : M R
+
, we have the follo wing conditional empirical inver-
sion fo rmulas:
E
I
(f Ψ
0
) =
E
I
h
P
N
1
Ψ
1
j=0
f Ψ
(j)
i
E
I
(N
1
Ψ
)
a.s., (A.20)
and if in addition,
R
t
0
f Ψ
0
s
ds, t 0, a.s., then
E
I
(f Ψ
) =
E
I
h
R
T
0
Ψ
0
0
f Ψ
0
s
ds
i
E
I
(T
0
Ψ
0
)
a.s. (A.21)
The following corollary states the empirical inversion formulas under the ergod-
icity condition.
136
Corollary A.11 If Ψ has the event stationary version Ψ
0
, and Ψ
0
is ergodic, and
0 < E(T
0
Ψ
0
) < (or equiva lently Ψ has the time stationary version Ψ
, and Ψ
is ergodic, and 0 < E(N
1
Ψ
) < ), then for all E M, we have the fo llowing
ergodic empirical inversion formulas:
Pr
0
E) =
E
h
P
N
1
Ψ
1
j=0
I
E
Ψ
(j)
i
E(N
1
Ψ
)
, (A.22)
Pr
E) =
E
h
R
T
0
Ψ
0
0
I
E
Ψ
0
s
ds
i
E(T
0
Ψ
0
)
. (A.23)
For al l measurable f : M R
+
, we have
E(f Ψ
0
) =
E
h
P
N
1
Ψ
1
j=0
f Ψ
(j)
i
E(N
1
Ψ
)
, (A.24)
and if in addition,
R
t
0
f Ψ
0
s
ds, t 0, a.s., then
E(f Ψ
) =
E
h
R
T
0
Ψ
0
0
f Ψ
0
s
ds
i
E(T
0
Ψ
0
)
. (A.25)
Bibliography
[ACT] Marcos Kawazoe Aguilera, Wei Chen, and Sam Toueg. On qui escent
reliable communication. SIAM Journal on Computing. To appear.
Part of the paper appeared in Proceedi ngs of the 11th International
Workshop on Distributed Algorithms, September 1997, 126–140.
[ACT99] Marcos Kawazo e Aguilera, Wei Chen, and Sam Toueg. Using the heart-
beat failure detector for quiescent reliable communication and consen-
sus in partiti onable networks. Theoretical Computer Science, 220(1):3–
30, June 1999.
[ACT00] Marcos Kawazoe Aguilera, Wei Chen, and Sam Toueg. Failure detec-
tion and consensus in the crash-recovery model. Distributed Computing,
2000. To appear. A n extended abstract appe ared in Proceedings of the
12th International Symposium on Distributed Computing, September
1998, 231-245.
[ADKM92] Yair Amir, Danny Dolev, Shlomo Krame r , and Dalia Malki. Transis: a
communication sub-system for high availability. In Proceedings of the
22nd Annual International Symposium on Fault-Tolerant Comp uting,
pages 76–84, Boston, July 1992.
[All90] Arnold O. Allen. Probability, Statistics, and Queueing Theory with
Computer Science Applications. Academic Press, 2nd edition, 1990.
[Arv94] K. Arvind. Probabilistic clock synchronization in distributed systems.
IEEE Transactions on Parallel and Distributed Sys tems, 5(5):475–487,
May 1994.
[BBS98] Sandeep Bajaj, Lee Breslau, and Scott Shenker. Uniform versus priority
dropping for layered video. In Proceedi ngs of the SIGCOMM’98ACM
137
138
Conference on Applications, Technologies, Architectures, and Protocols
for Com p uter Communication, pages 131–143, August 1998.
[BDGB94]
¨
Ozalp Babao˘glu, Renzo Davoli, Luigi-Alberto Giachini, and Mary Gray
Baker. Relacs: a communications infrastructure for constructing reli-
able applications in large-scale distributed systems, 1994. BROAD-
CAST Project deliverable report, Department of Computing Science,
University of Newcastle upon Tyne, UK.
[Bil95] Patrick Billingsley. Probability and Measure. John Wiley & Sons, 3rd
edition, 1995.
[Bra89] R. Braden, editor. Requirements for Internet Hosts-Communication
Layers. RFC 1122, O ctobe r 1989.
[BS98] Lee Bre slau and Scott Shenker. Best-effort ve r sus reservations: A sim-
ple comparative analysis. In Proceedings o f the SIGCOMM ’98ACM
Conference on Applications, Technologies, Architectures, and Protocols
for Com p uter Communication, pages 3–16, August 1998.
[BvR93] Kenneth P. Bir man and Robbert van Renesse, editors. Reliable Dis-
tributed Computing with the Isis Toolkit. IEEE Computer Society Press,
1993.
[CH98] Francis C. Chu and Joseph Y. Halpern. A decision-theoretic approach
to reliable message delivery. In Proceedings of the 12th I nternational
Symposium on Distributed Computing, Lecture Notes on Computer Sci-
ence, pages 89–103. Springer-Verlag, September 1998.
[CHS99] Francis C. Chu, Joseph Y. Halpern, and Praveen Seshadri. Least ex-
pected cost query optimization: A n exercise in utili ty. In Proceedings
of the 18th ACM Symposium on Principles of Database Systems, pages
138–147, May 1999.
[CHT96] Tushar Deepak Chandra, Vassos H adzilacos, and Sam Toueg. The
weakest failure detector for solving consensus. Journal of the ACM,
43(4):685–722, July 1996. An extended abstract appeared in Proceed-
ings of the 11th ACM Sympo sium on Principles of Distributed Comput-
ing, August, 1992, 147–158.
[Cri89] Flaviu Cristian. Probabilistic clock synchronization. Distributed Com-
puting, 3:146–158, 1989.
139
[CT96] Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors
for reliable distributed systems. Journal of the ACM, 43(2):225–267,
March 1996. A preliminary version app eared in Proceedings of the
10th ACM Symposium on Pri nciples o f Distributed Co mputing, August,
1991, 325–340.
[DFKM96] Danny Dolev, Roy Friedman, Idi t Keidar, and Dahlia Malkhi. Fail-
ure detect ors in omission failure environments. Technical Report 96-
1608, Department of Computer Science, Cornell University, Ithaca,
New York, September 1996.
[FC96] Christof Fetzer and Flaviu Cristian. Fail-aware failure detectors. In
Proceedings of the 15th Symposium on Reliable Distributed Systems,
pages 200–209, Oct 1996.
[FC97] Christof Fetzer and Flaviu Cristian. A fail-aware datagram service.
In 2nd Annual Workshop on Fault-Tolerant Parallel and Distributed
Systems, April 1997.
[GLS95] Rachid Guerraoui, Michael Larrea, and Andr´e Schiper. Non blocking
atomic commitment with an unreliable failure detector. In Proceedings
of the 14th IEEE Sympo s i um on Reliable Distributed Systems, pages
13–15, 1995.
[GM98] Mohamed G. Gouda and Tommy M. McGuire. Accel erated heartbeat
protocols. I n Proceedings of the 18th International Conference on Dis-
tributed Com puting Systems, May 1998.
[Hay98] Mark Garland Hayden. The Ensemble System. Ph.D. dissertation,
Department of Computer Science, Cornell University, January 1998.
[MHW96] Armin R. Mikler, Vasant Honavar, and Johnny S. K. Wong. Analysis of
utility-theoretic heuristics for intelligent adaptive network routing. In
Proceedings of the 13th National Conference o n Artificial Intelligence,
volume 1, pages 96–101, 1996.
[MM SA
+
96] Louise E. Moser, P. M. Melliar-Smith, Deborah A. Argarwal, Ravi K.
Budhia, and Colleen A. Lingley-Papadopoulos. Totem: A fault-tolerant
multicast group communication system. Communications of the ACM,
39(4):54–63, April 1996.
[Pfi98] Gregory F. Pfister. In Search of Clusters. Pre ntice -Hall, Inc., 2nd
edition, 1998.
140
[Res87] Michael D. Resnik. Choices: An Introduction to Decision Th eory. Uni-
versity of Minnesota Press, 1987.
[Ros83] Sheldon M. Ross. Stochastic Processes. John Wiley & Sons, 1983.
[RT99] Michel Raynal and Fed´eric Tronel. Group membership failure detec-
tion: a simple protocol and its probabilistic analysis. D i s tributed Sys-
tems Engineering Journal, 6(3):95–102, 1999.
[Sig95] Karl Sigman. Stationary Marked Poi nt Processes, an Intuitive Ap-
proach. Chapman & Hall, 1995.
[VR00] Paulo Ver´ıssimo and Michel Raynal. Time, clocks and temporal or-
der. In Sacha Krakowiak and Santosh K. Shrivastava, edit ors, Recent
Advances in Distributed Systems, chapter 1. Springer-Verlag, 2000. to
appear.
[vRBM96] Robbert van Renesse, Kenneth P. Birman, and Silvano Maffeis. Horus:
a flexible group communication system . Communications of the ACM,
39(4):76–83, April 1996.
[vRMH98] Robber t van Renesse, Yaron Minsky, and Mark Hayden. A gossip-style
failure detection service. In Proceedings of Middleware’98, September
1998.
... Classical implementations of failure detectors require the periodic transmission of heartbeat messages by each monitored process to all others [14,16,35]. This strategy is efficient when the distributed system runs on a single physical network based on broadcast, as it requires only a single message to send each heartbeat to all processes. ...
... The model itself can be extended in several ways, for example, to allow the recovery of processes and network partitions. Furthermore, the implementation and empirical evaluation of the proposed detectors based on the quality of service metrics [35] and the adoption of machine learning [37] are also planned as future work. Also planned as applications of the proposed model and algorithms for failure detection in large-scale asynchronous systems, cloud computing [38], and the Internet of Things [39]. ...
Article
Full-text available
Reliable systems require effective monitoring techniques for fault identification. System-level diagnosis was originally proposed in the 1960s as a test-based approach to monitor and identify faulty components of a general system. Over the last decades, several diagnosis models and strategies have been proposed, based on different fault models, and applied to the most diverse types of computer systems. In the 1990s, unreliable failure detectors emerged as an abstraction to enable consensus in asynchronous systems subject to crash faults. Since then, failure detectors have become the de facto standard for monitoring distributed systems. The purpose of the present work is to fill a conceptual gap by presenting a distributed diagnosis model that is consistent with unreliable failure detectors. Properties are proven for the number of tests/monitoring messages required, latency for event detection, as well as completeness and accuracy. Three different failure detectors compliant with the proposed model are presented, including vRing and vCube, which provide scalable alternatives to the traditional all-monitor-all strategy adopted by most existing failure detectors.
... Aplicações podem ter restrições temporais, e um detector que possui um atraso muito grande na detecção de falhas pode não ser suficiente. Por este motivo, [Chen et al. 2002] propõe métricas para a qualidade de serviço (quality of service), ou simplesmente QoS, de detectores de falhas. De maneira geral, as métricas de QoS para um detector de falhas buscam descrever a velocidade (speed) e a exatidão (accuracy) da detecção. ...
... Em outras palavras, as métricas definem quão rápido o detector detecta uma falha e quão bem este evita enganos. Ainda em [Chen et al. 2002], os autores propõem um detector de falhas, chamado NFD-E, que pode ser configurado de acordo com os parâmetros de QoS necessários para a aplicação em questão. Este detector visa o sistema probabilístico proposto pelos autores. ...
Conference Paper
Detectores de falhas são abstrações que, dependendo das propriedades que oferecem, permitem a solução do consenso em sistemas distribuídos assíncronos. Este trabalho apresenta um serviço de detecção de falhas baseado em disseminação epidêmica. O serviço foi implementado para a plataforma JXTA. Para permitir a avaliação com um número maior de processos, foi também implementado um simulador. Resultados experimentais são apresentados para o uso de processador e memória, tempo de detecção, taxa de enganos do detector, além do seu uso na execução de eleição de líder. Os resultados experimentais e de simulação indicam que o serviço é escalável com o número de processos e mostram que a estratégia de disseminação epidêmica possui vantagens significativas em grupos com grande número de processos.
Conference Paper
Detector de defeitos é um componente essencial na construção de sistemas distribuídos confiáveis e seu projeto depende fortemente do modelo de sistema distribuído, o que tem demandado soluções para tratar a movimentação de nós em redes móveis ad hoc (MANETs). Este trabalho apresenta um detector de defeitos assíncrono não-confiável baseado em fofoca que diferencia nós defeituosos e móveis através da manutenção de informações sobre a potência do sinal de recepção nos nós do sistema mapeadas em um pequeno histórico de regiões. As avaliações demonstram melhoras na qualidade de serviço do detector quando comparadas com o algoritmo de fofoca tradicional.
Chapter
Partially synchronous models are often assumed for designing distributed protocols because they capture realistic timing assumptions, such as the asynchronous and synchronous periods that the system can experience. In some of these models, protocols need to estimate network delays. Some protocols fix the global message delay bound for all executions, which leads to sub-optimal solutions in terms of latency, because this bound must be chosen conservatively. And other protocols employ delay estimation mechanisms that only give an upper bound on the delay without quantifying the estimation error. The performance of these protocols depends on how close their estimations are in relation to the actual network delay. For instance, some Byzantine consensus protocols use timeouts based on this estimation. We formalize this problem as the Global Delay Bound Estimation (\(\textsf{GDBE}\)) and address it by introducing a distributed oracle that enriches partial synchronous models. This oracle produces estimates of the channel delays that allow processes to derive an efficient global bounded estimate. Oracles and global bounded estimates, provide a framework that facilitates the design of protocols for partially synchronous models and the analysis of their time complexity. We formalize the properties of the oracle and the proposed framework and show that it can be implemented in the presence of crash failures. In contrast, we prove that \(\textsf{GDBE}\) cannot be solved in the Byzantine failure model, and show how to circumvent this impossibility using an extra assumption. Finally, we show how to use our framework to implement a view synchronizer thus obtaining an efficient solution for Byzantine consensus.KeywordsOracleGlobal delayTimeoutConsensusCrash failureByzantine failureChannel delaySynchronizerFixed delayPartial synchronyOne-way delay
Article
Full-text available
Failure detection is one of the basic functions of building a reliable disaster recovery backup system. Aiming at the application-level disaster recovery backup failure detection problem, this paper analyzes the remote disaster recovery center architecture and failure detection hierarchy, and predicts the arrival time of cross-domain heartbeat information through the back propagation neural network based on particle swarm optimization (PSO-BP). When the actual timeout is reached, the active Auxiliary Detection (AD) is used to improve the correctness of failure detection, and finally the effectiveness of method PSO-BP-AD is verified through simulation.
Conference Paper
Detectores de defeitos (FDs) não confíaveis são utilizados como bloco básico na especificação e implementação de tolerância a falhas em sistemas distribuídos assíncronos. Um exemplo típico de sistemas distribuídos assíncronos e de larga escala é a Internet. Neste contexto, FDs tradicionais apresentam problemas, uma vez que seu projeto destina-se a redes controladas (LAN). Um problema a ser tratado é a explosão de mensagens, pois em sistemas de larga escala, onde o número de processos e os atrasos são imprevisíveis o problema da explosão de mensagens pode comprometer o desempenho do serviço de detecção de defeitos e a escalabilidade da aplicação. Neste sentido, este artigo trata do problema da explosão de mensagens propondo uma abordagem genérica e prática que utiliza o reaproveitamento de mensagens para suprir mensagens de controle nos FDs.
Article
Full-text available
Distributed systems that span large geographic distances or manage large numbers of objects are already common place. In such systems, programming applications with even modest reliability requirements to run correctly and efficiently is a difficult task due to asynchrony and the possibility of complex failure scenarios. We describe the architecture of the RELACS communication subsystem that constitutes the microkernel of a layered approach to reliable computing in large-scale distributed systems. RELACS is designed to be highly portable and implements a very small number of abstractions and primitives that should be sufficient for building a variety of interesting higher-level paradigms.
Article
Full-text available
Summary. We argue that the tools of decision theory should be taken more seriously in the specification and analysis of systems. We illustrate this by considering a simple problem involving reliable communication, showing how considerations of utility and probability can be used to decide when it is worth sending heartbeat messages and, if they are sent, how often they should be sent.
Article
Full-text available
Summary. We study the problems of failure detection and consensus in asynchronous systems in which processes may crash and recover, and links may lose messages. We first propose new failure detectors that are particularly suitable to the crash-recovery model. We next determine under what conditions stable storage is necessary to solve consensus in this model. Using the new failure detectors, we give two consensus algorithms that match these conditions: one requires stable storage and the other does not. Both algorithms tolerate link failures and are particularly efficient in the runs that are most likely in practice – those with no failures or failure detector mistakes. In such runs, consensus is achieved within time and with 4 n messages, where is the maximum message delay and n is the number of processes in the system.
Book
Probability. Measure. Integration. Random Variables and Expected Values. Convergence of Distributions. Derivatives and Conditional Probability. Stochastic Processes. Appendix. Notes on the Problems. Bibliography. List of Symbols. Index.