ArticlePDF Available

Data-driven Mixed-Integer Linear Programming-based Optimisation for Efficient Failure Detection in Large-scale Distributed Systems

Authors:
  • Huawei Technologies Research and Developpement

Abstract and Figures

Failure detectors (FDs) are fundamental building blocks for distributed systems. An FD detects whether a process has crashed or not based on the reception of heartbeats’ messages sent by this process over a communication channel. A key challenge of FDs is to tune their parameters to achieve optimal performance which satisfies the desired system requirements. This is challenging due to the complexities of large-scale networks. Existing FDs ignore such optimisation and adopt ad-hoc parameters. In this paper, we propose a new Mixed Integer Linear Programming (MILP) optimisation-based FD algorithm. We obtain the MILP formulation via piecewise linearisation relaxations. The MILP involves obtaining optimal FD parameters that meet the optimal trade-off between its performance metrics requirements, network conditions and system parameters. The MILP maximises our FD’s accuracy under bounded failure detection time while considering network and system conditions as constraints. The MILP’s solution represents optimised FD parameters that maximise FD’s expected performance. To adapt to real-time network changes, our proposed MILP-based FD fits the probability distribution of heartbeats’ inter-arrivals. To address our FD scalability challenge in large-scale systems where the MILP model needs to compute approximate optimal solutions quickly, we also propose a heuristic algorithm. To test our proposed approach, we adopt Amazon Cloud as a realistic testing environment and develop a simulator for robustness tests. Our results show consistent improvement of overall FD performance and scalability. To the best of our knowledge, this is the first attempt to combine the MILP-based optimisation modelling with FD to achieve performance guarantees.
Content may be subject to copyright.
Data-driven Mixed-Integer Linear Programming-based
Optimisation for Efficient Failure Detection in Large-scale
Distributed Systems⋆,⋆⋆
Btissam Er-Rahmadia,c,1,Tiejun Maa,b,
aCentre for Risk Research, Department of Decision Analytic and Risk, Southampton Business School, University of Southampton,
Building 2, University Road, SO17 1BJ, UK.
bThe Artificial Intelligence Applications Institute, School of Informatics, Informatics Forum, The University of Edinburgh,
Crichton Street, EH8 9AB, UK.
cPresent Address: Edinburgh Research Centre, Huawei Technologies R&D, 2, Semple Street, EH3 8BL, UK.
ARTICLE INFO
Keywords:
Nonlinear Programming
Mixed Integer Linear Program-
ming
Distributed Systems
Failure Detection
Heartbeats
ABSTRACT
Failure detectors (FDs) are fundamental building blocks for distributed systems. An
FD detects whether a process has crashed or not based on the reception of heartbeats’
messages sent by this process over a communication channel. A key challenge of FDs
is to tune their parameters to achieve optimal performance which satisfies the desired
system requirements. This is challenging due to the complexities of large-scale net-
works. Existing FDs ignore such optimisation and adopt ad-hoc parameters. In this
paper, we propose a new Mixed Integer Linear Programming (MILP) optimisation-
based FD algorithm. We obtain the MILP formulation via piecewise linearisation
relaxations. The MILP involves obtaining optimal FD parameters that meet the opti-
mal trade-off between its performance metrics requirements, network conditions and
system parameters. The MILP maximises our FD’s accuracy under bounded failure
detection time while considering network and system conditions as constraints. The
MILP’s solution represents optimised FD parameters that maximise FD’s expected
performance. To adapt to real-time network changes, our proposed MILP-based FD
fits the probability distribution of heartbeats’ inter-arrivals. To address our FD scala-
bility challenge in large-scale systems where the MILP model needs to compute ap-
proximate optimal solutions quickly, we also propose a heuristic algorithm. To test
our proposed approach, we adopt Amazon Cloud as a realistic testing environment
and develop a simulator for robustness tests. Our results show consistent improve-
ment of overall FD performance and scalability. To the best of our knowledge, this
is the first attempt to combine the MILP-based optimisation modelling with FD to
achieve performance guarantees.
Declarations of interest: none.
⋆⋆ This piece of research was partially sponsored by Huawei Ltd.
Corresponding author
btissam.errahmadi@gmail.com (B. Er-Rahmadi); tiejun.ma@soton.ac.uk (T. Ma)
ORCID (s): 0000-0003-0526-661X (B. Er-Rahmadi); 0000-0001-5545-6978 (T. Ma)
1This work has been achieved while B. Er-Rahmadi was a Research Fellow at the University of Southampton, UK; currently
she is affiliated to Edinburgh Research Centre, Huawei Technologies R&D, 2, Semple Street, EH3 8BL, UK.
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 1 of 28
Optimised Failure Detection Algorithms
1. Introduction
1.1. Failure Detection in Distributed Systems.
A distributed system is a set of autonomous computing processes, whose overall computing operations
appear to a user as a single coherent system (Tanenbaum and Steen 2007). In general, these processes may
represent hardware devices or software processes. To achieve the single system perception, these processes
need to collaborate with each other. It is unavoidable that some processes of the distributed system may
stop working (e.g., crash failure) at a random time. Failure detectors (FDs) are needed to identify such
failures (Guerraoui et al. 2009). An FD is a distributed algorithm that is implemented in at least one of the
distributed system processes. The FD would generally receive liveness messages (e.g., heartbeat messages)
from monitored processes to make decisions on whether the latter are still alive or not. Liveness decisions
are based on the receipt of liveness messages within a specific timeout.
It is, however, challenging to detect these failures in distributed networks (Laprie 1992). Generally, the
processes communicate via a network (e.g., a data centre, a cloud system, or a mobile network). Conse-
quently, a vital issue is the accuracy with which an FD will detect that a particular process has crashed.
In addition, network delays generated by messages transmissions over communications channels can vary
and are not upper-bounded. Therefore, FDs cannot wait indefinitely to detect whether any other process has
crashed. Fischer et al. (1985) show the impossibility of distinguishing between a crashed process and a very
slow one in a pure asynchronous system (known as the Fischer-Lynch-Patersons impossibility result).
Chandra and Toueg (1996) introduce the concept of unreliable FDs to detect the crash behaviour of a
process. By adopting such FDs, every live process sends liveness heartbeat messages to an FD at regular
intervals. This guarantees that, if expected heartbeat messages are missing within a specific timeout, FD will
suspect that the corresponding process has crashed. Chandra and Toueg (1996) also provide an abstracted
classification of FDs based on their eventual behaviour to solve a set of membership problems such as con-
sensus problems. Chandra and Toueg’s work kick-started the examination of the quality of service (QoS)
(e.g., speed and accuracy) of FDs. One limitation of such FDs is that, if a heartbeat is missed because of net-
work delays or packets losses, its sender process will certainly be suspected. Inappropriate FD parameters
(heartbeat interval and timeout) may lead to erroneous FD decisions.
1.2. Challenges and Problem Description
Researchers (e.g., Bertier et al. 2002;Hayashibara et al. 2004;Satzger et al. 2007,2011) have proposed
some FDs with adaptive decision parameters to cope with dynamic system environments. However, they can
only partially achieve this goal. Such FDs tend to be adaptive to network conditions without considering FDs’
QoS requirements. Moreover, the FD parameters’ adaptation may benefit one failure detection performance
measure but degrade others (e.g., Chen et al. 2002;Hayashibara et al. 2004;Satzger et al. 2011). For example,
FDs with fast failure detection speed may result in lower accuracy, or vice versa. None of these previous
works has considered the optimal trade-off between an FD’s QoS performance metrics and its parameters.
The FD parameters are set in an ad-hoc style, which may only achieve sub-optimal performance in given
distributed network environments.
In this paper, we address the research challenge of the FD algorithm QoS performance optimisation.
We consider FDs that use the heartbeat mechanism. In this mechanism, a monitored process periodically
sends heartbeat messages to the FD. The latter continuously checks the liveness of this monitored process
based on either a timeout or a threshold (depending on the FD algorithm). In the first option, the FD sets
a timeout by which it should have received the heartbeat messages. In the second option, the FD computes
a suspicion measure from heartbeats’ arrival times and compares it to a threshold. Consequently, the FD
detects failures if expected heartbeats are not received within the timeout or the suspicion measure is higher
than the threshold. Overall, heartbeat-FDs have two parameters: the heartbeat interval (𝛿) for liveness update
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 2 of 28
Optimised Failure Detection Algorithms
and the timeout (Δ) for FD decisions (suspect or trust), which is equivalent to a threshold (𝜀) in other FDs.
The QoS of such an FD is generally measured by failure detection time and query accuracy probability.
As heartbeat messages are generally transmitted via networks, the choice of FD parameters (𝛿and Δ/𝜀)
impacts its performance measures. This is because a heartbeat message may arrive later than expected, or
even be dropped, which implies potential false decisions made by the FD. Therefore, FD parameters should
be set in a way that considers fluctuating network delays and packets losses.
However, using a short heartbeat interval means that heartbeats are sent more frequently, which allows
for the receiving of frequent liveness updates. This might enhance the failure detection time but, at the same
time, implies a growing consumption of the network bandwidth. If a longer heartbeat interval is selected,
it may increase the chances of false detection (i.e. decision mistakes). Furthermore, a short FD timeout (or
a small threshold) may enhance the failure detection speed but may result in less accurate decisions. These
issues show the challenge posed by the trade-offs involved in selecting FDs’ parameters; this needs to be
addressed in order to optimise its QoS. More specifically, this paper addresses the challenges of optimising
𝛿and Δ/𝜀. Setting optimal FD parameters will achieve higher accuracy and reduce the failure detection time
while meeting constraints of real-time network conditions and system characteristics.
1.3. Our Approach
To overcome the research challenge of the FD algorithm’s optimal query accuracy probability and de-
tection speed trade-off, we propose a new threshold-based FD algorithm. We call our FD the Self-Organised
Network-Aware Failure Detector (SONAFD). SONAFD learns the probability distribution of heartbeats’
inter-arrival times. It also models the failure detection as a Non Linear optimisation problem (NLP). This
NLP maximises SONAFD’s QoS in terms of query accuracy probability and upper-bounds its detection time.
The NLP decision variables are SONAFD’s heartbeat interval 𝛿and suspicion threshold timeout Δ, which is
used to compute its suspicion threshold 𝜀. SONAFD’s NLP satisfies system and network constraints. This
NLP is then converted to MINLP by applying piecewise linearisation (PWL) functions (Geißler et al. 2015;
Rebennack and Krasko 2019;Vielma et al. 2010b). Finally, SONAFD uses a greedy heuristic algorithm
that we propose for solving the optimisation problem in large-scale systems. To the best of our knowledge,
our proposed SONAFD is the first attempt to solve failure detection performance issues by adopting Mixed
Integer Linear Programming (MILP)-based optimisation models. This is while taking into consideration the
scalability of distributed systems.
The rest of the paper is structured as follows. Section 2discusses related literature. Section 3describes
the considered system model, while Section 4presents the design details of our proposed SONAFD. Section 5
presents both Amazon Cloud testbed and simulation evaluation results as robustness tests to validate our
proposed approach using CPLEX and our proposed heuristic algorithm, and Section 6concludes the paper.
2. Related work
2.1. Failure Detectors and their QoS
Chen et al. (2002) propose quantitative QoS metrics to measure FDs failure speed, probabilistic accu-
racy, and mistake rate. Since then, the key works on FDs’ design and performance evaluation have adopted
Chen et al. (2002)’s QoS metrics to evaluate and compare FDs’ performance (e.g., Hayashibara et al. 2004;
Liu et al. 2017;Satzger et al. 2011). In this paper, we use the same QoS metrics but extend their definitions
to a more realistic failure assumption crash-recovery instead of crash-stop (see Section 3).
To improve the QoS of an FD, Chen’s (Chen et al. 2002) and Bertier’s (Bertier et al. 2002) proposed
binary decision (i.e. trust or suspect) FDs to estimate the new heartbeats’ arrival times based on observed
communication delays. However, such efforts achieve limited performance improvements where the lim-
itations have three aspects. First, such FDs do not always provide up-to-date or correct lists of suspected
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 3 of 28
Optimised Failure Detection Algorithms
processes. This is due to the volatility of the network behaviour such as delays and losses. Second, the
binary outputs of these FDs (i.e. trust or suspect) are not capable of satisfying the QoS requirements at the
applications level. Third, the network dynamics make it difficult for these FDs to adjust their parameters
optimally. For example, network delays may vary when FD adapts itself to the new change. Sotoma and
Madeira (2001) implement an adaptive FD that uses the average value of heartbeats’ inter-arrivals in its
timeout estimation. Similar works such as Xiong et al. (2012) and Turchetti et al. (2016) also try to enhance
the QoS of FDs by considering the actual feedback of previously achieved QoS. However, none of these FDs
was able to find and set up its QoS-aware optimal parameters.
Unlike binary decisions FDs, accrual FDs (e.g., Hayashibara et al. 2004) output a probabilistic estimate
that a process has failed. Such a design allows applications to decouple failure interpretation from failure
monitoring. The 𝑃 ℎ𝑖 detector proposed by Hayashibara et al. (2004) is the first accrual/threshold-based
FD. The 𝑃 ℎ𝑖 detector outputs a positive value on a continuous scale, called 𝑃 ℎ𝑖 (i.e. 𝜑). 𝑃 ℎ𝑖 reflects a
confidence level on the probabilistic likelihood that the monitored process has crashed. The 𝑃 ℎ𝑖 detector
accrues over time and tends toward infinity if the monitored process crashes. Such a confidence level is
compared regularly to a suspicion threshold set by system management/application layers according to their
QoS requirements. The 𝑃 ℎ𝑖 detector has been used in a number of real systems: like OpenDayLight by Akka
(2018) documented in Akka (2021) and the decentralised storage system Cassandra (Lakshman and Malik
2010). The 𝑃 ℎ𝑖 detector assumes that the heartbeats’ inter-arrivals follow the normal distribution. Although
the 𝑃 ℎ𝑖 detector has achieved efficient and stable failure detection, its assumption on the heartbeats’ inter-
arrivals distribution limits its range of applications.
The aim of the work proposed by Satzger et al. (2007,2011) is to generalise the Phi detector. Instead of
making an assumption based on the heartbeats’ inter-arrival times, it deducts the cumulative distribution of
sampled heartbeats in a sliding window based on a histogram density estimation. A suspicion probability
is then obtained using this cumulative distribution. As an adaptive accrual FD, it increases flexibility and
decreases computation costs compared to the 𝑃 ℎ𝑖 detector. In our paper, we adopt a similar idea to the 𝑃 ℎ𝑖
detector and Satzger’s approach in terms of exploiting heartbeats’ inter-arrivals’ data and design a suspicion
threshold-based FD. This FD capitalises on the benefits of our proposed MILP model to compute its optimal
parameters according to the QoS requirements and network conditions.
Other failure detectors of a similar fashion were proposed. Most of these FDs assume that the heartbeats’
inter-arrivals follow 1) an Exponential Distribution, suggested by Xiong and Defago (2007) and 2) a Weibull
Distribution proposed by Liu et al. (2017). The latter FD is an abstracted concept and represents challenges
that may prevent its deployment in real-world systems (e.g., onerous processing). Also, it may not be ideal
to have a fixed assumption on the heartbeats’ inter-arrivals probability in the dynamic changing networks.
Hence, such probability distributions shall be adaptive for specific applications.
2.2. Communications Network Modelling
Most previous works on failure detection make use of the arrival time of recently received heartbeat mes-
sages in a specific time duration. Chen et al. (2002) estimate the network behaviour (i.e. delays and losses)
based on the collected information of successfully received heartbeats. The estimation involves computing
the expected value and variance of heartbeats’ delays without any assumption on the delays distribution in
relatively stable traffic conditions. If the network traffic is bursty, it projects a combination between short-
term and long-term heartbeats without a concrete implementation of the idea. Bertier et al. (2002) consider
only the last received heartbeat message and do not save heartbeats’ history. Sotoma and Madeira (2001) do
not consider real-world network conditions; they only simulate network delays as a set of predefined values
alternatively used as fixed delay intervals. Xiong et al. (2012) take into account the average of heartbeats’
inter-arrivals saved in the history window and include network communication delays. Turchetti et al. (2016)
follow a similar approach to Chen et al. (2002). Hayashibara et al. (2004) consider statistics of successfully
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 4 of 28
Optimised Failure Detection Algorithms
received heartbeats to obtain the mean and variance of heartbeats’ inter-arrivals. The authors assumed that
these inter-arrival times follow a normal distribution. Xiong and Defago (2007) and Liu et al. (2017) are
similar to Hayashibara et al. (2004) but assume different heartbeat inter-arrival time distributions (e.g., the
exponential distribution and weibull distribution, respectively). As a generalisation of the 𝑃 ℎ𝑖 FD, Satzger
et al. (2007,2011) approximate the cumulative distribution function (CDF) of heartbeats’ inter-arrivals via
the cumulative frequencies of the most recently received heartbeat messages. The CDF of the inter-arrivals’
histogram is used directly to obtain suspicion values as there is no assumption on the distribution of heartbeat
arrivals. Although previous work tried to use available information on heartbeats’ inter-arrivals, none has
attempted a flexible setting based on the impact of network conditions on inter-arrival times.
All these mentioned FDs are based on the exploitation of the network conditions only. Our proposed
approach combines the use of 1) probability distribution fitting applied to heartbeats’ inter-arrivals and 2)
adopting an MILP model for the estimation of optimal FD parameters to adapt to the network conditions. This
has not been explored in previous work. Our proposed approach has shown the effectiveness of exploiting
network conditions (as input parameters) and optimising FDs parameters together. The trade-off between
the QoS of FDs and costs of network bandwidth is formulated as a decision-making action in uncertain
environments, particularly in online learning systems (Abdel-Aziz et al. 2020;Ferreira et al. 2018;Gupta
et al. 2006;Tan et al. 2009). Previous literature related to networked systems (e.g., Hussin et al. 2015;
Madhushani and Leonard 2021;Xu et al. 2018) has extensive studies on such a trade-off to enhance the
performance of the system. In this paper we applied such an approach to the failure detection challenge.
2.3. Optimisation Modelling
Our proposed FD optimisation model is inspired by applying integer programming. It has allowed us to
model our FD’s QoS as decision variables and constraints. Similar problems and approaches in a multitude
of applications have been explored in previous literature (e.g., Du et al. 2012,2019;Gullhav et al. 2017;
Kaewpuang et al. 2013;Li et al. 2012). Particularly, the nonlinear programming (Luenberger and Ye 2016)
is suitable for the network and QoS-driven failure detection problem. This is because it can handle the general
(i.e. nonlinear) formulation of the optimised performance metric under different heartbeats’ inter-arrivals’
distributions. Going deeper into our optimisation modelling, the MILP (Conforti et al. 2014) allows us to
tackle the nonlinearity of constraints, by using PWL relaxations (Geißler et al. 2012;Rebennack and Krasko
2019;Vielma et al. 2010a). This is to facilitate a simpler and more efficient model implementation.
Therefore, we propose to model the distributed algorithm as an optimisation problem (i.e. MILP) that
aims at enhancing an FD’s decision-making accuracy while respecting its QoS and resource constraints.
Such optimisation approaches have been successfully adopted for resource management (Buyya et al. 2010;
Kaewpuang et al. 2013), communications design (Li et al. 2012), load balancing (Gullhav et al. 2017;Liu
and Righter 1998), repair policy (Sleptchenko and Johnson 2014), energy footprints management (Shen and
Wang 2014), and decision analytics (Heilig and V 2014).
In summary, the previous proposed FDs are designed with specific conditions. As network systems are
evolving rapidly, the need for autonomous and self-adaptable FDs becomes more imperative. None of these
existing FDs can optimise its parameters according to the changing network conditions, resource constraints
and QoS requirements to maximise its failure detection accuracy and speed. Thus, we aim to achieve such
an FD design with our proposed MILP optimisation-based FD.
3. System Model
3.1. Distributed Network Model
We consider that a distributed network consists of a finite set of processes: Σ𝑁=𝑝1, 𝑝2, 𝑝3, ..., 𝑝𝑁
and 𝑁2. These processes can communicate their liveness by sending heartbeat messages. They may fail
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 5 of 28
Optimised Failure Detection Algorithms
under crash-recovery assumption. In this assumption, a process may fail (e.g., human error, energy outage,
etc.) and recover (e.g., after intervention of an administrator) many times during an observation period (as
studied by Aguilera et al. (2000) and Ma et al. (2010)). However, the crash-stop assumption states that when
a process fails, it fails infinitely and never recovers (discussed in Chen et al. 2002). The crash-recovery
assumption is better as it allows us to align with more realistic failure models.
We consider that the 𝑁𝑡ℎ process implements the FD and monitors all remaining processes (i.e. 𝑁 1
monitored processes) in a star topology. This means that monitored processes communicate only with the FD
and not with each other. Throughout the paper, FD and 𝑝𝑁are used interchangeably to mean the monitoring
entity of FD. We assume that any monitored process 𝑝𝑖,𝑖 [1 . . 𝑁 1] and the process implementing FD,
i.e. 𝑝𝑁, are connected by two unidirectional quasi-reliable communication channels (Barolli et al. 2018).
These are defined as unreliable network channels. Such a channel ensures that no message can be created,
changed, or copied. However, messages may be lost in the channel. The notion of ‘channel’ in this paper
does not necessarily correspond to a physical channel and represents an end-to-end connection.
We assume that the time clock of the distributed network is synchronised i.e. there is no (or negligible)
clock drift. In real-world systems, many synchronisation methods/protocols are available, yielding to a
negligible clock drift (e.g., 10−6). Such an assumption holds in real-world implementations (Coulouris et al.
2001,2005;Marouani and Dagenais 2008).
For simplicity, and without loss of generality, we highlight our description with only two processes; these
are 𝑝𝑖(such as 𝑖could have any value in [1 . . 𝑁 1]) and 𝑝𝑁. This description could be easily extended
to the 𝑁-system. 𝑝𝑖sends liveliness messages i.e. heartbeats to 𝑝𝑁at regular interval 𝛿𝑖. If 𝑝𝑁does
not receive a 𝑝𝑖s heartbeat message within a determined timeout, it starts suspecting 𝑝𝑖until it receives a
new heartbeat message. This is how a timeout-based FD would make a decision. However, our proposed
FD algorithm is threshold-based. It regularly computes an instantaneous suspicion level (noted as 𝜀𝑠𝑖(𝑡))
at instant 𝑡based on previously received heartbeats from 𝑝𝑖. Our FD compares 𝜀𝑠𝑖(𝑡)to its corresponding
threshold 𝜀𝑖:𝑝𝑖might be suspected or trusted by 𝑝𝑁.𝜀𝑖is generally set according to application requirements.
𝜀𝑖can be mapped to a specific timeout Δ𝜀𝑖. This timeout corresponds to the average of considered timeouts
upon new heartbeats’ receptions. We will detail the equation relating 𝜀𝑖and Δ𝜀𝑖in Section 4.1. The parameter
𝜀𝑖allows our FD to make a monitoring decision on 𝑝𝑖. Thus, the configuration parameters that characterise
our FD are a heartbeat interval 𝛿𝑖and a suspicion threshold 𝜀𝑖.
3.2. QoS Metrics of FDs
To evaluate the QoS of an FD, a set of probabilistic performance metrics were first proposed by Chen
et al. (2002) and extended in Ma et al. (2010). Such metrics have been widely adopted and used in previ-
ous works (Hayashibara et al. 2004;Liu et al. 2017;Satzger et al. 2011;Xiong and Defago 2007). More
specifically, the detection time 𝑇𝐷𝑖represents the speed at which a failure is detected. The query accuracy
probability 𝑃𝐴𝑖depicts the accuracy of an FD decision. The mistake rate (𝜆𝑀𝑖) illustrates the frequency of
the FD’s false decisions. Figure 1illustrates interactions between 𝑝𝑖and 𝑝𝑁under crash-recovery. More
precisely, under crash-recovery, the notion of “mistake” may include multiple states. 𝑝𝑁makes a mistake
if it suspects 𝑝𝑖while the latter is functional, or trusts it while it crashes. The FD’s mistake is not only as-
sociated with transitions but also with the mutual comparison between FD output (state or transition) and
𝑝𝑖states/transitions. In fact, in line with Ma et al. (2010), we consider the following states of 𝑝𝑖:Recovery
when 𝑝𝑖is functional and Crash when it is faulty. Hence, the transition from Recovery to Crash is called the
C-Transition, and the transition from Crash to Recovery is called the R-Transition. Figure 1also shows the
main FD QoS metrics, including the mistakes that happen in crash-recovery systems. A mistake can be:
-Mistake type 1: the FD starts suspecting 𝑝𝑖(i.e. S-Transition) while it is functional (in Recovery state).
-Mistake type 2: the FD is trusting 𝑝𝑖while it has just crashed (i.e. C-Transition).
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 6 of 28
Optimised Failure Detection Algorithms
up
down
up
trust
suspect suspect
trust trust
Observation period 𝑻
Mistake type 1:
Suspect while Healthy
Mistake type 2:
Trust while Crush
Mistake type 4:
Suspect while Recovery
Mistake type 3:
Trust while Faulty
suspect
trust
𝑻𝑫𝒊𝑷𝑨𝒊= 𝒌
𝑻𝑮𝒊𝒌
𝑻
𝝀𝑴𝒊=𝟒
𝑻
𝒑𝑵
(𝒊.𝒆. 𝑭𝑫)
𝒑𝒊
𝑻𝑮𝒊𝟐
𝑻𝑮𝒊𝟏𝑻𝑮𝒊𝟑𝑻𝑮𝒊𝟒𝑻𝑮𝒊𝟓
S-Transition
C-Transition
R-Transition
T-Transition
Figure 1: Definition of mistakes and mainly considered QoS metrics for an FD algorithm.
-Mistake type 3: the FD starts trusting 𝑝𝑖(i.e. T-Transition) while it is faulty (it is in Crash state).
-Mistake type 4: the FD is suspecting (i.e. Suspect)𝑝𝑖while it has just recovered (i.e. R-Transition).
Therefore, we adopt the following set of QoS metrics to evaluate an FD performance:
-Detection Time (𝑇𝐷𝑖): this random variable represents the time period between the time 𝑝𝑖starts crashing
to the time 𝑝𝑁starts suspecting 𝑝𝑖. This metric reflects the speed at which an FD detects faults of 𝑝𝑖. The
shorter 𝑇𝐷𝑖is, the faster the failure detector is.
-Mistake Rate (𝜆𝑀𝑖): this random variable represents the average number of mistakes (whatever the mis-
take type is) that 𝑝𝑁makes in a time unit in respect to the state of 𝑝𝑖.
-Query Accuracy Probability (𝑃𝐴𝑖): it measures the probability that, when queried at a random time, 𝑝𝑁
indicates correctly the state of 𝑝𝑖. Practically, 𝑃𝐴𝑖is computed as the ratio between the sum of time periods
(i.e. 𝑇𝑘
𝐺𝑖in Figure 1) during which 𝑝𝑁specifies correctly the state of 𝑝𝑖to the observation period.
In this paper, we focus on optimising the trade-off between two main metrics 𝑃𝐴and 𝑇𝐷 in our
MILP modelling. FD measures the expected 𝑇𝐷and 𝑃𝐴based on 𝑇𝐷𝑖s and 𝑃𝐴𝑖s over all monitored processes,
respectively. We also evaluate 𝜆𝑀(i.e. the expected 𝜆𝑀𝑖from all monitored processes) in our evaluation
results. Therefore, 𝑃𝐴and 𝑇𝐷are incorporated into the objective function and/or constraints (introduced in
Section 4.2.2). This is to ensure that the solution of the MILP will satisfy the FD’s QoS requirements.
4. Self-optimised Network-Aware Failure Detector: SONAFD
An FD shall be able to be aware of the continuously changing communication network and can adapt itself
to provide efficient failure detection. An FD should have the following desired capabilities: 1) identifying
network messaging characteristics and behaviour (e.g., probability distribution of heartbeats’ inter-arrivals);
2) adapting the FD’s decision threshold regularly to provide fast and accurate decisions about monitored pro-
cesses’ failures/recoveries; and 3) guaranteeing failure detection QoS requirements with existing resources’
constraints and providing optimal parameters of the FD to maximise its QoS.
To achieve these desired features, an adaptive FD which fulfils different QoS needs is required. While
existing FDs were proposed to allow updating heartbeat timeouts (Bertier et al. 2002;Chen et al. 2002;
Hayashibara et al. 2004;Satzger et al. 2007,2011), they are unable to optimally incorporate QoS require-
ments into their parameters’ settings and consider network resource constraints. Therefore, we introduce the
Self-Organised Network-Aware Failure Detector (SONAFD). We design the SONAFD to address existing
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 7 of 28
Optimised Failure Detection Algorithms
FDs’ performance issues and achieve the features we discussed above. SONAFD is a combination of an
adaptive FD algorithm that we call the Network-Aware Failure Detector (NAFD) and an MILP model.
NAFD learns the probability distribution of heartbeats’ inter-arrivals and adapts its suspicion threshold
𝜀𝑖to dynamic network conditions. NAFD contains three main parts. First, NAFD collects a sample of
recent heartbeats’ inter-arrivals. It then performs probability distribution fitting on the collected sample
using the Kolmogorov-Smirnov (K-S) test (Kolmogorov 1933;Smirnov 1948). The aim is to determine
the representative probability distribution of heartbeats’ inter-arrivals in real-world network environments.
The choice of probability distributions is independent of NAFD as we can fit any probability distributions
on a collected sample of heartbeats’ messages’ inter-arrivals. Second, NAFD applies the fitted probability
distribution on a sample of a short window of heartbeats’ inter-arrivals to infer the probabilistic likelihood
of 𝑝𝑖crash. This probabilistic likelihood is continuously compared to 𝜀𝑖to make failure detection decisions
(trust or suspect). Third, NAFD updates its 𝜀𝑖regularly using collected network packets delays and losses
data (simultaneously collected with heartbeats’ inter-arrivals) to adapt to changing network conditions.
We also propose an NLP model for NAFD that aims at maximising its query accuracy probability while
embedding the required upper-bound of its failure detection time. The NLP will guarantee the required sys-
tem constraints (e.g., bandwidth) and consider network conditions through the fitted probability distribution.
To simplify the computation complexity, our proposed NLP is converted to an MILP model. The optimal
solutions of the MILP model are 𝛿𝑖and Δ𝜀𝑖used by NAFD in its settings. In addition, we propose a greedy
heuristic algorithm to efficiently compute the proposed MILP model solutions and enhance its scalability.
In summary, NAFD represents the failure detection functions of how heartbeats’ inter-arrivals should
be exploited to optimally detect failure or recovery. SONAFD adopts our proposed MILP to automate the
optimal parameters’ settings and meets the expected QoS of NAFD and system requirements.
4.1. NAFD
For NAFD, we aim to achieve three features. The first feature is an FD that adopts a customised heart-
beats’ inter-arrivals probability distribution. Such a distribution depends on the distributed system envi-
ronment in which NAFD is implemented. This distribution may be updated regularly to adapt to network
environment changes. The second feature is the heartbeat message function and threshold-based accrual FD
algorithm. The third feature is the self-adapting suspicion threshold FD. Table 1contains the main notations
of NAFD. When NAFD starts, 𝑝𝑖sends regular heartbeat messages each 𝛿𝑖time interval (Algorithm 1(a)-
line 3). NAFD uses its adapted threshold to detect potential failures. The features of the NAFD are as
follows. Please refer to NAFD pseudo-codes for 𝑝𝑖in Algorithm 1(a) and FD process in Algorithm 1(b) (the
detailed procedures’ pseudo-codes are available in Appendix G).
First, the heartbeat messages’ inter-arrivals probability distribution fitting (Algorithm 1(b)-line 3,
with further details in Algorithm B.1 in Appendix B): NAFD is designed to take into account the changing
behaviour of heartbeats’ inter-arrivals. NAFD/SONAFD continuously collects heartbeats’ inter-arrivals. It
performs probability distribution fitting in the background by regularly computing the goodness-of-fit test
distance. NAFD/SONAFD adopts the probability distribution for which this distance is lower than the crit-
ical value of the goodness-of-fit test, and is consistently the lowest for the last twenty-four hours. We use
the K-S test (Appendix A) as a goodness-of-fit test to fit the collected data. The main advantage of adopting
the K-S criterion is that the probability distribution of the K-S test statistic itself does not depend on the
underlying cumulative distribution function being tested. It is also an exact test that does not depend on
sample sizes (Guthrie 2020). Let 𝜋
𝑖and Π𝑖be the best fitted distribution and the set of its characterising pa-
rameters (estimated from the fitted sample), respectively. For 𝑝𝑖,𝜋
𝑖and Π𝑖are the outputs of the probability
distribution fitting procedure. The details of probability distribution fitting are given in Appendix B.
Second, the accrual failure detection and threshold: NAFD adopts the fitted distribution in predicting
the arrival time of the expected next heartbeat message and identifying the probabilistic likelihood that 𝑝𝑖
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 8 of 28
Optimised Failure Detection Algorithms
Table 1
General Notations related to NAFD.
Notation Description
𝑁Number of processes in the system.
𝛿𝑖An FD parameter representing the heartbeats period of 𝑝𝑖(ms).
𝜀𝑖An FD parameter representing the threshold of NAFD/SONAFD for 𝑝𝑖.
Δ𝜀𝑖The mean equivalent timeout associated with threshold 𝜀𝑖(ms).
𝑖Heartbeat history of 𝑝𝑖.
𝑊𝑖=𝑊Sampling window size of 𝑝𝑖: maximum size of 𝑖.
𝜇𝑖Mean of heartbeat inter-arrivals in 𝑖of 𝑝𝑖(ms).
𝜎𝑖Standard deviation of heartbeat inter-arrivals in 𝑖of 𝑝𝑖(ms).
𝐺𝑖(𝑡)The probability that a given heartbeat message will arrive more than 𝑡time units later
than the previous heartbeat for 𝑝𝑖, assuming that heartbeat inter-arrivals follow the
distribution 𝜋
𝑖, characterised by parameters set Π𝑖
𝑖Network history of 𝑝𝑖containing delays and packet losses information of its corresponding
communication channel.
𝐷𝑖=𝔼(𝐷𝑗
𝑖)𝑗𝑖Packet delay mean for process 𝑖computed for whole 𝑖(ms).
𝑆𝑖Packet delay standard deviation for 𝑝𝑖computed for whole 𝑖(ms).
𝜏𝑖Packet loss rate for 𝑝𝑖computed for whole 𝑖.
𝑃𝐶𝐿 Required percentage of confidence coverage to estimate the number of consecutively
lost packets at 𝜏𝑖(e.g., 99%).
𝐾𝑖Estimated number of consecutive lost packets at 𝑃𝐶𝐿 confidence coverage.
𝑇𝑚𝑜𝑛 Time interval used by NAFD/SONAFD to check the state of monitored processes (ms).
𝑇𝑎𝑑𝑎𝑝𝑡 Time interval used by NAFD to trigger 𝜀𝑖adaptation (ms).
has crashed. There are two operations:
- Heartbeats’ inter-arrivals’ sampling (Algorithm 1(b)-line 9, with details in Algorithm G.2-lines 2-3):
NAFD keeps track of an observation window of recent heartbeats’ inter-arrivals (i.e. heartbeat history 𝑖).
NAFD saves the last 𝑊𝑖heartbeats’ interval-arrivals to its sampling window 𝑖. If 𝑖has more than 𝑊𝑖
messages, the oldest messages in 𝑖are dropped. NAFD uses 𝑖to estimate the parameters of 𝜋
𝑖(e.g., mean
and variance) stored in Π𝑖. We will further discuss the size 𝑊𝑖of the observation window 𝑖in Section 4.4.
This is to examine its impact on the overall FD performance.
- Calculating NAFD’s suspicion level (Algorithm 1(b)-line 6with details in Algorithm G.3-lines 4-5-6):
NAFD uses recent heartbeats’ inter-arrivals sample in 𝑖to estimate the probability that 𝑝𝑖has crashed. Such
a calculation is performed based on the adopted probability distribution 𝜋
𝑖, and its estimated parameters from
𝑖, i.e. Π𝑖. NAFD continuously computes and converts this probability of crash to a positive real number
𝜀𝑠𝑖. NAFD then compares 𝜀𝑠𝑖to 𝜀𝑖to make a failure detection decision. Given that 𝑇𝑙𝑎𝑠𝑡𝑖is the arrival time
of the last received heartbeat from 𝑝𝑖,𝜀𝑠𝑖is computed at current instant 𝑡𝑛𝑜𝑤 as follows:
𝜀𝑠𝑖(𝑡𝑛𝑜𝑤)𝑑 𝑒𝑓
= 𝑙𝑜𝑔10 (𝐺𝑖(𝑡𝑛𝑜𝑤 𝑇𝑙𝑎𝑠𝑡𝑖)) 𝑤ℎ𝑒𝑟𝑒 𝐺𝑖(𝑡) = 1 𝐶𝐷𝐹 (𝑡, 𝜋
𝑖,Π𝑖),(1)
where 𝐺𝑖(𝑡)represents the probability that a given heartbeat will arrive more than 𝑡time units later than
the previous heartbeat. Given Δ𝑡=𝑡𝑛𝑜𝑤 𝑇𝑙𝑎𝑠𝑡𝑖,𝐺𝑖𝑡)corresponds to the probability that expected (next)
heartbeat will arrive at least after Δ𝑡time units after the last received heartbeat. The value of 𝜀𝑠𝑖increases if
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 9 of 28
Optimised Failure Detection Algorithms
the time difference 𝑡𝑛𝑜𝑤 𝑇𝑙𝑎𝑠𝑡𝑖increases, and vice versa. 𝜀𝑠𝑖is compared to the threshold 𝜀𝑖: if 𝜀𝑠𝑖is lower
than or equal to 𝜀𝑖,𝑝𝑁trusts 𝑝𝑖and suspects it otherwise.
Third, NAFD’s decision suspicion threshold 𝜀𝑖adaptation (Algorithm 1(b)-line 7, further detailed in
Algorithm G.4-lines 6to 10): To adapt to changing network conditions (high network delay or large bursts
of packets loss), NAFD adjusts its threshold autonomously. This is to maintain the required QoS in terms
of accuracy and speed. In addition to the saved heartbeats’ inter-arrivals in 𝑖, NAFD also saves network
packets delays and losses history, noted as 𝑖. The size of network/channel history 𝑖is different from
𝑊𝑖: it corresponds to one hour of collected delays and packet loss information. By referring to equation (1),
𝜀𝑠𝑖represents the degree at which the elapsed time since last heartbeat arrival time is lower or higher than a
given timeout: we note this timeout as Δ𝜀𝑖.Δ𝜀𝑖verifies (2):
𝜀𝑖= 𝑙𝑜𝑔10 (𝐺𝑖𝜀𝑖)), such as (See Appendix D): Δ𝜀𝑖=𝐷𝑖+ 3 𝑆𝑖+𝐾𝑖𝛿𝑖if 𝐷𝑖𝑆𝑖
𝑃99𝑡ℎ𝑖+𝐾𝑖𝛿𝑖otherwise, (2)
where 𝐷𝑖is the average heartbeat messages delay, 𝑆𝑖is the standard deviation of heartbeat messages’ delay,
𝑃99𝑡ℎ𝑖is the 99th percentile of heartbeat messages’ delay, 𝜏𝑖is the heartbeat messages loss rate and 𝐾𝑖is the
estimated number of successively lost packets at a specific packet loss rate 𝜏𝑖.𝐾𝑖is obtained as described in
Appendix E. Finally, 𝜀𝑖is computed as:
𝜀𝑖=𝑙𝑜𝑔10 (𝐺𝑖(𝐷𝑖+ 3 𝑆𝑖+𝐾𝑖𝛿𝑖)) if 𝐷𝑖𝑆𝑖
𝑙𝑜𝑔10 (𝐺𝑖(𝑃99𝑡ℎ𝑖+𝐾𝑖𝛿𝑖)) otherwise. (3)
Algorithm 1(a) NAFD (in 𝑝𝑖,𝑖 [1 . . 𝑁 −1]):
1: 𝛿𝑖= 10𝑚𝑠
2: procedure SENDHEARTBE AT()
3: while 𝑚𝑜𝑑(𝑡𝑛𝑜𝑤 , 𝛿𝑖)=0do Send heartbeat
Algorithm 1(b) NAFD (in 𝑝𝑁):
1: 𝑖 [1 . . 𝑁 1] 𝜀𝑖= 1;𝑇𝑚𝑜𝑛 = 10𝑚𝑠;𝑇𝑎𝑑𝑎𝑝𝑡 = 1200000𝑚𝑠
2: for 𝑖, 𝑖 [1 . . 𝑁 1] do
3: 𝜋
𝑖FITHBINT ERARR IVALS(𝑖)
4: procedure MAIN
5: if 𝑡𝑛𝑜𝑤 =𝑡𝑠𝑡𝑎𝑟𝑡 then INITIA LISE ()
6: else if 𝑚𝑜𝑑 (𝑡𝑛𝑜𝑤, 𝑇𝑚𝑜𝑛 )=0then DETECTFAILUR E()
7: else if 𝑚𝑜𝑑 (𝑡𝑛𝑜𝑤, 𝑇𝑎𝑑 𝑎𝑝𝑡)=0then ADAPT THRES HOLD S()
8: while ℎ𝑒𝑎𝑟𝑡𝑏𝑒𝑎𝑡 do
9: PROCESSREC EIVE DHEART BEAT(𝑡𝑛𝑜𝑤 )
4.2. SONAFD
To optimise NAFD, we consider two QoS metrics: the failure detection time 𝑇𝐷and the query accuracy
probability 𝑃𝐴, which represent the delay of detecting a failure and failure decision accuracy, respectively
(see details in Section 3.2). Note that there is a trade-off between failure detection time and accuracy. The
higher 𝜀𝑖is, the worse 𝑇𝐷is (i.e. slower failure detection) with a potential better 𝑃𝐴. In addition, we use the
mistake rate 𝜆𝑀to model relationships between considered performance metrics (𝑇𝐷and 𝑃𝐴) and SONAFD
parameters (𝛿𝑖and 𝜀𝑖) (Section 4.2.1). We also use 𝜆𝑀to evaluate the performance of SONAFD in Section 5.
We propose an NLP model to design SONAFD with optimal parameters and guarantee the required QoS.
To simplify the computation complexity, we transform the NLP to an MILP model, to which we refer as
and 𝕄, respectively (see details in Section 4.2.2). Table 2contains notations related to the model /𝕄. The
objective of /𝕄is to find optimal values of SONAFD’s parameters, heartbeats’ intervals
𝛿𝑖and suspicion
thresholds’ timeouts
Δ𝜀𝑖, that guarantee maximal 𝑃𝐴and bounded 𝑇𝐷:𝛿𝑖and Δ𝜀𝑖are the decision variables.
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 10 of 28
Optimised Failure Detection Algorithms
𝜀𝑖is then obtained from
Δ𝜀𝑖via equation (4), where
𝐺𝑖represents 𝐺𝑖whose characterising parameter (i.e.
mean) is replaced by
𝛿𝑖:
𝜀𝑖
𝑑𝑒𝑓
= 𝑙𝑜𝑔10
𝐺𝑖
Δ𝜀𝑖.(4)
The pseudo-code of SONAFD is presented in Algorithm 2(a) for 𝑝𝑖and Algorithm 2(b) for 𝑝𝑁. At each
𝑝𝑖, the update of 𝛿𝑖with
𝛿𝑖makes SONAFD different from NAFD (i.e. Algorithm 2(a)-line 4). At 𝑝𝑁,
SONAFD uses all NAFD’s procedures and additionally uses the function OPTIMISEPARAMETERS (i.e Al-
gorithm 2(b)-line 7, further detailed in Algorithm G.5). This procedure differentiates SONAFD from NAFD
by adding 𝕄to NAFD and computing optimal parameters as a solution. SONAFD runs OPTIMISEPARAM-
ETERS at regular intervals (i.e. 𝑇𝑂 𝑝𝑡), which starts by retrieving heartbeat, network, and system information
in Step 1 (Algorithm G.5-lines 2-6). 𝕄is solved in Step 2 (Algorithm G.5-line 7) using the IBM ILOG
CPLEX Optimizer blackbox MILP solver. It is based on the implementation of the Branch and Cut algo-
rithm as introduced in IBM 2017 . In Step 3, the output of solving 𝕄is retrieved. If 𝕄is feasible, and after
solving it, all processes are set to use the optimised parameters: i.e.
𝛿𝑖(Algorithm 2(a)-line 4and Algo-
rithm G.5-line 10) and 𝜀𝑖(Algorithm G.5-line 12 and Algorithm G.5-line 12). If 𝕄is unfeasible, SONAFD
proceeds to 𝜀𝑖adaptation instead (Algorithm G.5-line 14) as detailed in Algorithm G.4). Performance and
scalability evaluation results are presented in Section 5.
Algorithm 2(a) SONAFD (in 𝑝𝑖,𝑖 [1 . . 𝑁 −1]):
1: 𝛿𝑖= 10𝑚𝑠
2: procedure SENDHEARTBE AT()
3: while 𝑚𝑜𝑑(𝑡𝑛𝑜𝑤 , 𝛿𝑖)=0do
4: if (heartbeat interval update (
𝛿𝑖)) then 𝛿𝑖=
𝛿𝑖
5: Send heartbeat
Algorithm 2(b) SONAFD (𝑝𝑁):
1: 𝜀𝑖= 1;𝑇𝑚𝑜𝑛 = 10𝑚𝑠;𝑇𝑜𝑝𝑡 = 1200000𝑚𝑠
2: for 𝑖, 𝑖 [1 . . 𝑁 1] do
3: 𝜋
𝑖FITHBINT ERARR IVALS(𝑖)
4: procedure MAIN
5: if 𝑡𝑛𝑜𝑤 =𝑡𝑠𝑡𝑎𝑟𝑡 then INITIA LISE ()
6: else if 𝑚𝑜𝑑 (𝑡𝑛𝑜𝑤, 𝑇𝑚𝑜𝑛 )=0then DETECTFAILUR E()
7: else if 𝑚𝑜𝑑 (𝑡𝑛𝑜𝑤, 𝑇𝑜𝑝𝑡 )=0then OPTIMISE PARAM ETER S()
8: while ℎ𝑒𝑎𝑟𝑡𝑏𝑒𝑎𝑡 do
9: PROCESSREC EIVE DHEART BEAT(𝑡𝑛𝑜𝑤 )
4.2.1. Relationships between FD’s QoS metrics and SONAFD parameters
In this section, we provide detailed analysis about how heartbeats’ period 𝛿𝑖and suspicion threshold
𝜀𝑖(and its corresponding timeout Δ𝜀𝑖) impact the considered QoS metrics of FD. This is important for
modelling the constraints of SONAFD MILP problem (constraints (19) and (23)). Once 𝕄is solved,
𝛿𝑖and
𝜀𝑖(i.e.
Δ𝜀𝑖) will be updated as the new parameters’ settings of SONAFD.
Failure Detection time: to have rigorous QoS considerations, we adopt the worst case scenario for the
estimation of 𝑇𝐷𝑖by considering the longest failure detection time duration. In such a scenario, a crash would
happen immediately after successfully sending a heartbeat message to 𝑝𝑁. Therefore, 𝑇𝑙𝑎𝑠𝑡𝑖of SONAFD is
updated with the arrival time of this heartbeat message provided that such a heartbeat message is not lost.
Let 𝑡𝐷𝑖be the instant time at which SONAFD detects this failure. SONAFD detects such a failure when the
current 𝜀𝑠𝑖at 𝑡𝐷𝑖exceeds 𝜀𝑖, i.e. 𝜀𝑠𝑖(𝑡𝐷𝑖)> 𝜀𝑖.𝑝𝑁suspects 𝑝𝑖just after the 𝜀𝑠𝑖(𝑡𝑛𝑜𝑤 ) = 𝜀𝑖. We consider the
inequality to find the upper boundary of detection time 𝑇𝐷𝑖: when 𝑡𝑛𝑜𝑤 =𝑡𝐷𝑖, it implies 𝜀𝑠𝑖(𝑡𝐷𝑖)> 𝜀𝑖. By
replacing 𝜀𝑠𝑖by its definition from equation (1), and 𝜀𝑖by its definition according to equation (2), we have:
𝑙𝑜𝑔10 (𝐺𝑖(𝑡𝐷𝑖𝑇𝑙𝑎𝑠𝑡𝑖)) >𝑙𝑜𝑔10 (
𝐺𝑖𝜀𝑖)) 𝐺𝑖(𝑡𝐷𝑖𝑇𝑙𝑎𝑠𝑡𝑖)<
𝐺𝑖𝜀𝑖).(5)
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 11 of 28
Optimised Failure Detection Algorithms
Table 2
Notations related to the design of 𝕄.
Notation Description
𝛿𝑖A decision variable representing heartbeats’ period of process 𝑖.
Δ𝜀𝑖A decision variable representing the mean equivalent timeout associated to threshold 𝜀𝑖.
𝐵Network bandwidth (bits/ms).
𝑀Message/Packet size (bits).
𝑂Allowed overhead percentage of transmitted heartbeats.
𝑇𝐷𝑖Detection time of process 𝑖(ms).
𝑇𝑇 𝑆
𝐷Tolerated detection time of FD or detection time threshold (ms).
𝑃𝐴𝑖Query accuracy probability of process 𝑖.
𝑃𝑅𝑒𝑞
𝐴𝑖The minimum required query accuracy probability for process 𝑖(e.g., 95%).
𝑇𝑆The time precision used as the discretisation step in the discretisation of (ms).
𝐹𝑈 𝐵 A multiplication factor to set up an upper-bound on Δ𝜀𝑖in the discretisation of (>1).
𝑇𝑜𝑝𝑡 Time interval used by SONAFD to trigger its MILP initialisation and solving (ms).
For the probability distribution of heartbeats’ inter-arrivals in the Amazon Elastic Compute Cloud (EC2),
the exponential distribution 2represents a good trade-off for fitting different tested monitored processes (i.e.
virtual machines) with parameter Π𝑖= {𝜇𝑖}. The distribution choice is discussed in Section 4.3. This means
that 𝐺𝑖(𝑡) = 𝑒
𝑡
𝜇𝑖, where 𝜇𝑖is the average of heartbeats inter-arrivals of process 𝑖:
𝑒
−(𝑡𝐷𝑖𝑇𝑙𝑎𝑠𝑡𝑖)
𝜇𝑖< 𝑒
−Δ𝜀𝑖
𝛿𝑖
−(𝑡𝐷𝑖𝑇𝑙𝑎𝑠𝑡𝑖)
𝜇𝑖
<
−Δ𝜀𝑖
𝛿𝑖
𝑡𝐷𝑖𝑇𝑙𝑎𝑠𝑡𝑖>𝜇𝑖
𝛿𝑖
Δ𝜀𝑖.(6)
As we are considering the worst case scenario of failure occurrence, let 𝑡𝐹𝑖be the instant time at which 𝑝𝑖
crashes, which corresponds to the sending time of its last heartbeat before the failure. Then 𝑇𝑙𝑎𝑠𝑡𝑖=𝑡𝐹𝑖+𝐷𝐹𝑖,
where 𝐷𝐹𝑖is the delay of the last sent heartbeat message before failure. The detection time 𝑇𝐷𝑖is
𝑇𝐷𝑖=𝑡𝐷𝑖𝑡𝐹𝑖>𝜇𝑖
𝛿𝑖
Δ𝜀𝑖+𝑇𝐷𝑖.(7)
Let 𝑚𝑎
𝑖and 𝑚𝑏
𝑖be two heartbeats of 𝑝𝑖with sequence numbers 𝑎and 𝑏, respectively. 𝑚𝑎
𝑖and 𝑚𝑏
𝑖can be
either successive heartbeats (i.e. 𝑏=𝑎+1) or not (lost heartbeat in the network). Let 𝑇𝑎
𝑖,𝑇𝑏
𝑖,𝐴𝑎
𝑖,𝐴𝑏
𝑖,𝐷𝑎
𝑖and
𝐷𝑏
𝑖be the transmit times, the arrival times and delays of 𝑚𝑎
𝑖and 𝑚𝑏
𝑖, respectively. Let 𝛽𝑎𝑏
𝑖,𝛾𝑎𝑏
𝑖and 𝐽𝑎𝑏
𝑖be the
inter-arrival time, the inter-transmit time and jitter (packet delay variation) between 𝑚𝑎
𝑖and 𝑚𝑏
𝑖, respectively.
The jitter represents the delay difference between two successive heartbeats. Then:
𝐴𝑎
𝑖=𝑇𝑎
𝑖+𝐷𝑎
𝑖;𝐴𝑏
𝑖=𝑇𝑏
𝑖+𝐷𝑏
𝑖;𝐽𝑎𝑏
𝑖=𝐷𝑏
𝑖𝐷𝑎
𝑖;𝛽𝑎𝑏
𝑖=𝐴𝑏
𝑖𝐴𝑎
𝑖;𝛾𝑎𝑏
𝑖=𝑇𝑏
𝑖𝑇𝑎
𝑖
𝛽𝑎𝑏
𝑖=𝐴𝑏
𝑖𝐴𝑎
𝑖=𝑇𝑏
𝑖+𝐷𝑏
𝑖 (𝑇𝑎
𝑖+𝐷𝑎
𝑖) = (𝑇𝑏
𝑖𝑇𝑎
𝑖) + 𝐷𝑏
𝑖𝐷𝑎
𝑖=𝛾𝑎𝑏
𝑖+𝐽𝑎𝑏
𝑖
𝛽𝑎𝑏
𝑖=𝛿𝑖+𝐽𝑎𝑏
𝑖.
(8)
2Such a distribution can change within different network environments and conditions. Our proposed K-S fitting process can
adapt to such distribution changes and identify alternative best-fitted probability distributions.
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 12 of 28
Optimised Failure Detection Algorithms
Since 𝑝𝑖sends heartbeats at regular intervals, then 𝛾𝑎𝑏
𝑖=𝛿𝑖. Consequently we have 𝜇𝑖=𝔼𝑎𝑏(𝛽𝑎𝑏
𝑖) =
𝔼𝑎𝑏(𝛿𝑖+𝐽𝑎𝑏
𝑖)𝜇𝑖=𝛿𝑖+𝔼𝑎𝑏(𝐽𝑎𝑏
𝑖) = 𝛿𝑖+𝐽𝑖, where 𝐽𝑖is the average jitter collected for 𝑝𝑖heartbeats:
𝜇𝑖=𝛿𝑖+𝐽𝑖𝑇𝐷𝑖>𝛿𝑖+𝐽𝑖
𝛿𝑖
× Δ𝜀𝑖+𝐷𝐹𝑖.(9)
Generally the average jitter 𝐽𝑖is small compared to 𝛿𝑖and can be neglected in the equation (9): 𝐽𝑖=
𝑜(𝛿𝑖)𝛿𝑖+𝐽𝑖
𝛿𝑖
1.
The average detection time (in the worst case) for 𝑝𝑖is estimated as 𝑇𝐷𝑖>Δ𝜀𝑖+𝐷𝐹𝑖, since 𝐷𝐹𝑖is the
delay of the last sent heartbeat before failure then it is impacted by current network delay. As it is the case
with any heartbeat sent over the network, it randomly undergoes a network delay, which could be estimated
by network average delay. We replace 𝐷𝐹𝑖by 𝐷𝑖to have a more general formula:
𝑇𝐷𝑖>Δ𝜀𝑖+𝐷𝑖.(10)
Equation (10) describes the direct relation between SONAFD timeout Δ𝜀𝑖with its worst-case 𝑇𝐷𝑖. Since
the timeout is related to heartbeat interval (see constraints (20) and (21) in Section 4.2.2), we know that fast
𝑇𝐷𝑖are equivalent to shorter 𝛿𝑖(high heartbeats frequency) and shorter Δ𝜀𝑖(i.e. smaller 𝜀𝑖).
Mistake Rate: the mistake rate 𝜆𝑀𝑖corresponds to the frequency at which mistakes occur, which is the
inverse of the average mistake recurrence time 𝔼(𝑇𝑀𝑅𝑖).𝔼(𝑇𝑀 𝑅𝑖)is the period during which mistakes
happen. Their relationship is as follows in equation (11):
𝜆𝑀𝑖=1
𝔼(𝑇𝑀𝑅𝑖).(11)
Following Chen et al. (2002), the average mistake recurrence time is expressed as 𝔼(𝑇𝑀𝑅𝑖) = 𝛿𝑖
𝑃𝑠𝑖
, where
𝑃𝑠𝑖is the probability that 𝑝𝑁suspects 𝑝𝑖(i.e. that an S-Transition occurs). For SONAFD, 𝑃𝑠𝑖is equivalent
to the probability that a heartbeat will arrive more than Δ𝜀𝑖time units later than the previous heartbeat. 𝑃𝑠𝑖
is given by the equation 𝑃𝑠𝑖=𝐺𝑖𝜀𝑖). Therefore, the mistake rate is defined in the following equation (12):
𝜆𝑀𝑖=
𝐺𝑖𝜀𝑖)
𝛿𝑖
.(12)
Query Accuracy Probability: let 𝔼(𝑇𝑀𝑖)be the average mistake duration. The query accuracy probability
𝑃𝐴𝑖is defined by equation (13):
𝑃𝐴𝑖= 1
𝔼(𝑇𝑀𝑖)
𝔼(𝑇𝑀𝑅𝑖).(13)
According to Chen et al. (2002), the average mistake duration is expressed as 𝔼(𝑇𝑀𝑖) = 𝛿𝑖
𝑞0𝑖
, where 𝑞0𝑖
is the probability that, for any 𝑘2,𝑝𝑁receives heartbeat 𝑚𝑘−1
𝑖before time 𝑡+ Δ𝑡𝑜𝑘
𝑖(Δ𝑡𝑜𝑘
𝑖is equivalent to
Δ𝜀𝑖, which is the timeout corresponding to message 𝑘). This means that 𝑞0𝑖is equivalent to the probability
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 13 of 28
Optimised Failure Detection Algorithms
that a heartbeat will arrive less than Δ𝑡𝑜𝑘
𝑖time units later than the previous heartbeat, which means that
𝑞0𝑖= 1 𝐺𝑖𝑡𝑜𝑘
𝑖) = 1 𝐺𝑖𝜀𝑖). Therefore, 𝔼(𝑇𝑀𝑖)is as follows:
𝔼(𝑇𝑀𝑖) = 𝛿𝑖
1 𝐺𝑖𝜀𝑖).(14)
By combining equations (12) and (14), 𝑃𝐴𝑖is expressed in equation (15) and used in constraint (19):
𝑃𝐴𝑖= 1
𝐺𝑖𝜀𝑖)
1 𝐺𝑖𝜀𝑖).(15)
4.2.2. MILP Optimisation Modelling
The decision variables of are 𝛿𝑖and Δ𝜀𝑖, for each process 𝑖, 𝑖 [1 . . 𝑁 1]. These variables
represent the respective failure detection parameters (regarding each monitored process 𝑖) that should be set
in SONAFD to obtain better QoS, and hence should be strictly positive real numbers. We model SONAFD’s
parameters’ optimisation as an optimisation problem whose objective function target is to maximise the
expected query accuracy probability 𝔼(𝑃𝐴).is formatted as follows:
𝑀𝑎𝑥 (𝔼(𝑃𝐴) ) (16)
Subject to (S.t.)
𝑖 [1 . . 𝑁 1] 𝑇𝐷𝑖𝑇𝑇 𝑆
𝐷(17)
𝔼(𝑃𝐴) = 𝔼(𝑃𝐴𝑖)𝑖∈[1. .𝑁−1] =1
𝑁 1
𝑁−1
𝑖=1
𝑃𝐴𝑖(18)
𝑖 [1 . . 𝑁 1] 𝑃𝐴𝑖= 1
𝐺𝑖𝜀𝑖)
1 𝐺𝑖𝜀𝑖)(19)
𝑖 [1 . . 𝑁 1] Δ𝜀𝑖𝐾𝑖𝛿𝑖𝐷𝑖+ 3 𝑆𝑖(20)
𝑖 [1 . . 𝑁 1] Δ𝜀𝑖+𝛿𝑖ln 1 𝑃𝑅𝑒𝑞
𝐴𝑖
2 𝑃𝑅𝑒𝑞
𝐴𝑖0(21)
𝑖 [1 . . 𝑁 1]
𝑁−1
𝑖=1
1
𝛿𝑖
𝑂
𝐵
𝑀(22)
𝑖 [1 . . 𝑁 1] Δ𝜀𝑖𝑇𝐷𝑖𝐷𝑖(23)
𝑖 [1 . . 𝑁 1] 𝑃𝐴𝑖0(24)
𝑖 [1 . . 𝑁 1] 𝛿𝑖>0(25)
𝑖 [1 . . 𝑁 1] Δ𝜀𝑖>0.(26)
Constraint (17) permits us to upper-bound each process 𝑇𝐷𝑖by 𝑇𝑇𝑆
𝐷(e.g., 𝑇𝑇 𝑆
𝐷= 3000𝑚𝑠). This allows
us to model SONAFD’s failure detection speed as a constraint. By setting 𝑇𝐷𝑖𝑇𝑇 𝑆
𝐷for each process 𝑖, it is
guaranteed that its new 𝛿𝑖and Δ𝜀𝑖will not generate a 𝑇𝐷𝑖higher than 𝑇𝑇 𝑆
𝐷. Since 𝑇𝐷=𝔼(𝑇𝐷𝑖)𝑖∈[1. .𝑁 −1], it
is guaranteed that 𝑇𝐷will not exceed the selected threshold 𝑇𝑇 𝑆
𝐷. A lower 𝑇𝑇 𝑆
𝐷would enable a faster failure
detection. However, a particularly small 𝑇𝑇𝑆
𝐷may constrain the model too much and yield to infeasibility.
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 14 of 28
Optimised Failure Detection Algorithms
Constraint (18) allows us to compute 𝔼(𝑃𝐴)as the expected value of 𝑃𝐴𝑖obtained for each monitored
process 𝑝𝑖, 𝑖 [1 . . 𝑁 1].
Constraint (19) defines 𝑃𝐴𝑖for each 𝑝𝑖, 𝑖 [1 . . 𝑁 −1].𝑃𝐴𝑖is obtained for each process 𝑖, 𝑖 [1 . . 𝑁 −1]
from equation (15) applied to process 𝑖: details are provided in Section 4.2.1.
Constraint (20) states that, for each 𝑝𝑖, 𝑖 [1 . . 𝑁 1], the associated timeout Δ𝜀𝑖with threshold 𝜀𝑖
should be greater than the expected transmission time of the heartbeat message so that SONAFD tolerates
network delays adequately (see Section 4.1 for NAFD’s suspicion threshold adjustment). Δ𝜀𝑖is adopted
to tolerate network delays and losses. The lower bound of Δ𝜀𝑖is set to the same as NAFD obtained from
equation (D.1) (see Section 4.1 for more details). In summary, for each 𝑝𝑖, 𝑖 [1 . . 𝑁 1], its optimised Δ𝜀𝑖
should take into account its average delay and three times its delay standard deviation (to cover high delays’
values). This constraint is obtained from the Chebyshev’s inequality applied on delays as random variables
for any given probability distribution. 88.8889% 3of delays are within 3 × 𝑆𝑖from the average 𝐷𝑖, which
guarantees to cover this fraction of random delays in building the decision timeout Δ𝜀𝑖.
Constraint (21) defines the minimum ratio between Δ𝜀𝑖and 𝛿𝑖to satisfy a minimum 𝑃𝑅𝑒𝑞
𝐴𝑖. This in-
equation is obtained from constraint (19) by setting 𝑃𝐴𝑖to a numerical value 𝑃𝑅𝑒𝑞
𝐴𝑖(e.g., 𝑃𝑅𝑒𝑞
𝐴𝑖= 95%).
Constraint (21) is obtained by inverse computation of the last in-equation in constraint (27):
𝑃𝐴𝑖𝑃𝑅𝑒𝑞
𝐴𝑖
1
𝐺𝑖𝜀𝑖)
1 𝐺𝑖𝜀𝑖)𝑃𝑅𝑒𝑞
𝐴𝑖
1 𝑒
−Δ𝜀𝑖
𝛿𝑖
1 𝑒
−Δ𝜀𝑖
𝛿𝑖
𝑃𝑅𝑒𝑞
𝐴𝑖
Δ𝜀𝑖
𝛿𝑖
ln 1 𝑃𝑅𝑒𝑞
𝐴𝑖
2 𝑃𝑅𝑒𝑞
𝐴𝑖Δ𝜀𝑖+𝛿𝑖ln 1 𝑃𝑅𝑒𝑞
𝐴𝑖
2 𝑃𝑅𝑒𝑞
𝐴𝑖0.
(27)
Constraint (22) represents the limit of heartbeats’ messages overhead in a given network. If each 𝑝𝑖, 𝑖
[1 . . 𝑁 1] sends heartbeats’ messages to 𝑝𝑁at the time interval of 𝛿𝑖, it would receive 𝑛𝑏 =𝑁−1
𝑖=1
1
𝛿𝑖
if
there are no message losses. Assuming that these messages have the same size 𝑀, if 𝑝𝑁has a bandwidth 𝐵,
it would be possible to transmit 𝐵
𝑀messages with the same size 𝑀. In a real-world network environment,
it cannot allow all its bandwidth to be dedicated to heartbeats’ messages only. Therefore, only a reasonable
small percentage of bandwidth usage should be allowed for heartbeat messages. Such a requirement can
be ensured by applying a percentage 𝑂to the total number of possible transmissions 𝐵
𝑀, which means that
𝑛𝑏 𝑂×𝐵
𝑀. To simplify for implementation, we assume that all monitored processes have the same
𝛿𝑖=𝛿, which is commonly used in practice. The left side of constraint (22) becomes 𝑖 [1 . . 𝑁 1]
𝑁−1
𝑖=1
1
𝛿𝑖
=𝑁−1
𝑖=1
1
𝛿=𝑁−1
𝛿=𝑁−1
𝛿𝑖
. Constraint (22) becomes
𝑖 [1 . . 𝑁 1] 𝑂𝐵
𝑀(𝑁 1) 𝛿𝑖1.(28)
Constraint (23) represents 𝑇𝐷𝑖computed for each process 𝑖, 𝑖 [1 . . 𝑁 1], as the average delay 𝐷𝑖and
timeout associated with Δ𝜀𝑖.𝑇𝐷𝑖considers the worst case scenario, in which process 𝑖will crash just after
sending a heartbeat message (see Section 4.2.1 for details).
Constraint (24) ensures that 𝑃𝐴𝑖is a positive variable.
3If delays are known to follow a normal distribution, the Chebyshev’s inequality will cover 99.73% possible values of such a
random variable.
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 15 of 28
Optimised Failure Detection Algorithms
Constraints (25) and (26) ensure that SONAFD is functional: constraint (25) imposes a strictly positive
𝛿𝑖so that all monitored processes send regular heartbeat messages. Constraint (26) allows for tolerating
waiting time for heartbeat transmission by setting a strictly positive Δ𝜀𝑖
4.
Due to the objective function equation (16) and constraints (19) and (22), our proposed is a nonlinear
problem. The (indirectly) nonlinear objective function and nonlinear constraints are crucial for representing
the application properly and mathematically (Rebennack 2016b). We adopt the piecewise linearisation to
convert our proposed NLP model to a Mixed Integer Linear Problem. MILP is much easier to solve and
piecewise linearisation is frequently used in various applications to approximate nonlinearities with linear
functions (Geißler et al. 2011,2015;Rebennack and Krasko 2019;Vielma et al. 2010a). There is a mature
set of works that provide proof of, and methods for, applying piecewise linear functions to simplify nonlinear
problems to linear ones and control their solving (Burlacu et al. 2020;Geißler et al. 2012,2015;Rebennack
and Krasko 2019;Vielma et al. 2010b). Benders’ decomposition (Rebennack 2016a;Steeger and Rebennack
2017) could be adopted for an advanced level of simplification but is outside the scope of this paper. The
details of the transformation of to an MILP (i.e. 𝕄) are presented in Appendix F. Constraint (19) is finally
replaced by
𝑖 [1 . . 𝑁 1] 2 ×
𝑚𝑖
𝑗=1
𝑋𝑖,𝑗 𝐺𝑖𝑚𝑖𝑛𝑖+𝑗𝑇𝑆)
𝑚𝑖
𝑗=1
𝑍𝑖,𝑗 𝐺𝑖𝑚𝑖𝑛𝑖+𝑗𝑇𝑆) + 𝑃𝐴𝑖= 1 (29)
𝑖 [1 . . 𝑁 ]
𝑚𝑖
𝑗=1
𝑋𝑖,𝑗 = 1 (30)
𝑖 [1 . . 𝑁 1],𝑗 [1 . . 𝑚𝑖] 𝑍𝑖,𝑗 𝑋𝑖,𝑗 0(31)
𝑖 [1 . . 𝑁 1],𝑗 [1 . . 𝑚𝑖] 𝑍𝑖,𝑗 0(32)
𝑖 [1 . . 𝑁 1],𝑗 [1 . . 𝑚𝑖] 𝑍𝑖,𝑗 𝑃𝐴𝑖0(33)
𝑖 [1 . . 𝑁 1],𝑗 [1 . . 𝑚𝑖] 𝑃𝐴𝑖+𝑋𝑖,𝑗 𝑍𝑖,𝑗 1(34)
𝑖 [1 . . 𝑁 1] Δ𝜀𝑖
𝑚𝑖
𝑗=1
𝑋𝑖,𝑗 𝑚𝑖𝑛𝑖+𝑗𝑇𝑆)=0.(35)
where 𝑚𝑖is the number of discrete values as obtained from Appendix F-equation (F.3), 𝑋𝑖,𝑗 is a binary
variable defined in Appendix F-equation (F.5), 𝑍𝑖,𝑗 is a real variable as in Appendix F-equation (F.11) and
Δ𝑚𝑖𝑛𝑖=𝐷𝑖+3𝑆𝑖+𝐾𝑖𝛿𝑖. Discretised 𝐺𝑖(𝑡), i.e. 𝐺𝑖𝑚𝑖𝑛𝑖+𝑗𝑇𝑆)is also defined in Appendix F-equation (F.6).
Finally, 𝕄’s formulation that we implement in CPLEX has the objective function (16), subject to con-
straints (17), (18), (20), (21), (23), (24), (25), (26), (28), (29), (30), (31), (32), (33), (34) and (35).
4.3. Heartbeats’ inter-arrivals’ Distribution
NAFD uses the heartbeats’ inter-arrivals’ distributions to compute suspicion levels 𝜀𝑠𝑖and to adapt its
suspicion thresholds 𝜀𝑖for each of its monitored processes, respectively (as in lines 6-7in Algorithm 1(b)).
SONAFD additionally uses the heartbeats’ inter-arrivals’ distributions in the modelling of 𝕄and to adapt its
suspicion thresholds when 𝕄is unfeasible. Constraint (19), which defines 𝑃𝐴𝑖for each monitored process,
is expressed as a function of heartbeats’ inter-arrivals CDF. This CDF needs to be identified, and updated
when SONAFD is running. For this purpose, we collect heartbeats’ inter-arrivals and perform a probability
4The CPLEX solver adopted in this paper (i.e. in our implementation) does not allow us to insert strict inequalities (i.e. ">"), and
hence replace them by non-strict ones (i.e. ""). For our proposed approach, we retain the solver transformation since constraints
(22) and (20) guarantee that 𝛿𝑖and Δ𝜀𝑖are non-null, respectively, with strictly positive system parameters (message size, delay, etc.).
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 16 of 28
Optimised Failure Detection Algorithms
distribution fitting on these data (see Appendix C), particularly during the first time that NAFD/SONAFD is
deployed in such a real-world system. As we consider that the monitored processes are connected in a star
topology to the monitoring process, this eliminates complex network topology consideration. For example,
if some processes share the same physical infrastructure, it is necessary to consider their joint heartbeats’
inter-arrivals’ probability distribution (Bisnik and Abouzeid 2009), which is out of the scope of this paper.
We also consider that heartbeats’ inter-arrivals follow an exponential distribution, which implies that
heartbeats’ arrivals are assumed to be Poisson processes (Ross 1996). Packets’ arrivals are often modelled
as Poisson point processes (PPP). Early traffic models were motivated by telephony where calls are assumed
to be independent and identically distributed, and hence their holding times are exponential. Poisson traffic
models have been widely adopted in the network analysis and performance evaluation of different applica-
tions (Chun Chung Chan and Hanly 2001;Karagiannis et al. 2004;Kheirkhah et al. 2019;Kirichek et al.
2016;Sun et al. 2019;Yu et al. 2006). PPP are characterised by the following important analytical properties:
1) the number of arrivals in distinct intervals is statistically independent, i.e. memoryless; 2) superposition
of independent Poisson processes with specific rates results in a new Poisson process whose rate is the sum
of rates; 3) has a unitary coefficient of variation, i.e. its parameter is equal to its mean and variation; and
4) in specific conditions, a multiplexing of independent traffic streams is approximately a Poisson process
according to the Palm-Khintchine theorem (Heyman and Sobel 2004).
Although Poisson packets arrivals’ models were widely adopted, their suitability is contested as many
researchers believe they do not mimic enough real-world traffic data in modern networks: e.g., LAN, MAN,
WAN (Paxson and Floyd 1994). These networks experience batch and/or correlated packets’ arrivals and
traffic burstiness, which are important factors to consider in the modelling of their traffic (Becchi 2008). The
Poisson model is considered unable to capture these elements, specifically the aspects related to traffic bursti-
ness (Karagiannis et al. 2004). This has oriented the research community towards long-range dependence
(LRD), self-similarity (i.e. fractional) and bursty (i.e. Markov, Renewal, Autoregressive) models.
Despite the proliferation of such models, there has been a switch back to the re-consideration of PPP for
internet traffic. The former analysis of network traffic encounters luck of accuracy and robustness (Kara-
giannis et al. 2004). In fact, many research works have shown that the assumption of Poisson distribution, or
a derived version of it, is in accordance with real internet packet arrivals (Cao et al. 2003;Karagiannis et al.
2004;Sukhov et al. 2016;Yu et al. 2006). More specifically, at the edges of the internet (i.e. closer to end
users’ devices), links have generally low speeds and their capacity cannot be upgraded swimmingly, which
leads to continuous bursty traffic; however, the internet core benefits from high-speed links and a consider-
able number of connections, which yields to the absence of burstiness: the packets arrivals are hence close
to Independence and Poisson.
In summary, characterising modern networks traffic is highly complex as it is constantly evolving and
extremely dynamic; identifying its features cannot be solved once and for all. Nevertheless, the Poisson
assumption represents an attractive way to solve the failure detection performance problem: it has the ad-
vantage of analytic simplicity and can be fairly valid in many distributed systems. It also helps us to main-
tain the complexity of our MILP to solve the performance-aware failure detection problem. In this paper,
we focus more on providing a framework solution for QoS-aware failure detection. Hence, we focus on a
Proof-of-Concept of SONAFD with a generalised network topology and a realistic heartbeats’ inter-arrivals’
probability distribution. Poisson packet arrivals represent a valuable trade-off in designing our FD and prov-
ing its efficiency. Furthermore, to tolerate possible error of distribution assumptions, we adopted Cheby-
shev’s inequality with our constraints by adding times standard deviation; the coverage of random values
reaches 88.89% minimum and will reach 99.97% if such a distribution follows the normal distribution (fol-
lows the central limit theorem). We believe that such an approach achieves our goal of adopting reasonably
good fitting of probability distributions, balancing the complexity of the model and computation costs, while
being able to tolerate potential distribution and parameters’ estimation errors. Our evaluation results (see
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 17 of 28
Optimised Failure Detection Algorithms
Section 5.2) have shown that our adopted approach is satisfactory.
4.4. The sizes of heartbeats’ history and probability distribution fitting sampling
We recall that 𝑊=𝑊𝑖represents the size of heartbeats’ inter-arrivals’ sample, measured as number of
inter-arrivals. An FD retains the last 𝑊𝑖inter-arrivals to check the state of 𝑝𝑖. This means that 𝑝𝑁computes
the suspicion level 𝜀𝑠𝑖based on 𝑊𝑖𝑊. Let 𝐹represents the sample size of heartbeats’ inter-arrivals that
are used to perform the probability distribution fitting, measured as a time duration. In our experiments,
𝑊= 1000, similar to related works such as Chen et al. (2002), Hayashibara et al. (2004) and Satzger et al.
(2007). 𝐹equals to one-day time window. According to the numerous sets of NAFD/SONAFD experiments
in the cloud environment, we noticed that one day is a good setting, as it represents one day cycle.
It is indisputable that the size of 𝐹has a meaningful impact on the probability distribution fitting quality,
and hence the performance of NAFD/SONAFD. In general, the larger 𝐹is, the more accurate the probability
distribution fitting is. However, network conditions may change faster and, consequently, the best fitted
probability distribution might become obsolete. It is necessary to choose 𝐹carefully so that it provides
both a satisfactory-accurate and up-to-date probability distribution fitting. Meanwhile, the main focus of the
SONAFD design is to make it driven by performance requirements in an optimal policy, even by making
assumptions on or fixing specific parameters like 𝐹. The assumption on the value of 𝐹may not conform to
all systems, but the overall performance optimisation objectives are still satisfactory.
To compensate this assumption, the parameters of the fitted probability distribution are computed each
time a new heartbeat arrives (based on most recent 𝑊inter-arrivals). This keeps them up-to-date to recent
heartbeats’ arrivals, and hence enables a fast response to network condition changes. In addition, we have
incorporated the Chebyshev’s inequality into the design of NAFD/SONAFD. This is to tolerate distribution
assumption errors in its detection accuracy/speed. We believe that all these considerations help to build a
fairly solid solution, which ensures a balance between developing SONAFD MILP modelling and opening
up future work directions.
Future work may adopt a dynamic pricing method as in Cheung et al. (2017) or be dynamically set,
and be driven by changing conditions (Cai and Hames 2010;Fields et al. 2021;Kutzner et al. 2017). For
example, Cheung et al. (2017) use a dynamic pricing model to minimise worst-case regret with unknown
demand functions through learning phases. These phases have different samples sizes, which makes the
overall pricing model more flexible towards demand functions. An improved SONAFD may consider a
similar approach: e.g., setting the sizes of dynamic samples based on the patterns of networks conditions.
4.5. Heuristic Algorithm
If the the size of 𝕄grows (e.g., 𝑁increases), it becomes computationally expensive or even unfeasible
to solve 𝕄. Therefore, to solve such a computation challenge and improve the scalability of SONAFD, we
also design a greedy heuristic algorithm, noted as .specifically aims to solve the objective function and
constraints in 𝕄. Figure 2introduces the main steps of . The main idea is to bound decision variables
(𝛿𝑖and Δ𝜀𝑖) using constraints from 𝕄.starts at the lowest bounds of both 𝛿𝑖and Δ𝜀𝑖, then evaluates
SONAFD 𝑃𝐴and compares it to 𝑃𝑡𝑎𝑟𝑔𝑒𝑡
𝐴. The algorithm conducts small/unitary increments of Δ𝜀𝑖for each
𝛿𝑖. If 𝑃𝑡𝑎𝑟𝑔𝑒𝑡
𝐴is not reached when Δ𝜀𝑖= Δ𝑢𝑝𝑝𝑒𝑟
𝜀𝑖then 𝛿𝑖is incremented and Δ𝜀𝑖is reset to its lower bound. Δ𝜀𝑖
is incremented until 𝑃𝐴reaches 𝑃𝑡𝑎𝑟𝑔𝑒𝑡
𝐴. The algorithm stops when 𝑃𝑡𝑎𝑟𝑔𝑒𝑡
𝐴is reached or when all possible
values of 𝛿𝑖and Δ𝜀𝑖are evaluated. The evaluation of our heuristic algorithm is presented in Section 5.4.
5. NAFD and SONAFD Performance Evaluation
We aim to validate the QoS enhancement achieved by the proposed SONAFD and its 𝕄. Most previous
works either evaluate such an FD algorithm with a simple two-process environment (Bertier et al. 2002;Chen
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 18 of 28
Optimised Failure Detection Algorithms
Initialise parameters
Start
Compute and bounds :  and

While ( )
While (
)
󰇛󰇜
󰇛󰇜and
󰇛
󰇜
?
End
(Get Solutions)
End (No solution)
Yes
No
No
?
?
Yes
Yes
No
Figure 2: Flowchart of the greedy heuristic algorithm .
et al. 2002), or via a wide area network (Hayashibara et al. 2004;Satzger et al. 2011), or simply in a local
network (Liu et al. 2017). To achieve rigorous and robust evaluation results, we evaluate our approaches in
a real-world application environment (Amazon EC2) experiencing its own real-time constraints. The real-
world sought constraints are packets’ transmission delays and losses in the cloud network. This allows our
proposed approach to minimise the gaps between theoretical design and practical applications and hence
demonstrate better failure detection performance in real-world systems.
To tackle the challenge of not being able to alter network conditions in a real-world network environ-
ment, we also design and implement a simulator based on CloudSim. CloudSim was originally developed
by Melbourne University (2018). This allows us to control network conditions and test the robustness of
our approaches when network conditions change. To mimic real-world failures, we have also designed and
implemented a failure controller, which can randomly fail/stop a machine/process following a given proba-
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 19 of 28
Optimised Failure Detection Algorithms
bility distribution (e.g., normal distribution). All these comprehensive test environment designs allow us to
rigorously test the true performance of our proposed FD algorithm and optimisation approach.
5.1. Experiment Setup
To benchmark the QoS enhancement that would be guaranteed by NAFD and SONAFD, we compare
them with one of the most representative failure detectors - the 𝑃 ℎ𝑖 detector introduced by Hayashibara
et al. (2004). It has been used in many real systems such as Akka OpenDayLight (Akka 2018) and has been
benchmarked in a number of studies (Liu et al. 2017;Satzger et al. 2007,2011;Tomsic et al. 2015;Xiong
et al. 2012). Moreover, numerous research papers (see Chan et al. 2015;Fang et al. 2016;Yang and Wang
2016) highlight the importance of the 𝑃 ℎ𝑖 detector in terms of flexibility compared to other FDs. In our
performance evaluation, we consider a modified version of the 𝑃 ℎ𝑖 detector. We adjusts the 𝑃 ℎ𝑖 detector
threshold as regularly as NAFD/SONAFD, and considers the same average timeout as NAFD to compute
its adjusted threshold. We call it Adaptive 𝑃 ℎ𝑖. The idea is to achieve a fair comparison between 𝑃 ℎ𝑖 and
NAFD/SONAFD in terms of adapting the decision parameter to network conditions. We implement these
three failure detectors in the Amazon on-demand cloud service (EC2). It allows us to manage a large number
of distributed cloud virtual machines (VMs) and configure them with custom parameters. The Amazon EC2
provides one of the most-widely used cloud computing services. Its network environments are dynamic and
depend on realistic connections between VMs.
We consider the following scenario; this involves the monitor cloud VM which is responsible for running
FDs and the optimisation model, and for monitoring and recording performance measures. The monitored
VMs (50 instances) 5send heartbeat messages periodically to notify the FD VM about their liveness. Mon-
itored VMs are connected to the monitor VM in a star topology. The hardwares of the VMs used in our
experiments are as follows: 1) One 3.3 GHz CPU; 2) 1 GB memory; and 3) 8 GB storage. The monitor
VM is based in the Asia Pacific region (Singapore). The monitored VMs are distributed over three different
regions: Tokyo, London and North Virginia. This is to ensure a wide geographical distribution of deployed
VMs and obtain a variety of network communication conditions. Consequently, FD performance metrics
are evaluated on a regular basis for diversified network conditions. This is a global network setting that has
not been explored by previous work (Liu et al. 2017;Satzger et al. 2011;Sleptchenko and Johnson 2014).
5.2. Amazon EC2 Evaluation Results
We have run our evaluation tests using various experimental settings. We set SONAFD with the network
bandwidth overhead constraint (𝑂= 0.1%) and a failure detection time threshold (𝑇𝑇 𝑆
𝐷= 3𝑠). These settings
are according to a survey we conducted with a number of technical staff members within one of the largest
telecom/network service companies. This threshold represents a trade-off value for different distributed
systems and can be adjusted (Bosilca et al. 2016;gigaspaces 2019). In the first settings of the evaluation,
we use an average failure rate of 1𝐹 𝑎𝑖𝑙𝑢𝑟𝑒ℎ𝑜𝑢𝑟 and an average failure duration of 900 seconds to inject
enough failures in such an evaluation environment so that more QoS data are collected. Figure 3represents
the evaluation results of failure detection QoS metrics summarised as follows:
1. SONAFD and NAFD provide better 𝑃𝐴and 𝜆𝑀compared to the Adaptive 𝑃 ℎ𝑖 FD, even if NAFD and
Adaptive 𝑃 ℎ𝑖 use the same timeout to adjust their thresholds. This is because the Adaptive 𝑃 ℎ𝑖 FD
applies a different function to compute its suspicion threshold from the timeout. This function depends
on the assumed heartbeats’ inter-arrivals’ probability distribution (i.e. normal). It means that the normal
distribution is less representative of existing cloud network conditions.
5Running a large number of VMs on Amazon is costly. For experiments larger than 50, we actually have the simulations and
use the analytical results associated with the heuristic algorithm; thus, we expect similar performance results of NAFD/SONAFD.
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 20 of 28
Optimised Failure Detection Algorithms
2h
4h
6h
8h
10h
12h
14h
16h
18h
20h
22h
24h
Test duration
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
Detection Time (s)
Adaptive Phi
NAFD
SONAFD
(a) 𝑇𝐷
2h
4h
6h
8h
10h
12h
14h
16h
18h
20h
22h
24h
Test duration
10 -4
10 -3
10 -2
10 -1
Mistake rate/s
Adaptive Phi
NAFD
SONAFD
(b) 𝜆𝑀
2h
4h
6h
8h
10h
12h
14h
16h
18h
20h
22h
24h
Test duration
0.996
0.9965
0.997
0.9975
0.998
0.9985
0.999
0.9995
1
Query Accuracy Probability
Adaptive Phi
NAFD
SONAFD
(c) 𝑃𝐴
0.5 1 1.5 2
Detection Time (s)
0.994
0.995
0.996
0.997
0.998
0.999
1
Query Accuracy Probability
Adaptive Phi
NAFD
SONAFD
(d) 𝑃𝐴versus 𝑇𝐷
Figure 3: Amazon EC2 results for a simultaneous running of three FDs (solver: CPLEX).
2. Adaptive 𝑃 ℎ𝑖 FD has the second shortest 𝑇𝐷. However, even its suspicion threshold is adapted regularly,
the 𝑃 ℎ𝑖 FD struggles to find the trade-off between performance metrics to satisfy required 𝑃𝐴and 𝑇𝐷.
3. SONAFD further enhances the accuracy metrics (𝑃𝐴and 𝜆𝑀) over NAFD, with an increase of 𝑇𝐷. SON-
AFD was designed to maximise 𝑃𝐴by optimising not only one parameter but the two FD parameters: 𝛿𝑖
and 𝜀𝑖. Thus, SONAFD operates on both two FD parameters to foster performance enhancement.
4. To enhance 𝑃𝐴, SONAFD increases its 𝑇𝐷, as SONAFD will wait longer before making a suspect de-
cision. Even though this is necessary in its design to achieve the required QoS altogether, SONAFD
provides the best trade-off between these two conflicting metrics.
5.3. Simulations Validation
Although Amazon EC2 is a real-world test environment, its main limitation is that we cannot control
network conditions to evaluate changing network behaviours. In addition, Amazon EC2 is a commercial
service and its usage with a large number of global machines comes with major financial costs. Therefore,
to achieve a robust performance evaluation that covers both network conditions and changes in failure be-
haviours with a large number of machines, we also designed and implemented a simulation tool specific to
failure detection. We adopted CloudSim as the simulation platform (Melbourne University 2018) as it of-
fers a scenario similar to that in Amazon EC2. CloudSim provides an extensible simulation framework that
enables modelling, simulation and testing of new Cloud Computing infrastructures’ and applications ser-
vices. We developed a specific FD simulator to include heartbeat messaging, failure controllers and failure
detection algorithms for VMs.
The simulation setting is similar to Amazon EC2. We consider a scenario with 51 VMs. One of these
VMs represents the FD and the remaining 50 VMs are monitored processes. Thirty different parameters’
configurations are simulated: these settings swipe different failure rates, failure duration, packet delays and
packet losses, etc. All scenarios provided similar results to the real environment test in EC2 and demonstrated
the robustness of our proposed SONAFD. Due to space limitations, in this paper we only represent one of the
simulated configurations as shown in Figure 4. Figures 4a,4b and 4c show 𝑇𝐷,𝜆𝑀and 𝑃𝐴, respectively, for
the three FDs. SONAFD performs the best in terms of 𝑃𝐴and 𝜆𝑀, followed by NAFD then the Adaptive 𝑃 ℎ𝑖
FD. However for 𝑇𝐷, the Adaptive 𝑃 ℎ𝑖 performs the best. These observations are consistent with SONAFD
and NAFD’s performance in real-world test environments (e.g., Amazon EC2) introduced in Section 5.2.
5.4. SONAFD Scalability Evaluation
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 21 of 28
Optimised Failure Detection Algorithms
1h 4h 8h 12h 16h 20h 24h
Simulation time
0
0.1
0.2
0.3
0.4
0.5
0.6
Detection Time (s)
Adaptive Phi
NAFD
SONAFD
(a) 𝑇𝐷
1h 4h 8h 12h 16h 20h 24h
Simulation time
10-3
10-2
10-1
100
Mistake rate/s
Adaptive Phi
NAFD
SONAFD
(b) 𝜆𝑀
1h 4h 8h 12h 16h 20h 24h
Simulation time
0.92
0.94
0.96
0.98
1
Query Accuracy Probability
Adaptive Phi
NAFD
SONAFD
(c) 𝑃𝐴
Figure 4: Simulation results for three FDs.
Large-scale distributed systems, like cloud computing, can easily scale-up with hundreds or thousands
of processes as they are designed to be dynamic for rapid and flexible delivery of services. Therefore, main-
taining QoS and scalability together for FD is a challenging task due to the system size increase. SONAFD
is designed to ensure efficient failure detection for large-scale systems. However, the MILP modelling ap-
proach within SONAFD’s design and its optimisation solution-solving time will increase exponentially as
the system size increases. Moreover for the continuous constraint (19), we adopted discretisation and lin-
earisation methods (as detailed in the online Appendix F) for solving the model (Gendron and Gouveia 2016;
Kunnumkal and Talluri 2015). Such approaches generated additional decision variables to the implemented
𝕄:𝑋𝑖,𝑗 and 𝑍𝑖,𝑗 such as 𝑖 [1 . . 𝑁 1], 𝑗 [1 . . 𝑚𝑖], and 𝑚𝑖is the number of discrete values for
𝑝𝑖(obtained from equation (F.3)-Appendix F). We recall that 𝑇𝑆is the time precision that represents the
discretisation step of constraint (19) and is an input of 𝕄. The value 𝑇𝑆impacts the number of these addi-
tional variables 𝑋𝑖,𝑗 and 𝑍𝑖,𝑗 . As a consequence, the scalability of our proposed 𝕄model is degraded. 𝑇𝑆is
inversely proportional to the number of modelled discrete values 𝑚𝑖,𝑖 [1 . . 𝑁 1] (see equation (F.3)-
Appendix F). Therefore, the higher 𝑇𝑆is, the lower 𝑚𝑖is, and the lower the number of modelled variables
is. Consequently, it is important to assess this impact by testing different values of 𝑇𝑆.
(a) Execution time
100 200 300 400 500
Monitored processes (N)
0
100
200
300
400
500
600
700
800
Model execution time (s)
Building Time
Solving Time
(b) Execution time for 𝑇𝑆= 10ms
Figure 5: Execution time of (i.e. solving 𝕄in CPLEX solver) (seconds).
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 22 of 28
Optimised Failure Detection Algorithms
The overall time to get an output for 𝕄from the solver (optimised solutions or no solutions) can be
divided into two parts: 1) the building time, which refers to the time spent by the solver in creating variables
and setting up the constraints, and 2) the model-solving time, which corresponds to the time spent by the
solver to search for the model solutions. To avoid any confusion, we refer to the overall solving time of 𝕄
as the execution time. Throughout this section, We refer to 𝕄solved in CPLEX” as .
Table 3
A set of parameters for 𝕄scalability evaluation (please refer to Tables 1and 2for notations and units).
𝐵 𝑀 𝑇 𝑇 𝑆
𝐷𝑂 𝐹𝑈 𝐵 𝑃𝐶𝐿 𝑃𝑅𝑒𝑞
𝐴𝑖𝜇𝑖𝐷𝑖𝑆𝑖𝜏𝑖
49100 416 3000 1% 2 99% 0.95 11.5±1 150±100 50 ± 30 5%±5%
We set implementation with different input parameters (see Table 3). The intuition is to tune these
parameters as diversely as possible. This allows us to evaluate numerous combinations and achieve robust
evaluation of the SONAFD scalability. A set of parameters consists of fixed values of 𝐵,𝑀,𝑇𝑇𝑆
𝐷,𝑂,
𝐹𝑈 𝐵 ,𝑃𝐶𝐿 and 𝑃𝑅𝑒𝑞
𝐴𝑖, and random values of 𝜇𝑖,𝐷𝑖,𝑆𝑖and 𝜏𝑖. For a given set of parameters, we collect
the execution times of for different values of the discretisation step: 𝑇𝑆 {1,2,5,10,20,50,100} and
different sizes of the system: 𝑁 [2 . . 1001]. Such configurations represent real-world systems, where the
fixed parameters are often given by application-specific requirements. The parameters with random values
represent real-world dynamic network behaviour. 𝑇𝑆could also be application-specific, but we choose a
set of different values to be tested as it has a strong impact on the feasibility and execution time of . We
generate uniform random values of 𝜇𝑖,𝐷𝑖,𝑆𝑖and 𝜏𝑖as shown in Table 3. Due to space limitations, we only
present the execution time of one set of parameters (i.e. Table 3and Figure 5). We plot the time speed when
is feasible and solved. We also tested other parameter sets and results are similar to Figure 5.
Figure 5illustrates the undesired impacts of large system size 𝑁and small discretisation step 𝑇𝑆on
execution times, particularly on the model building phase. Figure 5a shows a three-dimensional evaluation
of the overall execution time of (i.e. z-axis) versus 𝑁(i.e. x-axis) and 𝑇𝑆(i.e. y-axis). It only shows
execution times when has solutions: CPLEX cannot solve 𝕄for 𝑁 > 500 when 𝑇𝑆= 1𝑚𝑠,𝑁 > 500
when 𝑇𝑆= 10𝑚𝑠, and 𝑁 > 540 when 𝑇𝑆= 100𝑚𝑠. Figure 5a also shows that the execution time becomes
higher when the system size gets larger (i.e. 𝑁gets larger) and 𝑇𝑆is shorter (i.e. more discrete values). 𝑇𝑆
dramatically increases execution time. Figure 5b demonstrates how the building time and solving time of
contribute to its execution time, for 𝑇𝑆= 10𝑚𝑠 and for different values of 𝑁. The building time represents
the dominating part of the execution time. This is due to the high number of variables that are modelled in
𝕄. Therefore, more time is spent in setting up and its constraints. However, the actual solving time is
hardly impacted by 𝑇𝑆and increases noticeably when the system size 𝑁gets larger.
To tackle this scalability issue, we proposed a heuristic algorithm as discussed in Section 4.5. Figure 6
represents the comparison between 𝕄execution times using the CPLEX solver (i.e. ) and the proposed .
Figures 6a,6b and 6c show that can shorten the execution time and tackle the scalability issue with a large
number of processes. Figure 6a represents the execution times of (for different 𝑇𝑆) and as a function of
𝑁. Figure 6b presents 𝑃𝐴=𝔼(1 𝐺𝑖𝜀𝑖)
1−𝐺𝑖𝜀𝑖))𝑖∈[1. .𝑁−1] where 𝛿𝑖and Δ𝜀𝑖are the obtained solutions of (for
different 𝑇𝑆) and , respectively, versus 𝑁. Figure 6c depicts the minimal bound of 𝑇𝐷(constraint (23)):
𝑇𝐷=𝔼𝜀𝑖+𝐷𝑖)𝑖∈[1. .𝑁−1] where Δ𝜀𝑖is from the solutions (different 𝑇𝑆) and solutions, respectively,
versus 𝑁.
Figures 6a,6b and 6c show that can shorten the execution time and tackle the system size and scalability
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 23 of 28
Optimised Failure Detection Algorithms
200 400 600 800 1000
Monitored processes (N)
10-4
10-2
100
102
104
106
Total execution time (s)
MILP-Ts=1ms
MILP-Ts=2ms
MILP-Ts=5ms
MILP-Ts=10ms
MILP-Ts=20ms
MILP-Ts=50ms
MILP-Ts=100ms
Heuristic
(a) Execution time (s)
200 400 600 800 1000
Monitored processes (N)
0.999999999996
0.999999999997
0.999999999998
0.999999999999
1
PA
MILP-Ts=1ms
MILP-Ts=2ms
MILP-Ts=5ms
MILP-Ts=10ms
MILP-Ts=20ms
MILP-Ts=50ms
MILP-Ts=100ms
Heuristic
200 400 600
1
(b) 𝑃𝐴
200 400 600 800 1000
Monitored processes (N)
500
1000
1500
2000
2500
3000
TD (ms)
MILP-Ts=1ms
MILP-Ts=2ms
MILP-Ts=5ms
MILP-Ts=10ms
MILP-Ts=20ms
MILP-Ts=50ms
MILP-Ts=100ms
Heuristic
(c) 𝑇𝐷(ms)
Figure 6: Comparison between the CPLEX solutions and the heuristic algorithm (𝑃𝑡𝑎𝑟𝑔𝑒𝑡
𝐴= 95%) in terms
of execution time of 𝕄and QoS of SONAFD.
issue with a large number of processes. In summary, 1) is much faster than the CPLEX solver, which
yields to computing efficiency of the MILP solutions; 2) keeps providing feasible solutions when the
system scales in size. The CPLEX solver cannot efficiently find a solution within the time duration of
required SONAFD parameters update, and 3) provides feasible and similar solution quality to the CPLEX
solver in terms of 𝑃𝐴and better quality in terms of 𝑇𝐷. By design, stops searching 𝛿𝑖and Δ𝜀𝑖as early
as 𝑃𝐴𝑃𝑡𝑎𝑟𝑔𝑒𝑡
𝐴. This guarantees a minimum required 𝑃𝐴𝑃𝑡𝑎𝑟𝑔𝑒𝑡
𝐴of SONAFD. On the other hand, it
yields to smaller values of 𝑇𝐷as optimal values of 𝛿𝑖and Δ𝜀𝑖are associated with higher 𝑃𝐴and 𝑇𝐷. It is
worth recalling that 𝕄contains the constraint (21) that ensures optimal values of 𝛿𝑖and Δ𝜀𝑖for guaranteed
𝑃𝐴=𝑃𝑅𝑒𝑞
𝐴𝑖. This comes at the cost of increasing 𝑇𝐷(see constraint (23)). However, the values of 𝛿𝑖and Δ𝜀𝑖
found by are just good enough to provide the minimal required 𝑃𝑡𝑎𝑟𝑔𝑒𝑡
𝐴. For a fair comparison between
solution and solution, 𝕄is solved with 𝑃𝑅𝑒𝑞
𝐴𝑖=𝑃𝑡𝑎𝑟𝑔𝑒𝑡
𝐴(𝑃𝑅𝑒𝑞
𝐴𝑖in and 𝑃𝑡𝑎𝑟𝑔𝑒𝑡
𝐴in ).
To sum up, represents a satisfying alternative to address the noted scalability issue of 𝕄when solved
with the CPLEX solver. It provides substantial economy on the execution times of obtaining 𝕄solutions,
while supplying comparable failure detection QoS.
6. Conclusion
In this paper, we proposed a novel MILP-based failure detector (SONAFD). SONAFD is capable of
guaranteeing the required QoS performance requirements. This is achieved by translating QoS requirements
as constraints in an MILP model, and adjusting its parameters accordingly to search for the best possible
trade-offs and solutions. The objective of SONAFD is to maximise its accuracy probability while respecting
constraints on its failure detection time. To tackle scalability and computation efficiency challenges when
the system size becomes larger, we proposed a heuristic algorithm specific to SONAFD. This algorithm
provides fast approximated solutions. Our results show that our proposed heuristic algorithm achieves a
good approximated solution compared with the numerical solution obtained from CPLEX. Our proposed
heuristic algorithm is able to scale to large-size systems over thousands of nodes at a much faster rate.
Our experiments are based on a real-worldwide distributed system (Amazon EC2), and show a superior
performance of our proposed solution with better query accuracy and satisfactory detection speed. Our re-
sults highlighted that SONAFD can enhance the failure detection accuracy and speed while it can guarantee
the given QoS requirements and system constraints through exploring the trade-off between these require-
ments. We also evaluated our proposed solution via simulations to simulate changing network conditions.
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 24 of 28
Optimised Failure Detection Algorithms
The results are consistent with Amazon EC2 experiments. These results demonstrate consistent and robust
improvements with our approach.
To the best of our knowledge, our MILP-based SONAFD is the first attempt to combine an adaptive fail-
ure detection algorithm with data-driven operation research approaches. No previous work has considered
scalability, data-driven performance optimisation with constraints and FD algorithm design together. Our
tests are based on real-world Amazon global data centres as well as extensive simulations with a simulator
we developed to test our proposed solution in comprehensive ways beyond the scope of previous literature.
Both Amazon and simulation results demonstrate the stability and robustness of our proposed approach.
As networked application systems become more sophisticated and larger (e.g., Cloud, Internet of Things
(IoT), 5G network, Blockchain, etc.), such FDs are fundamentally important for achieving QoS guarantee,
scalability, real-time monitoring, and fault tolerance goals simultaneously.
Acknowledgements
The authors acknowledge the use of the IRIDIS High Performance Computing Facility, and associated
support services at the University of Southampton, UK, in the completion of this work.
The authors would like to thank Prof. Tolga Bektas, Professor of Logistics Management at the University
of Liverpool, UK, for assisting this research by providing valuable advice in the optimisation modelling.
Finally, the authors would like to thank the anonymous reviewers for their valuable feedback.
References
Abdel-Aziz, M.K., Samarakoon, S., Bennis, M., Saad, W., 2020. Ultra-Reliable and Low-Latency Vehicular Communication: An Active Learning
Approach. IEEE Communications Letters 24, 367–370. doi:10.1109/LCOMM.2019.2956929.
Aguilera, M.K., Chen, W., Toueg, S., 2000. Failure detection and consensus in the crash-recovery model. Distrib. Comput. 13, 99–125. doi:10.
1007/s004460050070.
Akka, 2018. Akka | Akka. URL: https://akka.io/.
Akka, 2021. Phi Accrual Failure Detector Akka Documentation. URL: https://doc.akka.io/docs/akka/current/typed/
failure-detector.html.
Barolli, L., Leu, F.Y., Enokido, T., Chen, H.C., 2018. Advances on Broadband and Wireless Computing, Communication and Applications:
Proceedings of the 13th International Conference on Broadband and Wireless Computing, Communication and Applications (BWCCA-2018).
Springer.
Becchi, M., 2008. From Poisson Processes to Self-Similarity: a Survey of Network Traffic Models. Technical Report. Washington University in St.
Louis. URL: https://www.cse.wustl.edu/~jain/cse567-06/ftp/traffic_models1.pdf.
Bertier, M., Marin, O., Sens, P., 2002. Implementation and performance evaluation of an adaptable failure detector, in: Proceedings International
Conference on Dependable Systems and Networks, pp. 354–363. doi:10.1109/DSN.2002.1028920.
Bisnik, N., Abouzeid, A.A., 2009. Queuing network models for delay analysis of multihop wireless ad hoc networks. Ad Hoc Networks 7, 79–97.
doi:10.1016/j.adhoc.2007.12.001.
Bosilca, G., Bouteiller, A., Guermouche, A., Herault, T., Robert, Y., Sens, P., Dongarra, J., 2016. Failure Detection and Propagation in HPC systems,
in: SC ’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 312–322.
doi:10.1109/SC.2016.26.
Burlacu, R., Geißler, B., Schewe, L., 2020. Solving Mixed-Integer Nonlinear Programs using Adaptively Refined Mixed-Integer Linear Programs.
Optimization Methods and Software 35, 37–64. doi:10.1080/10556788.2018.1556661.
Buyya, R., Ranjan, R., Calheiros, R.N., 2010. InterCloud: Utility-Oriented Federation of Cloud Computing Environments for Scaling of Appli-
cation Services, in: Hsu, C.H., Yang, L.T., Park, J.H., Yeo, S.S. (Eds.), Algorithms and Architectures for Parallel Processing, Springer Berlin
Heidelberg. pp. 13–31.
Cai, Y., Hames, D., 2010. Minimum Sample Size Determination for Generalized Extreme Value Distribution. Communications in Statistics -
Simulation and Computation 40, 87–98. doi:10.1080/03610918.2010.530368. publisher: Taylor & Francis.
Cao, J., Cleveland, W.S., Lin, D., Sun, D.X., 2003. Internet Traffic Tends Toward Poisson and Independent as the Load Increases, in: Denison,
D.D., Hansen, M.H., Holmes, C.C., Mallick, B., Yu, B. (Eds.), Nonlinear Estimation and Classification. Springer, New York, NY. Lecture Notes
in Statistics, pp. 83–109. doi:10.1007/978-0- 387-21579- 2_6.
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 25 of 28
Optimised Failure Detection Algorithms
Chan, Y.C., Wang, K., Hsu, Y.H., 2015. Fast Controller Failover for Multi-domain Software-Defined Networks, in: 2015 European Conference on
Networks and Communications (EuCNC), pp. 370–374. doi:10.1109/EuCNC.2015.7194101.
Chandra, T.D., Toueg, S., 1996. Unreliable failure detectors for reliable distributed systems. Journal of the ACM 43, 225–267. doi:10.1145/
226643.226647.
Chen, W., Toueg, S., Aguilera, M.K., 2002. On the quality of service of failure detectors. IEEE Transactions on Computers 51, 561–580. doi:10.
1109/TC.2002.1004595.
Cheung, W.C., Simchi-Levi, D., Wang, H., 2017. Technical Note—Dynamic Pricing and Demand Learning with Limited Price Experimentation.
Operations Research 65, 1722–1731. doi:10.1287/opre.2017.1629. publisher: INFORMS.
Chun Chung Chan, Hanly, S.V., 2001. Calculating the outage probability in a CDMA network with spatial Poisson traffic. IEEE Transactions on
Vehicular Technology 50, 183–204. doi:10.1109/25.917918.
Conforti, M., Cornuéjols, G., Zambelli, G., 2014. Integer Programming. volume 271 of Graduate Texts in Mathematics. Springer International
Publishing, Cham. doi:10.1007/978-3- 319-11008- 0.
Coulouris, G., Dollimore, J., Kindberg, T., Blair, G., 2001. Time and global state. Distributed Systems Concepts and Design , 385–400.
Coulouris, G.F., Dollimore, J., Kindberg, T., 2005. Distributed Systems: Concepts and Design. Pearson Education.
Du, A.Y., Das, S., Ramesh, R., 2012. Efficient Risk Hedging by Dynamic Forward Pricing: A Study in Cloud Computing. INFORMS Journal on
Computing 25, 625–642. doi:10.1287/ijoc.1120.0526.
Du, D.Z., Pardalos, P.M., Zhang, Z., 2019. Nonlinear Combinatorial Optimization. Springer International Publishing.
Fang, K.C., Wang, K., Wang, J.H., 2016. A fast and load-aware controller failover mechanism for software-defined networks, in: 2016 10th
International Symposium on Communication Systems, Networks and Digital Signal Processing (CSNDSP), pp. 1–6. doi:10.1109/CSNDSP.
2016.7573944.
Ferreira, P.V.R., Paffenroth, R., Wyglinski, A.M., Hackett, T.M., Bilén, S.G., Reinhart, R.C., Mortensen, D.J., 2018. Multiobjective Reinforcement
Learning for Cognitive Satellite Communications Using Deep Neural Network Ensembles. IEEE Journal on Selected Areas in Communications
36, 1030–1041. doi:10.1109/JSAC.2018.2832820.
Fields, E., Osorio, C., Zhou, T., 2021. A Data-Driven Method for Reconstructing a Distribution from a Truncated Sample with an Application to
Inferring Car-Sharing Demand. Transportation Science doi:10.1287/trsc.2020.1028. publisher: INFORMS.
Fischer, M.J., Lynch, N.A., Paterson, M.S., 1985. Impossibility of distributed consensus with one faulty process. Journal of the ACM 32, 374–382.
doi:10.1145/3149.214121.
Geißler, B., Kolb, O., Lang, J., Leugering, G., Martin, A., Morsi, A., 2011. Mixed integer linear models for the optimization of dynamical transport
networks. Mathematical Methods of Operations Research 73, 339–362. doi:10.1007/s00186- 011-0354- 5.
Geißler, B., Martin, A., Morsi, A., Schewe, L., 2012. Using Piecewise Linear Functions for Solving MINLPs. Mixed Integer Nonlinear Programming
, 287–314doi:10.1007/978-1- 4614-1927- 3. publisher: Springer Science+Business Media, New York.
Geißler, B., Morsi, A., Schewe, L., Schmidt, M., 2015. Solving power-constrained gas transportation problems using an MIP-based alternating
direction method. Computers & Chemical Engineering 82, 303–317. doi:10.1016/j.compchemeng.2015.07.005.
Gendron, B., Gouveia, L., 2016. Reformulations by Discretization for Piecewise Linear Integer Multicommodity Network Flow Problems. Trans-
portation Science 51, 629–649. doi:10.1287/trsc.2015.0634.
gigaspaces, 2019. Failure Detection. URL: https://docs.gigaspaces.com/latest/admin/troubleshooting-failure- detection.
html.
Guerraoui, R., Herlihy, M., Kuznetsov, P., Lynch, N., Newport, C., 2009. On the weakest failure detector ever. Distributed Computing 21, 353–366.
doi:10.1007/s00446-009- 0079-3.
Gullhav, A.N., Cordeau, J.F., Hvattum, L.M., Nygreen, B., 2017. Adaptive large neighborhood search heuristics for multi-tier service deployment
problems in clouds. European Journal of Operational Research 259, 829–846. doi:10.1016/j.ejor.2016.11.003.
Gupta, A.K., Smith, K.G., Shalley, C.E., 2006. The Inter play between Exploration and Exploitation. The Academy of Management Journal 49,
693–706. doi:10.2307/20159793. publisher: Academy of Management.
Guthrie, W.F., 2020. NIST/SEMATECH e-Handbook of Statistical Methods (NIST Handbook 151). URL: https://www.itl.nist.gov/
div898/handbook/, doi:10.18434/M32189. type: dataset.
Hayashibara, N., Defago, X., Yared, R., Katayama, T., 2004. The phi; accrual failure detector, in: Proceedings of the 23rd IEEE International
Symposium on Reliable Distributed Systems, 2004., pp. 66–78. doi:10.1109/RELDIS.2004.1353004.
Heilig, L., Voß, S., 2014. Decision Analytics for Cloud Computing: A Classification and Literature Review, in: Bridging Data and Decisions.
INFORMS. INFORMS TutORials in Operations Research, pp. 1–26. doi:10.1287/educ.2014.0124.
Heyman, D.P., Sobel, M.J., 2004. Stochastic Models in Operations Research: Stochastic Processes and Operating Characteristics. Courier Cor po-
ration.
Hussin, M., Asilah Wati Abdul Hamid, N., Kasmiran, K.A., 2015. Improving reliability in resource management through adaptive reinforcement
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 26 of 28
Optimised Failure Detection Algorithms
learning for distributed systems. Journal of Parallel and Distributed Computing 75, 93–100. doi:10.1016/j.jpdc.2014.10.001.
IBM, 2017. CPLEX User’s Manual Version 12 Release 8. URL: https://www.ibm.com/docs/en/SSSA5P_12.8.0/ilog.odms.studio.
help/pdf/usrcplex.pdf.
Kaewpuang, R., Niyato, D., Wang, P., Hossain, E., 2013. A Framework for Cooperative Resource Management in Mobile Cloud Computing. IEEE
Journal on Selected Areas in Communications 31, 2685–2700. doi:10.1109/JSAC.2013.131209.
Karagiannis, T., Molle, M., Faloutsos, M., 2004. Long-range dependence ten years of Internet traffic modeling. IEEE Internet Computing 8, 57–64.
doi:10.1109/MIC.2004.46.
Kheirkhah, M., Wakeman, I., Parisis, G., 2019. Multipath transport and packet spraying for efficient data delivery in data centres. Computer
Networks 162, 106852. doi:10.1016/j.comnet.2019.07.008.
Kirichek, R., Golubeva, M., Kulik, V., Koucheryavy, A., 2016. The home network traffic models investigation, in: 2016 18th International Confer-
ence on Advanced Communication Technology (ICACT), pp. 97–100. doi:10.1109/ICACT.2016.7423288.
Kolmogorov, A., 1933. Sulla determinazione empirica di una legge didistribuzione. Giorn Dell’inst Ital Degli Att 4, 89–91.
Kunnumkal, S., Talluri, K., 2015. On a Piecewise-Linear Approximation for Network Revenue Management. Mathematics of Operations Research
41, 72–91. doi:10.1287/moor.2015.0716.
Kutzner, F.L., Read, D., Stewart, N., Brown, G., 2017. Choosing the Devil You Don’t Know: Evidence for Limited Sensitivity to Sample Size–Based
Uncertainty When It Offers an Advantage. Management Science 63, 1519–1528. doi:10.1287/mnsc.2015.2394.
Lakshman, A., Malik, P., 2010. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 35. doi:10.
1145/1773912.1773922.
Laprie, J.C., 1992. Dependability: Basic concepts and terminology, in: Dependability: Basic Concepts and Terminology. Springer, pp. 3–245.
Li, Y.M., Tan, Y., De, P., 2012. Self-Organized Formation and Evolution of Peer-to-Peer Networks. INFORMS Journal on Computing 25, 502–516.
doi:10.1287/ijoc.1120.0517.
Liu, J., Wu, Z., Wu, J., Dong, J., Zhao, Y., Wen, D., 2017. A Weibull distribution accrual failure detector for cloud computing. PLOS ONE 12,
e0173666. doi:10.1371/journal.pone.0173666.
Liu, Z., Righter, R., 1998. Optimal Load Balancing on Distributed Homogeneous Unreliable Processors. Operations Research .
Luenberger, D.G., Ye, Y., 2016. Linear and Nonlinear Programming. volume 228 of International Series in Operations Research & Management
Science. Springer International Publishing, Cham. doi:10.1007/978-3-319- 18842-3.
Ma, T., Hillston, J., Anderson, S., 2010. On the Quality of Service of Crash-Recovery Failure Detectors. IEEE Transactions on Dependable and
Secure Computing 7, 271–283. doi:10.1109/TDSC.2009.35.
Madhushani, U., Leonard, N.E., 2021. Heterogeneous Explore-Exploit Strategies on Multi-Star Networks. IEEE Control Systems Letters 5, 1603–
1608. doi:10.1109/LCSYS.2020.3042459.
Marouani, H., Dagenais, M.R., 2008. Internal Clock Drift Estimation in Computer Clusters. Journal of Computer Systems, Networks, and Com-
munications 2008, 1–7. doi:10.1155/2008/583162.
Melbourne University, 2018. CloudSim: A Framework For Modeling And Simulation Of Cloud Computing Infrastructures And Services. URL:
https://github.com/Cloudslab/cloudsim. original-date: 2015-03-18.
Paxson, V., Floyd, S., 1994. Wide-area traffic: the failure of Poisson modeling. ACM SIGCOMM Computer Communication Review 24, 257–268.
doi:10.1145/190809.190338.
Rebennack, S., 2016a. Combining sampling-based and scenario-based nested Benders decomposition methods: application to stochastic dual
dynamic programming. Mathematical Programming 156, 343–389. doi:10.1007/s10107-015-0884-3.
Rebennack, S., 2016b. Computing tight bounds via piecewise linear functions through the example of circle cutting problems. Mathematical
Methods of Operations Research 84, 3–57. doi:10.1007/s00186- 016-0546- 0.
Rebennack, S., Krasko, V., 2019. Piecewise Linear Function Fitting via Mixed-Integer Linear Programming. INFORMS Journal on Computing
doi:10.1287/ijoc.2019.0890.
Ross, S.M., 1996. Stochastic Processes. Wiley.
Satzger, B., Pietzowski, A., Trumler, W., Ungerer, T., 2007. A New Adaptive Accrual Failure Detector for Dependable Distributed Systems, in:
Proceedings of the 2007 ACM Symposium on Applied Computing, ACM, New York, NY, USA. pp. 551–555. doi:10.1145/1244002.1244129.
Satzger, B., Pietzowski, A., Ungerer, T., 2011. Autonomous and scalable failure detection in distributed systems. International Journal of Au-
tonomous and Adaptive Communications Systems 4, 61. doi:10.1504/IJAACS.2011.037749.
Shen, S., Wang, J., 2014. Stochastic Modeling and Approaches for Managing Energy Footprints in Cloud Computing Service. Ser vice Science 6,
15–33. doi:10.1287/serv.2013.0061.
Sleptchenko, A., Johnson, M.E., 2014. Maintaining Secure and Reliable Distributed Control Systems. INFORMS Journal on Computing 27,
103–117. doi:10.1287/ijoc.2014.0613.
Smirnov, N., 1948. Table for Estimating the Goodness of Fit of Empirical Distributions. Annals of Mathematical Statistics 19, 279–281. doi:10.
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 27 of 28
Optimised Failure Detection Algorithms
1214/aoms/1177730256. publisher: Institute of Mathematical Statistics.
Sotoma, I., Madeira, E.R.M., 2001. ADAPTATION - Algorithms to Adaptive Fault Monitoring and their implementation on CORBA, in: Proceed-
ings 3rd International Symposium on Distributed Objects and Applications, pp. 219–228. doi:10.1109/DOA.2001.954087.
Steeger, G., Rebennack, S., 2017. Dynamic convexification within nested Benders decomposition using Lagrangian relaxation: An application to
the strategic bidding problem. European Journal of Operational Research 257, 669–686. doi:10.1016/j.ejor.2016.08.006.
Sukhov, A.M., Astrakhantseva, M.A., Pervitsky, A.K., Boldyrev, S.S., Bukatov, A.A., 2016. Generating a function for network delay. Journal of
High Speed Networks 22, 321–333. doi:10.3233/JHS- 160552. publisher: IOS Press.
Sun, J., Liu, F., Ahmed, M., Li, Y., 2019. Efficient Virtual Network Function Placement for Poisson Arrived Traffic, in: ICC 2019 - 2019 IEEE
International Conference on Communications (ICC), pp. 1–7. doi:10.1109/ICC.2019.8761609. iSSN: 1938-1883.
Tan, K.C., Chiam, S.C., Mamun, A.A., Goh, C.K., 2009. Balancing exploration and exploitation with adaptive variation for evolutionary multi-
objective optimization. European Journal of Operational Research 197, 701–713. doi:10.1016/j.ejor.2008.07.025.
Tanenbaum, A.S., Steen, M.v., 2007. Distributed Systems: Principles and Paradigms. Pearson Prentice Hall.
Tomsic, A., Sens, P., Garcia, J., Arantes, L., Sopena, J., 2015. 2w-FD: A Failure Detector Algorithm with QoS, in: 2015 IEEE International Parallel
and Distributed Processing Symposium, pp. 885–893. doi:10.1109/IPDPS.2015.74.
Turchetti, R.C., Duarte, E.P., Arantes, L., Sens, P., 2016. A QoS-configurable failure detection service for internet applications. Journal of Internet
Services and Applications 7, 9. doi:10.1186/s13174-016-0051-y.
Vielma, J.P., Ahmed, S., Nemhauser, G., 2010a. Mixed-Integer Models for Nonseparable Piecewise-Linear Optimization: Unifying Framework and
Extensions. Operations Research 58, 303–315. doi:10.1287/opre.1090.0721. publisher: INFORMS.
Vielma, J.P., Ahmed, S., Nemhauser, G., 2010b. A Note on “A Superior Representation Method for Piecewise Linear Functions”. INFORMS
Journal on Computing 22, 493–497. doi:10.1287/ijoc.1100.0379. publisher: INFORMS.
Xiong, N., Defago, X., 2007. ED FD: Improving the Phi Accrual Failure Detector. Research Report. School of Information Science, Japan Advanced
Institute of Science and Technology. URL: https://dspace.jaist.ac.jp/dspace/bitstream/10119/4799/1/IS-RR- 2007-007.pdf.
Xiong, N., Vasilakos, A.V., Wu, J., Yang, Y.R., Rindos, A., Zhou, Y., Song, W., Pan, Y., 2012. A Self-tuning Failure Detection Scheme for Cloud
Computing Service, in: 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 668–679. doi:10.1109/IPDPS.
2012.126.
Xu, Z., Tang, J., Meng, J., Zhang, W., Wang, Y., Liu, C.H., Yang, D., 2018. Experience-driven Networking: A Deep Reinforcement Learning based
Approach, in: IEEE INFOCOM 2018 - IEEE Conference on Computer Communications, pp. 1871–1879. doi:10.1109/INFOCOM.2018.
8485853.
Yang, T.W., Wang, K., 2016. Failure detection service with low mistake rates for SDN controllers, in: 2016 18th Asia-Pacific Network Operations
and Management Symposium (APNOMS), pp. 1–6. doi:10.1109/APNOMS.2016.7737210.
Yu, H., Zheng, D., Zhao, B.Y., Zheng, W., 2006. Understanding user behavior in large-scale video-on-demand systems. ACM SIGOPS Operating
Systems Review 40, 333–344. doi:10.1145/1218063.1217968.
Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 28 of 28
... MILP has multiple applications in problems of importance to society: ambulance relocation (Lee et al., 2022), cost-sharing for ride-sharing (Hu et al., 2021), drop box location (Schmidt and Albert, 2023), efficient failure detection in large-scale distributed systems (Er-Rahmadi and Ma, 2022), home healthcare routing (Dastgoshade et al., 2020), home service routing and appointment scheduling (Tsang and Shehadeh, 2023), job-shop scheduling (Liu et al., 2021), facility location (Basciftci et al., 2021), flow-shop scheduling (Hong et al., 2019;Balogh et al., 2022;Öztop et al., 2022), freight transportation (Archetti et al., 2021), location and inventory prepositioning of disaster relief supplies (Shehadeh and Tucker, 2022), machine scheduling with sequence-dependent setup times (Yalaoui and Nguyen, 2021), maritime inventory routing (Gasse et al., 2022), multi-agent path finding with conflict-based search (Huang et al., 2021), multi-depot electric bus scheduling (Gkiotsalitis et al., 2023), multi-echelon/multifacility green reverse logistics network design (Reddy et al., 2022), optimal physician staffing (Prabhu et al., 2021), optimal search path with visibility (Morin et al., 2023), oral cholera vaccine distribution (Smalley et al., 2015), outpatient colonoscopy scheduling (Shehadeh et al., 2020), pharmaceutical distribution (Zhu and Ursavas, 2018), plant factory crop scheduling (Huang et al., 2020), post-disaster blood supply (Hamdan and Diabat, 2020;Kamyabniya et al., 2021), real assembly line balancing with human-robot collaboration (Nourmohammadi et al., 2022), reducing vulnerability to human trafficking (Kaya et al., 2022), restoration planning and crew routing (Morshedlou et al., 2021), ridepooling (Gaul et al., 2022), scheduling of unconventional oil field development (Soni et al., 2021), security-constrained optimal power flow (Velloso et al., 2021), semiconductor manufacturing (Chang and Dong, 2017), surgery scheduling (Kayvanfar et al., 2021), unit commitment (Kim et al., 2018;Chen et al., 2019;Li and Zhai, 2019;Chen et al., 2020;Li et al., 2020;van Ackooij et al., 2021), vehicle sharing and task allocation (Arias-Melia et al., 2022), workload apportionment (Gasse et al., 2022), and many others. However, MILP problems belong, in general, to the class of NP-hard problems because of the presence of integer variables x. ...
Article
Full-text available
Operations in areas of importance to society are frequently modeled as mixed-integer linear programming (MILP) problems. While MILP problems suffer from combinatorial complexity, Lagrangian Relaxation has been a beacon of hope to resolve the associated difficulties through decomposition. Due to the non-smooth nature of Lagrangian dual functions, the coordination aspect of the method has posed serious challenges. This paper presents several significant historical milestones (beginning with Polyak’s pioneering work in 1967) toward improving Lagrangian Relaxation coordination through improved optimization of non-smooth functionals. Finally, this paper presents the most recent developments in Lagrangian Relaxation for fast resolution of MILP problems. The paper also briefly discusses the opportunities that Lagrangian Relaxation can provide at this point in time.
... The peers' failure behaviors considerably affect the BFT algorithm performance as allocating peers more prone to failures within the same committees will result in the failure of the consensus in these committees, and hence on the failure of the overall consensus. We use a failure detector (for e.g., [4,5]) to detect Crash failures. We record participating peers in previously failed consensus as Byzantine failures. ...
... MILP has multiple applications in problems of importance to society: ambulance relocation (Lee et al., 2022), balanced item placement (Gasse et al., 2022), cost-sharing for ride-sharing (Hu et al., 2021), drop box location (Schmidt and Albert, 2022), efficient failure detection in large-scale distributed systems (Er-Rahmadi and Ma, 2022), home healthcare routing (Dastgoshade et al., 2020), home service routing and appointment scheduling (Tsang and Shehadeh, 2022), inventory control under demand and lead time uncertainty (Thorsen and Yao, 2017), job-shop scheduling (Liu et al., 2021), facility location (Basciftci et al., 2021), flow-shop scheduling (Hong et al., 2019;Balogh et al., 2022;Öztop et al., 2022), freight transportation (Archetti et al., 2021), location and inventory prepositioning of disaster relief supplies (Shehadeh and Tucker, 2022), machine scheduling with sequence-dependent setup times (Yalaoui and Nguyen, 2021), maritime inventory routing (Gasse et al., 2022), multi-agent path finding with conflict-based search (Huang et al., 2021), multi-depot electric bus scheduling (Gkiotsalitis et al., 2023), multi-echelon/multi-facility green reverse logistics network design (Reddy et al., 2022), optimal physician staffing (Prabhu et al., 2021), optimal search path with visibility (Morin et al., 2023), oral cholera vaccine distribution (Smalley et al., 2015), outpatient colonoscopy scheduling (Shehadeh et al., 2020), pharmaceutical distribution (Zhu and Ursavas, 2018), plant factory crop scheduling (Huang et al., 2020), post-disaster blood supply (Hamdan and Diabat, 2020;Kamyabniya et al., 2021), real assembly line balancing with human-robot collaboration (Nourmohammadi et al., 2022), reducing vulnerability to human trafficking (Kaya et al., 2022), restoration planning and crew routing (Morshedlou et al., 2021), ridepooling (Gaul et al., 2022), scheduling of unconventional oil field development (Soni et al., 2021), security-constrained optimal power flow (Velloso et al., 2021), semiconductor manufacturing (Chang and Dong, 2017), surgery scheduling (Kayvanfar et al., 2021), unit commitment (Kim et al., 2018;Chen et al., 2019;Li and Zhai, 2019;Chen et al., 2020;Li et al., 2020;van Ackooij et al., 2021), vehicle shar-ing and task allocation (Arias-Melia et al., 2022), workload apportionment (Gasse et al., 2022), and many others. Because of integer variables x, MILP problems are NP-hard and instances of practical sizes are generally difficult to solve to optimality due to the combinatorial complexity: the computational effort increases super-linearly (i.e., exponentially) as the problem size increases. ...
Preprint
Full-text available
Operations in areas of importance to society are frequently modeled as Mixed-Integer Linear Programming (MILP) problems. While MILP problems suffer from combinatorial complexity, Lagrangian Relaxation has been a beacon of hope to resolve the associated difficulties through decomposition. Due to the non-smooth nature of Lagrangian dual functions, the coordination aspect of the method has posed serious challenges. This paper presents several significant historical milestones (beginning with Polyak's pioneering work in 1967) toward improving Lagrangian Relaxation coordination through improved optimization of non-smooth functionals. Finally, this paper presents the most recent developments in Lagrangian Relaxation for fast resolution of MILP problems. The paper also briefly discusses the opportunities that Lagrangian Relaxation can provide at this point in time.
Article
Byzantine fault-tolerance (BFT) consensus is a fundamental building block of distributed systems such as blockchains. However, implementations based on classic PBFT and most linear PBFT-variants still suffer from message communication complexity, restricting the scalability and performance of BFT algorithms when serving large-scale systems with growing numbers of peers. To tackle the scalability and performance challenges, we propose ParBFT , a new Byzantine consensus parallelism scheme combining classic BFT protocols and a novel Bilevel Mixed-Integer Linear Programming(BL-MILP)-based optimisation model. The core aim of ParBFT is to improve scalability via parallel consensus while providing enhanced safety (i.e. ensuring consistent total order across all correct replicas). Another core novelty is the integration of the BL-MILP model into ParBFT. The BL-MILP allows us to compute optimal numerical decisions for parallel committees (i.e. the optimal number of committees and peer allocation for each committee) and improve consensus performance while ensuring security. Finally, we test the performance of the proposed ParBFT on Microsoft Azure Cloud systems with 20 to 300 peers and find that ParBFT can achieve significant improvement compared to the state-of-the-art protocols.
Article
Full-text available
This article deals with a comparative study of the physicochemical and electrical properties of monoesters of castor oil compared with their counterparts based on FeO3 and ZnO nanoparticles. The results are also compared with those in the literature on triesters, and also with the recommendations of the IEEE C 57.14 standard. The data is analysed statistically using a goodness‑of‑ft test. The analysis of the viscosity data at 40 °C shows an increase in viscosity. For concentrations of 0.10 wt%, 0.15 wt% and 0.20 wt% these are respectively 5.4%, 9.69%, 12.9% for FeO3 NFs and 7.6%, 9.91% and 12.7% for Z nO NFs. For the same concentrations, the increase in acid number is respectively 3.2%, 2.9%, 2.5% for FeO3 samples and 3.18%, 2.0%, 1.2% for ZnO samples. For the same concentrations, the fre point shows an increment of 4%, 3% and 2% for FeO3 samples and a regression of 8.75%, 6.88% and 5.63% for ZnO samples. As for the breakdown voltage, for the same concentrations we observe respectively an increment of 43%, 27%, 34% for the FeO3. The results show an improvement on partial discharge inception voltage with FeO3 of 24%, 8.13% and 15.21% respectively for the concentrations 0.10 wt%, 0.15 wt% and 0.20 wt%.
Chapter
Byzantine fault-tolerance (BFT) algorithms enhance trustworthiness of distributed systems by guaranteeing their resilience to Byzantine faults. Traditional BFT algorithms suffer from scalability issues, resulting in performance bottlenecks (e.g., low throughputs) in large-scale distributed systems. Moreover, distributed systems are generally deployed on geographically and/or logically distributed networks, which aggravates the performance-scalability issue. To tackle this challenge, existing works have proposed a number of new BFT algorithms (e.g., HotStuff, FastBFT). However, limited work has explored parallel BFT based on a partitioned set of connected subgroups. This is challenging due to 1) heterogeneous communications delays between different, potentially geographically distributed, peers, and 2) peers may have a random crash and/or Byzantine failures, which contribute to the failure of the BFT consensus. To address these issues, we propose a stochastic programming (SP) model to maximise the throughput, while considering communications delays and failure behaviors as constraints. The SP model solution provides the optimal multi-committee organisation. Evaluation results show 24% throughput enhancement with the SP model.KeywordsStochastic ProgrammingByzantine Fault Tolerant AlgorithmParallel Consensus
Article
Full-text available
In this letter, an age of information (AoI)-aware transmission power and resource block (RB) allocation technique for vehicular communication networks is proposed. Due to the highly dynamic nature of vehicular networks, gaining a prior knowledge about the network dynamics, i.e., wireless channels and interference, in order to allocate resources, is challenging. Therefore, to effectively allocate power and RBs, the proposed approach allows the network to actively learn its dynamics by balancing a tradeoff between minimizing the probability that the vehicles’ AoI exceeds a predefined threshold and maximizing the knowledge about the network dynamics. In this regard, using a Gaussian process regression (GPR) approach, an online decentralized strategy is proposed to actively learn the network dynamics, estimate the vehicles’ future AoI, and proactively allocate resources. Simulation results show a significant improvement in terms of AoI violation probability, compared to several baselines, with a reduction of at least 50%.
Article
Full-text available
Modern data centres provide large aggregate network capacity and multiple paths among servers. Traffic in data centres is very diverse; most of the data is produced by long, bandwidth hungry flows but the large majority of flows, which commonly come with stringent deadlines regarding their completion time, are short. It has been shown that TCP is not efficient for any of these types of traffic in modern data centres. MultiPath TCP (MPTCP) employs multipath data transport and is efficient for long flows but ill-suited for short flows. In this paper, we present Maximum MultiPath TCP (MMPTCP), a novel transport protocol which extends MPTCP and, compared to TCP and MPTCP, reduces short flows’ completion times, while providing excellent goodput to long flows. To do so, MMPTCP runs in two phases; initially, it randomly scatters packets in the network under a single congestion window exploiting all available paths. This is beneficial to latency-sensitive flows. After a specific amount of data is sent, MMPTCP switches to a regular MultiPath TCP mode. MMPTCP is incrementally deployable in existing data centres as it does not require any modifications outside the transport layer and behaves well when competing with MPTCP flows. We also present a topology-specific extension of MMPTCP that adjusts the numbers of subflows during the second phase of the protocol based on knowledge about the location of the receiver in the data centre. We present extensive evaluation that shows that MMPTCP’s design objectives are met. We have implemented MMPTCP (along with MPTCP and packet spraying) in ns-3 and evaluated our protocol in simulated FatTree topologies. We have evaluated how MMPTCP performs compared to TCP and MPTCP and how its performance is affected by transient hotspots in the network. We have also experimented with different thresholds for duplicate acknowledgements and fast retransmissions and shown that MMPTCP performs well when the size of short flows is widely ranged. Finally, we have evaluated how MMPTCP performs under conditions that result in Incast, when different congestion control algorithms are used in its second phase and when varying the overall network load.
Article
This paper proposes a method to recover an unknown probability distribution given a censored or truncated sample from that distribution. The proposed method is a novel and conceptually simple detruncation method based on sampling the observed data according to weights learned by solving a simulation-based optimization problem; this method is especially appropriate in cases where little analytic information is available but the truncation process can be simulated. The proposed method is compared with the ubiquitous maximum likelihood estimation (MLE) method in a variety of synthetic validation experiments, where it is found that the proposed method performs slightly worse than perfectly specified MLE and competitively with slightly misspecified MLE. The practical application of this method is then demonstrated via a pair of case studies in which the proposed detruncation method is used alongside a car-sharing service simulator to estimate demand for round-trip car-sharing services in the Boston and New York metropolitan areas.
Article
We investigate the benefits of heterogeneity in multi-agent explore-exploit decision making where the goal of the agents is to maximize cumulative group reward. To do so we study a class of distributed stochastic bandit problems in which agents communicate over a multi-star network and make sequential choices among options in the same uncertain environment. Typically, in multi-agent bandit problems, agents use homogeneous decision-making strategies. However, group performance can be improved by incorporating heterogeneity into the choices agents make, especially when the network graph is irregular, i.e. when agents have different numbers of neighbors. We design and analyze new heterogeneous explore-exploit strategies, using the multi-star as the model irregular network graph. The key idea is to enable center agents to do more exploring than they would do using the homogeneous strategy, as a means of providing more useful data to the peripheral agents. In the case all agents broadcast their reward values and choices to their neighbors with the same probability, we provide theoretical guarantees that group performance improves under the proposed heterogeneous strategies as compared to under homogeneous strategies. We use numerical simulations to illustrate our results and to validate our theoretical bounds.
Book
Graduate students and researchers in applied mathematics, optimization, engineering, computer science, and management science will find this book a useful reference which provides an introduction to applications and fundamental theories in nonlinear combinatorial optimization. Nonlinear combinatorial optimization is a new research area within combinatorial optimization and includes numerous applications to technological developments, such as wireless communication, cloud computing, data science, and social networks. Theoretical developments including discrete Newton methods, primal-dual methods with convex relaxation, submodular optimization, discrete DC program, along with several applications are discussed and explored in this book through articles by leading experts.
Article
Piecewise linear (PWL) functions are used in a variety of applications. Computing such continuous PWL functions, however, is a challenging task. Software packages and the literature on PWL function fitting are dominated by heuristic methods. This is true for both fitting discrete data points and continuous univariate functions. The only exact methods rely on nonconvex model formulations. Exact methods compute continuous PWL function for a fixed number of breakpoints minimizing some distance function between the original function and the PWL function. An optimal PWL function can only be computed if the breakpoints are allowed to be placed freely and are not fixed to a set of candidate breakpoints. In this paper, we propose the first convex model for optimal continuous univariate PWL function fitting. Dependent on the metrics chosen, the resulting formulations are either mixed-integer linear programming or mixed-integer quadratic programming problems. These models yield optimal continuous PWL functions for a set of discrete data. On the basis of these convex formulations, we further develop an exact algorithm to fit continuous univariate functions. Computational results for benchmark instances from the literature demonstrate the superiority of the proposed convex models compared with state-of-the-art nonconvex models.
Article
We propose a method for solving mixed-integer nonlinear programmes (MINLPs) to global optimality by discretization of occurring nonlinearities. The main idea is based on using piecewise linear functions to construct mixed-integer linear programme (MIP) relaxations of the underlying MINLP. In order to find a global optimum of the given MINLP, we develop an iterative algorithm which solves MIP relaxations that are adaptively refined. We are able to give convergence results for a wide range of MINLPs requiring only continuous nonlinearities with bounded domains and an oracle computing maxima of the nonlinearities on their domain. Moreover, the practicalness of our approach is shown numerically by an application from the field of gas network optimization.
Article
Future spacecraft communication subsystems will potentially benefit from software-defined radios controlled by artificial intelligence algorithms. In this paper, we propose a novel radio resource allocation algorithm leveraging multi-objective reinforcement learning and artificial neural network ensembles able to manage available resources and conflicting mission-based goals. The uncertainty in the performance of thousands of possible radio parameter combinations, and the dynamic behavior of the radio channel over time producing a continuous multi-dimensional state–action space, requires a fixed-size memory continuous state–action mapping instead of the traditional discrete mapping. In addition, actions need to be decoupled from states in order to allow for online learning, performance monitoring, and resource allocation prediction. The proposed approach leverages the authors’ previous research on constraining decisions predicted to have poor performance through “virtual environment exploration”. The simulation results show the performance for different communication mission profiles and accuracy benchmarks are provided for future research reference. The proposed approach constitutes part of the core cognitive engine proof-of-concept delivered to NASA John H. Glenn Research Center’s SCaN Testbed radios on-board the International Space Station. IEEE