ArticlePDF Available

Data-driven Mixed-Integer Linear Programming-based Optimisation for Efficient Failure Detection in Large-scale Distributed Systems

February 2022
European Journal of Operational Research 303(2)

February 2022
303(2)

DOI:10.1016/j.ejor.2022.02.006

Authors:

Huawei Technologies Research and Developpement

Failure detectors (FDs) are fundamental building blocks for distributed systems. An FD detects whether a process has crashed or not based on the reception of heartbeats’ messages sent by this process over a communication channel. A key challenge of FDs is to tune their parameters to achieve optimal performance which satisfies the desired system requirements. This is challenging due to the complexities of large-scale networks. Existing FDs ignore such optimisation and adopt ad-hoc parameters. In this paper, we propose a new Mixed Integer Linear Programming (MILP) optimisation-based FD algorithm. We obtain the MILP formulation via piecewise linearisation relaxations. The MILP involves obtaining optimal FD parameters that meet the optimal trade-off between its performance metrics requirements, network conditions and system parameters. The MILP maximises our FD’s accuracy under bounded failure detection time while considering network and system conditions as constraints. The MILP’s solution represents optimised FD parameters that maximise FD’s expected performance. To adapt to real-time network changes, our proposed MILP-based FD fits the probability distribution of heartbeats’ inter-arrivals. To address our FD scalability challenge in large-scale systems where the MILP model needs to compute approximate optimal solutions quickly, we also propose a heuristic algorithm. To test our proposed approach, we adopt Amazon Cloud as a realistic testing environment and develop a simulator for robustness tests. Our results show consistent improvement of overall FD performance and scalability. To the best of our knowledge, this is the first attempt to combine the MILP-based optimisation modelling with FD to achieve performance guarantees.

Definition of mistakes and mainly considered QoS metrics for an FD algorithm.

…

Flowchart of the greedy heuristic algorithm .

…

Notations related to the design of .

…

Figures - uploaded by Btissam Er-Rahmadi

Content may be subject to copyright.

Content uploaded by Btissam Er-Rahmadi

Content may be subject to copyright.

Data-driven Mixed-Integer Linear Programming-based

Optimisation for Eﬃcient Failure Detection in Large-scale

Distributed Systems⋆,⋆⋆

Btissam Er-Rahmadia,c,1,Tiejun Maa,b,∗

aCentre for Risk Research, Department of Decision Analytic and Risk, Southampton Business School, University of Southampton,

Building 2, University Road, SO17 1BJ, UK.

bThe Artiﬁcial Intelligence Applications Institute, School of Informatics, Informatics Forum, The University of Edinburgh,

Crichton Street, EH8 9AB, UK.

cPresent Address: Edinburgh Research Centre, Huawei Technologies R&D, 2, Semple Street, EH3 8BL, UK.

ARTICLE INFO

Keywords:

Nonlinear Programming

Mixed Integer Linear Program-

ming

Distributed Systems

Failure Detection

Heartbeats

ABSTRACT

Failure detectors (FDs) are fundamental building blocks for distributed systems. An

FD detects whether a process has crashed or not based on the reception of heartbeats’

messages sent by this process over a communication channel. A key challenge of FDs

is to tune their parameters to achieve optimal performance which satisﬁes the desired

system requirements. This is challenging due to the complexities of large-scale net-

works. Existing FDs ignore such optimisation and adopt ad-hoc parameters. In this

paper, we propose a new Mixed Integer Linear Programming (MILP) optimisation-

based FD algorithm. We obtain the MILP formulation via piecewise linearisation

relaxations. The MILP involves obtaining optimal FD parameters that meet the opti-

mal trade-oﬀ between its performance metrics requirements, network conditions and

system parameters. The MILP maximises our FD’s accuracy under bounded failure

detection time while considering network and system conditions as constraints. The

MILP’s solution represents optimised FD parameters that maximise FD’s expected

performance. To adapt to real-time network changes, our proposed MILP-based FD

ﬁts the probability distribution of heartbeats’ inter-arrivals. To address our FD scala-

bility challenge in large-scale systems where the MILP model needs to compute ap-

proximate optimal solutions quickly, we also propose a heuristic algorithm. To test

our proposed approach, we adopt Amazon Cloud as a realistic testing environment

and develop a simulator for robustness tests. Our results show consistent improve-

ment of overall FD performance and scalability. To the best of our knowledge, this

is the ﬁrst attempt to combine the MILP-based optimisation modelling with FD to

achieve performance guarantees.

⋆Declarations of interest: none.

⋆⋆ This piece of research was partially sponsored by Huawei Ltd.

∗Corresponding author

btissam.errahmadi@gmail.com (B. Er-Rahmadi); tiejun.ma@soton.ac.uk (T. Ma)

ORCID (s): 0000-0003-0526-661X (B. Er-Rahmadi); 0000-0001-5545-6978 (T. Ma)

1This work has been achieved while B. Er-Rahmadi was a Research Fellow at the University of Southampton, UK; currently

she is aﬃliated to Edinburgh Research Centre, Huawei Technologies R&D, 2, Semple Street, EH3 8BL, UK.

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 1 of 28

Optimised Failure Detection Algorithms

1. Introduction

1.1. Failure Detection in Distributed Systems.

A distributed system is a set of autonomous computing processes, whose overall computing operations

appear to a user as a single coherent system (Tanenbaum and Steen 2007). In general, these processes may

represent hardware devices or software processes. To achieve the single system perception, these processes

need to collaborate with each other. It is unavoidable that some processes of the distributed system may

stop working (e.g., crash failure) at a random time. Failure detectors (FDs) are needed to identify such

failures (Guerraoui et al. 2009). An FD is a distributed algorithm that is implemented in at least one of the

distributed system processes. The FD would generally receive liveness messages (e.g., heartbeat messages)

from monitored processes to make decisions on whether the latter are still alive or not. Liveness decisions

are based on the receipt of liveness messages within a speciﬁc timeout.

It is, however, challenging to detect these failures in distributed networks (Laprie 1992). Generally, the

processes communicate via a network (e.g., a data centre, a cloud system, or a mobile network). Conse-

quently, a vital issue is the accuracy with which an FD will detect that a particular process has crashed.

In addition, network delays generated by messages’ transmissions over communications channels can vary

and are not upper-bounded. Therefore, FDs cannot wait indeﬁnitely to detect whether any other process has

crashed. Fischer et al. (1985) show the impossibility of distinguishing between a crashed process and a very

slow one in a pure asynchronous system (known as the Fischer-Lynch-Paterson’s impossibility result).

Chandra and Toueg (1996) introduce the concept of unreliable FDs to detect the crash behaviour of a

process. By adopting such FDs, every live process sends liveness heartbeat messages to an FD at regular

intervals. This guarantees that, if expected heartbeat messages are missing within a speciﬁc timeout, FD will

suspect that the corresponding process has crashed. Chandra and Toueg (1996) also provide an abstracted

classiﬁcation of FDs based on their eventual behaviour to solve a set of membership problems such as con-

sensus problems. Chandra and Toueg’s work kick-started the examination of the quality of service (QoS)

(e.g., speed and accuracy) of FDs. One limitation of such FDs is that, if a heartbeat is missed because of net-

work delays or packets’ losses, its sender process will certainly be suspected. Inappropriate FD parameters

(heartbeat interval and timeout) may lead to erroneous FD decisions.

1.2. Challenges and Problem Description

Researchers (e.g., Bertier et al. 2002;Hayashibara et al. 2004;Satzger et al. 2007,2011) have proposed

some FDs with adaptive decision parameters to cope with dynamic system environments. However, they can

only partially achieve this goal. Such FDs tend to be adaptive to network conditions without considering FDs’

QoS requirements. Moreover, the FD parameters’ adaptation may beneﬁt one failure detection performance

measure but degrade others (e.g., Chen et al. 2002;Hayashibara et al. 2004;Satzger et al. 2011). For example,

FDs with fast failure detection speed may result in lower accuracy, or vice versa. None of these previous

works has considered the optimal trade-oﬀ between an FD’s QoS performance metrics and its parameters.

The FD parameters are set in an ad-hoc style, which may only achieve sub-optimal performance in given

distributed network environments.

In this paper, we address the research challenge of the FD algorithm QoS performance optimisation.

We consider FDs that use the heartbeat mechanism. In this mechanism, a monitored process periodically

sends heartbeat messages to the FD. The latter continuously checks the liveness of this monitored process

based on either a timeout or a threshold (depending on the FD algorithm). In the ﬁrst option, the FD sets

a timeout by which it should have received the heartbeat messages. In the second option, the FD computes

a suspicion measure from heartbeats’ arrival times and compares it to a threshold. Consequently, the FD

detects failures if expected heartbeats are not received within the timeout or the suspicion measure is higher

than the threshold. Overall, heartbeat-FDs have two parameters: the heartbeat interval (𝛿) for liveness update

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 2 of 28

Optimised Failure Detection Algorithms

and the timeout (Δ) for FD decisions (suspect or trust), which is equivalent to a threshold (𝜀) in other FDs.

The QoS of such an FD is generally measured by failure detection time and query accuracy probability.

As heartbeat messages are generally transmitted via networks, the choice of FD parameters (𝛿and Δ/𝜀)

impacts its performance measures. This is because a heartbeat message may arrive later than expected, or

even be dropped, which implies potential false decisions made by the FD. Therefore, FD parameters should

be set in a way that considers ﬂuctuating network delays and packets’ losses.

However, using a short heartbeat interval means that heartbeats are sent more frequently, which allows

for the receiving of frequent liveness updates. This might enhance the failure detection time but, at the same

time, implies a growing consumption of the network bandwidth. If a longer heartbeat interval is selected,

it may increase the chances of false detection (i.e. decision mistakes). Furthermore, a short FD timeout (or

a small threshold) may enhance the failure detection speed but may result in less accurate decisions. These

issues show the challenge posed by the trade-oﬀs involved in selecting FDs’ parameters; this needs to be

addressed in order to optimise its QoS. More speciﬁcally, this paper addresses the challenges of optimising

𝛿and Δ/𝜀. Setting optimal FD parameters will achieve higher accuracy and reduce the failure detection time

while meeting constraints of real-time network conditions and system characteristics.

1.3. Our Approach

To overcome the research challenge of the FD algorithm’s optimal query accuracy probability and de-

tection speed trade-oﬀ, we propose a new threshold-based FD algorithm. We call our FD the Self-Organised

Network-Aware Failure Detector (SONAFD). SONAFD learns the probability distribution of heartbeats’

inter-arrival times. It also models the failure detection as a Non Linear optimisation problem (NLP). This

NLP maximises SONAFD’s QoS in terms of query accuracy probability and upper-bounds its detection time.

The NLP decision variables are SONAFD’s heartbeat interval 𝛿and suspicion threshold timeout Δ, which is

used to compute its suspicion threshold 𝜀. SONAFD’s NLP satisﬁes system and network constraints. This

NLP is then converted to MINLP by applying piecewise linearisation (PWL) functions (Geißler et al. 2015;

Rebennack and Krasko 2019;Vielma et al. 2010b). Finally, SONAFD uses a greedy heuristic algorithm

that we propose for solving the optimisation problem in large-scale systems. To the best of our knowledge,

our proposed SONAFD is the ﬁrst attempt to solve failure detection performance issues by adopting Mixed

Integer Linear Programming (MILP)-based optimisation models. This is while taking into consideration the

scalability of distributed systems.

The rest of the paper is structured as follows. Section 2discusses related literature. Section 3describes

the considered system model, while Section 4presents the design details of our proposed SONAFD. Section 5

presents both Amazon Cloud testbed and simulation evaluation results as robustness tests to validate our

proposed approach using CPLEX and our proposed heuristic algorithm, and Section 6concludes the paper.

2. Related work

2.1. Failure Detectors and their QoS

Chen et al. (2002) propose quantitative QoS metrics to measure FDs’ failure speed, probabilistic accu-

racy, and mistake rate. Since then, the key works on FDs’ design and performance evaluation have adopted

Chen et al. (2002)’s QoS metrics to evaluate and compare FDs’ performance (e.g., Hayashibara et al. 2004;

Liu et al. 2017;Satzger et al. 2011). In this paper, we use the same QoS metrics but extend their deﬁnitions

to a more realistic failure assumption — crash-recovery instead of crash-stop (see Section 3).

To improve the QoS of an FD, Chen’s (Chen et al. 2002) and Bertier’s (Bertier et al. 2002) proposed

binary decision (i.e. trust or suspect) FDs to estimate the new heartbeats’ arrival times based on observed

communication delays. However, such eﬀorts achieve limited performance improvements where the lim-

itations have three aspects. First, such FDs do not always provide up-to-date or correct lists of suspected

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 3 of 28

Optimised Failure Detection Algorithms

processes. This is due to the volatility of the network behaviour such as delays and losses. Second, the

binary outputs of these FDs (i.e. trust or suspect) are not capable of satisfying the QoS requirements at the

applications level. Third, the network dynamics make it diﬃcult for these FDs to adjust their parameters

optimally. For example, network delays may vary when FD adapts itself to the new change. Sotoma and

Madeira (2001) implement an adaptive FD that uses the average value of heartbeats’ inter-arrivals in its

timeout estimation. Similar works such as Xiong et al. (2012) and Turchetti et al. (2016) also try to enhance

the QoS of FDs by considering the actual feedback of previously achieved QoS. However, none of these FDs

was able to ﬁnd and set up its QoS-aware optimal parameters.

Unlike binary decisions FDs, accrual FDs (e.g., Hayashibara et al. 2004) output a probabilistic estimate

that a process has failed. Such a design allows applications to decouple failure interpretation from failure

monitoring. The 𝑃 ℎ𝑖 detector proposed by Hayashibara et al. (2004) is the ﬁrst accrual/threshold-based

FD. The 𝑃 ℎ𝑖 detector outputs a positive value on a continuous scale, called 𝑃 ℎ𝑖 (i.e. 𝜑). 𝑃 ℎ𝑖 reﬂects a

conﬁdence level on the probabilistic likelihood that the monitored process has crashed. The 𝑃 ℎ𝑖 detector

accrues over time and tends toward inﬁnity if the monitored process crashes. Such a conﬁdence level is

compared regularly to a suspicion threshold set by system management/application layers according to their

QoS requirements. The 𝑃 ℎ𝑖 detector has been used in a number of real systems: like OpenDayLight by Akka

(2018) documented in Akka (2021) and the decentralised storage system Cassandra (Lakshman and Malik

2010). The 𝑃 ℎ𝑖 detector assumes that the heartbeats’ inter-arrivals follow the normal distribution. Although

the 𝑃 ℎ𝑖 detector has achieved eﬃcient and stable failure detection, its assumption on the heartbeats’ inter-

arrivals distribution limits its range of applications.

The aim of the work proposed by Satzger et al. (2007,2011) is to generalise the Phi detector. Instead of

making an assumption based on the heartbeats’ inter-arrival times, it deducts the cumulative distribution of

sampled heartbeats in a sliding window based on a histogram density estimation. A suspicion probability

is then obtained using this cumulative distribution. As an adaptive accrual FD, it increases ﬂexibility and

decreases computation costs compared to the 𝑃 ℎ𝑖 detector. In our paper, we adopt a similar idea to the 𝑃 ℎ𝑖

detector and Satzger’s approach in terms of exploiting heartbeats’ inter-arrivals’ data and design a suspicion

threshold-based FD. This FD capitalises on the beneﬁts of our proposed MILP model to compute its optimal

parameters according to the QoS requirements and network conditions.

Other failure detectors of a similar fashion were proposed. Most of these FDs assume that the heartbeats’

inter-arrivals follow 1) an Exponential Distribution, suggested by Xiong and Defago (2007) and 2) a Weibull

Distribution proposed by Liu et al. (2017). The latter FD is an abstracted concept and represents challenges

that may prevent its deployment in real-world systems (e.g., onerous processing). Also, it may not be ideal

to have a ﬁxed assumption on the heartbeats’ inter-arrivals probability in the dynamic changing networks.

Hence, such probability distributions shall be adaptive for speciﬁc applications.

2.2. Communications Network Modelling

Most previous works on failure detection make use of the arrival time of recently received heartbeat mes-

sages in a speciﬁc time duration. Chen et al. (2002) estimate the network behaviour (i.e. delays and losses)

based on the collected information of successfully received heartbeats. The estimation involves computing

the expected value and variance of heartbeats’ delays without any assumption on the delays’ distribution in

relatively stable traﬃc conditions. If the network traﬃc is bursty, it projects a combination between short-

term and long-term heartbeats without a concrete implementation of the idea. Bertier et al. (2002) consider

only the last received heartbeat message and do not save heartbeats’ history. Sotoma and Madeira (2001) do

not consider real-world network conditions; they only simulate network delays as a set of predeﬁned values

alternatively used as ﬁxed delay intervals. Xiong et al. (2012) take into account the average of heartbeats’

inter-arrivals saved in the history window and include network communication delays. Turchetti et al. (2016)

follow a similar approach to Chen et al. (2002). Hayashibara et al. (2004) consider statistics of successfully

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 4 of 28

Optimised Failure Detection Algorithms

received heartbeats to obtain the mean and variance of heartbeats’ inter-arrivals. The authors assumed that

these inter-arrival times follow a normal distribution. Xiong and Defago (2007) and Liu et al. (2017) are

similar to Hayashibara et al. (2004) but assume diﬀerent heartbeat inter-arrival time distributions (e.g., the

exponential distribution and weibull distribution, respectively). As a generalisation of the 𝑃 ℎ𝑖 FD, Satzger

et al. (2007,2011) approximate the cumulative distribution function (CDF) of heartbeats’ inter-arrivals via

the cumulative frequencies of the most recently received heartbeat messages. The CDF of the inter-arrivals’

histogram is used directly to obtain suspicion values as there is no assumption on the distribution of heartbeat

arrivals. Although previous work tried to use available information on heartbeats’ inter-arrivals, none has

attempted a ﬂexible setting based on the impact of network conditions on inter-arrival times.

All these mentioned FDs are based on the exploitation of the network conditions only. Our proposed

approach combines the use of 1) probability distribution ﬁtting applied to heartbeats’ inter-arrivals and 2)

adopting an MILP model for the estimation of optimal FD parameters to adapt to the network conditions. This

has not been explored in previous work. Our proposed approach has shown the eﬀectiveness of exploiting

network conditions (as input parameters) and optimising FDs parameters together. The trade-oﬀ between

the QoS of FDs and costs of network bandwidth is formulated as a decision-making action in uncertain

environments, particularly in online learning systems (Abdel-Aziz et al. 2020;Ferreira et al. 2018;Gupta

et al. 2006;Tan et al. 2009). Previous literature related to networked systems (e.g., Hussin et al. 2015;

Madhushani and Leonard 2021;Xu et al. 2018) has extensive studies on such a trade-oﬀ to enhance the

performance of the system. In this paper we applied such an approach to the failure detection challenge.

2.3. Optimisation Modelling

Our proposed FD optimisation model is inspired by applying integer programming. It has allowed us to

model our FD’s QoS as decision variables and constraints. Similar problems and approaches in a multitude

of applications have been explored in previous literature (e.g., Du et al. 2012,2019;Gullhav et al. 2017;

Kaewpuang et al. 2013;Li et al. 2012). Particularly, the nonlinear programming (Luenberger and Ye 2016)

is suitable for the network and QoS-driven failure detection problem. This is because it can handle the general

(i.e. nonlinear) formulation of the optimised performance metric under diﬀerent heartbeats’ inter-arrivals’

distributions. Going deeper into our optimisation modelling, the MILP (Conforti et al. 2014) allows us to

tackle the nonlinearity of constraints, by using PWL relaxations (Geißler et al. 2012;Rebennack and Krasko

2019;Vielma et al. 2010a). This is to facilitate a simpler and more eﬃcient model implementation.

Therefore, we propose to model the distributed algorithm as an optimisation problem (i.e. MILP) that

aims at enhancing an FD’s decision-making accuracy while respecting its QoS and resource constraints.

Such optimisation approaches have been successfully adopted for resource management (Buyya et al. 2010;

Kaewpuang et al. 2013), communications design (Li et al. 2012), load balancing (Gullhav et al. 2017;Liu

and Righter 1998), repair policy (Sleptchenko and Johnson 2014), energy footprints management (Shen and

Wang 2014), and decision analytics (Heilig and Voß 2014).

In summary, the previous proposed FDs are designed with speciﬁc conditions. As network systems are

evolving rapidly, the need for autonomous and self-adaptable FDs becomes more imperative. None of these

existing FDs can optimise its parameters according to the changing network conditions, resource constraints

and QoS requirements to maximise its failure detection accuracy and speed. Thus, we aim to achieve such

an FD design with our proposed MILP optimisation-based FD.

3. System Model

3.1. Distributed Network Model

We consider that a distributed network consists of a ﬁnite set of processes: Σ𝑁=𝑝1, 𝑝2, 𝑝3, ..., 𝑝𝑁

and 𝑁≥2. These processes can communicate their liveness by sending heartbeat messages. They may fail

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 5 of 28

Optimised Failure Detection Algorithms

under crash-recovery assumption. In this assumption, a process may fail (e.g., human error, energy outage,

etc.) and recover (e.g., after intervention of an administrator) many times during an observation period (as

studied by Aguilera et al. (2000) and Ma et al. (2010)). However, the crash-stop assumption states that when

a process fails, it fails inﬁnitely and never recovers (discussed in Chen et al. 2002). The crash-recovery

assumption is better as it allows us to align with more realistic failure models.

We consider that the 𝑁𝑡ℎ process implements the FD and monitors all remaining processes (i.e. 𝑁− 1

monitored processes) in a star topology. This means that monitored processes communicate only with the FD

and not with each other. Throughout the paper, FD and 𝑝𝑁are used interchangeably to mean the monitoring

entity of FD. We assume that any monitored process 𝑝𝑖,∀𝑖∈ [1 . . 𝑁 − 1] and the process implementing FD,

i.e. 𝑝𝑁, are connected by two unidirectional quasi-reliable communication channels (Barolli et al. 2018).

These are deﬁned as unreliable network channels. Such a channel ensures that no message can be created,

changed, or copied. However, messages may be lost in the channel. The notion of ‘channel’ in this paper

does not necessarily correspond to a physical channel and represents an end-to-end connection.

We assume that the time clock of the distributed network is synchronised — i.e. there is no (or negligible)

clock drift. In real-world systems, many synchronisation methods/protocols are available, yielding to a

negligible clock drift (e.g., 10−6). Such an assumption holds in real-world implementations (Coulouris et al.

2001,2005;Marouani and Dagenais 2008).

For simplicity, and without loss of generality, we highlight our description with only two processes; these

are 𝑝𝑖(such as 𝑖could have any value in [1 . . 𝑁 − 1]) and 𝑝𝑁. This description could be easily extended

to the 𝑁-system. 𝑝𝑖sends liveliness messages — i.e. heartbeats — to 𝑝𝑁at regular interval 𝛿𝑖. If 𝑝𝑁does

not receive a 𝑝𝑖’s heartbeat message within a determined timeout, it starts suspecting 𝑝𝑖until it receives a

new heartbeat message. This is how a timeout-based FD would make a decision. However, our proposed

FD algorithm is threshold-based. It regularly computes an instantaneous suspicion level (noted as 𝜀𝑠𝑖(𝑡))

at instant 𝑡based on previously received heartbeats from 𝑝𝑖. Our FD compares 𝜀𝑠𝑖(𝑡)to its corresponding

threshold 𝜀𝑖:𝑝𝑖might be suspected or trusted by 𝑝𝑁.𝜀𝑖is generally set according to application requirements.

𝜀𝑖can be mapped to a speciﬁc timeout Δ𝜀𝑖. This timeout corresponds to the average of considered timeouts

upon new heartbeats’ receptions. We will detail the equation relating 𝜀𝑖and Δ𝜀𝑖in Section 4.1. The parameter

𝜀𝑖allows our FD to make a monitoring decision on 𝑝𝑖. Thus, the conﬁguration parameters that characterise

our FD are a heartbeat interval 𝛿𝑖and a suspicion threshold 𝜀𝑖.

3.2. QoS Metrics of FDs

To evaluate the QoS of an FD, a set of probabilistic performance metrics were ﬁrst proposed by Chen

et al. (2002) and extended in Ma et al. (2010). Such metrics have been widely adopted and used in previ-

ous works (Hayashibara et al. 2004;Liu et al. 2017;Satzger et al. 2011;Xiong and Defago 2007). More

speciﬁcally, the detection time 𝑇𝐷𝑖represents the speed at which a failure is detected. The query accuracy

probability 𝑃𝐴𝑖depicts the accuracy of an FD decision. The mistake rate (𝜆𝑀𝑖) illustrates the frequency of

the FD’s false decisions. Figure 1illustrates interactions between 𝑝𝑖and 𝑝𝑁under crash-recovery. More

precisely, under crash-recovery, the notion of “mistake” may include multiple states. 𝑝𝑁makes a mistake

if it suspects 𝑝𝑖while the latter is functional, or trusts it while it crashes. The FD’s mistake is not only as-

sociated with transitions but also with the mutual comparison between FD output (state or transition) and

𝑝𝑖states/transitions. In fact, in line with Ma et al. (2010), we consider the following states of 𝑝𝑖:Recovery

when 𝑝𝑖is functional and Crash when it is faulty. Hence, the transition from Recovery to Crash is called the

C-Transition, and the transition from Crash to Recovery is called the R-Transition. Figure 1also shows the

main FD QoS metrics, including the mistakes that happen in crash-recovery systems. A mistake can be:

-Mistake type 1: the FD starts suspecting 𝑝𝑖(i.e. S-Transition) while it is functional (in Recovery state).

-Mistake type 2: the FD is trusting 𝑝𝑖while it has just crashed (i.e. C-Transition).

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 6 of 28

Optimised Failure Detection Algorithms

down

trust

suspect suspect

trust trust

Observation period 𝑻

Mistake type 1:

Suspect while Healthy

Mistake type 2:

Trust while Crush

Mistake type 4:

Suspect while Recovery

Mistake type 3:

Trust while Faulty

suspect

trust

𝑻𝑫𝒊𝑷𝑨𝒊= 𝒌

𝑻𝑮𝒊𝒌

𝑻

𝝀𝑴𝒊=𝟒

𝑻

𝒑𝑵

(𝒊.𝒆. 𝑭𝑫)

𝒑𝒊

𝑻𝑮𝒊𝟐

𝑻𝑮𝒊𝟏𝑻𝑮𝒊𝟑𝑻𝑮𝒊𝟒𝑻𝑮𝒊𝟓

S-Transition

C-Transition

R-Transition

T-Transition

Figure 1: Deﬁnition of mistakes and mainly considered QoS metrics for an FD algorithm.

-Mistake type 3: the FD starts trusting 𝑝𝑖(i.e. T-Transition) while it is faulty (it is in Crash state).

-Mistake type 4: the FD is suspecting (i.e. Suspect)𝑝𝑖while it has just recovered (i.e. R-Transition).

Therefore, we adopt the following set of QoS metrics to evaluate an FD performance:

-Detection Time (𝑇𝐷𝑖): this random variable represents the time period between the time 𝑝𝑖starts crashing

to the time 𝑝𝑁starts suspecting 𝑝𝑖. This metric reﬂects the speed at which an FD detects faults of 𝑝𝑖. The

shorter 𝑇𝐷𝑖is, the faster the failure detector is.

-Mistake Rate (𝜆𝑀𝑖): this random variable represents the average number of mistakes (whatever the mis-

take type is) that 𝑝𝑁makes in a time unit in respect to the state of 𝑝𝑖.

-Query Accuracy Probability (𝑃𝐴𝑖): it measures the probability that, when queried at a random time, 𝑝𝑁

indicates correctly the state of 𝑝𝑖. Practically, 𝑃𝐴𝑖is computed as the ratio between the sum of time periods

(i.e. 𝑇𝑘

𝐺𝑖in Figure 1) during which 𝑝𝑁speciﬁes correctly the state of 𝑝𝑖to the observation period.

In this paper, we focus on optimising the trade-oﬀ between two main metrics — 𝑃𝐴and 𝑇𝐷— in our

MILP modelling. FD measures the expected 𝑇𝐷and 𝑃𝐴based on 𝑇𝐷𝑖s and 𝑃𝐴𝑖s over all monitored processes,

respectively. We also evaluate 𝜆𝑀(i.e. the expected 𝜆𝑀𝑖from all monitored processes) in our evaluation

results. Therefore, 𝑃𝐴and 𝑇𝐷are incorporated into the objective function and/or constraints (introduced in

Section 4.2.2). This is to ensure that the solution of the MILP will satisfy the FD’s QoS requirements.

4. Self-optimised Network-Aware Failure Detector: SONAFD

An FD shall be able to be aware of the continuously changing communication network and can adapt itself

to provide eﬃcient failure detection. An FD should have the following desired capabilities: 1) identifying

network messaging characteristics and behaviour (e.g., probability distribution of heartbeats’ inter-arrivals);

2) adapting the FD’s decision threshold regularly to provide fast and accurate decisions about monitored pro-

cesses’ failures/recoveries; and 3) guaranteeing failure detection QoS requirements with existing resources’

constraints and providing optimal parameters of the FD to maximise its QoS.

To achieve these desired features, an adaptive FD which fulﬁls diﬀerent QoS needs is required. While

existing FDs were proposed to allow updating heartbeat timeouts (Bertier et al. 2002;Chen et al. 2002;

Hayashibara et al. 2004;Satzger et al. 2007,2011), they are unable to optimally incorporate QoS require-

ments into their parameters’ settings and consider network resource constraints. Therefore, we introduce the

Self-Organised Network-Aware Failure Detector (SONAFD). We design the SONAFD to address existing

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 7 of 28

Optimised Failure Detection Algorithms

FDs’ performance issues and achieve the features we discussed above. SONAFD is a combination of an

adaptive FD algorithm that we call the Network-Aware Failure Detector (NAFD) and an MILP model.

NAFD learns the probability distribution of heartbeats’ inter-arrivals and adapts its suspicion threshold

𝜀𝑖to dynamic network conditions. NAFD contains three main parts. First, NAFD collects a sample of

recent heartbeats’ inter-arrivals. It then performs probability distribution ﬁtting on the collected sample

using the Kolmogorov-Smirnov (K-S) test (Kolmogorov 1933;Smirnov 1948). The aim is to determine

the representative probability distribution of heartbeats’ inter-arrivals in real-world network environments.

The choice of probability distributions is independent of NAFD as we can ﬁt any probability distributions

on a collected sample of heartbeats’ messages’ inter-arrivals. Second, NAFD applies the ﬁtted probability

distribution on a sample of a short window of heartbeats’ inter-arrivals to infer the probabilistic likelihood

of 𝑝𝑖crash. This probabilistic likelihood is continuously compared to 𝜀𝑖to make failure detection decisions

(trust or suspect). Third, NAFD updates its 𝜀𝑖regularly using collected network packets’ delays and losses

data (simultaneously collected with heartbeats’ inter-arrivals) to adapt to changing network conditions.

We also propose an NLP model for NAFD that aims at maximising its query accuracy probability while

embedding the required upper-bound of its failure detection time. The NLP will guarantee the required sys-

tem constraints (e.g., bandwidth) and consider network conditions through the ﬁtted probability distribution.

To simplify the computation complexity, our proposed NLP is converted to an MILP model. The optimal

solutions of the MILP model are 𝛿𝑖and Δ𝜀𝑖used by NAFD in its settings. In addition, we propose a greedy

heuristic algorithm to eﬃciently compute the proposed MILP model solutions and enhance its scalability.

In summary, NAFD represents the failure detection functions of how heartbeats’ inter-arrivals should

be exploited to optimally detect failure or recovery. SONAFD adopts our proposed MILP to automate the

optimal parameters’ settings and meets the expected QoS of NAFD and system requirements.

4.1. NAFD

For NAFD, we aim to achieve three features. The ﬁrst feature is an FD that adopts a customised heart-

beats’ inter-arrivals’ probability distribution. Such a distribution depends on the distributed system envi-

ronment in which NAFD is implemented. This distribution may be updated regularly to adapt to network

environment changes. The second feature is the heartbeat message function and threshold-based accrual FD

algorithm. The third feature is the self-adapting suspicion threshold FD. Table 1contains the main notations

of NAFD. When NAFD starts, 𝑝𝑖sends regular heartbeat messages each 𝛿𝑖time interval (Algorithm 1(a)-

line 3). NAFD uses its adapted threshold to detect potential failures. The features of the NAFD are as

follows. Please refer to NAFD pseudo-codes for 𝑝𝑖in Algorithm 1(a) and FD process in Algorithm 1(b) (the

detailed procedures’ pseudo-codes are available in Appendix G).

First, the heartbeat messages’ inter-arrivals’ probability distribution ﬁtting (Algorithm 1(b)-line 3,

with further details in Algorithm B.1 in Appendix B): NAFD is designed to take into account the changing

behaviour of heartbeats’ inter-arrivals. NAFD/SONAFD continuously collects heartbeats’ inter-arrivals. It

performs probability distribution ﬁtting in the background by regularly computing the goodness-of-ﬁt test

distance. NAFD/SONAFD adopts the probability distribution for which this distance is lower than the crit-

ical value of the goodness-of-ﬁt test, and is consistently the lowest for the last twenty-four hours. We use

the K-S test (Appendix A) as a goodness-of-ﬁt test to ﬁt the collected data. The main advantage of adopting

the K-S criterion is that the probability distribution of the K-S test statistic itself does not depend on the

underlying cumulative distribution function being tested. It is also an exact test that does not depend on

sample sizes (Guthrie 2020). Let 𝜋∗

𝑖and Π𝑖be the best ﬁtted distribution and the set of its characterising pa-

rameters (estimated from the ﬁtted sample), respectively. For 𝑝𝑖,𝜋∗

𝑖and Π𝑖are the outputs of the probability

distribution ﬁtting procedure. The details of probability distribution ﬁtting are given in Appendix B.

Second, the accrual failure detection and threshold: NAFD adopts the ﬁtted distribution in predicting

the arrival time of the expected next heartbeat message and identifying the probabilistic likelihood that 𝑝𝑖

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 8 of 28

Optimised Failure Detection Algorithms

Table 1

General Notations related to NAFD.

Notation Description

𝑁Number of processes in the system.

𝛿𝑖An FD parameter representing the heartbeats period of 𝑝𝑖(ms).

𝜀𝑖An FD parameter representing the threshold of NAFD/SONAFD for 𝑝𝑖.

Δ𝜀𝑖The mean equivalent timeout associated with threshold 𝜀𝑖(ms).

𝑖Heartbeat history of 𝑝𝑖.

𝑊𝑖=𝑊Sampling window size of 𝑝𝑖: maximum size of 𝑖.

𝜇𝑖Mean of heartbeat inter-arrivals in 𝑖of 𝑝𝑖(ms).

𝜎𝑖Standard deviation of heartbeat inter-arrivals in 𝑖of 𝑝𝑖(ms).

𝐺𝑖(𝑡)The probability that a given heartbeat message will arrive more than 𝑡time units later

than the previous heartbeat for 𝑝𝑖, assuming that heartbeat inter-arrivals follow the

distribution 𝜋∗

𝑖, characterised by parameters set Π𝑖

𝑖Network history of 𝑝𝑖containing delays and packet losses information of its corresponding

communication channel.

𝐷𝑖=𝔼(𝐷𝑗

𝑖)𝑗∈𝑖Packet delay mean for process 𝑖computed for whole 𝑖(ms).

𝑆𝑖Packet delay standard deviation for 𝑝𝑖computed for whole 𝑖(ms).

𝜏𝑖Packet loss rate for 𝑝𝑖computed for whole 𝑖.

𝑃𝐶𝐿 Required percentage of conﬁdence coverage to estimate the number of consecutively

lost packets at 𝜏𝑖(e.g., 99%).

𝐾𝑖Estimated number of consecutive lost packets at 𝑃𝐶𝐿 conﬁdence coverage.

𝑇𝑚𝑜𝑛 Time interval used by NAFD/SONAFD to check the state of monitored processes (ms).

𝑇𝑎𝑑𝑎𝑝𝑡 Time interval used by NAFD to trigger 𝜀𝑖adaptation (ms).

has crashed. There are two operations:

- Heartbeats’ inter-arrivals’ sampling (Algorithm 1(b)-line 9, with details in Algorithm G.2-lines 2-3):

NAFD keeps track of an observation window of recent heartbeats’ inter-arrivals (i.e. heartbeat history 𝑖).

NAFD saves the last 𝑊𝑖heartbeats’ interval-arrivals to its sampling window 𝑖. If 𝑖has more than 𝑊𝑖

messages, the oldest messages in 𝑖are dropped. NAFD uses 𝑖to estimate the parameters of 𝜋∗

𝑖(e.g., mean

and variance) stored in Π𝑖. We will further discuss the size 𝑊𝑖of the observation window 𝑖in Section 4.4.

This is to examine its impact on the overall FD performance.

- Calculating NAFD’s suspicion level (Algorithm 1(b)-line 6with details in Algorithm G.3-lines 4-5-6):

NAFD uses recent heartbeats’ inter-arrivals sample in 𝑖to estimate the probability that 𝑝𝑖has crashed. Such

a calculation is performed based on the adopted probability distribution 𝜋∗

𝑖, and its estimated parameters from

𝑖, i.e. Π𝑖. NAFD continuously computes and converts this ‘probability’ of crash to a positive real number

𝜀𝑠𝑖. NAFD then compares 𝜀𝑠𝑖to 𝜀𝑖to make a failure detection decision. Given that 𝑇𝑙𝑎𝑠𝑡𝑖is the arrival time

of the last received heartbeat from 𝑝𝑖,𝜀𝑠𝑖is computed at current instant 𝑡𝑛𝑜𝑤 as follows:

𝜀𝑠𝑖(𝑡𝑛𝑜𝑤)𝑑 𝑒𝑓

= −𝑙𝑜𝑔10 (𝐺𝑖(𝑡𝑛𝑜𝑤 −𝑇𝑙𝑎𝑠𝑡𝑖)) 𝑤ℎ𝑒𝑟𝑒 𝐺𝑖(𝑡) = 1 − 𝐶𝐷𝐹 (𝑡, 𝜋 ∗

𝑖,Π𝑖),(1)

where 𝐺𝑖(𝑡)represents the probability that a given heartbeat will arrive more than 𝑡time units later than

the previous heartbeat. Given Δ𝑡=𝑡𝑛𝑜𝑤 −𝑇𝑙𝑎𝑠𝑡𝑖,𝐺𝑖(Δ𝑡)corresponds to the probability that expected (next)

heartbeat will arrive at least after Δ𝑡time units after the last received heartbeat. The value of 𝜀𝑠𝑖increases if

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 9 of 28

Optimised Failure Detection Algorithms

the time diﬀerence 𝑡𝑛𝑜𝑤 −𝑇𝑙𝑎𝑠𝑡𝑖increases, and vice versa. 𝜀𝑠𝑖is compared to the threshold 𝜀𝑖: if 𝜀𝑠𝑖is lower

than or equal to 𝜀𝑖,𝑝𝑁trusts 𝑝𝑖and suspects it otherwise.

Third, NAFD’s decision suspicion threshold 𝜀𝑖adaptation (Algorithm 1(b)-line 7, further detailed in

Algorithm G.4-lines 6to 10): To adapt to changing network conditions (high network delay or large bursts

of packets’ loss), NAFD adjusts its threshold autonomously. This is to maintain the required QoS in terms

of accuracy and speed. In addition to the saved heartbeats’ inter-arrivals in 𝑖, NAFD also saves network

packets’ delays and losses history, noted as 𝑖. The size of network/channel history 𝑖is diﬀerent from

𝑊𝑖: it corresponds to one hour of collected delays and packet loss information. By referring to equation (1),

𝜀𝑠𝑖represents the degree at which the elapsed time since last heartbeat arrival time is lower or higher than a

given timeout: we note this timeout as Δ𝜀𝑖.Δ𝜀𝑖veriﬁes (2):

𝜀𝑖= −𝑙𝑜𝑔10 (𝐺𝑖(Δ𝜀𝑖)), such as (See Appendix D): Δ𝜀𝑖=𝐷𝑖+ 3 ⋅𝑆𝑖+𝐾𝑖⋅𝛿𝑖if 𝐷𝑖≥𝑆𝑖

𝑃99𝑡ℎ𝑖+𝐾𝑖⋅𝛿𝑖otherwise, (2)

where 𝐷𝑖is the average heartbeat messages’ delay, 𝑆𝑖is the standard deviation of heartbeat messages’ delay,

𝑃99𝑡ℎ𝑖is the 99th percentile of heartbeat messages’ delay, 𝜏𝑖is the heartbeat messages’ loss rate and 𝐾𝑖is the

estimated number of successively lost packets at a speciﬁc packet loss rate 𝜏𝑖.𝐾𝑖is obtained as described in

Appendix E. Finally, 𝜀𝑖is computed as:

𝜀𝑖=−𝑙𝑜𝑔10 (𝐺𝑖(𝐷𝑖+ 3 ⋅𝑆𝑖+𝐾𝑖⋅𝛿𝑖)) if 𝐷𝑖≥𝑆𝑖

−𝑙𝑜𝑔10 (𝐺𝑖(𝑃99𝑡ℎ𝑖+𝐾𝑖⋅𝛿𝑖)) otherwise. (3)

Algorithm 1(a) NAFD (in 𝑝𝑖,∀𝑖∈ [1 . . 𝑁 −1]):

1: 𝛿𝑖= 10𝑚𝑠

2: procedure SENDHEARTBE AT()

3: while 𝑚𝑜𝑑(𝑡𝑛𝑜𝑤 , 𝛿𝑖)=0do Send heartbeat

Algorithm 1(b) NAFD (in 𝑝𝑁):

1: ∀𝑖∈ [1 . . 𝑁 − 1] 𝜀𝑖= 1;𝑇𝑚𝑜𝑛 = 10𝑚𝑠;𝑇𝑎𝑑𝑎𝑝𝑡 = 1200000𝑚𝑠

2: for 𝑖, 𝑖 ∈ [1 . . 𝑁 − 1] do

3: 𝜋∗

𝑖←FITHBINT ERARR IVALS(𝑖)

4: procedure MAIN

5: if 𝑡𝑛𝑜𝑤 =𝑡𝑠𝑡𝑎𝑟𝑡 then INITIA LISE ()

6: else if 𝑚𝑜𝑑 (𝑡𝑛𝑜𝑤, 𝑇𝑚𝑜𝑛 )=0then DETECTFAILUR E()

7: else if 𝑚𝑜𝑑 (𝑡𝑛𝑜𝑤, 𝑇𝑎𝑑 𝑎𝑝𝑡)=0then ADAPT THRES HOLD S()

8: while ℎ𝑒𝑎𝑟𝑡𝑏𝑒𝑎𝑡 do

9: PROCESSREC EIVE DHEART BEAT(𝑡𝑛𝑜𝑤 )

4.2. SONAFD

To optimise NAFD, we consider two QoS metrics: the failure detection time 𝑇𝐷and the query accuracy

probability 𝑃𝐴, which represent the delay of detecting a failure and failure decision accuracy, respectively

(see details in Section 3.2). Note that there is a trade-oﬀ between failure detection time and accuracy. The

higher 𝜀𝑖is, the worse 𝑇𝐷is (i.e. slower failure detection) with a potential better 𝑃𝐴. In addition, we use the

mistake rate 𝜆𝑀to model relationships between considered performance metrics (𝑇𝐷and 𝑃𝐴) and SONAFD

parameters (𝛿𝑖and 𝜀𝑖) (Section 4.2.1). We also use 𝜆𝑀to evaluate the performance of SONAFD in Section 5.

We propose an NLP model to design SONAFD with optimal parameters and guarantee the required QoS.

To simplify the computation complexity, we transform the NLP to an MILP model, to which we refer as ℕ

and 𝕄, respectively (see details in Section 4.2.2). Table 2contains notations related to the model ℕ/𝕄. The

objective of ℕ/𝕄is to ﬁnd optimal values of SONAFD’s parameters, heartbeats’ intervals 

𝛿𝑖and suspicion

thresholds’ timeouts 

Δ𝜀𝑖, that guarantee maximal 𝑃𝐴and bounded 𝑇𝐷:𝛿𝑖and Δ𝜀𝑖are the decision variables.

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 10 of 28

Optimised Failure Detection Algorithms

𝜀𝑖is then obtained from 

Δ𝜀𝑖via equation (4), where 

𝐺𝑖represents 𝐺𝑖whose characterising parameter (i.e.

mean) is replaced by 

𝛿𝑖:

𝜀𝑖

𝑑𝑒𝑓

= −𝑙𝑜𝑔10 

𝐺𝑖−

Δ𝜀𝑖.(4)

The pseudo-code of SONAFD is presented in Algorithm 2(a) for 𝑝𝑖and Algorithm 2(b) for 𝑝𝑁. At each

𝑝𝑖, the update of 𝛿𝑖with 

𝛿𝑖makes SONAFD diﬀerent from NAFD (i.e. Algorithm 2(a)-line 4). At 𝑝𝑁,

SONAFD uses all NAFD’s procedures and additionally uses the function OPTIMISEPARAMETERS (i.e Al-

gorithm 2(b)-line 7, further detailed in Algorithm G.5). This procedure diﬀerentiates SONAFD from NAFD

by adding 𝕄to NAFD and computing optimal parameters as a solution. SONAFD runs OPTIMISEPARAM-

ETERS at regular intervals (i.e. 𝑇𝑂 𝑝𝑡), which starts by retrieving heartbeat, network, and system information

in Step 1 (Algorithm G.5-lines 2-6). 𝕄is solved in Step 2 (Algorithm G.5-line 7) using the IBM ILOG

CPLEX Optimizer blackbox MILP solver. It is based on the implementation of the Branch and Cut algo-

rithm as introduced in IBM 2017 . In Step 3, the output of solving 𝕄is retrieved. If 𝕄is feasible, and after

solving it, all processes are set to use the optimised parameters: i.e. 

𝛿𝑖(Algorithm 2(a)-line 4and Algo-

rithm G.5-line 10) and 𝜀𝑖(Algorithm G.5-line 12 and Algorithm G.5-line 12). If 𝕄is unfeasible, SONAFD

proceeds to 𝜀𝑖adaptation instead (Algorithm G.5-line 14) as detailed in Algorithm G.4). Performance and

scalability evaluation results are presented in Section 5.

Algorithm 2(a) SONAFD (in 𝑝𝑖,∀𝑖∈ [1 . . 𝑁 −1]):

1: 𝛿𝑖= 10𝑚𝑠

2: procedure SENDHEARTBE AT()

3: while 𝑚𝑜𝑑(𝑡𝑛𝑜𝑤 , 𝛿𝑖)=0do

4: if (heartbeat interval update ( 

𝛿𝑖)) then 𝛿𝑖=

𝛿𝑖

5: Send heartbeat

Algorithm 2(b) SONAFD (𝑝𝑁):

1: 𝜀𝑖= 1;𝑇𝑚𝑜𝑛 = 10𝑚𝑠;𝑇𝑜𝑝𝑡 = 1200000𝑚𝑠

2: for 𝑖, 𝑖 ∈ [1 . . 𝑁 − 1] do

3: 𝜋∗

𝑖←FITHBINT ERARR IVALS(𝑖)

4: procedure MAIN

5: if 𝑡𝑛𝑜𝑤 =𝑡𝑠𝑡𝑎𝑟𝑡 then INITIA LISE ()

6: else if 𝑚𝑜𝑑 (𝑡𝑛𝑜𝑤, 𝑇𝑚𝑜𝑛 )=0then DETECTFAILUR E()

7: else if 𝑚𝑜𝑑 (𝑡𝑛𝑜𝑤, 𝑇𝑜𝑝𝑡 )=0then OPTIMISE PARAM ETER S()

8: while ℎ𝑒𝑎𝑟𝑡𝑏𝑒𝑎𝑡 do

9: PROCESSREC EIVE DHEART BEAT(𝑡𝑛𝑜𝑤 )

4.2.1. Relationships between FD’s QoS metrics and SONAFD parameters

In this section, we provide detailed analysis about how heartbeats’ period 𝛿𝑖and suspicion threshold

𝜀𝑖(and its corresponding timeout Δ𝜀𝑖) impact the considered QoS metrics of FD. This is important for

modelling the constraints of SONAFD MILP problem (constraints (19) and (23)). Once 𝕄is solved, 

𝛿𝑖and

𝜀𝑖(i.e. 

Δ𝜀𝑖) will be updated as the new parameters’ settings of SONAFD.

Failure Detection time: to have rigorous QoS considerations, we adopt the worst case scenario for the

estimation of 𝑇𝐷𝑖by considering the longest failure detection time duration. In such a scenario, a crash would

happen immediately after successfully sending a heartbeat message to 𝑝𝑁. Therefore, 𝑇𝑙𝑎𝑠𝑡𝑖of SONAFD is

updated with the arrival time of this heartbeat message provided that such a heartbeat message is not lost.

Let 𝑡𝐷𝑖be the instant time at which SONAFD detects this failure. SONAFD detects such a failure when the

current 𝜀𝑠𝑖at 𝑡𝐷𝑖exceeds 𝜀𝑖, i.e. 𝜀𝑠𝑖(𝑡𝐷𝑖)> 𝜀𝑖.𝑝𝑁suspects 𝑝𝑖just after the 𝜀𝑠𝑖(𝑡𝑛𝑜𝑤 ) = 𝜀𝑖. We consider the

inequality to ﬁnd the upper boundary of detection time 𝑇𝐷𝑖: when 𝑡𝑛𝑜𝑤 =𝑡𝐷𝑖, it implies 𝜀𝑠𝑖(𝑡𝐷𝑖)> 𝜀𝑖. By

replacing 𝜀𝑠𝑖by its deﬁnition from equation (1), and 𝜀𝑖by its deﬁnition according to equation (2), we have:

−𝑙𝑜𝑔10 (𝐺𝑖(𝑡𝐷𝑖−𝑇𝑙𝑎𝑠𝑡𝑖)) >−𝑙𝑜𝑔10 (

𝐺𝑖(Δ𝜀𝑖)) ⟹𝐺𝑖(𝑡𝐷𝑖−𝑇𝑙𝑎𝑠𝑡𝑖)<

𝐺𝑖(Δ𝜀𝑖).(5)

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 11 of 28

Optimised Failure Detection Algorithms

Table 2

Notations related to the design of 𝕄.

Notation Description

𝛿𝑖A decision variable representing heartbeats’ period of process 𝑖.

Δ𝜀𝑖A decision variable representing the mean equivalent timeout associated to threshold 𝜀𝑖.

𝐵Network bandwidth (bits/ms).

𝑀Message/Packet size (bits).

𝑂Allowed overhead percentage of transmitted heartbeats.

𝑇𝐷𝑖Detection time of process 𝑖(ms).

𝑇𝑇 𝑆

𝐷Tolerated detection time of FD or detection time threshold (ms).

𝑃𝐴𝑖Query accuracy probability of process 𝑖.

𝑃𝑅𝑒𝑞

𝐴𝑖The minimum required query accuracy probability for process 𝑖(e.g., 95%).

𝑇𝑆The time precision used as the discretisation step in the discretisation of ℕ(ms).

𝐹𝑈 𝐵 A multiplication factor to set up an upper-bound on Δ𝜀𝑖in the discretisation of ℕ(>1).

𝑇𝑜𝑝𝑡 Time interval used by SONAFD to trigger its MILP initialisation and solving (ms).

For the probability distribution of heartbeats’ inter-arrivals in the Amazon Elastic Compute Cloud (EC2),

the exponential distribution 2represents a good trade-oﬀ for ﬁtting diﬀerent tested monitored processes (i.e.

virtual machines) with parameter Π𝑖= {𝜇𝑖}. The distribution choice is discussed in Section 4.3. This means

that 𝐺𝑖(𝑡) = 𝑒

−𝑡

𝜇𝑖, where 𝜇𝑖is the average of heartbeats’ inter-arrivals of process 𝑖:

𝑒

−(𝑡𝐷𝑖−𝑇𝑙𝑎𝑠𝑡𝑖)

𝜇𝑖< 𝑒

−Δ𝜀𝑖

𝛿𝑖⟹

−(𝑡𝐷𝑖−𝑇𝑙𝑎𝑠𝑡𝑖)

𝜇𝑖

−Δ𝜀𝑖

𝛿𝑖

⟹𝑡𝐷𝑖−𝑇𝑙𝑎𝑠𝑡𝑖>𝜇𝑖

𝛿𝑖

⋅Δ𝜀𝑖.(6)

As we are considering the worst case scenario of failure occurrence, let 𝑡𝐹𝑖be the instant time at which 𝑝𝑖

crashes, which corresponds to the sending time of its last heartbeat before the failure. Then 𝑇𝑙𝑎𝑠𝑡𝑖=𝑡𝐹𝑖+𝐷𝐹𝑖,

where 𝐷𝐹𝑖is the delay of the last sent heartbeat message before failure. The detection time 𝑇𝐷𝑖is

𝑇𝐷𝑖=𝑡𝐷𝑖−𝑡𝐹𝑖>𝜇𝑖

𝛿𝑖

⋅Δ𝜀𝑖+𝑇𝐷𝑖.(7)

Let 𝑚𝑎

𝑖and 𝑚𝑏

𝑖be two heartbeats of 𝑝𝑖with sequence numbers 𝑎and 𝑏, respectively. 𝑚𝑎

𝑖and 𝑚𝑏

𝑖can be

either successive heartbeats (i.e. 𝑏=𝑎+1) or not (lost heartbeat in the network). Let 𝑇𝑎

𝑖,𝑇𝑏

𝑖,𝐴𝑎

𝑖,𝐴𝑏

𝑖,𝐷𝑎

𝑖and

𝐷𝑏

𝑖be the transmit times, the arrival times and delays of 𝑚𝑎

𝑖and 𝑚𝑏

𝑖, respectively. Let 𝛽𝑎𝑏

𝑖,𝛾𝑎𝑏

𝑖and 𝐽𝑎𝑏

𝑖be the

inter-arrival time, the inter-transmit time and jitter (packet delay variation) between 𝑚𝑎

𝑖and 𝑚𝑏

𝑖, respectively.

The jitter represents the delay diﬀerence between two successive heartbeats. Then:

𝐴𝑎

𝑖=𝑇𝑎

𝑖+𝐷𝑎

𝑖;𝐴𝑏

𝑖=𝑇𝑏

𝑖+𝐷𝑏

𝑖;𝐽𝑎𝑏

𝑖=𝐷𝑏

𝑖−𝐷𝑎

𝑖;𝛽𝑎𝑏

𝑖=𝐴𝑏

𝑖−𝐴𝑎

𝑖;𝛾𝑎𝑏

𝑖=𝑇𝑏

𝑖−𝑇𝑎

𝑖

⟹𝛽𝑎𝑏

𝑖=𝐴𝑏

𝑖−𝐴𝑎

𝑖=𝑇𝑏

𝑖+𝐷𝑏

𝑖− (𝑇𝑎

𝑖+𝐷𝑎

𝑖) = (𝑇𝑏

𝑖−𝑇𝑎

𝑖) + 𝐷𝑏

𝑖−𝐷𝑎

𝑖=𝛾𝑎𝑏

𝑖+𝐽𝑎𝑏

𝑖

⟹𝛽𝑎𝑏

𝑖=𝛿𝑖+𝐽𝑎𝑏

𝑖.

(8)

2Such a distribution can change within diﬀerent network environments and conditions. Our proposed K-S ﬁtting process can

adapt to such distribution changes and identify alternative best-ﬁtted probability distributions.

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 12 of 28

Optimised Failure Detection Algorithms

Since 𝑝𝑖sends heartbeats at regular intervals, then 𝛾𝑎𝑏

𝑖=𝛿𝑖. Consequently we have 𝜇𝑖=𝔼𝑎𝑏(𝛽𝑎𝑏

𝑖) =

𝔼𝑎𝑏(𝛿𝑖+𝐽𝑎𝑏

𝑖)⟹𝜇𝑖=𝛿𝑖+𝔼𝑎𝑏(𝐽𝑎𝑏

𝑖) = 𝛿𝑖+𝐽𝑖, where 𝐽𝑖is the average jitter collected for 𝑝𝑖heartbeats:

𝜇𝑖=𝛿𝑖+𝐽𝑖⟹𝑇𝐷𝑖>𝛿𝑖+𝐽𝑖

𝛿𝑖

× Δ𝜀𝑖+𝐷𝐹𝑖.(9)

Generally the average jitter 𝐽𝑖is small compared to 𝛿𝑖and can be neglected in the equation (9): 𝐽𝑖=

𝑜(𝛿𝑖)⟹𝛿𝑖+𝐽𝑖

𝛿𝑖

≃ 1.

The average detection time (in the worst case) for 𝑝𝑖is estimated as 𝑇𝐷𝑖>Δ𝜀𝑖+𝐷𝐹𝑖, since 𝐷𝐹𝑖is the

delay of the last sent heartbeat before failure then it is impacted by current network delay. As it is the case

with any heartbeat sent over the network, it randomly undergoes a network delay, which could be estimated

by network average delay. We replace 𝐷𝐹𝑖by 𝐷𝑖to have a more general formula:

𝑇𝐷𝑖>Δ𝜀𝑖+𝐷𝑖.(10)

Equation (10) describes the direct relation between SONAFD timeout Δ𝜀𝑖with its worst-case 𝑇𝐷𝑖. Since

the timeout is related to heartbeat interval (see constraints (20) and (21) in Section 4.2.2), we know that fast

𝑇𝐷𝑖are equivalent to shorter 𝛿𝑖(high heartbeats frequency) and shorter Δ𝜀𝑖(i.e. smaller 𝜀𝑖).

Mistake Rate: the mistake rate 𝜆𝑀𝑖corresponds to the frequency at which mistakes occur, which is the

inverse of the average mistake recurrence time 𝔼(𝑇𝑀𝑅𝑖).𝔼(𝑇𝑀 𝑅𝑖)is the period during which mistakes

happen. Their relationship is as follows in equation (11):

𝜆𝑀𝑖=1

𝔼(𝑇𝑀𝑅𝑖).(11)

Following Chen et al. (2002), the average mistake recurrence time is expressed as 𝔼(𝑇𝑀𝑅𝑖) = 𝛿𝑖

𝑃𝑠𝑖

, where

𝑃𝑠𝑖is the probability that 𝑝𝑁suspects 𝑝𝑖(i.e. that an S-Transition occurs). For SONAFD, 𝑃𝑠𝑖is equivalent

to the probability that a heartbeat will arrive more than Δ𝜀𝑖time units later than the previous heartbeat. 𝑃𝑠𝑖

is given by the equation 𝑃𝑠𝑖=𝐺𝑖(Δ𝜀𝑖). Therefore, the mistake rate is deﬁned in the following equation (12):

𝜆𝑀𝑖=

𝐺𝑖(Δ𝜀𝑖)

𝛿𝑖

.(12)

Query Accuracy Probability: let 𝔼(𝑇𝑀𝑖)be the average mistake duration. The query accuracy probability

𝑃𝐴𝑖is deﬁned by equation (13):

𝑃𝐴𝑖= 1 −

𝔼(𝑇𝑀𝑖)

𝔼(𝑇𝑀𝑅𝑖).(13)

According to Chen et al. (2002), the average mistake duration is expressed as 𝔼(𝑇𝑀𝑖) = 𝛿𝑖

𝑞0𝑖

, where 𝑞0𝑖

is the probability that, for any 𝑘≥2,𝑝𝑁receives heartbeat 𝑚𝑘−1

𝑖before time 𝑡+ Δ𝑡𝑜𝑘

𝑖(Δ𝑡𝑜𝑘

𝑖is equivalent to

Δ𝜀𝑖, which is the timeout corresponding to message 𝑘). This means that 𝑞0𝑖is equivalent to the probability

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 13 of 28

Optimised Failure Detection Algorithms

that a heartbeat will arrive less than Δ𝑡𝑜𝑘

𝑖time units later than the previous heartbeat, which means that

𝑞0𝑖= 1 − 𝐺𝑖(Δ𝑡𝑜𝑘

𝑖) = 1 − 𝐺𝑖(Δ𝜀𝑖). Therefore, 𝔼(𝑇𝑀𝑖)is as follows:

𝔼(𝑇𝑀𝑖) = 𝛿𝑖

1 − 𝐺𝑖(Δ𝜀𝑖).(14)

By combining equations (12) and (14), 𝑃𝐴𝑖is expressed in equation (15) and used in constraint (19):

𝑃𝐴𝑖= 1 −

𝐺𝑖(Δ𝜀𝑖)

1 − 𝐺𝑖(Δ𝜀𝑖).(15)

4.2.2. MILP Optimisation Modelling

The decision variables of ℕare 𝛿𝑖and Δ𝜀𝑖, for each process 𝑖, ∀𝑖∈ [1 . . 𝑁 − 1]. These variables

represent the respective failure detection parameters (regarding each monitored process 𝑖) that should be set

in SONAFD to obtain better QoS, and hence should be strictly positive real numbers. We model SONAFD’s

parameters’ optimisation as an optimisation problem whose objective function target is to maximise the

expected query accuracy probability 𝔼(𝑃𝐴).ℕis formatted as follows:

𝑀𝑎𝑥 (𝔼(𝑃𝐴) ) (16)

Subject to (S.t.)

∀𝑖∈ [1 . . 𝑁 − 1] ∶ 𝑇𝐷𝑖≤𝑇𝑇 𝑆

𝐷(17)

𝔼(𝑃𝐴) = 𝔼(𝑃𝐴𝑖)𝑖∈[1. .𝑁−1] =1

𝑁− 1

𝑁−1



𝑖=1

𝑃𝐴𝑖(18)

∀𝑖∈ [1 . . 𝑁 − 1] ∶ 𝑃𝐴𝑖= 1 −

𝐺𝑖(Δ𝜀𝑖)

1 − 𝐺𝑖(Δ𝜀𝑖)(19)

∀𝑖∈ [1 . . 𝑁 − 1] ∶ Δ𝜀𝑖−𝐾𝑖⋅𝛿𝑖≥𝐷𝑖+ 3 ⋅𝑆𝑖(20)

∀𝑖∈ [1 . . 𝑁 − 1] ∶ Δ𝜀𝑖+𝛿𝑖ln 1 − 𝑃𝑅𝑒𝑞

𝐴𝑖

2 − 𝑃𝑅𝑒𝑞

𝐴𝑖≥0(21)

∀𝑖∈ [1 . . 𝑁 − 1] ∶

𝑁−1



𝑖=1

𝛿𝑖

≤𝑂⋅

𝐵

𝑀(22)

∀𝑖∈ [1 . . 𝑁 − 1] ∶ Δ𝜀𝑖≤𝑇𝐷𝑖−𝐷𝑖(23)

∀𝑖∈ [1 . . 𝑁 − 1] ∶ 𝑃𝐴𝑖≥0(24)

∀𝑖∈ [1 . . 𝑁 − 1] ∶ 𝛿𝑖>0(25)

∀𝑖∈ [1 . . 𝑁 − 1] ∶ Δ𝜀𝑖>0.(26)

Constraint (17) permits us to upper-bound each process 𝑇𝐷𝑖by 𝑇𝑇𝑆

𝐷(e.g., 𝑇𝑇 𝑆

𝐷= 3000𝑚𝑠). This allows

us to model SONAFD’s failure detection speed as a constraint. By setting 𝑇𝐷𝑖≤𝑇𝑇 𝑆

𝐷for each process 𝑖, it is

guaranteed that its new 𝛿𝑖and Δ𝜀𝑖will not generate a 𝑇𝐷𝑖higher than 𝑇𝑇 𝑆

𝐷. Since 𝑇𝐷=𝔼(𝑇𝐷𝑖)𝑖∈[1. .𝑁 −1], it

is guaranteed that 𝑇𝐷will not exceed the selected threshold 𝑇𝑇 𝑆

𝐷. A lower 𝑇𝑇 𝑆

𝐷would enable a faster failure

detection. However, a particularly small 𝑇𝑇𝑆

𝐷may constrain the model too much and yield to infeasibility.

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 14 of 28

Optimised Failure Detection Algorithms

Constraint (18) allows us to compute 𝔼(𝑃𝐴)as the expected value of 𝑃𝐴𝑖obtained for each monitored

process 𝑝𝑖, 𝑖 ∈ [1 . . 𝑁 − 1].

Constraint (19) deﬁnes 𝑃𝐴𝑖for each 𝑝𝑖, 𝑖 ∈ [1 . . 𝑁 −1].𝑃𝐴𝑖is obtained for each process 𝑖, 𝑖 ∈ [1 . . 𝑁 −1]

from equation (15) applied to process 𝑖: details are provided in Section 4.2.1.

Constraint (20) states that, for each 𝑝𝑖, 𝑖 ∈ [1 . . 𝑁 − 1], the associated timeout Δ𝜀𝑖with threshold 𝜀𝑖

should be greater than the expected transmission time of the heartbeat message so that SONAFD tolerates

network delays adequately (see Section 4.1 for NAFD’s suspicion threshold adjustment). Δ𝜀𝑖is adopted

to tolerate network delays and losses. The lower bound of Δ𝜀𝑖is set to the same as NAFD obtained from

equation (D.1) (see Section 4.1 for more details). In summary, for each 𝑝𝑖, 𝑖 ∈ [1 . . 𝑁 − 1], its optimised Δ𝜀𝑖

should take into account its average delay and three times its delay standard deviation (to cover high delays’

values). This constraint is obtained from the Chebyshev’s inequality applied on delays as random variables

for any given probability distribution. 88.8889% 3of delays are within 3 × 𝑆𝑖from the average 𝐷𝑖, which

guarantees to cover this fraction of random delays in building the decision timeout Δ𝜀𝑖.

Constraint (21) deﬁnes the minimum ratio between Δ𝜀𝑖and 𝛿𝑖to satisfy a minimum 𝑃𝑅𝑒𝑞

𝐴𝑖. This in-

equation is obtained from constraint (19) by setting 𝑃𝐴𝑖to a numerical value 𝑃𝑅𝑒𝑞

𝐴𝑖(e.g., 𝑃𝑅𝑒𝑞

𝐴𝑖= 95%).

Constraint (21) is obtained by inverse computation of the last in-equation in constraint (27):

𝑃𝐴𝑖≥𝑃𝑅𝑒𝑞

𝐴𝑖

⟹1 −

𝐺𝑖(Δ𝜀𝑖)

1 − 𝐺𝑖(Δ𝜀𝑖)≥𝑃𝑅𝑒𝑞

𝐴𝑖

⟹1 − 𝑒

−Δ𝜀𝑖

𝛿𝑖

1 − 𝑒

−Δ𝜀𝑖

𝛿𝑖

≥𝑃𝑅𝑒𝑞

𝐴𝑖

⟹

Δ𝜀𝑖

𝛿𝑖

≥− ln 1 − 𝑃𝑅𝑒𝑞

𝐴𝑖

2 − 𝑃𝑅𝑒𝑞

𝐴𝑖⟹Δ𝜀𝑖+𝛿𝑖ln 1 − 𝑃𝑅𝑒𝑞

𝐴𝑖

2 − 𝑃𝑅𝑒𝑞

𝐴𝑖≥0.

(27)

Constraint (22) represents the limit of heartbeats’ messages overhead in a given network. If each 𝑝𝑖, 𝑖 ∈

[1 . . 𝑁 − 1] sends heartbeats’ messages to 𝑝𝑁at the time interval of 𝛿𝑖, it would receive ℎ𝑛𝑏 =𝑁−1

𝑖=1

𝛿𝑖

there are no message losses. Assuming that these messages have the same size 𝑀, if 𝑝𝑁has a bandwidth 𝐵,

it would be possible to transmit 𝐵

𝑀messages with the same size 𝑀. In a real-world network environment,

it cannot allow all its bandwidth to be dedicated to heartbeats’ messages only. Therefore, only a reasonable

small percentage of bandwidth usage should be allowed for heartbeat messages. Such a requirement can

be ensured by applying a percentage 𝑂to the total number of possible transmissions 𝐵

𝑀, which means that

ℎ𝑛𝑏 ≤𝑂×𝐵

𝑀. To simplify ℕfor implementation, we assume that all monitored processes have the same

𝛿𝑖=𝛿, which is commonly used in practice. The left side of constraint (22) becomes ∀𝑖∈ [1 . . 𝑁 − 1] ∶

𝑁−1

𝑖=1

𝛿𝑖

=𝑁−1

𝑖=1

𝛿=𝑁−1

𝛿𝑖

. Constraint (22) becomes

∀𝑖∈ [1 . . 𝑁 − 1] ∶ 𝑂⋅𝐵

𝑀⋅(𝑁− 1) 𝛿𝑖≥1.(28)

Constraint (23) represents 𝑇𝐷𝑖computed for each process 𝑖, 𝑖 ∈ [1 . . 𝑁 − 1], as the average delay 𝐷𝑖and

timeout associated with Δ𝜀𝑖.𝑇𝐷𝑖considers the worst case scenario, in which process 𝑖will crash just after

sending a heartbeat message (see Section 4.2.1 for details).

Constraint (24) ensures that 𝑃𝐴𝑖is a positive variable.

3If delays are known to follow a normal distribution, the Chebyshev’s inequality will cover 99.73% possible values of such a

random variable.

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 15 of 28

Optimised Failure Detection Algorithms

Constraints (25) and (26) ensure that SONAFD is functional: constraint (25) imposes a strictly positive

𝛿𝑖so that all monitored processes send regular heartbeat messages. Constraint (26) allows for tolerating

waiting time for heartbeat transmission by setting a strictly positive Δ𝜀𝑖

Due to the objective function equation (16) and constraints (19) and (22), our proposed ℕis a nonlinear

problem. The (indirectly) nonlinear objective function and nonlinear constraints are crucial for representing

the application properly and mathematically (Rebennack 2016b). We adopt the piecewise linearisation to

convert our proposed NLP model to a Mixed Integer Linear Problem. MILP is much easier to solve and

piecewise linearisation is frequently used in various applications to approximate nonlinearities with linear

functions (Geißler et al. 2011,2015;Rebennack and Krasko 2019;Vielma et al. 2010a). There is a mature

set of works that provide proof of, and methods for, applying piecewise linear functions to simplify nonlinear

problems to linear ones and control their solving (Burlacu et al. 2020;Geißler et al. 2012,2015;Rebennack

and Krasko 2019;Vielma et al. 2010b). Benders’ decomposition (Rebennack 2016a;Steeger and Rebennack

2017) could be adopted for an advanced level of simpliﬁcation but is outside the scope of this paper. The

details of the transformation of ℕto an MILP (i.e. 𝕄) are presented in Appendix F. Constraint (19) is ﬁnally

replaced by

∀𝑖∈ [1 . . 𝑁 − 1] ∶ 2 ×

𝑚𝑖



𝑗=1

𝑋𝑖,𝑗 𝐺𝑖(Δ𝑚𝑖𝑛𝑖+𝑗𝑇𝑆) −

𝑚𝑖



𝑗=1

𝑍𝑖,𝑗 𝐺𝑖(Δ𝑚𝑖𝑛𝑖+𝑗𝑇𝑆) + 𝑃𝐴𝑖= 1 (29)

∀𝑖∈ [1 . . 𝑁 ] ∶

𝑚𝑖



𝑗=1

𝑋𝑖,𝑗 = 1 (30)

∀𝑖∈ [1 . . 𝑁 − 1],∀𝑗∈ [1 . . 𝑚𝑖] ∶ 𝑍𝑖,𝑗 −𝑋𝑖,𝑗 ≤0(31)

∀𝑖∈ [1 . . 𝑁 − 1],∀𝑗∈ [1 . . 𝑚𝑖] ∶ 𝑍𝑖,𝑗 ≥0(32)

∀𝑖∈ [1 . . 𝑁 − 1],∀𝑗∈ [1 . . 𝑚𝑖] ∶ 𝑍𝑖,𝑗 −𝑃𝐴𝑖≤0(33)

∀𝑖∈ [1 . . 𝑁 − 1],∀𝑗∈ [1 . . 𝑚𝑖] ∶ 𝑃𝐴𝑖+𝑋𝑖,𝑗 −𝑍𝑖,𝑗 ≤1(34)

∀𝑖∈ [1 . . 𝑁 − 1] ∶ Δ𝜀𝑖−

𝑚𝑖



𝑗=1

𝑋𝑖,𝑗 (Δ𝑚𝑖𝑛𝑖+𝑗𝑇𝑆)=0.(35)

where 𝑚𝑖is the number of discrete values as obtained from Appendix F-equation (F.3), 𝑋𝑖,𝑗 is a binary

variable deﬁned in Appendix F-equation (F.5), 𝑍𝑖,𝑗 is a real variable as in Appendix F-equation (F.11) and

Δ𝑚𝑖𝑛𝑖=𝐷𝑖+3⋅𝑆𝑖+𝐾𝑖⋅𝛿𝑖. Discretised 𝐺𝑖(𝑡), i.e. 𝐺𝑖(Δ𝑚𝑖𝑛𝑖+𝑗𝑇𝑆)is also deﬁned in Appendix F-equation (F.6).

Finally, 𝕄’s formulation that we implement in CPLEX has the objective function (16), subject to con-

straints (17), (18), (20), (21), (23), (24), (25), (26), (28), (29), (30), (31), (32), (33), (34) and (35).

4.3. Heartbeats’ inter-arrivals’ Distribution

NAFD uses the heartbeats’ inter-arrivals’ distributions to compute suspicion levels 𝜀𝑠𝑖and to adapt its

suspicion thresholds 𝜀𝑖for each of its monitored processes, respectively (as in lines 6-7in Algorithm 1(b)).

SONAFD additionally uses the heartbeats’ inter-arrivals’ distributions in the modelling of 𝕄and to adapt its

suspicion thresholds when 𝕄is unfeasible. Constraint (19), which deﬁnes 𝑃𝐴𝑖for each monitored process,

is expressed as a function of heartbeats’ inter-arrivals’ CDF. This CDF needs to be identiﬁed, and updated

when SONAFD is running. For this purpose, we collect heartbeats’ inter-arrivals and perform a probability

4The CPLEX solver adopted in this paper (i.e. in our implementation) does not allow us to insert strict inequalities (i.e. ">"), and

hence replace them by non-strict ones (i.e. "≥"). For our proposed approach, we retain the solver transformation since constraints

(22) and (20) guarantee that 𝛿𝑖and Δ𝜀𝑖are non-null, respectively, with strictly positive system parameters (message size, delay, etc.).

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 16 of 28

Optimised Failure Detection Algorithms

distribution ﬁtting on these data (see Appendix C), particularly during the ﬁrst time that NAFD/SONAFD is

deployed in such a real-world system. As we consider that the monitored processes are connected in a star

topology to the monitoring process, this eliminates complex network topology consideration. For example,

if some processes share the same physical infrastructure, it is necessary to consider their joint heartbeats’

inter-arrivals’ probability distribution (Bisnik and Abouzeid 2009), which is out of the scope of this paper.

We also consider that heartbeats’ inter-arrivals follow an exponential distribution, which implies that

heartbeats’ arrivals are assumed to be Poisson processes (Ross 1996). Packets’ arrivals are often modelled

as Poisson point processes (PPP). Early traﬃc models were motivated by telephony where calls are assumed

to be independent and identically distributed, and hence their holding times are exponential. Poisson traﬃc

models have been widely adopted in the network analysis and performance evaluation of diﬀerent applica-

tions (Chun Chung Chan and Hanly 2001;Karagiannis et al. 2004;Kheirkhah et al. 2019;Kirichek et al.

2016;Sun et al. 2019;Yu et al. 2006). PPP are characterised by the following important analytical properties:

1) the number of arrivals in distinct intervals is statistically independent, i.e. memoryless; 2) superposition

of independent Poisson processes with speciﬁc rates results in a new Poisson process whose rate is the sum

of rates; 3) has a unitary coeﬃcient of variation, i.e. its parameter is equal to its mean and variation; and

4) in speciﬁc conditions, a multiplexing of independent traﬃc streams is approximately a Poisson process

according to the Palm-Khintchine theorem (Heyman and Sobel 2004).

Although Poisson packets’ arrivals’ models were widely adopted, their suitability is contested as many

researchers believe they do not mimic enough real-world traﬃc data in modern networks: e.g., LAN, MAN,

WAN (Paxson and Floyd 1994). These networks experience batch and/or correlated packets’ arrivals and

traﬃc burstiness, which are important factors to consider in the modelling of their traﬃc (Becchi 2008). The

Poisson model is considered unable to capture these elements, speciﬁcally the aspects related to traﬃc bursti-

ness (Karagiannis et al. 2004). This has oriented the research community towards long-range dependence

(LRD), self-similarity (i.e. fractional) and bursty (i.e. Markov, Renewal, Autoregressive) models.

Despite the proliferation of such models, there has been a switch back to the re-consideration of PPP for

internet traﬃc. The former analysis of network traﬃc encounters luck of accuracy and robustness (Kara-

giannis et al. 2004). In fact, many research works have shown that the assumption of Poisson distribution, or

a derived version of it, is in accordance with real internet packet arrivals (Cao et al. 2003;Karagiannis et al.

2004;Sukhov et al. 2016;Yu et al. 2006). More speciﬁcally, at the edges of the internet (i.e. closer to end

users’ devices), links have generally low speeds and their capacity cannot be upgraded swimmingly, which

leads to continuous bursty traﬃc; however, the internet core beneﬁts from high-speed links and a consider-

able number of connections, which yields to the absence of burstiness: the packets’ arrivals are hence close

to Independence and Poisson.

In summary, characterising modern networks traﬃc is highly complex as it is constantly evolving and

extremely dynamic; identifying its features cannot be solved once and for all. Nevertheless, the Poisson

assumption represents an attractive way to solve the failure detection performance problem: it has the ad-

vantage of analytic simplicity and can be fairly valid in many distributed systems. It also helps us to main-

tain the complexity of our MILP to solve the performance-aware failure detection problem. In this paper,

we focus more on providing a framework solution for QoS-aware failure detection. Hence, we focus on a

Proof-of-Concept of SONAFD with a generalised network topology and a realistic heartbeats’ inter-arrivals’

probability distribution. Poisson packet arrivals represent a valuable trade-oﬀ in designing our FD and prov-

ing its eﬃciency. Furthermore, to tolerate possible error of distribution assumptions, we adopted Cheby-

shev’s inequality with our constraints by adding 3× times standard deviation; the coverage of random values

reaches 88.89% minimum and will reach 99.97% if such a distribution follows the normal distribution (fol-

lows the central limit theorem). We believe that such an approach achieves our goal of adopting reasonably

good ﬁtting of probability distributions, balancing the complexity of the model and computation costs, while

being able to tolerate potential distribution and parameters’ estimation errors. Our evaluation results (see

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 17 of 28

Optimised Failure Detection Algorithms

Section 5.2) have shown that our adopted approach is satisfactory.

4.4. The sizes of heartbeats’ history and probability distribution ﬁtting sampling

We recall that 𝑊=𝑊𝑖represents the size of heartbeats’ inter-arrivals’ sample, measured as number of

inter-arrivals. An FD retains the last 𝑊𝑖inter-arrivals to check the state of 𝑝𝑖. This means that 𝑝𝑁computes

the suspicion level 𝜀𝑠𝑖based on 𝑊𝑖∕𝑊. Let 𝐹represents the sample size of heartbeats’ inter-arrivals that

are used to perform the probability distribution ﬁtting, measured as a time duration. In our experiments,

𝑊= 1000, similar to related works such as Chen et al. (2002), Hayashibara et al. (2004) and Satzger et al.

(2007). 𝐹equals to one-day time window. According to the numerous sets of NAFD/SONAFD experiments

in the cloud environment, we noticed that one day is a good setting, as it represents one day cycle.

It is indisputable that the size of 𝐹has a meaningful impact on the probability distribution ﬁtting quality,

and hence the performance of NAFD/SONAFD. In general, the larger 𝐹is, the more accurate the probability

distribution ﬁtting is. However, network conditions may change faster and, consequently, the best ﬁtted

probability distribution might become obsolete. It is necessary to choose 𝐹carefully so that it provides

both a satisfactory-accurate and up-to-date probability distribution ﬁtting. Meanwhile, the main focus of the

SONAFD design is to make it driven by performance requirements in an optimal policy, even by making

assumptions on or ﬁxing speciﬁc parameters like 𝐹. The assumption on the value of 𝐹may not conform to

all systems, but the overall performance optimisation objectives are still satisfactory.

To compensate this assumption, the parameters of the ﬁtted probability distribution are computed each

time a new heartbeat arrives (based on most recent 𝑊inter-arrivals). This keeps them up-to-date to recent

heartbeats’ arrivals, and hence enables a fast response to network condition changes. In addition, we have

incorporated the Chebyshev’s inequality into the design of NAFD/SONAFD. This is to tolerate distribution

assumption errors in its detection accuracy/speed. We believe that all these considerations help to build a

fairly solid solution, which ensures a balance between developing SONAFD MILP modelling and opening

up future work directions.

Future work may adopt a dynamic pricing method as in Cheung et al. (2017) or be dynamically set,

and be driven by changing conditions (Cai and Hames 2010;Fields et al. 2021;Kutzner et al. 2017). For

example, Cheung et al. (2017) use a dynamic pricing model to minimise worst-case regret with unknown

demand functions through learning phases. These phases have diﬀerent samples’ sizes, which makes the

overall pricing model more ﬂexible towards demand functions. An improved SONAFD may consider a

similar approach: e.g., setting the sizes of dynamic samples based on the patterns of networks’ conditions.

4.5. Heuristic Algorithm

If the the size of 𝕄grows (e.g., 𝑁increases), it becomes computationally expensive or even unfeasible

to solve 𝕄. Therefore, to solve such a computation challenge and improve the scalability of SONAFD, we

also design a greedy heuristic algorithm, noted as ℍ.ℍspeciﬁcally aims to solve the objective function and

constraints in 𝕄. Figure 2introduces the main steps of ℍ. The main idea is to bound decision variables

(𝛿𝑖and Δ𝜀𝑖) using constraints from 𝕄.ℍstarts at the lowest bounds of both 𝛿𝑖and Δ𝜀𝑖, then evaluates

SONAFD 𝑃𝐴and compares it to 𝑃𝑡𝑎𝑟𝑔𝑒𝑡

𝐴. The algorithm conducts small/unitary increments of Δ𝜀𝑖for each

𝛿𝑖. If 𝑃𝑡𝑎𝑟𝑔𝑒𝑡

𝐴is not reached when Δ𝜀𝑖= Δ𝑢𝑝𝑝𝑒𝑟

𝜀𝑖then 𝛿𝑖is incremented and Δ𝜀𝑖is reset to its lower bound. Δ𝜀𝑖

is incremented until 𝑃𝐴reaches 𝑃𝑡𝑎𝑟𝑔𝑒𝑡

𝐴. The algorithm stops when 𝑃𝑡𝑎𝑟𝑔𝑒𝑡

𝐴is reached or when all possible

values of 𝛿𝑖and Δ𝜀𝑖are evaluated. The evaluation of our heuristic algorithm is presented in Section 5.4.

5. NAFD and SONAFD Performance Evaluation

We aim to validate the QoS enhancement achieved by the proposed SONAFD and its 𝕄. Most previous

works either evaluate such an FD algorithm with a simple two-process environment (Bertier et al. 2002;Chen

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 18 of 28

Optimised Failure Detection Algorithms

Initialise parameters

Start

Compute and bounds     :   and 

 



While (     )

While (     

)

     

    

󰇛󰇜

 󰇛󰇜and 

 󰇛

󰇜



 



?

      

      

End

(Get Solutions)

End (No solution)

Yes

 

?

     



 ?

Yes

Figure 2: Flowchart of the greedy heuristic algorithm ℍ.

et al. 2002), or via a wide area network (Hayashibara et al. 2004;Satzger et al. 2011), or simply in a local

network (Liu et al. 2017). To achieve rigorous and robust evaluation results, we evaluate our approaches in

a real-world application environment (Amazon EC2) experiencing its own real-time constraints. The real-

world sought constraints are packets’ transmission delays and losses in the cloud network. This allows our

proposed approach to minimise the gaps between theoretical design and practical applications and hence

demonstrate better failure detection performance in real-world systems.

To tackle the challenge of not being able to alter network conditions in a real-world network environ-

ment, we also design and implement a simulator based on CloudSim. CloudSim was originally developed

by Melbourne University (2018). This allows us to control network conditions and test the robustness of

our approaches when network conditions change. To mimic real-world failures, we have also designed and

implemented a failure controller, which can randomly fail/stop a machine/process following a given proba-

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 19 of 28

Optimised Failure Detection Algorithms

bility distribution (e.g., normal distribution). All these comprehensive test environment designs allow us to

rigorously test the true performance of our proposed FD algorithm and optimisation approach.

5.1. Experiment Setup

To benchmark the QoS enhancement that would be guaranteed by NAFD and SONAFD, we compare

them with one of the most representative failure detectors - the 𝑃 ℎ𝑖 detector introduced by Hayashibara

et al. (2004). It has been used in many real systems such as Akka OpenDayLight (Akka 2018) and has been

benchmarked in a number of studies (Liu et al. 2017;Satzger et al. 2007,2011;Tomsic et al. 2015;Xiong

et al. 2012). Moreover, numerous research papers (see Chan et al. 2015;Fang et al. 2016;Yang and Wang

2016) highlight the importance of the 𝑃 ℎ𝑖 detector in terms of ﬂexibility compared to other FDs. In our

performance evaluation, we consider a modiﬁed version of the 𝑃 ℎ𝑖 detector. We adjusts the 𝑃 ℎ𝑖 detector

threshold as regularly as NAFD/SONAFD, and considers the same average timeout as NAFD to compute

its adjusted threshold. We call it Adaptive 𝑃 ℎ𝑖. The idea is to achieve a fair comparison between 𝑃 ℎ𝑖 and

NAFD/SONAFD in terms of adapting the decision parameter to network conditions. We implement these

three failure detectors in the Amazon on-demand cloud service (EC2). It allows us to manage a large number

of distributed cloud virtual machines (VMs) and conﬁgure them with custom parameters. The Amazon EC2

provides one of the most-widely used cloud computing services. Its network environments are dynamic and

depend on realistic connections between VMs.

We consider the following scenario; this involves the monitor cloud VM which is responsible for running

FDs and the optimisation model, and for monitoring and recording performance measures. The monitored

VMs (50 instances) 5send heartbeat messages periodically to notify the FD VM about their liveness. Mon-

itored VMs are connected to the monitor VM in a star topology. The hardwares of the VMs used in our

experiments are as follows: 1) One 3.3 GHz CPU; 2) 1 GB memory; and 3) 8 GB storage. The monitor

VM is based in the Asia Paciﬁc region (Singapore). The monitored VMs are distributed over three diﬀerent

regions: Tokyo, London and North Virginia. This is to ensure a wide geographical distribution of deployed

VMs and obtain a variety of network communication conditions. Consequently, FD performance metrics

are evaluated on a regular basis for diversiﬁed network conditions. This is a global network setting that has

not been explored by previous work (Liu et al. 2017;Satzger et al. 2011;Sleptchenko and Johnson 2014).

5.2. Amazon EC2 Evaluation Results

We have run our evaluation tests using various experimental settings. We set SONAFD with the network

bandwidth overhead constraint (𝑂= 0.1%) and a failure detection time threshold (𝑇𝑇 𝑆

𝐷= 3𝑠). These settings

are according to a survey we conducted with a number of technical staﬀ members within one of the largest

telecom/network service companies. This threshold represents a trade-oﬀ value for diﬀerent distributed

systems and can be adjusted (Bosilca et al. 2016;gigaspaces 2019). In the ﬁrst settings of the evaluation,

we use an average failure rate of 1𝐹 𝑎𝑖𝑙𝑢𝑟𝑒∕ℎ𝑜𝑢𝑟 and an average failure duration of 900 seconds to inject

enough failures in such an evaluation environment so that more QoS data are collected. Figure 3represents

the evaluation results of failure detection QoS metrics summarised as follows:

1. SONAFD and NAFD provide better 𝑃𝐴and 𝜆𝑀compared to the Adaptive 𝑃 ℎ𝑖 FD, even if NAFD and

Adaptive 𝑃 ℎ𝑖 use the same timeout to adjust their thresholds. This is because the Adaptive 𝑃 ℎ𝑖 FD

applies a diﬀerent function to compute its suspicion threshold from the timeout. This function depends

on the assumed heartbeats’ inter-arrivals’ probability distribution (i.e. normal). It means that the normal

distribution is less representative of existing cloud network conditions.

5Running a large number of VMs on Amazon is costly. For experiments larger than 50, we actually have the simulations and

use the analytical results associated with the heuristic algorithm; thus, we expect similar performance results of NAFD/SONAFD.

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 20 of 28

Optimised Failure Detection Algorithms

10h

12h

14h

16h

18h

20h

22h

24h

Test duration

0.8

1.2

1.4

1.6

1.8

2.2

2.4

Detection Time (s)

Adaptive Phi

NAFD

SONAFD

(a) 𝑇𝐷

10h

12h

14h

16h

18h

20h

22h

24h

Test duration

10 -4

10 -3

10 -2

10 -1

Mistake rate/s

Adaptive Phi

NAFD

SONAFD

(b) 𝜆𝑀

10h

12h

14h

16h

18h

20h

22h

24h

Test duration

0.996

0.9965

0.997

0.9975

0.998

0.9985

0.999

0.9995

Query Accuracy Probability

Adaptive Phi

NAFD

SONAFD

0.5 1 1.5 2

Detection Time (s)

0.994

0.995

0.996

0.997

0.998

0.999

Query Accuracy Probability

Adaptive Phi

NAFD

SONAFD

(d) 𝑃𝐴versus 𝑇𝐷

Figure 3: Amazon EC2 results for a simultaneous running of three FDs (solver: CPLEX).

2. Adaptive 𝑃 ℎ𝑖 FD has the second shortest 𝑇𝐷. However, even its suspicion threshold is adapted regularly,

the 𝑃 ℎ𝑖 FD struggles to ﬁnd the trade-oﬀ between performance metrics to satisfy required 𝑃𝐴and 𝑇𝐷.

3. SONAFD further enhances the accuracy metrics (𝑃𝐴and 𝜆𝑀) over NAFD, with an increase of 𝑇𝐷. SON-

AFD was designed to maximise 𝑃𝐴by optimising not only one parameter but the two FD parameters: 𝛿𝑖

and 𝜀𝑖. Thus, SONAFD operates on both two FD parameters to foster performance enhancement.

4. To enhance 𝑃𝐴, SONAFD increases its 𝑇𝐷, as SONAFD will wait longer before making a suspect de-

cision. Even though this is necessary in its design to achieve the required QoS altogether, SONAFD

provides the best trade-oﬀ between these two conﬂicting metrics.

5.3. Simulations Validation

Although Amazon EC2 is a real-world test environment, its main limitation is that we cannot control

network conditions to evaluate changing network behaviours. In addition, Amazon EC2 is a commercial

service and its usage with a large number of global machines comes with major ﬁnancial costs. Therefore,

to achieve a robust performance evaluation that covers both network conditions and changes in failure be-

haviours with a large number of machines, we also designed and implemented a simulation tool speciﬁc to

failure detection. We adopted CloudSim as the simulation platform (Melbourne University 2018) as it of-

fers a scenario similar to that in Amazon EC2. CloudSim provides an extensible simulation framework that

enables modelling, simulation and testing of new Cloud Computing infrastructures’ and applications’ ser-

vices. We developed a speciﬁc FD simulator to include heartbeat messaging, failure controllers and failure

detection algorithms for VMs.

The simulation setting is similar to Amazon EC2. We consider a scenario with 51 VMs. One of these

VMs represents the FD and the remaining 50 VMs are monitored processes. Thirty diﬀerent parameters’

conﬁgurations are simulated: these settings swipe diﬀerent failure rates, failure duration, packet delays and

packet losses, etc. All scenarios provided similar results to the real environment test in EC2 and demonstrated

the robustness of our proposed SONAFD. Due to space limitations, in this paper we only represent one of the

simulated conﬁgurations as shown in Figure 4. Figures 4a,4b and 4c show 𝑇𝐷,𝜆𝑀and 𝑃𝐴, respectively, for

the three FDs. SONAFD performs the best in terms of 𝑃𝐴and 𝜆𝑀, followed by NAFD then the Adaptive 𝑃 ℎ𝑖

FD. However for 𝑇𝐷, the Adaptive 𝑃 ℎ𝑖 performs the best. These observations are consistent with SONAFD

and NAFD’s performance in real-world test environments (e.g., Amazon EC2) introduced in Section 5.2.

5.4. SONAFD Scalability Evaluation

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 21 of 28

Optimised Failure Detection Algorithms

1h 4h 8h 12h 16h 20h 24h

Simulation time

0.1

0.2

0.3

0.4

0.5

0.6

Detection Time (s)

Adaptive Phi

NAFD

SONAFD

(a) 𝑇𝐷

1h 4h 8h 12h 16h 20h 24h

Simulation time

10-3

10-2

10-1

100

Mistake rate/s

Adaptive Phi

NAFD

SONAFD

(b) 𝜆𝑀

1h 4h 8h 12h 16h 20h 24h

Simulation time

0.92

0.94

0.96

0.98

Query Accuracy Probability

Adaptive Phi

NAFD

SONAFD

Figure 4: Simulation results for three FDs.

Large-scale distributed systems, like cloud computing, can easily scale-up with hundreds or thousands

of processes as they are designed to be dynamic for rapid and ﬂexible delivery of services. Therefore, main-

taining QoS and scalability together for FD is a challenging task due to the system size increase. SONAFD

is designed to ensure eﬃcient failure detection for large-scale systems. However, the MILP modelling ap-

proach within SONAFD’s design and its optimisation solution-solving time will increase exponentially as

the system size increases. Moreover for the continuous constraint (19), we adopted discretisation and lin-

earisation methods (as detailed in the online Appendix F) for solving the model (Gendron and Gouveia 2016;

Kunnumkal and Talluri 2015). Such approaches generated additional decision variables to the implemented

𝕄:𝑋𝑖,𝑗 and 𝑍𝑖,𝑗 such as 𝑖∈ [1 . . 𝑁 − 1], 𝑗 ∈ [1 . . 𝑚𝑖], and 𝑚𝑖is the number of discrete values for

𝑝𝑖(obtained from equation (F.3)-Appendix F). We recall that 𝑇𝑆is the time precision that represents the

discretisation step of constraint (19) and is an input of 𝕄. The value 𝑇𝑆impacts the number of these addi-

tional variables 𝑋𝑖,𝑗 and 𝑍𝑖,𝑗 . As a consequence, the scalability of our proposed 𝕄model is degraded. 𝑇𝑆is

inversely proportional to the number of modelled discrete values 𝑚𝑖,∀𝑖∈ [1 . . 𝑁 − 1] (see equation (F.3)-

Appendix F). Therefore, the higher 𝑇𝑆is, the lower 𝑚𝑖is, and the lower the number of modelled variables

is. Consequently, it is important to assess this impact by testing diﬀerent values of 𝑇𝑆.

(a) Execution time

100 200 300 400 500

Monitored processes (N)

100

200

300

400

500

600

700

800

Model execution time (s)

Building Time

Solving Time

(b) Execution time for 𝑇𝑆= 10ms

Figure 5: Execution time of ℂ(i.e. solving 𝕄in CPLEX solver) (seconds).

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 22 of 28

Optimised Failure Detection Algorithms

The overall time to get an output for 𝕄from the solver (optimised solutions or no solutions) can be

divided into two parts: 1) the building time, which refers to the time spent by the solver in creating variables

and setting up the constraints, and 2) the model-solving time, which corresponds to the time spent by the

solver to search for the model solutions. To avoid any confusion, we refer to the overall solving time of 𝕄

as the execution time. Throughout this section, We refer to “𝕄solved in CPLEX” as ℂ.

Table 3

A set of parameters for 𝕄scalability evaluation (please refer to Tables 1and 2for notations and units).

𝐵 𝑀 𝑇 𝑇 𝑆

𝐷𝑂 𝐹𝑈 𝐵 𝑃𝐶𝐿 𝑃𝑅𝑒𝑞

𝐴𝑖𝜇𝑖𝐷𝑖𝑆𝑖𝜏𝑖

49100 416 3000 1% 2 99% 0.95 11.5±1 150±100 50 ± 30 5%±5%

We set ℂimplementation with diﬀerent input parameters (see Table 3). The intuition is to tune these

parameters as diversely as possible. This allows us to evaluate numerous combinations and achieve robust

evaluation of the SONAFD scalability. A set of parameters consists of ﬁxed values of 𝐵,𝑀,𝑇𝑇𝑆

𝐷,𝑂,

𝐹𝑈 𝐵 ,𝑃𝐶𝐿 and 𝑃𝑅𝑒𝑞

𝐴𝑖, and random values of 𝜇𝑖,𝐷𝑖,𝑆𝑖and 𝜏𝑖. For a given set of parameters, we collect

the execution times of ℂfor diﬀerent values of the discretisation step: 𝑇𝑆∈ {1,2,5,10,20,50,100} and

diﬀerent sizes of the system: 𝑁∈ [2 . . 1001]. Such conﬁgurations represent real-world systems, where the

ﬁxed parameters are often given by application-speciﬁc requirements. The parameters with random values

represent real-world dynamic network behaviour. 𝑇𝑆could also be application-speciﬁc, but we choose a

set of diﬀerent values to be tested as it has a strong impact on the feasibility and execution time of ℂ. We

generate uniform random values of 𝜇𝑖,𝐷𝑖,𝑆𝑖and 𝜏𝑖as shown in Table 3. Due to space limitations, we only

present the execution time of one set of parameters (i.e. Table 3and Figure 5). We plot the time speed when

ℂis feasible and solved. We also tested other parameter sets and results are similar to Figure 5.

Figure 5illustrates the undesired impacts of large system size 𝑁and small discretisation step 𝑇𝑆on ℂ

execution times, particularly on the model building phase. Figure 5a shows a three-dimensional evaluation

of the overall execution time of ℂ(i.e. z-axis) versus 𝑁(i.e. x-axis) and 𝑇𝑆(i.e. y-axis). It only shows

execution times when ℂhas solutions: CPLEX cannot solve 𝕄for 𝑁 > 500 when 𝑇𝑆= 1𝑚𝑠,𝑁 > 500

when 𝑇𝑆= 10𝑚𝑠, and 𝑁 > 540 when 𝑇𝑆= 100𝑚𝑠. Figure 5a also shows that the execution time becomes

higher when the system size gets larger (i.e. 𝑁gets larger) and 𝑇𝑆is shorter (i.e. more discrete values). 𝑇𝑆

dramatically increases ℂexecution time. Figure 5b demonstrates how the building time and solving time of

ℂcontribute to its execution time, for 𝑇𝑆= 10𝑚𝑠 and for diﬀerent values of 𝑁. The building time represents

the dominating part of the execution time. This is due to the high number of variables that are modelled in

𝕄. Therefore, more time is spent in setting up ℂand its constraints. However, the actual solving time is

hardly impacted by 𝑇𝑆and increases noticeably when the system size 𝑁gets larger.

To tackle this scalability issue, we proposed a heuristic algorithm ℍas discussed in Section 4.5. Figure 6

represents the comparison between 𝕄execution times using the CPLEX solver (i.e. ℂ) and the proposed ℍ.

Figures 6a,6b and 6c show that ℍcan shorten the execution time and tackle the scalability issue with a large

number of processes. Figure 6a represents the execution times of ℂ(for diﬀerent 𝑇𝑆) and ℍas a function of

𝑁. Figure 6b presents 𝑃𝐴=𝔼(1 − 𝐺𝑖(Δ𝜀𝑖)

1−𝐺𝑖(Δ𝜀𝑖))𝑖∈[1. .𝑁−1] where 𝛿𝑖and Δ𝜀𝑖are the obtained solutions of ℂ(for

diﬀerent 𝑇𝑆) and ℍ, respectively, versus 𝑁. Figure 6c depicts the minimal bound of 𝑇𝐷(constraint (23)):

𝑇𝐷=𝔼(Δ𝜀𝑖+𝐷𝑖)𝑖∈[1. .𝑁−1] where Δ𝜀𝑖is from the ℂsolutions (diﬀerent 𝑇𝑆) and ℍsolutions, respectively,

versus 𝑁.

Figures 6a,6b and 6c show that ℍcan shorten the execution time and tackle the system size and scalability

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 23 of 28

Optimised Failure Detection Algorithms

200 400 600 800 1000

Monitored processes (N)

10-4

10-2

100

102

104

106

Total execution time (s)

MILP-Ts=1ms

MILP-Ts=2ms

MILP-Ts=5ms

MILP-Ts=10ms

MILP-Ts=20ms

MILP-Ts=50ms

MILP-Ts=100ms

Heuristic

(a) Execution time (s)

200 400 600 800 1000

Monitored processes (N)

0.999999999996

0.999999999997

0.999999999998

0.999999999999

MILP-Ts=1ms

MILP-Ts=2ms

MILP-Ts=5ms

MILP-Ts=10ms

MILP-Ts=20ms

MILP-Ts=50ms

MILP-Ts=100ms

Heuristic

200 400 600

(b) 𝑃𝐴

200 400 600 800 1000

Monitored processes (N)

500

1000

1500

2000

2500

3000

TD (ms)

MILP-Ts=1ms

MILP-Ts=2ms

MILP-Ts=5ms

MILP-Ts=10ms

MILP-Ts=20ms

MILP-Ts=50ms

MILP-Ts=100ms

Heuristic

Figure 6: Comparison between the CPLEX solutions ℂand the heuristic algorithm ℍ(𝑃𝑡𝑎𝑟𝑔𝑒𝑡

𝐴= 95%) in terms

of execution time of 𝕄and QoS of SONAFD.

issue with a large number of processes. In summary, 1) ℍis much faster than the CPLEX solver, which

yields to computing eﬃciency of the MILP solutions; 2) ℍkeeps providing feasible solutions when the

system scales in size. The CPLEX solver cannot eﬃciently ﬁnd a solution within the time duration of

required SONAFD parameters’ update, and 3) ℍprovides feasible and similar solution quality to the CPLEX

solver in terms of 𝑃𝐴and better quality in terms of 𝑇𝐷. By design, ℍstops searching 𝛿𝑖and Δ𝜀𝑖as early

as 𝑃𝐴≥𝑃𝑡𝑎𝑟𝑔𝑒𝑡

𝐴. This guarantees a minimum required 𝑃𝐴≥𝑃𝑡𝑎𝑟𝑔𝑒𝑡

𝐴of SONAFD. On the other hand, it

yields to smaller values of 𝑇𝐷as optimal values of 𝛿𝑖and Δ𝜀𝑖are associated with higher 𝑃𝐴and 𝑇𝐷. It is

worth recalling that 𝕄contains the constraint (21) that ensures optimal values of 𝛿𝑖and Δ𝜀𝑖for guaranteed

𝑃𝐴=𝑃𝑅𝑒𝑞

𝐴𝑖. This comes at the cost of increasing 𝑇𝐷(see constraint (23)). However, the values of 𝛿𝑖and Δ𝜀𝑖

found by ℍare just good enough to provide the minimal required 𝑃𝑡𝑎𝑟𝑔𝑒𝑡

𝐴. For a fair comparison between ℂ

solution and ℍsolution, 𝕄is solved with 𝑃𝑅𝑒𝑞

𝐴𝑖=𝑃𝑡𝑎𝑟𝑔𝑒𝑡

𝐴(𝑃𝑅𝑒𝑞

𝐴𝑖in ℂand 𝑃𝑡𝑎𝑟𝑔𝑒𝑡

𝐴in ℍ).

To sum up, ℍrepresents a satisfying alternative to address the noted scalability issue of 𝕄when solved

with the CPLEX solver. It provides substantial economy on the execution times of obtaining 𝕄solutions,

while supplying comparable failure detection QoS.

6. Conclusion

In this paper, we proposed a novel MILP-based failure detector (SONAFD). SONAFD is capable of

guaranteeing the required QoS performance requirements. This is achieved by translating QoS requirements

as constraints in an MILP model, and adjusting its parameters accordingly to search for the best possible

trade-oﬀs and solutions. The objective of SONAFD is to maximise its accuracy probability while respecting

constraints on its failure detection time. To tackle scalability and computation eﬃciency challenges when

the system size becomes larger, we proposed a heuristic algorithm speciﬁc to SONAFD. This algorithm

provides fast approximated solutions. Our results show that our proposed heuristic algorithm achieves a

good approximated solution compared with the numerical solution obtained from CPLEX. Our proposed

heuristic algorithm is able to scale to large-size systems over thousands of nodes at a much faster rate.

Our experiments are based on a real-worldwide distributed system (Amazon EC2), and show a superior

performance of our proposed solution with better query accuracy and satisfactory detection speed. Our re-

sults highlighted that SONAFD can enhance the failure detection accuracy and speed while it can guarantee

the given QoS requirements and system constraints through exploring the trade-oﬀ between these require-

ments. We also evaluated our proposed solution via simulations to simulate changing network conditions.

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 24 of 28

Optimised Failure Detection Algorithms

The results are consistent with Amazon EC2 experiments. These results demonstrate consistent and robust

improvements with our approach.

To the best of our knowledge, our MILP-based SONAFD is the ﬁrst attempt to combine an adaptive fail-

ure detection algorithm with data-driven operation research approaches. No previous work has considered

scalability, data-driven performance optimisation with constraints and FD algorithm design together. Our

tests are based on real-world Amazon global data centres as well as extensive simulations with a simulator

we developed to test our proposed solution in comprehensive ways beyond the scope of previous literature.

Both Amazon and simulation results demonstrate the stability and robustness of our proposed approach.

As networked application systems become more sophisticated and larger (e.g., Cloud, Internet of Things

(IoT), 5G network, Blockchain, etc.), such FDs are fundamentally important for achieving QoS guarantee,

scalability, real-time monitoring, and fault tolerance goals simultaneously.

Acknowledgements

The authors acknowledge the use of the IRIDIS High Performance Computing Facility, and associated

support services at the University of Southampton, UK, in the completion of this work.

The authors would like to thank Prof. Tolga Bektas, Professor of Logistics Management at the University

of Liverpool, UK, for assisting this research by providing valuable advice in the optimisation modelling.

Finally, the authors would like to thank the anonymous reviewers for their valuable feedback.

References

Abdel-Aziz, M.K., Samarakoon, S., Bennis, M., Saad, W., 2020. Ultra-Reliable and Low-Latency Vehicular Communication: An Active Learning

Approach. IEEE Communications Letters 24, 367–370. doi:10.1109/LCOMM.2019.2956929.

Aguilera, M.K., Chen, W., Toueg, S., 2000. Failure detection and consensus in the crash-recovery model. Distrib. Comput. 13, 99–125. doi:10.

1007/s004460050070.

Akka, 2018. Akka | Akka. URL: https://akka.io/.

Akka, 2021. Phi Accrual Failure Detector •Akka Documentation. URL: https://doc.akka.io/docs/akka/current/typed/

failure-detector.html.

Barolli, L., Leu, F.Y., Enokido, T., Chen, H.C., 2018. Advances on Broadband and Wireless Computing, Communication and Applications:

Proceedings of the 13th International Conference on Broadband and Wireless Computing, Communication and Applications (BWCCA-2018).

Springer.

Becchi, M., 2008. From Poisson Processes to Self-Similarity: a Survey of Network Traﬃc Models. Technical Report. Washington University in St.

Louis. URL: https://www.cse.wustl.edu/~jain/cse567-06/ftp/traffic_models1.pdf.

Bertier, M., Marin, O., Sens, P., 2002. Implementation and performance evaluation of an adaptable failure detector, in: Proceedings International

Conference on Dependable Systems and Networks, pp. 354–363. doi:10.1109/DSN.2002.1028920.

Bisnik, N., Abouzeid, A.A., 2009. Queuing network models for delay analysis of multihop wireless ad hoc networks. Ad Hoc Networks 7, 79–97.

doi:10.1016/j.adhoc.2007.12.001.

Bosilca, G., Bouteiller, A., Guermouche, A., Herault, T., Robert, Y., Sens, P., Dongarra, J., 2016. Failure Detection and Propagation in HPC systems,

in: SC ’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 312–322.

doi:10.1109/SC.2016.26.

Burlacu, R., Geißler, B., Schewe, L., 2020. Solving Mixed-Integer Nonlinear Programs using Adaptively Reﬁned Mixed-Integer Linear Programs.

Optimization Methods and Software 35, 37–64. doi:10.1080/10556788.2018.1556661.

Buyya, R., Ranjan, R., Calheiros, R.N., 2010. InterCloud: Utility-Oriented Federation of Cloud Computing Environments for Scaling of Appli-

cation Services, in: Hsu, C.H., Yang, L.T., Park, J.H., Yeo, S.S. (Eds.), Algorithms and Architectures for Parallel Processing, Springer Berlin

Heidelberg. pp. 13–31.

Cai, Y., Hames, D., 2010. Minimum Sample Size Determination for Generalized Extreme Value Distribution. Communications in Statistics -

Simulation and Computation 40, 87–98. doi:10.1080/03610918.2010.530368. publisher: Taylor & Francis.

Cao, J., Cleveland, W.S., Lin, D., Sun, D.X., 2003. Internet Traﬃc Tends Toward Poisson and Independent as the Load Increases, in: Denison,

D.D., Hansen, M.H., Holmes, C.C., Mallick, B., Yu, B. (Eds.), Nonlinear Estimation and Classiﬁcation. Springer, New York, NY. Lecture Notes

in Statistics, pp. 83–109. doi:10.1007/978-0- 387-21579- 2_6.

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 25 of 28

Optimised Failure Detection Algorithms

Chan, Y.C., Wang, K., Hsu, Y.H., 2015. Fast Controller Failover for Multi-domain Software-Deﬁned Networks, in: 2015 European Conference on

Networks and Communications (EuCNC), pp. 370–374. doi:10.1109/EuCNC.2015.7194101.

Chandra, T.D., Toueg, S., 1996. Unreliable failure detectors for reliable distributed systems. Journal of the ACM 43, 225–267. doi:10.1145/

226643.226647.

Chen, W., Toueg, S., Aguilera, M.K., 2002. On the quality of service of failure detectors. IEEE Transactions on Computers 51, 561–580. doi:10.

1109/TC.2002.1004595.

Cheung, W.C., Simchi-Levi, D., Wang, H., 2017. Technical Note—Dynamic Pricing and Demand Learning with Limited Price Experimentation.

Operations Research 65, 1722–1731. doi:10.1287/opre.2017.1629. publisher: INFORMS.

Chun Chung Chan, Hanly, S.V., 2001. Calculating the outage probability in a CDMA network with spatial Poisson traﬃc. IEEE Transactions on

Vehicular Technology 50, 183–204. doi:10.1109/25.917918.

Conforti, M., Cornuéjols, G., Zambelli, G., 2014. Integer Programming. volume 271 of Graduate Texts in Mathematics. Springer International

Publishing, Cham. doi:10.1007/978-3- 319-11008- 0.

Coulouris, G., Dollimore, J., Kindberg, T., Blair, G., 2001. Time and global state. Distributed Systems Concepts and Design , 385–400.

Coulouris, G.F., Dollimore, J., Kindberg, T., 2005. Distributed Systems: Concepts and Design. Pearson Education.

Du, A.Y., Das, S., Ramesh, R., 2012. Eﬃcient Risk Hedging by Dynamic Forward Pricing: A Study in Cloud Computing. INFORMS Journal on

Computing 25, 625–642. doi:10.1287/ijoc.1120.0526.

Du, D.Z., Pardalos, P.M., Zhang, Z., 2019. Nonlinear Combinatorial Optimization. Springer International Publishing.

Fang, K.C., Wang, K., Wang, J.H., 2016. A fast and load-aware controller failover mechanism for software-deﬁned networks, in: 2016 10th

International Symposium on Communication Systems, Networks and Digital Signal Processing (CSNDSP), pp. 1–6. doi:10.1109/CSNDSP.

2016.7573944.

Ferreira, P.V.R., Paﬀenroth, R., Wyglinski, A.M., Hackett, T.M., Bilén, S.G., Reinhart, R.C., Mortensen, D.J., 2018. Multiobjective Reinforcement

Learning for Cognitive Satellite Communications Using Deep Neural Network Ensembles. IEEE Journal on Selected Areas in Communications

36, 1030–1041. doi:10.1109/JSAC.2018.2832820.

Fields, E., Osorio, C., Zhou, T., 2021. A Data-Driven Method for Reconstructing a Distribution from a Truncated Sample with an Application to

Inferring Car-Sharing Demand. Transportation Science doi:10.1287/trsc.2020.1028. publisher: INFORMS.

Fischer, M.J., Lynch, N.A., Paterson, M.S., 1985. Impossibility of distributed consensus with one faulty process. Journal of the ACM 32, 374–382.

doi:10.1145/3149.214121.

Geißler, B., Kolb, O., Lang, J., Leugering, G., Martin, A., Morsi, A., 2011. Mixed integer linear models for the optimization of dynamical transport

networks. Mathematical Methods of Operations Research 73, 339–362. doi:10.1007/s00186- 011-0354- 5.

Geißler, B., Martin, A., Morsi, A., Schewe, L., 2012. Using Piecewise Linear Functions for Solving MINLPs. Mixed Integer Nonlinear Programming

, 287–314doi:10.1007/978-1- 4614-1927- 3. publisher: Springer Science+Business Media, New York.

Geißler, B., Morsi, A., Schewe, L., Schmidt, M., 2015. Solving power-constrained gas transportation problems using an MIP-based alternating

direction method. Computers & Chemical Engineering 82, 303–317. doi:10.1016/j.compchemeng.2015.07.005.

Gendron, B., Gouveia, L., 2016. Reformulations by Discretization for Piecewise Linear Integer Multicommodity Network Flow Problems. Trans-

portation Science 51, 629–649. doi:10.1287/trsc.2015.0634.

gigaspaces, 2019. Failure Detection. URL: https://docs.gigaspaces.com/latest/admin/troubleshooting-failure- detection.

html.

Guerraoui, R., Herlihy, M., Kuznetsov, P., Lynch, N., Newport, C., 2009. On the weakest failure detector ever. Distributed Computing 21, 353–366.

doi:10.1007/s00446-009- 0079-3.

Gullhav, A.N., Cordeau, J.F., Hvattum, L.M., Nygreen, B., 2017. Adaptive large neighborhood search heuristics for multi-tier service deployment

problems in clouds. European Journal of Operational Research 259, 829–846. doi:10.1016/j.ejor.2016.11.003.

Gupta, A.K., Smith, K.G., Shalley, C.E., 2006. The Inter play between Exploration and Exploitation. The Academy of Management Journal 49,

693–706. doi:10.2307/20159793. publisher: Academy of Management.

Guthrie, W.F., 2020. NIST/SEMATECH e-Handbook of Statistical Methods (NIST Handbook 151). URL: https://www.itl.nist.gov/

div898/handbook/, doi:10.18434/M32189. type: dataset.

Hayashibara, N., Defago, X., Yared, R., Katayama, T., 2004. The phi; accrual failure detector, in: Proceedings of the 23rd IEEE International

Symposium on Reliable Distributed Systems, 2004., pp. 66–78. doi:10.1109/RELDIS.2004.1353004.

Heilig, L., Voß, S., 2014. Decision Analytics for Cloud Computing: A Classiﬁcation and Literature Review, in: Bridging Data and Decisions.

INFORMS. INFORMS TutORials in Operations Research, pp. 1–26. doi:10.1287/educ.2014.0124.

Heyman, D.P., Sobel, M.J., 2004. Stochastic Models in Operations Research: Stochastic Processes and Operating Characteristics. Courier Cor po-

ration.

Hussin, M., Asilah Wati Abdul Hamid, N., Kasmiran, K.A., 2015. Improving reliability in resource management through adaptive reinforcement

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 26 of 28

Optimised Failure Detection Algorithms

learning for distributed systems. Journal of Parallel and Distributed Computing 75, 93–100. doi:10.1016/j.jpdc.2014.10.001.

IBM, 2017. CPLEX User’s Manual Version 12 Release 8. URL: https://www.ibm.com/docs/en/SSSA5P_12.8.0/ilog.odms.studio.

help/pdf/usrcplex.pdf.

Kaewpuang, R., Niyato, D., Wang, P., Hossain, E., 2013. A Framework for Cooperative Resource Management in Mobile Cloud Computing. IEEE

Journal on Selected Areas in Communications 31, 2685–2700. doi:10.1109/JSAC.2013.131209.

Karagiannis, T., Molle, M., Faloutsos, M., 2004. Long-range dependence ten years of Internet traﬃc modeling. IEEE Internet Computing 8, 57–64.

doi:10.1109/MIC.2004.46.

Kheirkhah, M., Wakeman, I., Parisis, G., 2019. Multipath transport and packet spraying for eﬃcient data delivery in data centres. Computer

Networks 162, 106852. doi:10.1016/j.comnet.2019.07.008.

Kirichek, R., Golubeva, M., Kulik, V., Koucheryavy, A., 2016. The home network traﬃc models investigation, in: 2016 18th International Confer-

ence on Advanced Communication Technology (ICACT), pp. 97–100. doi:10.1109/ICACT.2016.7423288.

Kolmogorov, A., 1933. Sulla determinazione empirica di una legge didistribuzione. Giorn Dell’inst Ital Degli Att 4, 89–91.

Kunnumkal, S., Talluri, K., 2015. On a Piecewise-Linear Approximation for Network Revenue Management. Mathematics of Operations Research

41, 72–91. doi:10.1287/moor.2015.0716.

Kutzner, F.L., Read, D., Stewart, N., Brown, G., 2017. Choosing the Devil You Don’t Know: Evidence for Limited Sensitivity to Sample Size–Based

Uncertainty When It Oﬀers an Advantage. Management Science 63, 1519–1528. doi:10.1287/mnsc.2015.2394.

Lakshman, A., Malik, P., 2010. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 35. doi:10.

1145/1773912.1773922.

Laprie, J.C., 1992. Dependability: Basic concepts and terminology, in: Dependability: Basic Concepts and Terminology. Springer, pp. 3–245.

Li, Y.M., Tan, Y., De, P., 2012. Self-Organized Formation and Evolution of Peer-to-Peer Networks. INFORMS Journal on Computing 25, 502–516.

doi:10.1287/ijoc.1120.0517.

Liu, J., Wu, Z., Wu, J., Dong, J., Zhao, Y., Wen, D., 2017. A Weibull distribution accrual failure detector for cloud computing. PLOS ONE 12,

e0173666. doi:10.1371/journal.pone.0173666.

Liu, Z., Righter, R., 1998. Optimal Load Balancing on Distributed Homogeneous Unreliable Processors. Operations Research .

Luenberger, D.G., Ye, Y., 2016. Linear and Nonlinear Programming. volume 228 of International Series in Operations Research & Management

Science. Springer International Publishing, Cham. doi:10.1007/978-3-319- 18842-3.

Ma, T., Hillston, J., Anderson, S., 2010. On the Quality of Service of Crash-Recovery Failure Detectors. IEEE Transactions on Dependable and

Secure Computing 7, 271–283. doi:10.1109/TDSC.2009.35.

Madhushani, U., Leonard, N.E., 2021. Heterogeneous Explore-Exploit Strategies on Multi-Star Networks. IEEE Control Systems Letters 5, 1603–

1608. doi:10.1109/LCSYS.2020.3042459.

Marouani, H., Dagenais, M.R., 2008. Internal Clock Drift Estimation in Computer Clusters. Journal of Computer Systems, Networks, and Com-

munications 2008, 1–7. doi:10.1155/2008/583162.

Melbourne University, 2018. CloudSim: A Framework For Modeling And Simulation Of Cloud Computing Infrastructures And Services. URL:

https://github.com/Cloudslab/cloudsim. original-date: 2015-03-18.

Paxson, V., Floyd, S., 1994. Wide-area traﬃc: the failure of Poisson modeling. ACM SIGCOMM Computer Communication Review 24, 257–268.

doi:10.1145/190809.190338.

Rebennack, S., 2016a. Combining sampling-based and scenario-based nested Benders decomposition methods: application to stochastic dual

dynamic programming. Mathematical Programming 156, 343–389. doi:10.1007/s10107-015-0884-3.

Rebennack, S., 2016b. Computing tight bounds via piecewise linear functions through the example of circle cutting problems. Mathematical

Methods of Operations Research 84, 3–57. doi:10.1007/s00186- 016-0546- 0.

Rebennack, S., Krasko, V., 2019. Piecewise Linear Function Fitting via Mixed-Integer Linear Programming. INFORMS Journal on Computing

doi:10.1287/ijoc.2019.0890.

Ross, S.M., 1996. Stochastic Processes. Wiley.

Satzger, B., Pietzowski, A., Trumler, W., Ungerer, T., 2007. A New Adaptive Accrual Failure Detector for Dependable Distributed Systems, in:

Proceedings of the 2007 ACM Symposium on Applied Computing, ACM, New York, NY, USA. pp. 551–555. doi:10.1145/1244002.1244129.

Satzger, B., Pietzowski, A., Ungerer, T., 2011. Autonomous and scalable failure detection in distributed systems. International Journal of Au-

tonomous and Adaptive Communications Systems 4, 61. doi:10.1504/IJAACS.2011.037749.

Shen, S., Wang, J., 2014. Stochastic Modeling and Approaches for Managing Energy Footprints in Cloud Computing Service. Ser vice Science 6,

15–33. doi:10.1287/serv.2013.0061.

Sleptchenko, A., Johnson, M.E., 2014. Maintaining Secure and Reliable Distributed Control Systems. INFORMS Journal on Computing 27,

103–117. doi:10.1287/ijoc.2014.0613.

Smirnov, N., 1948. Table for Estimating the Goodness of Fit of Empirical Distributions. Annals of Mathematical Statistics 19, 279–281. doi:10.

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 27 of 28

Optimised Failure Detection Algorithms

1214/aoms/1177730256. publisher: Institute of Mathematical Statistics.

Sotoma, I., Madeira, E.R.M., 2001. ADAPTATION - Algorithms to Adaptive Fault Monitoring and their implementation on CORBA, in: Proceed-

ings 3rd International Symposium on Distributed Objects and Applications, pp. 219–228. doi:10.1109/DOA.2001.954087.

Steeger, G., Rebennack, S., 2017. Dynamic convexiﬁcation within nested Benders decomposition using Lagrangian relaxation: An application to

the strategic bidding problem. European Journal of Operational Research 257, 669–686. doi:10.1016/j.ejor.2016.08.006.

Sukhov, A.M., Astrakhantseva, M.A., Pervitsky, A.K., Boldyrev, S.S., Bukatov, A.A., 2016. Generating a function for network delay. Journal of

High Speed Networks 22, 321–333. doi:10.3233/JHS- 160552. publisher: IOS Press.

Sun, J., Liu, F., Ahmed, M., Li, Y., 2019. Eﬃcient Virtual Network Function Placement for Poisson Arrived Traﬃc, in: ICC 2019 - 2019 IEEE

International Conference on Communications (ICC), pp. 1–7. doi:10.1109/ICC.2019.8761609. iSSN: 1938-1883.

Tan, K.C., Chiam, S.C., Mamun, A.A., Goh, C.K., 2009. Balancing exploration and exploitation with adaptive variation for evolutionary multi-

objective optimization. European Journal of Operational Research 197, 701–713. doi:10.1016/j.ejor.2008.07.025.

Tanenbaum, A.S., Steen, M.v., 2007. Distributed Systems: Principles and Paradigms. Pearson Prentice Hall.

Tomsic, A., Sens, P., Garcia, J., Arantes, L., Sopena, J., 2015. 2w-FD: A Failure Detector Algorithm with QoS, in: 2015 IEEE International Parallel

and Distributed Processing Symposium, pp. 885–893. doi:10.1109/IPDPS.2015.74.

Turchetti, R.C., Duarte, E.P., Arantes, L., Sens, P., 2016. A QoS-conﬁgurable failure detection service for internet applications. Journal of Internet

Services and Applications 7, 9. doi:10.1186/s13174-016-0051-y.

Vielma, J.P., Ahmed, S., Nemhauser, G., 2010a. Mixed-Integer Models for Nonseparable Piecewise-Linear Optimization: Unifying Framework and

Extensions. Operations Research 58, 303–315. doi:10.1287/opre.1090.0721. publisher: INFORMS.

Vielma, J.P., Ahmed, S., Nemhauser, G., 2010b. A Note on “A Superior Representation Method for Piecewise Linear Functions”. INFORMS

Journal on Computing 22, 493–497. doi:10.1287/ijoc.1100.0379. publisher: INFORMS.

Xiong, N., Defago, X., 2007. ED FD: Improving the Phi Accrual Failure Detector. Research Report. School of Information Science, Japan Advanced

Institute of Science and Technology. URL: https://dspace.jaist.ac.jp/dspace/bitstream/10119/4799/1/IS-RR- 2007-007.pdf.

Xiong, N., Vasilakos, A.V., Wu, J., Yang, Y.R., Rindos, A., Zhou, Y., Song, W., Pan, Y., 2012. A Self-tuning Failure Detection Scheme for Cloud

Computing Service, in: 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 668–679. doi:10.1109/IPDPS.

2012.126.

Xu, Z., Tang, J., Meng, J., Zhang, W., Wang, Y., Liu, C.H., Yang, D., 2018. Experience-driven Networking: A Deep Reinforcement Learning based

Approach, in: IEEE INFOCOM 2018 - IEEE Conference on Computer Communications, pp. 1871–1879. doi:10.1109/INFOCOM.2018.

8485853.

Yang, T.W., Wang, K., 2016. Failure detection service with low mistake rates for SDN controllers, in: 2016 18th Asia-Paciﬁc Network Operations

and Management Symposium (APNOMS), pp. 1–6. doi:10.1109/APNOMS.2016.7737210.

Yu, H., Zheng, D., Zhao, B.Y., Zheng, W., 2006. Understanding user behavior in large-scale video-on-demand systems. ACM SIGOPS Operating

Systems Review 40, 333–344. doi:10.1145/1218063.1217968.

Er-Rahmadi and Ma: Preprint submitted to Elsevier Page 28 of 28

Survey on Lagrangian relaxation for MILP: importance, challenges, historical review, recent advancements, and opportunities

Article

Full-text available

Jul 2023
ANN OPER RES

Mikhail A. Bragin

Operations in areas of importance to society are frequently modeled as mixed-integer linear programming (MILP) problems. While MILP problems suffer from combinatorial complexity, Lagrangian Relaxation has been a beacon of hope to resolve the associated difficulties through decomposition. Due to the non-smooth nature of Lagrangian dual functions, the coordination aspect of the method has posed serious challenges. This paper presents several significant historical milestones (beginning with Polyak’s pioneering work in 1967) toward improving Lagrangian Relaxation coordination through improved optimization of non-smooth functionals. Finally, this paper presents the most recent developments in Lagrangian Relaxation for fast resolution of MILP problems. The paper also briefly discusses the opportunities that Lagrangian Relaxation can provide at this point in time.

A Stochastic Programming Approach for an Enhanced Performance of a Multi-committees Byzantine Fault Tolerant Algorithm

Presentation

Full-text available

Aug 2022

Lagrangian Relaxation for Mixed-Integer Linear Programming: Importance, Challenges, Recent Advancements, and Opportunities

Preprint

Full-text available

Jan 2023

Mikhail A. Bragin

Operations in areas of importance to society are frequently modeled as Mixed-Integer Linear Programming (MILP) problems. While MILP problems suffer from combinatorial complexity, Lagrangian Relaxation has been a beacon of hope to resolve the associated difficulties through decomposition. Due to the non-smooth nature of Lagrangian dual functions, the coordination aspect of the method has posed serious challenges. This paper presents several significant historical milestones (beginning with Polyak's pioneering work in 1967) toward improving Lagrangian Relaxation coordination through improved optimization of non-smooth functionals. Finally, this paper presents the most recent developments in Lagrangian Relaxation for fast resolution of MILP problems. The paper also briefly discusses the opportunities that Lagrangian Relaxation can provide at this point in time.

The open banking era: An optimal model for the emergency fund

Article

Dec 2023
EXPERT SYST APPL

ParBFT: An Optimised Byzantine Consensus Parallelism Scheme

Article

Dec 2023

Byzantine fault-tolerance (BFT) consensus is a fundamental building block of distributed systems such as blockchains. However, implementations based on classic PBFT and most linear PBFT-variants still suffer from message communication complexity, restricting the scalability and performance of BFT algorithms when serving large-scale systems with growing numbers of peers. To tackle the scalability and performance challenges, we propose ParBFT , a new Byzantine consensus parallelism scheme combining classic BFT protocols and a novel Bilevel Mixed-Integer Linear Programming(BL-MILP)-based optimisation model. The core aim of ParBFT is to improve scalability via parallel consensus while providing enhanced safety (i.e. ensuring consistent total order across all correct replicas). Another core novelty is the integration of the BL-MILP model into ParBFT. The BL-MILP allows us to compute optimal numerical decisions for parallel committees (i.e. the optimal number of committees and peer allocation for each committee) and improve consensus performance while ensuring security. Finally, we test the performance of the proposed ParBFT on Microsoft Azure Cloud systems with 20 to 300 peers and find that ParBFT can achieve significant improvement compared to the state-of-the-art protocols.

Statistical analysis of the impact of FeO3 and ZnO nanoparticles on the physicochemical and dielectric performance of monoester‑based nanofuids

Article

Full-text available

Jul 2023

This article deals with a comparative study of the physicochemical and electrical properties of monoesters of castor oil compared with their counterparts based on FeO3 and ZnO nanoparticles. The results are also compared with those in the literature on triesters, and also with the recommendations of the IEEE C 57.14 standard. The data is analysed statistically using a goodness‑of‑ft test. The analysis of the viscosity data at 40 °C shows an increase in viscosity. For concentrations of 0.10 wt%, 0.15 wt% and 0.20 wt% these are respectively 5.4%, 9.69%, 12.9% for FeO3 NFs and 7.6%, 9.91% and 12.7% for Z nO NFs. For the same concentrations, the increase in acid number is respectively 3.2%, 2.9%, 2.5% for FeO3 samples and 3.18%, 2.0%, 1.2% for ZnO samples. For the same concentrations, the fre point shows an increment of 4%, 3% and 2% for FeO3 samples and a regression of 8.75%, 6.88% and 5.63% for ZnO samples. As for the breakdown voltage, for the same concentrations we observe respectively an increment of 43%, 27%, 34% for the FeO3. The results show an improvement on partial discharge inception voltage with FeO3 of 24%, 8.13% and 15.21% respectively for the concentrations 0.10 wt%, 0.15 wt% and 0.20 wt%.

Democratization of Complex-Problem Solving to Enhance Participation, Transparency, Accountability, and Fairness: An Optimization Perspective

Preprint

Full-text available

Jun 2023

A Stochastic Programming Approach for an Enhanced Performance of a Multi-committees Byzantine Fault Tolerant Algorithm

Chapter

May 2023

Byzantine fault-tolerance (BFT) algorithms enhance trustworthiness of distributed systems by guaranteeing their resilience to Byzantine faults. Traditional BFT algorithms suffer from scalability issues, resulting in performance bottlenecks (e.g., low throughputs) in large-scale distributed systems. Moreover, distributed systems are generally deployed on geographically and/or logically distributed networks, which aggravates the performance-scalability issue. To tackle this challenge, existing works have proposed a number of new BFT algorithms (e.g., HotStuff, FastBFT). However, limited work has explored parallel BFT based on a partitioned set of connected subgroups. This is challenging due to 1) heterogeneous communications delays between different, potentially geographically distributed, peers, and 2) peers may have a random crash and/or Byzantine failures, which contribute to the failure of the BFT consensus. To address these issues, we propose a stochastic programming (SP) model to maximise the throughput, while considering communications delays and failure behaviors as constraints. The SP model solution provides the optimal multi-committee organisation. Evaluation results show 24% throughput enhancement with the SP model.KeywordsStochastic ProgrammingByzantine Fault Tolerant AlgorithmParallel Consensus

Ultra-Reliable and Low-Latency Vehicular Communication: An Active Learning Approach

Article

Full-text available

Dec 2019

In this letter, an age of information (AoI)-aware transmission power and resource block (RB) allocation technique for vehicular communication networks is proposed. Due to the highly dynamic nature of vehicular networks, gaining a prior knowledge about the network dynamics, i.e., wireless channels and interference, in order to allocate resources, is challenging. Therefore, to effectively allocate power and RBs, the proposed approach allows the network to actively learn its dynamics by balancing a tradeoff between minimizing the probability that the vehicles’ AoI exceeds a predefined threshold and maximizing the knowledge about the network dynamics. In this regard, using a Gaussian process regression (GPR) approach, an online decentralized strategy is proposed to actively learn the network dynamics, estimate the vehicles’ future AoI, and proactively allocate resources. Simulation results show a significant improvement in terms of AoI violation probability, compared to several baselines, with a reduction of at least 50%.

Multipath Transport and Packet Spraying for Efficient Data Delivery in Data Centres

Article

Full-text available

Jul 2019
COMPUT NETW

Modern data centres provide large aggregate network capacity and multiple paths among servers. Traffic in data centres is very diverse; most of the data is produced by long, bandwidth hungry flows but the large majority of flows, which commonly come with stringent deadlines regarding their completion time, are short. It has been shown that TCP is not efficient for any of these types of traffic in modern data centres. MultiPath TCP (MPTCP) employs multipath data transport and is efficient for long flows but ill-suited for short flows. In this paper, we present Maximum MultiPath TCP (MMPTCP), a novel transport protocol which extends MPTCP and, compared to TCP and MPTCP, reduces short flows’ completion times, while providing excellent goodput to long flows. To do so, MMPTCP runs in two phases; initially, it randomly scatters packets in the network under a single congestion window exploiting all available paths. This is beneficial to latency-sensitive flows. After a specific amount of data is sent, MMPTCP switches to a regular MultiPath TCP mode. MMPTCP is incrementally deployable in existing data centres as it does not require any modifications outside the transport layer and behaves well when competing with MPTCP flows. We also present a topology-specific extension of MMPTCP that adjusts the numbers of subflows during the second phase of the protocol based on knowledge about the location of the receiver in the data centre. We present extensive evaluation that shows that MMPTCP’s design objectives are met. We have implemented MMPTCP (along with MPTCP and packet spraying) in ns-3 and evaluated our protocol in simulated FatTree topologies. We have evaluated how MMPTCP performs compared to TCP and MPTCP and how its performance is affected by transient hotspots in the network. We have also experimented with different thresholds for duplicate acknowledgements and fast retransmissions and shown that MMPTCP performs well when the size of short flows is widely ranged. Finally, we have evaluated how MMPTCP performs under conditions that result in Incast, when different congestion control algorithms are used in its second phase and when varying the overall network load.

A Data-Driven Method for Reconstructing a Distribution from a Truncated Sample with an Application to Inferring Car-Sharing Demand

Article

Feb 2021

This paper proposes a method to recover an unknown probability distribution given a censored or truncated sample from that distribution. The proposed method is a novel and conceptually simple detruncation method based on sampling the observed data according to weights learned by solving a simulation-based optimization problem; this method is especially appropriate in cases where little analytic information is available but the truncation process can be simulated. The proposed method is compared with the ubiquitous maximum likelihood estimation (MLE) method in a variety of synthetic validation experiments, where it is found that the proposed method performs slightly worse than perfectly specified MLE and competitively with slightly misspecified MLE. The practical application of this method is then demonstrated via a pair of case studies in which the proposed detruncation method is used alongside a car-sharing service simulator to estimate demand for round-trip car-sharing services in the Boston and New York metropolitan areas.

Heterogeneous Explore-Exploit Strategies on Multi-Star Networks

Article

Dec 2020

We investigate the benefits of heterogeneity in multi-agent explore-exploit decision making where the goal of the agents is to maximize cumulative group reward. To do so we study a class of distributed stochastic bandit problems in which agents communicate over a multi-star network and make sequential choices among options in the same uncertain environment. Typically, in multi-agent bandit problems, agents use homogeneous decision-making strategies. However, group performance can be improved by incorporating heterogeneity into the choices agents make, especially when the network graph is irregular, i.e. when agents have different numbers of neighbors. We design and analyze new heterogeneous explore-exploit strategies, using the multi-star as the model irregular network graph. The key idea is to enable center agents to do more exploring than they would do using the homogeneous strategy, as a means of providing more useful data to the peripheral agents. In the case all agents broadcast their reward values and choices to their neighbors with the same probability, we provide theoretical guarantees that group performance improves under the proposed heterogeneous strategies as compared to under homogeneous strategies. We use numerical simulations to illustrate our results and to validate our theoretical bounds.

Nonlinear Combinatorial Optimization

Book

Jan 2019

Graduate students and researchers in applied mathematics, optimization, engineering, computer science, and management science will find this book a useful reference which provides an introduction to applications and fundamental theories in nonlinear combinatorial optimization. Nonlinear combinatorial optimization is a new research area within combinatorial optimization and includes numerous applications to technological developments, such as wireless communication, cloud computing, data science, and social networks. Theoretical developments including discrete Newton methods, primal-dual methods with convex relaxation, submodular optimization, discrete DC program, along with several applications are discussed and explored in this book through articles by leading experts.

Piecewise Linear Function Fitting via Mixed-Integer Linear Programming

Article

Dec 2019

Piecewise linear (PWL) functions are used in a variety of applications. Computing such continuous PWL functions, however, is a challenging task. Software packages and the literature on PWL function fitting are dominated by heuristic methods. This is true for both fitting discrete data points and continuous univariate functions. The only exact methods rely on nonconvex model formulations. Exact methods compute continuous PWL function for a fixed number of breakpoints minimizing some distance function between the original function and the PWL function. An optimal PWL function can only be computed if the breakpoints are allowed to be placed freely and are not fixed to a set of candidate breakpoints. In this paper, we propose the first convex model for optimal continuous univariate PWL function fitting. Dependent on the metrics chosen, the resulting formulations are either mixed-integer linear programming or mixed-integer quadratic programming problems. These models yield optimal continuous PWL functions for a set of discrete data. On the basis of these convex formulations, we further develop an exact algorithm to fit continuous univariate functions. Computational results for benchmark instances from the literature demonstrate the superiority of the proposed convex models compared with state-of-the-art nonconvex models.

Efficient Virtual Network Function Placement for Poisson Arrived Traffic

Conference Paper

May 2019

Solving mixed-integer nonlinear programmes using adaptively refined mixed-integer linear programmes

Article

Jan 2019

We propose a method for solving mixed-integer nonlinear programmes (MINLPs) to global optimality by discretization of occurring nonlinearities. The main idea is based on using piecewise linear functions to construct mixed-integer linear programme (MIP) relaxations of the underlying MINLP. In order to find a global optimum of the given MINLP, we develop an iterative algorithm which solves MIP relaxations that are adaptively refined. We are able to give convergence results for a wide range of MINLPs requiring only continuous nonlinearities with bounded domains and an oracle computing maxima of the nonlinearities on their domain. Moreover, the practicalness of our approach is shown numerically by an application from the field of gas network optimization.

Experience-driven Networking: A Deep Reinforcement Learning based Approach

Conference Paper

Apr 2018

Multiobjective Reinforcement Learning for Cognitive Satellite Communications Using Deep Neural Network Ensembles

Article

May 2018

Future spacecraft communication subsystems will potentially benefit from software-defined radios controlled by artificial intelligence algorithms. In this paper, we propose a novel radio resource allocation algorithm leveraging multi-objective reinforcement learning and artificial neural network ensembles able to manage available resources and conflicting mission-based goals. The uncertainty in the performance of thousands of possible radio parameter combinations, and the dynamic behavior of the radio channel over time producing a continuous multi-dimensional state–action space, requires a fixed-size memory continuous state–action mapping instead of the traditional discrete mapping. In addition, actions need to be decoupled from states in order to allow for online learning, performance monitoring, and resource allocation prediction. The proposed approach leverages the authors’ previous research on constraining decisions predicted to have poor performance through “virtual environment exploration”. The simulation results show the performance for different communication mission profiles and accuracy benchmarks are provided for future research reference. The proposed approach constitutes part of the core cognitive engine proof-of-concept delivered to NASA John H. Glenn Research Center’s SCaN Testbed radios on-board the International Space Station. IEEE