ArticlePDF Available

Smart Routing: Towards Proactive Fault Handling of Software-Defined Networks

January 2020
Computer Networks 170:107104

January 2020
170:107104

DOI:10.1016/j.comnet.2020.107104

License
CC BY 4.0

Authors:

Ali Malik

TU Dublin

Benjamin Aziz

Buckinghamshire New University

Mo Adda

University of Portsmouth

Chih-Heng Ke

National Quemoy University

In recent years, the emerging paradigm of software-defined networking has become a hot and thriving topic in both the industrial and academic sectors. Software-defined networking offers numerous benefits against legacy networking systems by simplifying the process of network management through reducing the cost of network configurations. Currently, data plane fault management is limited to two mechanisms: proactive and reactive. These fault management and recovery techniques are activated only after a failure occurrence and hence packet loss is highly likely to occur. This is due to convergence time where new network paths will need to be allocated in order to forward the affected traffic rather than drop it. Such convergence leads to temporary service disruption and unavailability. Practically, not only the speed of recovery mechanisms affects the convergence, but also the delay caused by the process of failure detection. In this paper, we define a new approach for data plane fault management in software- defined networks where the goal is to eliminate the convergence process altogether rather than accelerate the failure detection and recovery. We propose a new framework, called Smart Routing, which allows the network controller to receive forewarning signs on failures and hence avoid risky paths before the failure incidents occur. The proposed approach aims to decrease service disruption, which in turn increases network service availability. We validate our framework through a set of experiments that demonstrate how the underlying model runs and its impact on improving service availability. We take as example of the applicability of the new framework three types of topologies covering real and simulated networks.

The compartmentalisation of dependability concept, Laprie [5] .

…

Online failure prediction and time relations, Salfner et al. [44] .

…

Relation between prediction and failure sets.

…

Topology example with different EBC values.

…

Architecture of the proposed framework.

…

Figures - uploaded by Ali Malik

Content may be subject to copyright.

Content uploaded by Ali Malik

Content may be subject to copyright.

Computer Networks 170 (2020) 107104

Contents lists available at ScienceDirect

Computer Networks

journal homepage: www.elsevier.com/locate/comnet

Smart routing: Towards proactive fault handling of software-deﬁned

networks

Ali Malik

a , ∗, Benjamin Aziz

, Mo Adda

, Chih-Heng Ke

School of Electrical and Electronic Engineering, Technological University Dublin, Dublin D08 NF82, Ireland

School of Computing, Buckingham Building, University of Portsmou th, Portsmouth PO1 3HE, United Kingdom

Department of Computer Science and Information Engineering, National Quemoy University, Kinmen 892, Taiwan

a r t i c l e i n f o

Article history:

Received 8 January 2019

Revised 26 December 2019

Accepted 13 January 2020

Available online 16 January 2020

Keywo rds:

OpenFlow

Software-deﬁned networking

Fault management

Risk management

Service availability

a b s t r a c t

In recent years, the emerging paradigm of software-deﬁned networking has become a hot and thriving

topic in both the industrial and academic sectors. Software-deﬁned networking offers numerous beneﬁts

against legacy networking systems by simplifying the process of network management through reducing

the cost of network conﬁgurations. Currently, data plane fault management is limited to two mecha-

nisms: proactive and reactive . These fault management and recovery techniques are activated only after a

failure occurrence and hence packet loss is highly likely to occur. This is due to convergence time where

new network paths will need to be allocated in order to forward the affected traﬃc rather than drop it.

Such convergence leads to temporary service disruption and unavailability. Practically, not only the speed

of recovery mechanisms affects the convergence, but also the delay caused by the process of failure de-

tection. In this paper, we deﬁne a new approach for data plane fault management in software-deﬁned

networks where the goal is to eliminate the convergence process altogether rather than accelerate the

failure detection and recovery. We propose a new framework, called Smart Routing , which allows the net-

work controller to receive forewarning signs on failures and hence avoid risky paths before the failure

incidents occur. The proposed approach aims to decrease service disruption, which in turn increases net-

work service availability. We validate our framework through a set of experiments that demonstrate how

the underlying model runs and its impact on improving service availability. We take as example of the

applicability of the new framework three types of topologies covering real and simulated networks.

1. Introduction

The concern about the Internet ossiﬁcation, which is a conse-

quence of the growing number of variety networks (e.g. IoT, WSN,

Cloud, etc.) that serve up to 9 billion users around the globe, has

led to investigate for replacing the existing rigid network infras-

tructure with programmable one [1] . In this context, Software-

Deﬁned Networking (SDN) has been emerged as a promising solu-

tion for tackling the inﬂexibility of the legacy networking systems.

In fact, SDN is part of a long history of attempts that aim to lower

the barrier of deploying new innovations and make the network

more programmable. For more on the history of programmable

networks, we refer the interested readers to [2] . Unlike traditional

IP networks, SDN architectures consist of three planes: control, data

∗Corresponding author.

E-mail addresses: ali.malik@tudublin.ie , up714266@myport.ac.uk (A. Ma-

lik), benjamin.aziz@port.ac.uk (B. Aziz), mo.adda@port.ac.uk (M. Adda),

smallko@gmail.com (C.-H. Ke).

and application . The control plane , or sometimes called the con-

troller , represents the network brain, which provides the essential

functions and exerts a granular control by relying on the global

view over the network topology, which is a crucial feature that

has been missed in the past. The data plane comprises network

forwarding elements that constitute the network topology. These

forwarding elements are dictated by the network controller and

therefore the entire nodes need to disclose their status periodically

to the controller, hence the global view comes. In general, many

studies have classiﬁed the SDN into two layers by considering the

application plane as a complementary part to the control layer to

solve various kinds of network issues such as ﬁrewall and load bal-

ance. So far, OpenFlow [3] is the most widely used protocol that

enables the controller to govern the SDN data plane through car-

rying the forwarding rules as well as to capacitate the exchanging

of signals between the two planes. Recently, SDN was not solely

conﬁned to the academic ﬁeld but also gained the attraction of in-

dustry such as Google and Microsoft [4] .

https://doi.org/10.1016/j.comnet.2020.107104

2 A. Malik, B. Aziz and M. Adda et al. / Computer Networks 170 (2020) 107104

Fig. 1. The compartmentalisation of dependability concept, Laprie [5] .

Nowadays, communication networks play a vital role in human

being’s life activities as it represents the backbone for the modern

technologies. Networking equipment are failure prone and there-

fore, some aspects like availability, reliability and fault manage-

ment are necessary. The term dependability is the umbrella that en-

compasses the aforementioned aspects and Fig. 1 gives an overall

picture on the dependability taxonomy according to [5] . However,

this paper focuses on the availability attribute by means of fault

tolerance and forecasting of SDN link failures.

Although SDNs have brought a bunch of beneﬁts with signiﬁ-

cant network improvements, some new challenges that accompa-

nied with this innovation, such as security [6,7] and recovery from

failure, still need to be addressed thoroughly in order to maximise

the utility of SDNs [8,9] . Data plane link failure is considered to be

a manifold problem. This is because the controller needs ﬁrst to be

notiﬁed about the failure and then to compute alternative routes

in order to update the affected forwarding elements. The issue of

network link failures is not a recent phenomenon, as it takes place

in everyday operation with variations in living-time and causes

[10] . However, the new architecture of OpenFlow requires more

investigation in order to eliminate the challenges that hamper its

growth.

In order to maximise the service availability in SDNs, we deﬁne

a new approach that minimises the percentage of service unavail-

ability by using online failure prediction. This allows the network

controller to perform the necessary reconﬁguration prior to the oc-

currence of failure incidents. Although a number of works on SDN

fault management have been proposed, none of them has exploited

the global view of SDNs in the context of failure prediction. With

this context in mind, we can summarise the main contributions of

this paper as follows:

• A new network model that allows for the forecasting of link

failures by predicting their characteristics in an online fashion.

This model also demonstrates how links failure prediction can

be integrated into the process of proactive restoration with the

aid of risk analysis.

• We provide an implementation of the new model in terms of

a couple of fault tolerance algorithms. We use simulation tech-

niques to test the eﬃciency of these algorithms. Our simulation

results prove that the proposed model and algorithms improve

the service availability of SDNs.

The rest of the paper is organised as follows. Section 2 in-

troduces various SDN fault management techniques from the lit-

erature. The problem statement is highlighted in Section 3 . We

then present our network model and framework in Section 4 .

Sections 7 and 8 present the experimental procedure, observed re-

sult and comparison. Finally, the summary of this paper is pro-

vided in Section 9 with some future directions.

2. Related work

Link failure issue often occurs as a part of everyday operation.

Due its importance and the negative impact it has on the network

Quality of Service (QoS), a considerable amount of research has

been conducted to analyse, characterise, evaluate and recover from

the frequent issue of link failure. While, the physical separation of

the control plane from the data plane results into two indepen-

dent entities. Both entities are susceptible to failure incidents. Ac-

cording to [11] , control plane failures are more severe than other

failures. This is due to the signiﬁcant role of network controller

in managing the whole network activities. For more details about

control plane failures, we refer the interested readers to [12,13] .

However, in this paper, we are focusing on the data plane link

failures only.

Communication networks are prone to either unintentional fail-

ures, unplanned , due to various causes such as human errors, nat-

ural disasters like earthquakes, overload, software bugs, cable cut

and so on, or to intentional failures, planned , that caused by the

process of maintenance [14,15] . Fail ure recovery scheme is a nec-

essary requirement for networking system to ensure the reliabil-

ity and service availability. Generally, failure recovery mechanisms

of carrier-grade networks are categorized into two types: proactive

and reactive . In case of link failure, resilience mechanism of SDNs

ought to redirect the affected ﬂows in order to avoid the location

of failure and keep the system continue working despite the pres-

ence of the abnormal situation. SDN controller has the ability to

mask the data plane failure either proactively or reactively [16] ,

while each of them has pros and cons. In this section, we discuss

current effort s to t ackle the dat a plane link failures.

2.1. Proactive

In proactive, which is also know as protection , the alternative

paths are preplanned and reserved in advance (before a failure oc-

curs). According to [17] , there are three protection schemes that

can be applied to recovery from network failure:

• One to One (1: 1): In which, one protection path is dedicated

to protect exactly one path.

• One to Many (1: Y ): In which, one protection path is dedicated

to protect up to Y paths.

• Many to Many ( X : Y ): In which, X of speciﬁed protection paths

are dedicated to protect up to Y working paths such that X ≤Y .

The authors in [18] have implemented an OpenFlow monitor-

ing function for achieving a fast data plane recovery. In [19] , an-

other protection method has been proposed through using the

OpenFlow-based Segment Protection (OSP) scheme. The main dis-

advantage of this strategy is that it consumes the data plane stor-

ing capability since the more ﬂow entries (i.e rules) to be stored

the more space will be used, however, the current OpenFlow ap-

pliances in the market are able to accommodate up to 80 0 0 ﬂow

entries due to the limitation of the Ter na ry Content-Addressable

Memory (TCAM), thence such a kind of solution is costly [4,16] ,

where this issue was discussed in some studies such as [20,21] . In

addition, the installation of many attributes in the OpenFlow for-

warding elements could lead to deteriorate the process of match-

and-action for the data plane nodes. Moreover, there is no guar-

antee that the preserved backups are failure-free, in other words,

A. Malik, B. Aziz and M. Adda et al. / Computer Networks 170 (2020) 107104 3

the backup path might fail before the primary one and resulting a

waste in space and time.

2.2. Reactive

In reactive, which is also called restoration , the possible alter-

native paths are not preplanned and will be calculated dynam-

ically when failure occurs. The authors in [22,23] presented an

OpenFlow restoration method to recover from single-link failures.

However, their experiments were only conducted on small scale

network topologies not exceeding 14 nodes. In [24] , the authors

have demonstrated through extensive experiments that OpenFlow

restoration is not easily attainable within 50ms, especially for large

scale networks, unless using the protection technique.

In the same context, some works have utilised the concept of

multiple disjoint paths to be employed as a backup. For example,

CORONET [25] is presented as a fault tolerance system for SDNs, in

which multiple link failures can be resolved. The ADaptive Multi-

Path Computation Framework (ADMPCF) [26] for large scale Open-

Flow networks is produced as a traﬃc engineering tool that capa-

ble to hold two or more disjoint paths to be utilised when some

network events occur (e.g. link failure). HiQoS [27] is introduced as

a traﬃc engineering tool towards better QoS for SDNs. HiQoS com-

puted multipath (at least two constrained paths) between all pos-

sible pairs in a network, hence a quick recovery from a link failure

is attainable. Most of the existing works did not take into account

the processing time of ﬂow entries (e.g. insert, delete and mod-

ify) that need to be updated. Although the performance of Open-

Flow devices are associated with their manufacturer speciﬁcation,

in [28] , the authors stated that each single ﬂow entry insertion is

ranging from 0.5ms to 10ms. However, 11m s is the minimum du-

ration that required to modify a single rule since each modiﬁcation

process includes both deletion (old rule) and insertion (new one)

of rules [29] . There are a number of studies like [30–32] used the

OpenState mechanism to recover from data plane failures without

being dependant on the network controller and hence reducing the

overload on controller and speedup the process of recovery. How-

ever, such approaches still inapplicable as the existing OpenFlow

equipment does not support such customization.

Unlike the existing works, the authors in [33] have dealt with

problem of minimising the time of ﬂow entries that required to

divert from an affected primary path to backup one. Although,

the presented algorithms did not guarantee the shortest path from

end-to-end, but it opens a new direction that worth to be ex-

plored. Within the same context, authors in [34] produced new

algorithms for minimising the required time to update through re-

ducing the solution search space from source to destination in the

affected path. Similarly, in [35] an approach to divide the network

topology into non-overlapping cliques has been produced to tackle

the failure issue in local-based manner rather than global. Both

[34,35] took into account the time required to compute the alter-

native route in order to speed up the operation of update. While,

the main issue with the last three works is that it does not secure

the shortest path from source to destination.

2.3. Summary

In summary, the previous studies produced different methods

to tackle the problem of data plane recovery from link failure inci-

dents and for more details about SDN fault management we refer

the interested readers to the recent survey [11,36] . Eventually, pro-

tection techniques are not ideal due to the TCAM space exhaustion,

whereas the latency issue is the major drawback of the existing

restoration methods. As a result, SDN fault management still needs

more research and investigation.

3. Problem statement

Distinctly, the existing SDN fault management techniques are

getting involved after a failure occurrence. Thus, it cannot prevent

a certain impact on traﬃc ﬂows such as service unavailability. This

problem occurs due to the delay of the convergence scheme T

We deﬁne T

as the required time to amend a path in response to

a failure scenario. Typically, the convergence time in SDN can be

summarised as a combination of three factors:

• Failure detection time ( T

): This is the time required to detect

a failure incident. Comparing with the conventional network-

ing systems, the centralised management and global view of

SDN ease this task by continuously monitoring the network sta-

tus and get notiﬁcations upon failures. However, the speed of

receiving a notiﬁcation is sometimes associated with the na-

ture of network design and mode of communication (in-band

or out-of-band) [37,38] . According to [39] , the link failure detec-

tion time is ranging from tens to hundreds milliseconds, which

may also rely on the type of commercial OpenFlow switch.

• New route computation time ( T

): This is the spent time when

network controller runs a nominated shortest path routing al-

gorithm (e.g. Dijkstra [40] ) to compute the backup path (usually

for the reactive fault tolerance strategies). The T

computation

time could reach 10s of milliseconds [34] according to how big

the network is.

• Flow entries update time ( T

Update

): This is the time required to

update relevant switches, i.e., nodes involved in the affected

path. Again, this factor depends on how many forwarding rules

need to be updated after the failure scenario where the amount

of time for every single rule could exceed 10 ms.

Accordingly, the resulting convergence time can be calculated

through the following equation:

= T

+ T

dst



src

Update

(1)

Currently, the classical SDN fault management methods aim

to tackle the failure after its occurrence, therefore, the recovery

mechanism is activated after the moment of failure and hence

all the previous work proposals embroiled in a certain amount

of delay according to (1) . The only way to completely overcome

the three factors of (1) altogether is by handling the failure be-

fore it occurs. Therefore, failure prediction is required to provide

awareness about the potential future incidents as well as allow-

ing the controller to perform the reconﬁguration action in purpose

of overriding failures before causing damage on some paths. Al-

though there are a number of studies have put efforts on the area

of failure prediction, none of the traditional (except the work in

[41] ) and/or the new generation networking systems has exploited

the information that can be gained from any prediction method

to eliminate the network incidents (e.g. link failures). To the best

of our knowledge, Vidalenc et al. [41] is the only realistic study

that discussed the advantages of failure prediction through produc-

ing a risk-aware routing method for the legacy IP networks. While,

our work is differ from them by building a realistic framework of

proactive failure management for SDNs. In this work, we combine

the concept of the online failure prediction with risk analysis to-

wards maximising the network service availability.

4. The proposed model

Anticipating failures before they occur is a promising approach

for further enhancement of an SDN failure management tech-

niques, i.e., the proactive and reactive, in which the controller re-

sponds to failures when they take place. The SDN proposed model

4 A. Malik, B. Aziz and M. Adda et al. / Computer Networks 170 (2020) 107 104

Tabl e 1

List of notations.

Symbol Description

src Source router

dst Destination router

A Service availability

U Service unavailability

Link traversing any two arbitrary routers i and j

ptr A pointer that points to ﬁrst e

in the Queue

F Faile d link set

R Faile d/affected route set

L Potential failed link set

R Potential failed route set

M Prediction alarm message

CO Network controller

Threshold of failure probability

ω Threshold of risk

OF OpenFlow instruction

TP True positive

FN False negative

FP False positive

CC Cable cut per year

x Any shortest path algorithm x in terms of hops

for anticipating link failure events is presented in this section. We

start ﬁrst by outlining some of the notations we use in the rest of

the paper, as shown in Table 1 . The network topology is modelled

as an undirected graph G = (V, E) ; where V represents the ﬁnite

set of vertices (i.e. routers) in G that ranges over by { v

, v

, . . . , v

}

where { i, j, . . . , z} ⊂{ 1 , . . . , n } for n ∈ N , and E represents the ﬁ-

nite set of bidirectional edges (i.e. links) in G that denoted as { e

}

where each e

∈ Eis an edge that enables v

and v

to connect each

other. Now, we deﬁne the following test operational function ( OP )

over a link, which reﬂects the link state as follows:

OP (e

) =

1 the link is operational

0 otherwise

Therefore F can be deﬁned as follows:

F = { e

| e

∈ E ∧ OP (e

) = 0 }

Based on G , we deﬁne a path P as a sequence of consecutive ver-

tices representing routers in the network. Each path starts from a

source router, src , and ends with a destination router, dst :

P = (src, ... , dst)

We deﬁne the set Flow to represent all demand traﬃc ﬂows that

need to be serviced. Each ﬂow ∈ Flow is an instance of P , which

associates with a particular traﬃc that are deﬁned by unique src

and dst . We consider ﬂow

set to be the set of all the possible paths

between src and dst that can be derived from G , which is deﬁned

as follows:

flow

set

= { P | ( ﬁrst (P ) = src) ∧ ( last (P ) = dst ) }

and the deﬁnition of ﬁrst and last is given as functions on any gen-

eral sequence (a

, . . . , a

) :

ﬁrst ((a

, ... , a

)) = a

last ((a

, ... , a

)) = a

We also consider P

set as a set that contains all the admissible

paths that can be constructed from G , this means P ∈ P

set and

therefore, Flow ⊂P

set

. When a link failure is reported in G , we iden-

tify the affected routes as follows:

= { flow | flow ∈ F low ∧ ∃

, v

. v

, v

∈ flow ∧ OP (v

, v

) = 0 }

In the same context, but this time we consider the case of when

there is a link failure prediction message m

∈ M such that M set

denoted by { m

}

i =1

where each m

∈ M is deﬁned as m

= (

, t) ,

where t is the time when the system receives m

. In this context,

we deﬁne the following:

P F

= {

| ¯

∈ E ∧ ∃

= (

, t) ∧ m

∈ M}

to characterise the received link, which we use ¯

to imply that

∈ P F

L is a shorthand, with state of potential to fail and hence it

does not belong to F . Now, we can deﬁne the potential to fail route

set as follows:

P F

= {

flow

∈ F low ∧ (∃

∈

flow

∧

∈ P F

) }

where

flow

is a ﬂow that has at least one ¯

, in other words,

flow

∩ P F

 = ∅ .

4.1. SDN predictive model

All the previous effort s that dealt with data plane failures have

succeeded in mitigating the impact of failures (e.g. reduce the

downtime) rather than attempting to obviate their effect, such as

minimise the service unavailability. Network incidents that cause

routing instability, i.e., ﬂaps, and lead to signiﬁcant degrading of

network service availability vary [42,43] . However, in our case, we

are only concerned with data link failures. By relying on moni-

toring techniques, some failures can be predicted through failure

tracking, syndrome monitoring, and error reporting [44] . Conse-

quently, a set of conditions can be deﬁned as a base to trigger a

failure warning when at least one of the predeﬁned conditions is

satisﬁed. The following simple form illustrates rule-based failure

prediction:

if condition

then

war ning tr igger



. . .

if condition

then

war ning tr igger



Online failure prediction strategies vary, such as machine learn-

ing techniques (e.g. using the κ-nearest neighbor algorithm [45] )

and statistical analysis methods (e.g. time series [44] , Kalman and

Wiener ﬁlter [46] ). Such techniques can be used to predict the in-

coming events in short-term based through relying on the past and

current state information of a system. However, in this paper we

do not intend to propose a failure prediction solution as extensive

studies have been conducted in this ﬁeld with remarkable achieve-

ments. Instead, employing the online failure prediction as a tech-

nique to enrich the current SDN fault management is one of the

main aims of this work. A generic overview of the time relations

of online failure prediction is presented in Fig. 2 , which presumes

the following:

• t

: represents the historical data upon which the predictor is

forecasting the upcoming failure events.

• t

: represents the lead time, which is the time in which a fail-

ure alarm is generated. It can also be deﬁned as the minimum

duration between the prediction and failure.

• t

: is the warning time in which an action may be required

to ﬁnd a new solution based on the predicted event. Therefore,

t

must be greater than t

so that the information from pre-

diction will be serviceable. In SDN, the t

w should be at least

adequate to provide the time required to set up the longest

shortest path in the given G .

• t

: represents the time for which the prediction will be as-

sumed to be a valid case. This should be deﬁned carefully by

the network operator so as to identify the true and false alarms

after a certain time window (i.e. t

The quality of the failure prediction is usually evaluated by two

parameters: FP and FN ; whereas, Recall and Precision are the two

well-known metrics that are used to measure the overall perfor-

mance.

Recal l =

T P

T P + F N

P recision =

T P

T P + F P

(2)

A. Malik , B. Aziz and M. Adda et al. / Computer Networks 17 0 (2020) 107 104 5

Fig. 2. Online failure prediction and time relatio ns, Salfner et al. [44] .

Tabl e 2

Controller actions based on prediction.

Prediction Action

TP Select an alternative route

FP Unnecessary/needless action

FN Call the standard failure recovery

Fig. 3. Relatio n between prediction and failure sets.

Recall is deﬁned as the ratio of the accurately captured failures

to the total number of the certainly occurred failures, while, Pre-

cision is deﬁned as the ratio of the correctly classiﬁed failures to

the total number of the positive predictions. Correspondingly, SDN

controller actions will now associate with the predicted and un-

predicted situations as listed in Table 2 .

On one hand, every false failure alarm will lead to an unneces-

sary reconﬁguration for a particular set of routes in Flow and this

will cause unwitting network instability. On the other hand, a con-

troller needs to deal with the undetected failures in a similar way

to the classical methods. Consequently, the more precise behaviour

of prediction, the higher the percentage of network stability and

service availability will be gained. Fig. 3 shows the relevance be-

tween the network model and the predictive model.

4.2. Failure event model

We have implemented an approach of generating failure events

as it is very diﬃcult to ﬁnd a public network dataset that includes

some useful details like failures, hence, we adopted an alternative

approach by developing our failure model. This research intends

to enhance the SDN fault tolerance and resilience through max-

imising the network service availability. Two basic metrics have

been exploited in this model: mean-time between failure ( MTBF )

and mean-time to recover ( MTTR ); which are essential for calcu-

lating the availability and reliability of each network repairable

component [5,50] . MTBF is deﬁned as the average time in which

a particular component functions before failing, calculated from:



(start

down _ time

−start

up _ time

)

number of f ailures

; while, MTTR is the average time required

to repair a failed component. Each component, i.e., link, is char-

acterized by its own values of both MTBF and MTTR , which are

commonly independent from other components in the network. As

a consequence of lacking real data, some metrics (such as cable

length and CC ) can be alternatively used for measuring the two

availability metrics. According to [50] , MTBF can be calculated as

follows:

MT BF (hours ) =

CC ×365 ×24

Cable Length

(3)

For instance, when CC is equal to 100 km, it means that per

100 km there will be on average one cut per year. Besides this,

the MTTR of a link is inﬂuenced by its length [51] , which expresses

the fact that the longer link has a higher MTTR value. On this basis,

we have designed the following formula for calculating the MTTR

value for each link in the network.

MT T R (hours ) = γ×CableLength (4)

Where γis deﬁned as a parameter indicating the time required to

ﬁx the cable, which is measured by hour/kilometer format. Due to

the fact that links are physically distributed in different locations

and environments, therefore, γdiffers from one link to another.

In other words, even if some links have the same length, their γ

could be different as it relies on the physical location and the am-

bient conditions. Further discussion is in Section 6 .

5. Risk analysis

According to [52] , risk can be deﬁned as an attempt to answer

the following three questions:

• What scenario could occur?

• What is the likelihood that scenario would occur?

• What is the consequence if the scenario does occur?

We consider these questions towards formulating the risk of

failure in SDNs.

What scenario could occur? The scenario can be deﬁned as any

undesirable event such as failure. According to [53] , there are three

main types of failure scenarios that could affect the SDN network-

ing system, these are: controller failure (including hardware and

software), communication components failure (i.e. node and link)

and application failure (e.g., bugs in application code). However,

this paper considers the scenario of link failure only. Such scenario,

breaks the service down when occurs. Therefore, ﬁnding alterna-

tive path is necessary. We deﬁne the set of link failure scenarios

as F ranged over by variables f

, f

, . . . , f

∈ F .

What is the likelihood that scenario would occur? The likelihood

that a failure scenario disrupts the network services is conditional

on the occurrence of the scenario. We address this question by the

aid of online failure prediction that is in our case working based on

a scenario’s failure probability, p . Each failure scenario is associated

with a p value that by nature ranges between 0 and 1, this will be

further discussed in Section 6 .

What is the consequence if the scenario does occur? We address

this question by computing the percentage of loss or consequence,

c , that might potentially happen when a failure scenario is pre-

dicted at an early stage. Each failure scenario might lead to some

disconnections and service disruption. Therefore, the severity of

adverse effects of each failure scenario varies. For instance, c

that

was caused by f

might be different from c

that was caused by f

which would reﬂect the outage costs that would result from dis-

rupting some of the network connections.

6 A. Malik , B. Aziz and M. Adda et al. / Computer Networks 17 0 (2020) 107 104

Tabl e 3

List of failure scenarios.

Scenario Probability Consequence

1 p

1 c

2 p

2 c

. . .

n p

n c

Over a period of time, these questions would make a list of out-

comes as exempliﬁed in Table 3 , where each i th row in the table

can be represented as a triplet, i.e.,  f

, p

, c

 .

Risk can be estimated by using such information as follows:

Risk = { f

, p

, c

} , i = 1 , 2 , . . . , n (5)

Since we are considering the only link failure scenarios, f

)

we shall reﬁne the deﬁnition of risk in (5) . Accordingly, we rede-

ﬁne the risk to be the chance of damage that is determined by the

combination of the probability of link failure and its consequence.

Risk

)

= p

)

×c

) (6)

To deduce the risk value, the two factors of (6) , i.e., p and c ,

can be assessed independently. On one hand, the probability, p ,

depends on the eﬃcacy of the online failure predictor at deter-

mining the likelihood of the incoming failure scenarios, which is,

in this study, deﬁned by a selective failure probability threshold

value, T

. On the other hand, for the consequence, c , it can be

measured based upon the percentage of affected routes that would

result from the anticipated scenario. In our case, we take the def-

inition of such consequence one of the global network topological

characteristics, namely Edge Betweenness Centrality (EBC) [47] . This

is due to the fact that EBC is a direct indicator of the number of

paths that would fail as a consequence of the failure of a particular

link, therefore, providing a natural measure of risk consequence.

The EBC of a link e

is the total number of shortest paths be-

tween pairs of nodes that traverse the edge e

[47] , which can be

formulated as follows:

EBC



∈ V



∈ V

v i, v j

(7)

where vi, vj

denotes the number of shortest paths between nodes

vi and vj , while, v i, v j

denotes the number of shortest paths be-

tween nodes vi and vj and go through e

∈ E. For instance, Fig. 4

demonstrates an example topology with an EBC value for each link

in the network, which has been calculated based on Ulrik Bran-

des algorithm [48] . For instance, the EBC of e

is calculated by the

number of shortest paths containing the edge divided by all possi-

ble paths. Therefore, EBC

= 0 . 4 . This is because there are 20 pos-

sible paths in the example topology and 8 of them pass through

the edge e

. Given that network controller knows the demand

Fig. 4. Top olo gy example with different EBC value s.

traﬃc matrix between all pairs in the network, i.e., Flow . There-

fore, Eq. (7) in our case is congruent with the following:

EBC

∈ M

flow

(8)

where ﬂow

denotes the total number of paths in Flow set, while,

flow

denotes the number of paths in Flow set and pass through

∈ M.

With the above context in mind, the higher the EBC value of

, which is a normalised value between 0 and 1, the more critical

the link is and therefore, the higher the score indicating the con-

sequences. This is because the outcome of failure for a link with

high EBC will deﬁnitely lead to a huge number of path failures and

therefore a higher percentage of negative impacts on the availabil-

ity of network services.

Our goal in this analysis is to gauge the percentage of possi-

ble loss and provide such information to the concerned decision-

making mechanism, i.e., the routing mechanism in our case. For

more details about the existing risk analysis methods that ﬁt SDNs,

we refer the interested readers to [49] .

6. The proposed framework

From a high level point of view, Fig. 5 illustrates the main com-

ponents of our proposed framework where the Smart Routing (SR)

and Prediction Model components are the primary contribution of

our work. We discuss next in more detail the components we used

and developed in this framework.

6.1. SDN controller

The proposed framework supports POX controller [54] , which is

an open source SDN controller written in python and it is more

suitable for fast prototyping than other available controllers such

as [55] . The standard OpenFlow protocol [3] is used for establishing

the communication between the data and control planes, whereas

the set of POX APIs can be used for developing various network

control applications.

Fig. 5. Architecture of the proposed framework.

A. Malik , B. Aziz and M. Adda et al. / Computer Networks 17 0 (2020) 107 104 7

Tabl e 4

Service availability and network ﬂows relatio n.

Event Flow src → dst accessibility T

C Serviceability Notes

–ﬂow

x Yes –✔ Path is working

ﬂow

∈ F

R ﬂow

x No T

✖ Path is not working

ﬂow

∈ F

R ﬂow

x No T

✖ Search for alternatives

ﬂow

∈ F

R ﬂow

x No T

Update

✖ Path is restoring

–ﬂow

x Yes –✔ Path is restored

6.2. Smart routing

Firstly, this module is responsible for maintaining and parsing

the underlying network topology. Top ol og y parameters such as the

number of nodes and links, way of connection and port status

can be detected via the Link Layer Discovery Protocol (LLDP) [56] ,

which is one of the vital features of the current OpenFlow speciﬁ-

cation. The openﬂow.discovery [57] , which is an already developed

POX component that can be used to send specially crafted LLDP

messages out of OpenFlow nodes so that the topological view over

the data plane layer can be built.

This module will then convert the discovered network topol-

ogy into a graph G representation for eﬃcient management pur-

poses. To do so, we utilised the Networkx tool [58] , which is a pure

python package with a set of powerful functions for manipulating

network graphs. When the network starts working and after shap-

ing the data plane topology, the shortest path for each ﬂow ∈ Flow

is conﬁgured by the appointed SP

x algorithm, which thereafter is

stored in the Operational Routes table that is speciﬁed to contain

all the desired working (healthy) paths.

In order to perceive how the link failure incident could affect

the conﬁgured paths from the perspective of service availability

and convergence time, we provide a simple example in Table 4 in

which the service deterioration of the ﬂow

x due to link failure in-

cident is highlighted.

To maintain the Operational Routes table, two algorithms have

been implemented each with its own view in respect to keep the

Flow maintained. Algorithm 1 depicts the baseline shortest path

routing strategy (henceforth called Baseline Routing (BR)), which

is currently performed by the SDN controller. We specify Dijkstra’s

[40] algorithm, with complexity O (| V | + | E| log | V | ) , as the short-

est path ﬁnder approach for Algorithm 1 , which we denote by SP

instead of SP

. So, the SP

is a Dijkstra function that can be applied

on any ﬂow

set to return only one unique shortest path.

When the OpenFlow controller reports a link failure event, ev-

ery path suffering from that failure will be detected and then two

operations will be issued by the controller. First, a Remove , denoted

by OF

Remov e

, command is sent to all the routers that belong to each

failed path in Flow as a step to weeding out the incorrectly work-

ing entries, then an alternative route will be computed for every

affected ﬂow . The new ﬂow entries of the alternative path are then

forwarded to the relevant routers of each ﬂow through the Install ,

denoted by OF

Install

, command. Each modiﬁed ﬂow , i.e., assigned to

alternative, will be stored in a special set that is called the Labeled

Flow ( LF ), where: LF ⊂Flow and with length of n . This is to indicate

that each ﬂow ∈ LF is in a sub-optimal state. The recovery from

link failure procedure is demonstrated in line (1–13). However, the

algorithm also includes the reversion process after a failure is re-

formed (line 15–32), which is no less important than the recovery

process [59] , and also to take into account the percentage of rout-

ing ﬂaps that is necessary for later analysis. In fact, we developed

this algorithm for comparison purposes only against Algorithm 2 .

Therefore, it does not reﬂect a contribution of this paper.

In contrast (and unlike Algorithm 1 ), Algorithm 2 is one of the

main contributions of this work that exploited the prediction in-

formation towards enhancing the service availability and the fault

Algorithm 1: Baseline routing (BR).

On Normal : ∀ flow ∈ F low : Set Primary Path as ﬂow. flow ∈

(flow

set

)

On Failure : Do the following procedure

1 if Link failure reported then

2 foreach e

∈ F do

3 Compute: F

4 end

5 do

6 OF

Remov e (flow )

7 flow

set

:= flow

set

−{ flow }

8 flow := SP

( flow

set

)

9 OF

Install (flow )

10 LF ← flow

11 F

:= F

−{ flow }

12 while F

 = ∅ ;

13 end

14 c := 0

15 if Link repair reported then

16 do

17 if flow

c is currently optimal then

18 Do nothing

19 c := c + 1

20 end

21 if flow

c is currently sub-optimal then

22 OF

Remov e (flow

)

23 f low

:= SP

( f low

set

)

24 OF

Install (flow

)

25 LF := LF −{ flow

}

26 c := c + 1

27 end

28 if number of links = E

len

then

29 LF := empty

30 end

31 while c  LF

len

;

32 end

tolerance of SDNs. This algorithm depends on Bhandari’s algorithm

for ﬁnding K edge-disjoint paths [60] , which has been utilised as

a complementary to build the smart routing strategy. We denoted

Bhandari’s algorithm as SP

B in place of SP

Thereon, we consider the SP

B as a function that is speciﬁed to

compute 2 link-disjoint paths with the least total cost for any given

pair of nodes (i.e. src and dst ) or ﬂow

set

. For the purpose of distin-

guishing between the two returned paths of SP

, we denote the

ﬁrst path as flow

and the second disjoint one as flow

. The

time complexity of SP

is different from the SP

, which is a polyno-

mial that is equivalent to O ((K + 1) . | E| + | V | log | V | ) . The pseudo

code of smart routing is demonstrated in Algorithm 2 , in which the

flow

is initially selected to represent the primary path for each

ﬂow in the network.

The network controller will then start listening to the predic-

tion module, which will be discussed in the next section, for the

8 A. Malik , B. Aziz and M. Adda et al. / Computer Networks 170 (2020) 107104

Algorithm 2: Smart routing (SR).

Input : Network topology G (V, E) , M

Output : P F

≈∅

1 ∀ flow ∈ F low : Set Primary Path as f low

. f low

∈

(flow

set

)

2 if M = { m } then

3 P F

← ¯

4 end

5 foreach ¯

∈ P F

L do

6 Compute: P F

7 end

8 EBC

len

F low

len

9 Risk

= p(

) ×EBC

10 if Risk

 Risk

then

11 do

12 OF

Install (f low

. f low

∈ SP

(f low

set

))

13 OF

Remov e (f low

. f low

∈ SP

(f low

set

))

14 while P F

 = ∅ ;

15 Wait: t

16 if ¯

∈ F then

17 Mark as: T P

18 LF ← P F

19 else

20 Mark as: F P

21 do

22 OF

Install (f low

. f low

∈ SP

(f low

set

))

23 OF

Remov e (f low

. f low

∈ SP

(f low

set

))

24 while P F

 = ∅ ;

25 end

26 end

27 P F

= ∅

28 if [ F = (e

) ∧ (e

/ ∈ M) ] ∨ [ F = ( e

) ∧ (e

∈ M ) ∧ (Risk

Risk

) ] then

29 Mark as: F N

30 Call Algorithm1

31 end

32 if Link repair reported then

33 Call Algorithm1

34 end

potential of future incidents. When a new message ( m ) is received,

the controller will ﬁrstly construct the potential failed list, which

contains the information about link which is expected to fail in

the near future as described in (line 2–4). Secondly, the route (or

routes) which might be affected according to the predicted fail-

ure message will be computed as a preparatory step to replace

them (line 5–7). After identifying the routes that may possibly fail,

the EBC for the predicted link will be calculated as a step towards

measuring the risk (line 8–10). If the risk value is below the risk

threshold, then the prediction information will be ignored and no

action will be taken. Otherwise, the ﬂow entries of the newly com-

puted disjoint path from the second step will be installed through

using the Install command. This is done by adjusting the disjoint

path rules with lower priority than the primary path to avoid the

conﬂict of matching and action process. Following this step, the

forwarding rules of the risky primary paths will need to be deleted

in order to use TCAM resources eﬃciently.

This needs to be done in a similar procedure to the installation

but with the Remove command as demonstrated in (line 11–14).

After swapping the primary path due to an expected failure, this

action will be considered as the correct decision for a certain pe-

riod of time (i.e. t

) as indicated in line 15. To examine the sub-

stantiality of the changing routes decision, the link that was antic-

ipated to go down within t

will be compared against the fail-

ure set F . On one hand, and if the link exists, then the prediction

will be marked as TP , and each ﬂow ∈ PF

will be labeled as sub-

optimal, in addition, the reconﬁgured paths will store in LF (line

16–18). On the other hand, the prediction will be considered as

FP and in such a case it is necessary to reset the primary path to

its initial state (i.e. optimal) as deliberated in (line 19-25). In case

when there is a failure that was not captured by the prediction

module, such a case is considered as FN and the failure in such

situations is tackled by calling Algorithm 1 as outlined in (line 28–

30). Finally, Algorithm 1 will also be invoked when a failed link is

repaired (line 32–34).

6.3. Prediction module

In this work, this module is placed on top of the parsed net-

work topology state that gained from the network controller as

a result of lacking historical data. We consider each link as an

independent object of link class. The link class contains a set

of attributes, which currently includes eight attributes as shown

in Fig. 6 . The link attributes are used to control the up and

down events. In the current implementation, we used the prior-

ity queue, Q , as a pool to hold all the non-faulty links. On one

hand, Eqs. (3) and (4) are essential for computing the two static

attributes ( MTBF and MTTR ) of each link. For (3) , we rely on the

topologies information in Section 7.3 and by assuming that CC

equals the minimum cable length in a network. While, for (4) we

used the uniform distribution to generate γfor each link indepen-

dently. On the other hand, the six remaining ones are described as

follows:

• ID : a numerical unique value (i.e. 1 , 2 , . . . , n ) assigned to the

link to represent the link identiﬁcation number.

• F_Count : a counter to contain the number of times the link has

failed.

• Length : represents the link’s length in km, which is derived

from the topology speciﬁcation.

• Next_F : refers to the next time to failure of link, which con-

trols the process of moving the link into and out-of the Q . In

other words, this attribute determines the link life span in the

Q where the link will be dequeued when its Next_F equals to

zero.

• Probability_F : registers the current failure probability, p , of the

link. For instance, the Probability_F of the link ( j ) is arrived

through:

F _ Count(ID

)



i =1

F _ Count(ID

)

×100 , where n is the length of Q .

• Status : reﬂects the current state of the link as either opera-

tional or faulty.

On this basis, we have placed our online predictor scheme, i.e.,

represented by Algorithm 3 , on top of the priority queue in order

to send encapsulated messages about the links which satisfy the

following two conditions (as described in line 2–9):

 The probability of failure is greater than or equal to the

threshold T

.

 The leading time (i.e. t

) is less than or equal to the next

time to failure .

Failure Decision (FD) is a Boolean function that randomly gener-

ates True and False values for each link that satisﬁes the threshold

condition, i.e., T

. When FD is True, a failure event is generated by

putting the current link, i.e., F

ptr

)

, down if t

is satisﬁed. Oth-

erwise, when FD is False then no failure event will be generated.

Algorithm 3 is used only for evaluation purposes so that True and

A. Malik , B. Aziz and M. Adda et al. / Computer Networks 170 (2020) 107104 9

Fig. 6. Representation of links in priority queue.

Algorithm 3: Alarm message generator ( M ).

Input : G (V, E)

Output : M

1 while ( Q! = ∅ ) do

2 if P robabil it y _ F

ptr

)

 T

then

3 if FD = True then

4 Go To: 8

5 else

6 Go To: 15

7 end

8 Compute: t

9 if Next _ F

ptr

)

 t

then

10 Wait: Next _ F

ptr

)

−t

11 Generate: (m,

ptr

)

12 else

13 t

is not satisﬁed

14 end

15 end

16 Wait: Next _ F

ptr

)

= 0

17 end

False alarms can be made. Hence, the actual link failure prediction

method is outside the scope of this paper.

7. Experimental design and implementation

Since smart routing is aimed to enhance the SDN fault toler-

ance in the context of network service availability, we have imple-

mented some metrics for fair comparison between the traditional

SDN and the proposed system. We also show in this section the

adopted network topologies that have been utilised in our experi-

ments.

7.1. Availability measurements

Considering the convergence time that is required to shift from

a failed or non-operational path to an alternative or backup one,

which conforms with Eq. (1) . This convergence process is deﬁ-

nitely drive to some damages in the network availability and caus-

ing service unavailability. This issue results from the unavailability

of the affected path to the service for a certain amount of time,

as demonstrated in Table 4 . In order to identify the serviceable,

which are denoted by “Yes”, and unserviceable, which are denoted

by “No”, ﬂows with respect to some failure events, we formulated

this problem as follows:

(flow ∩ Q) = flow ⇒ Yes

(flow ∩ Q) ⊂flow ⇒ No

where, “Yes” and “No” can be obtained by intersecting each

ﬂow ∈ Flow against the Q . The ﬂow is subjected to “Yes” when all

its forming edges reside in the Q , otherwise, the ﬂow will be con-

sidered as unserviceable and subjected to “No”. By knowing the

number of serviceable and unserviceable ﬂows , the service unavail-

ability and thus the service availability can be measured.

The service unavailability of SDN ( U

SDN

) over a given interval

time with a certain number of failure events, which are denoted

by ev , can be arrived through the following:

SDN

(F l ow, G ) = 

e v

i =1

fl ow ∈ F low No

e v ×F low

len

(9)

Whereas, for smart routing it is important to further consider the

impact of Recall values. Hence, the service unavailability of SR ( U

)

can be arrived by the following:

(F low, G ) = (1 −Recal l ) ×(U

SDN

(F low, G )) (10)

Consequently, the availability A

, with x = SDN or SR, can be ar-

rived through the following:

= 1 −U

x (11)

7.2. Routing instability measurements

In traditional networks, routing protocols (e.g. IGP [61] ) perform

two routing changes as a reaction to every single failure, one time

when a failure occurs and another when a failure is repaired. In

fact, both changes are essential for the QoS where the ﬁrst change

is for the purpose of service availability, while, the goal of the sec-

ond one is to return back from the backup (i.e. sub-optimal) to

the primary (i.e. optimal) path again. In contrast, SDN architecture

brings centralisation and programmability to the scene, therefore,

traditional distributed protocols are independent of the SDN archi-

tecture. Maintaining the optimal path (e.g. minimum hops in our

case) of each ﬂow will require a continuously adaptive strategy that

will be responsible for replacing each sub-optimal ﬂow with the

optimal one after it becomes serviceable. To do so, we assume that

each alternative ﬂow is additionally stored in LF as mentioned in

Section 6 .

For an SDN, the routing ﬂaps (denoted by RF ) can be measured

by means of link up (denoted by u

) and down (denoted by d

) as

follows:

SDN

= 

flow ∈ LF

+ 

flow ∈ F

(12)

On one hand, and according to (12) , after each link down event;

a new route for each ﬂow ∈ F

is required, which then leads to

a ﬁrst routing change for each ﬂow . On the other hand, and after

each link up announcement, the controller will need to check the

10 A. Malik, B. Aziz and M. Adda et al. / Computer Networks 170 (2020) 10710 4

Fig. 7. Flow chart of routing ﬂaps.

state of each labeled ﬂow in LF to determine if itâs still the optimal

choice. If so, then no change will be made, otherwise, rerouting is

required and therefore it will result in another routing change.

However, for the smart routing mechanism, it is necessary to

consider the three prediction parameters also (i.e. FN, TP and FP ),

as follows:

= 

flow ∈ F

F N

+ 

flow ∈ PF

T P

+ 

flow ∈ PF

F P

+ 

flow ∈ LF

(13)

According to (13) , the FN

is equivalent to d

in (12) as it re-

ﬂects the actual failure events that have not been captured by the

prediction module, while the remaining are as follows:

• Each true prediction will lead to a ﬁrst reroute ﬂap that gives

the advantage of avoiding an upcoming failure event. While, the

second ﬂap will be similar to the scenario of RF

SDN through in-

serting the ﬂow into the LF and the next ﬂap builds upon the

link restoration u

• Each false prediction leads into two useless ﬂaps, one when the

prediction triggers an alarm, in such a case each potential ﬂow

will be added to the Temporary Labeled Flow set ( TLF ), as a tran-

sient step before it recognises the prediction was false. The sec-

ond ﬂap will perform when t

p expires.

We provide a deep overview of the process of measuring the

number of routing ﬂaps in the ﬂow chart of Fig. 7 , which also

shows how the LF is adjusted in the scenario of the two algo-

rithms, i.e., Algorithms 1 and 2 . Since all actions are associated

with the link state, in this work, we utilise the OpenFlow proto-

col to reﬂect the data plane links changing state. This is by relying

on the Loss of Signal (LoS) that detects link failures by depending

Fig. 8. Experimental topologies.

Tabl e 5

Topologies’ characteristics.

Topolo gy No des Edges Min

len (e

)

Max

len (e

)

janos-us 26 42 145 km 1127 km

germany50 50 88 36 km 236 km

waxman 70 140 15 km 1099 km

on OpenFlow PORT-STATUS messages. In addition, the proposed

prediction module produces further information about the poten-

tial failures. Both LoS and prediction information will be delivered

to the network controller through the Updater in order to apply the

appropriate action as illustrated in the above ﬂow chart.

7.3. Simulated network topologies

In order to evaluate the proposed method, we have modelled 3

network topologies as depicted in Fig. 8 . Both, (a) janos-us and (b)

germany50 represent real network topology instances that deﬁned

in [62] . However, (c) waxman is a synthetic topology that is cre-

ated by the Internet topology generator Brite [63] through using

the well-known Waxman model [64] . Wa xman’s model is a geo-

graphical approach that connects the distributed routers in a plane

on the basis of the distance among them, which is given by the

following formula:

P ({ v

, v

} ) = βexp

−d(v

, v

)

Lα(14)

where 0 < αand β≤1. d represents the distance between v

and v

, while L represents the maximum distance between any two

given nodes. The number of links among the generated nodes is

associated with the value of αin a directly proportional manner,

while the edge distance increases when the value of βis incre-

mented. We used Brite to generate a large-scale network topology

in comparison to the others (e.g. when the number of edges or

nodes ≥100). The characteristics of all the modelled topologies

are detailed in Table 5 .

7.4. Implementation

In order to validate our approach, the proposed framework is

built on top of the POX controller. The implementation code of the

current framework is made available on the Github platform [65] .

The proposed framework is evaluated by using the container-based

emulator, Mininet [66] . Mininet is a widely used emulation system,

as evidenced in a recent survey [4] , for evaluating and prototyp-

ing SDN protocols and applications. It can also be used to create

realistic virtual networks, running real kernel, switch and applica-

tion code, on a single machine (VM, cloud or native). Our experi-

ments were designed based on the topologies that we illustrated

in the preceding section. Since one of our experimental topolo-

gies was designed via Brite, we utilised the Fast Network Simu-

lation Setup (FNSS) [67] . FNSS is a python-based toolchain simu-

lator that can be used to facilitate the process of network exper-

iments. It provides a wide range of functions and adapters that

A. Malik , B. Aziz and M. Adda et al. / Computer Networks 170 (2020) 107104 11

Fig. 9. Flow diagram of a link’s life cycle in the Queue.

allow network researchers to parse graphs from different topol-

ogy generators (e.g. Brite) in order to be compatible with and/or

to interface with other simulator/emulator tools, such as Mininet.

Based on the failure event model ( Section 4.2 ), the general reli-

ability theory [68] has been utilised to generate failure events

using the exponential distribution ( mean = MT BF ) for the next

time to failure of each link, and lognormal distribution E ( μ, σ)

with μ= log (MT T R ) −((0 . 5) ×log (1 + ((0 . 6 ×MT T R )

/MT T R

)))

and σ=



log (1 + ((0 . 6 ×MT T R )

/MT T R

2 for time to recover. Re-

garding, failure anticipation, false and true positive have been gen-

erated during the simulated time using the uniform distribution

following the speciﬁed threshold value. Fig. 9 summarises the sim-

ulated link queuing system that is correlated to the two metrics

of reliability, i.e., MTBF and MTTR. In order to dispatch the predic-

tion information that is necessarily important to the SR module,

the distributed messages framework (ZeroMQ [69] ) was exploited

to carry the alarm messages, M , from the prediction module to

the network controller interface. In some network ﬂow conditions

it will activate the SR module to begin a possible reconﬁguration.

In the emulation environment, we employed two servers; one acts

as the OpenFlow controller and the other to simulate the network

topologies. For each server, we used Ubuntu v.14.04 LTS with Intel

Core-i5 CPU and 8 GB RAM.

8. Comparison and key advantages of SR

In this section, we present comparison and evaluation of the

proposed method versus the default SDN technique (i.e. BR). To

do so, the study has been conducted on the three topologies that

were summarised in Table 5 . To simulate the three topologies, we

ran the emulator for 144 h, i.e., each experimental topology was

simulated in the system for 48 h. Fig. 10 shows the obtained re-

sults from the three topologies based on parameter settings of

= 0 . 25 , T

= 0 . 1 , t

= 12 0 s and t

= 30 s. Keep in mind, and

as discussed earlier, the T

and T

ω values can be selected by the

network operator or by using additional algorithms (i.e. machine

learning) to identify the near optimal values. Since the main goal

of smart routing is to enhance the network service availability, we

plot for each network that which gives the BR and SR mechanisms

for the service availability percentage (Y-axis) and the rate of rout-

ing ﬂaps (X-axis). Furthermore, for SR, the performance of the on-

line failure predictor represented by the values of Recall and Pre-

cision are considered and reported respectively to each topology.

In fact, Recall value has a crucial impact on the service availability

in the SR scheme, however, Precision value has an impact on the

unnecessary routing changes.

On one hand, it can be clearly observed that SR outperformed

the BR in providing network service availability for all test cases.

In spite of the low Recall values (i.e. 0.2-0.3), there is still a gain in

service availability. It can also be observed that janos-us topology

gained the highest improvement percentage in the service avail-

ability and this is because its Recall value is greater than that of

the other topologies.

On the other hand, the rate of the routing ﬂaps generated by SR

is always higher than the BR. This disadvantage comes as a trade-

off for improving the network service availability. Given that the

routing instability by means of unnecessary ﬂaps is correlated with

the value Precision , we have measured the only useless ﬂaps that

were generated during the simulation time and for each topology

as shown in Fig. 11 . Fig. 11 (a) shows the only unnecessary rout-

ing changes that have been reported based on the FP rate of each

topology, where each single FP is associated with two useless ﬂaps,

that is, one for the reconﬁguration and the other for the reversion.

However, Fig. 11 (b) shows the percentage of useless routing ﬂaps

for each topology in comparison with the total number of ﬂaps. In

the worst case scenario the routing ﬂaps did not exceed 25%. Al-

though janos-us topology has the highest Precision value, it yielded

a relatively high percentage of useless ﬂaps and this is because the

number of links in the topology is low, hence, it is highly likely

that each single link is associated with a large number of routes

in contrast to the other two topologies. It is also clearly evident

that the online failure prediction plays a signiﬁcant role in both

service availability (by TP ) and routing ﬂaps (by FP ). Based upon

the experiments and simulations, we have some observations, as

follows:

• Some alternative routes are considered as optimal after receiv-

ing an update message, even though the received update is not

involved in its conforming path. The reason is that the current

system deﬁnes the optimal path based on hops number. There-

fore, each alternative path that has the same number of hops

as the optimal one will be considered to be an optimal path. It

might not be the case if the adopted routing constraint is not

the number of hops, i.e., using a speciﬁed cost function with

different parameters such as bandwidth, congestion, energy, etc.

• In some cases it is barely able to ﬁnd two-disjoint paths and

therefore, sometimes if a path faces two successive failure

alarms on its forming links, then no change will be made.

Hence, we used ( ≈) instead of ( = ) in the output of

Fig. 10. Routing ﬂaps and service availability.

12 A. Malik, B. Aziz and M. Adda et al. / Computer Networks 170 (2020) 107 104

Fig. 11. Routing instability measurements.

Algorithm 2 , to imply that an entirely empty PF

cannot be al-

ways guaranteed.

• It is also possible that each ﬂow ∈ LF may face one or more

risky links, thus in such a case the entangled ﬂow state will be

the same (i.e. sub-optimal).

• In some cases and when the Next _ F < 2 min , the controller will

ignore the prediction if it is generated, as in such cases the

t

is not satisﬁed and therefore the controller will not have

enough time for the reconﬁguration.

9. Conclusion and future directions

This paper has demonstrated the promise of using online fail-

ure prediction to enhance the SDN service availability. Since the

network service availability is a well established research area,

its implications for OpenFlow networks are limited. We presented

Smart Routing to tackle the problem of data plane link failures in

SDNs. The proposed approach differs from the existing contribu-

tions by allowing the SDN controller to have a time window in

order to reconﬁgure the network before the anticipated link failure

takes place. With such approach, the interruption of the network

services caused by link failures can be reduced and therefore it

brings signiﬁcant beneﬁts to the network service availability. We

demonstrated how the proposed model can be implemented us-

ing a couple of new algorithms that extract the risky links from

the currently-used paths and hence none of these paths will be af-

fected when a risky link fails. The performance of the proposed ap-

proach is tested and evaluated through extensive simulation exper-

iments on various real and synthetic network topologies conducted

with the link failure event model. The experimental ﬁndings show

clearly the effectiveness of the proposed method in enhancing the

SDN service availability. Unfortunately, the ﬂaps rate that can be

resulted from the failure prediction may lead to network instabil-

ity, especially when it reaches a high rate. For this purpose, we

measured the percentage of the unnecessary routing ﬂaps and in

the worst case scenario, the rate was 25%, which is nearly reason-

able in practice.

As future work, we will position the study in the setting of ma-

chine learning and signal processing towards achieving that the de-

cision will be made according to the optimal threshold value of the

probability of failure. We are also planning to extend this study

to consider some disaster situations where drastic failure scenar-

ios can lead to multiple link failures with high network availability

degradation and packet loss rates. In such scenarios, one needs to

consider different, possibly less predictable, metrics of failure. Ad-

ditionally, we plan to consider more complex scenarios where we

consider not only link failures but also other forms of failure, e.g.

controller, node and application failures.

Declaration of Competing Interest

We have no conﬂicts of interest to declare.

CRediT authorship contribution statement

Ali Malik: Conceptualization, Data curation, Investigation,

Methodology, Project administration, Software, Validation, Writ-

ing - original draft. Benjamin Aziz: Conceptualization, Supervision,

Writing - review & editing, Investigation, Validation. Mo Adda: Su-

pervision, Data curation, Writing - review & editing, Investigation,

Validation. Chih-Heng Ke: Data curation, Investigation, Validation,

Software, Writing - review & editing.

Acknowledgements

The authors thank the anonymous reviewers for their valuable

and thoughtful comments.

Supplementary material

Supplementary material associated with this article can be

found, in the online version, at doi: 10.1016/j.comnet.2020.107104 .

References

[1] P. Lin , J. Bi , H. Hu , T. Feng , X. Jiang , A quick survey on selected approaches for

preparing programmable networks, in: Proceedings of the 7th Asian Internet

Engineering Conference, ACM, 2011, pp. 160 –16 3 .

[2] N. Feamster , J. Rexford , E. Zegura , The road to SDN: an intellectual history of

programmable networks, ACM SIGCOMM Comput. Commun. Rev. 44 (2) (2014)

87–98 .

[3] N. McKeown , T. Anderson , H. Balakrishnan , G. Parulkar , L. Peterson , J. Rexford ,

J. Turner , OpenFlow: enabling innovation in campus networks, ACM SIGCOMM

Comput.

Commun. Rev. 38 (2) (2008) 69–74 .

[4] D. Kreutz , F.M. Ramos , P.E . Verissimo , C.E. Rothenberg , S. Azodolmolky , S. Uh-

lig , Software-deﬁned networking: a comprehensive survey, Proc. IEEE 103 (1)

(2015) 14–76 .

[5] J.C. Laprie , Dependability: basic concepts and terminology, in: Dependability:

Basic Concepts and Terminology, Springer, Vienna, 1992, pp. 3–245 .

[6] J. Ai , Z. Guo , H. Chen , G. Cheng , Improving the routing security in software-de-

ﬁned networks, IEEE Commun. Lett. 23 (5) (2019) 838–841 .

[7] T. Wang , Z. Guo , H. Chen , W. Liu

, BWMana ger: mitigating denial of service at-

tacks in software-deﬁned networks through bandwidth prediction, IEEE Trans.

Netw. Serv. Manag. 15 (4) (2018) 1235–1248 .

[8] J.A. Wickboldt , W. P. De Jesus , P. H. Isolani , C.B. Both , J. Rochol , L.Z. Granville ,

Software-deﬁned networking: management requirements and challenges, IEEE

Commun. Mag. 53 (1) (2015) 278–285 .

[9] I.F. Akyildiz , A. Lee , P. Wang , M. Luo , W. Chou , Research challenges for traﬃc

engineering in software deﬁned networks, IEEE Netw. 30 (3) (2016) 52–58 .

[10] G. Iannaccone, C.N. Chuah, R. Mortier, S. Bhattacharyya, C. Diot,

Analysis of link

failures in an IP backbone, Proceedings of the 2nd ACM SIGCOMM Work shop

on Internet measurment (pp. 237–242). ACM.

[11] F. da Rocha , C. Paulo , E.S. Mota , A survey on fault management in software-de-

ﬁned networks, IEEE Commun. Surv. Tutor. 19 (4) (2017) 2284–2321 .

A. Malik , B. Aziz and M. Adda et al. / Computer Networks 170 (2020) 107104 13

[12] T. Hu , P. Yi , Z. Guo , J. Lan , Y. Hu , Dynamic slave controller assignment for en-

hancing control plane robustness in software-deﬁned networks, Future Gener.

Comput. Syst. 95 (2019) 681–693 .

[13] Z. Guo, W. Feng, S. Liu, W. Jiang, Y. Xu, Z.L. Zhang, Retroﬂow: maintaining

control resiliency and ﬂow programmability for software-deﬁned WANs, 2019,

arXiv: 1905.03945 .

[14] A. Markopoulou , G. Iannaccone , S. Bhattacharyya , C.N. Chuah , C. Diot , Char-

acterization of failures in an IP backbone, in: INFOCOM 2004. Twenty-Third

AnnualJoint Conference of the IEEE Computer and Communications Societies,

volume 4, IEEE, 2004, March, pp. 2307–2317 .

[15] A. Markopoulou , G. Iannaccone , S. Bhattacharyya , C.N. Chuah , Y. Ganjali ,

C. Diot , Characterization of failures in an operational IP backbone network,

IEEE/ACM Trans. Netw. 16 (4) (2008) 749–762 .

[16] I.F. Akyildiz , A. Lee , P. Wang , M. Luo , W. Chou , A roadmap for traﬃc engineer-

ing in SDN-OpenFlow networks, Comput. Netw. 71 (2014) 1–30 .

[17] J.P. Vasseur , M. Pickavet , P. Demeester , Network rec overy: Protection and

Restoration of Optical, SONET-SDH, IP, and MPLS, Elsevier, 2004 .

[18] J.

Kempf , E. Bellagamba , A. Kern , D. Jocha , A. Ta kcs , P. Skldstrm , Scalable fault

management for OpenFlow, in: Communications (ICC), 2012 IEEE International

Conference, IEEE, 2012, pp. 6606–6610 .

[19] A . Sgambelluri , A . Giorgetti , F. Cugini , F. Paolucci , P. Castoldi , OpenFlow-based

segment protection in ethernet networks, J. Opt. Commun. Netw. 5 (9) (2013)

1066–1075 .

[20] Z. Guo , R. Liu , Y. Xu , A. Gushchin , A. Wal id , H.J. Chao , STAR: preventing

ﬂow-table overﬂow in software-deﬁned networks, Comput. Netw. 125 (2017)

15–25 .

[21] Z.

Guo , Y. Xu , R. Liu , A. Gushchin , K.Y. Chen , A. Wal id , H.J. Chao , Balancing

ﬂow table occupancy and link utilization in software-deﬁned networks, Future

Gener. Comput. Syst. 89 (2018) 213–223 .

[22] S. Sharma , D. Staessens , D. Colle , M. Pickavet , P. Demeester , Enabling fast

failure recovery in OpenFlow networks, in: Design of Reliable Communica-

tion Networks (DRCN), 2011 8th International Works hop on the, IEEE, 2011,

pp. 164–171 .

[23] D. Staessens , S. Sharma , D. Colle , M. Pickavet , P. Demeester , Software deﬁned

networking: Meeting carrier

grade requirements, in: Local & Metropolitan Area

Networks (LANMAN), 2011 18th IEEE Wo rks hop on, IEEE, 2011, pp. 1–6 .

[24] S. Sharma , D. Staessens , D. Colle , M. Pickavet , P. Demeester , OpenFlow: meet-

ing carrier-grade recovery requirements, Comput. Commun. 36 (6) (2013)

656–665 .

[25] H. Kim , M. Schlansker , J.R. Santos , J. To ur ril he s , Y. Turner , N. Feamster , Coronet:

fault tolerance for software deﬁned networks, in: Network Protocols (ICNP),

2012 20th IEEE International Conference on, IEEE, 2012, pp. 1–2 .

[26] M. Luo , Y. Zeng , J. Li , W. Chou

, An adaptive multi-path computation framework

for centrally controlled networks, Comput. Netw. 83 (2015) 30–44 .

[27] Y. Jinyao , Z. Hailong , S. Qianjun , L. Bo , G. Xiao , HiQoS: an SDN-based multipath

QoS solution, China Commun. 12 (5) (2015) 123–133 .

[28] C. Rotsos , N. Sarrar , S. Uhlig , R. Sherwood , A.W. Moore , OFLOPS: an open

framework for OpenFlow switch evaluation, in: International Conference on

Passive and Active Network Measurement, Springer, Berlin Heidelberg, 2012,

pp. 85–95 .

[29] X. Jin , H.H. Liu , R. Gandhi , S. Kandula , R. Mahajan ,

M. Zhang , R. Wattenhofer ,

Dynamic scheduling of network updates, in: ACM SIGCOMM Computer Com-

munication Review (Vol. 44, No. 4, ACM, 2014, pp. 539–550 .

[30] G. Bianchi , M. Bonola , A. Capone , C. Cascone , OpenState: programming plat-

form-independent stateful OpenFlow applications inside the switch, ACM SIG-

COMM Comput. Commun. Rev. 44 (2) (2014) 44–51 .

[31] A. Capone , C. Cascone , A.Q. Nguyen , B. Sanso , Detour planning for fast and re-

liable failure rec overy in SDN with OpenState, in: Design of Reliable Commu-

nication Networks (DRCN), 2015 11 th International Conference on the,

IEEE,

2015, pp. 25–32 .

[32] C. Cascone , L. Pollini , D. Sanvito , A. Capone , B. Sanso , SPIDER: fault resilient

SDN pipeline with recovery delay guarantees, in: NetSoft Conference and

Workshops (NetSoft), 2016 IEEE, IEEE, 2016, pp. 296–302 .

[33] S.A. Astaneh , S.S. Heydari , Optimization of SDN ﬂow operations in multi-failure

restoration scenarios, IEEE Trans. Netw. Serv. Manag. 13 (3) (2016) 421–432 .

[34] A. Malik , B. Aziz , M. Adda , C.H. Ke , Optimisation methods for fast restoration

of software-deﬁned networks, IEEE Access 5 (2017) 16111–16123 .

[35] A. Malik, B. Aziz,

C.H. Ke, H. Liu, M. Adda, Virtual topology partitioning to-

wards an eﬃcient failure recovery of software deﬁned networks, in: Machine

Learning and Cybernetics (ICMLC), 2017 International Conference on, IEEE, pp.

646–651.

[36] A. Malik, B. Aziz, A. Al-Haj, M. Adda, Software-deﬁned networks: a walk-

through guide from occurrence to data plane fault tolerance (No. e27624v1).

PeerJ preprints, 2019.

[37] S.S. Lee , K.Y. Li , K.Y. Chan , G.H. Lai , Y.C. Chung , Software-based fast fail-

ure recovery for resilient OpenFlow networks, in: Reliable Networks De-

sign and Modeling (RNDM), 2015 7th International Worksh op on, IEEE, 2015,

pp. 194–200

[38] M. Desai , T. Nandagopal , Coping with link failures in centralized control plane

architectures, in: Communication Systems and Networks (COMSNETS), 2010

Second International Conference on, IEEE, 2010, pp. 1–10 .

[39] S.S. Lee , K.Y. Li , K.Y. Chan , G.H. Lai , Y.C. Chung , Path layout planning and soft-

ware based fast failure detection in survivable OpenFlow networks, in: Design

of Reliable Communication Networks (DRCN), 2014 10t h International Confer-

ence on the, IEEE, 2014, pp. 1–8 .

[40] E.W. Dijkstra , W. E. , A note on two problems in connexion with graphs, Numer.

Math. 1 (1) (1959) 269–271 .

[41] B. Vidalenc , L. Ciavaglia , L. Noirie , E. Renault , Dynamic risk-aware routing for

OSPF networks, in: Integrated Network Management (IM 2013), 2013 IFIP/IEEE

International Symposium

on, IEEE, 2013, pp. 226–234 .

[42] A. Medem , R. Te ixe ira , N. Feamster , M. Meulle , Joint analysis of network inci-

dents and intradomain routing changes, in: Network and Service Management

(CNSM), 2010 International Conference on, IEEE, 2010, pp. 198–205 .

[43] C. Labovitz , G.R. Malan , F. Jahanian , Origins of internet routing instability, in:

INFOCOM’99. Eighteenth Annual Joint Conference of the IEEE Computer and

Communications Societies. Proceedings. IEEE, IEEE, 1999, pp. 218–226 .

[44] F. Salfner , M. Lenk , M. Malek , A survey of online failure prediction methods,

ACM Comput. Surv. (CSUR) 42

(3) (2010) 10 .

[45] A. Medem , R. Teixeira , N. Usunier , Predicting critical intradomain routing

events, in: Global Telecommunications Conference (GLOBECOM 2010), 2010

IEEE, IEEE, 2010, pp. 1–5 .

[46] R.S. Mangoubi , Robust Estimation and Failure Detection: A Concise Treatment,

Springer Science & Business Media, 2012 .

[47] L. Lu , M. Zhang , Edge betweenness centrality, in: Encyclopedia of Systems Bi-

ology, Springer, New York, NY., 2013, pp. 647–648 .

[48] U. Brandes , On varia nts of shortest–path betweenness centrality and their

generic computation, Soc. Netw. 30 (2) (2008) 136–145 .

[49] S. Szwaczyk , K. Wrona , M. Amanowicz , Applicability of risk analysis meth-

ods to risk-aware routin g in software-deﬁned networks, in: 2018 International

Conference on Military Communications and Information Systems (ICMCIS),

IEEE, 2018, pp. 1–7 .

[50] S. De Maesschalck , D. Colle , I. Lievens , M. Pickavet , P. Demeester , C. Mauz ,

J. Derkacz , Pan-european optical transport networks: an availability-based

comparison, Photonic Netw. Commun. 5 (3) (2003) 203–225 .

[51] A.J. Gonzalez , B.E. Helvik , Characterisation of router and link failure processes

in UNINETTs IP backbone network, Int. J. Space Based Situated Comput. 7 (1)

(2012) 3–11 . 2

[52] S. Kaplan , B.J. Garrick , On the quantitative deﬁnition of risk, Risk Anal. 1 (1)

(1981) 11–27 .

[53] B. Chandrasekaran , T. Benson , Tol era tin g SDN application failures with

legoSDN, in: Proceedings of the 13t h ACM Worksho p on Hot Topi cs in Net-

works, ACM, 2014, p. 22 .

[54] POX Wiki, [Online]. Avail able : https://openﬂow.stanford.edu/display/ONL/POX+

Wiki .

[55] A. Shalimov , D. Zuikov , D. Zimarina , V. Pashkov , R. Smeliansky , Advanced study

of SDN/OpenFlow controllers, in: Proceedings of the 9th Central & Eastern Eu-

ropean Software Engineering Conference in Russia, ACM, 2013, p. 1 .

[56] W.Y. Huang , J.W. Hu , S.C. Lin , T.L. Liu , P.W. Tsai , C.S. Yan g , J.J. Mambretti ,De-

sign and implementation of an automatic network to pology discovery system

for the future internet across different domains, in: Advanced Information Net-

working and Applications Workshops (WAINA), 2012 26th International Con-

ference on, IEEE, 2012, pp. 903–908 .

[57] Att/pox, Accessed on July. 15, 2019. [Online]. Availab le: https://github.com/att/

pox/blob/master/pox/openﬂow/discovery.py .

[58] D.A. Schult , P. Swart , Exploring network structure, dynamics, and function us-

ing networkx, in: Proceedings of the 7th Python in Science Conferences (SciPy

2008) (Vol. 2008, 2008, pp.

11–16 .

[59] A. Malik , B. Aziz , M. Adda , Toward s ﬁlling the gap of routi ng changes in soft-

ware-de ﬁned networks, in: Proceedings of the Future Technologies Conference,

Springer, Cham., 2018, pp. 682–693 .

[60] R. Bhandari , Survivable Networks: Algorithms for Diverse Routing, Springer

Science & Business Media, 1999 .

[61] S. Poretsky, B. Imhoff, K. Michielsen, Terminology for benchmarking link-state

IGP data-plane route convergence (no. RFC 6412), 2011.

[62] SNDlib library, [Online]. Availa ble: http://sndlib.zib.de .

[63] A . Medina, A . Lakhina, I. Matta, J. Byers, 2001, BRITE: an approach to univer-

sal topology genera tion. In: Modeling,

Analysis and Simulation of Computer

and Telecommunication Systems, 2001. Proceedings. Ninth International Sym-

posium on (pp. 346–353). IEEE.

[64] B.M. Waxman , Routing of multipoint connections, IEEE J. Sel. Areas Commun.

6 (9) (1988) 1617–1622 .

[65] SDN proactive fault handling, Accessed on July. 15, 2019. [Online]. Availab le:

https://github.com/Ali00/SDN- Smart- Rou ting .

[66] B. Lantz , B. Heller , N. McKeown , A network in a laptop: rapid prototyping for

software-deﬁned networks, in: Proceedings of the 9th ACM SIGCOMM Work-

shop on Hot Top ics in Networks, ACM , 2010, p. 19 .

[67] L. Saino , C. Cocora , G. Pavlou ,

A toolchain for simplifying network simulation

setup, in: Proceedings of the 6th International ICST Conference on Simulation

Tools and Techniques, ICST (Institute for Computer Sciences, Social-Informatics

and Telecommunications Engineering, 2013, pp. 82–91 .

[68] M. Ohring , J.R. Lloyd , Reliability and failure of electronic materials and devices,

Academic Press, 2009 .

[69] ZeroMQ, [Online]. Available: http://zeromq.org/ .

14 A. Malik, B. Aziz and M. Adda et al. / Computer Networks 170 (2020) 107 104

Ali Malik is a Postdoctoral Researcher at the School of

Electrical and Electronic Engineering, Technological Uni-

versity Dublin, Ireland. He receive d his B.Sc. degree in

computer science from Al-Qadisiyah University, Iraq, in

2009. He also holds an M.Sc. degree in information tech-

nology from BAMU University, India, in 2012. He obtained

his Ph.D. degree in computer science from the Univers ity

of Portsmouth, Un ited Kingdom, in 2019. His current re-

search interests include software-deﬁned networks, rout-

ing, fault management, risk and cybersecurity.

Benjamin Aziz is a Senior Lecturer in Computer Secu-

rity at the School of Computing, Unive rsity of Portsmouth,

United Kingdom. He holds Ph.D. degree in formal ver-

iﬁcation of computer security from Dublin City Univer-

sity (DCU), Ireland, in 2003. He has worked in the past

as a postdoctoral researcher at Imperial College London

and Rutherford Appleton Laboratory, in areas related to

security engineering of large-scale systems, formal de-

sign and analysis, requirements engineering and digital

forensics. He is on board program committees for several

conferences and working groups, such as ERCIM’s FMICS,

STM, Cloud Security Alliance and IFIP WG11.3. His re-

search interests include formal modelling, security, com-

puter forensics, risk management, software engineering, IoT and software-deﬁned

networks.

Mo Adda is a Principal Lecturer in computer networks at

the School of Computing, University of Portsmouth. He re -

ceived the Ph.D. degree in distributed systems and par-

allel processing from the University of Surrey. He was a

Senior Lecturer with the University of Richmond, where

he taught programming, computer architecture, and net-

working for ten years. From 1999 to 2002, he was a Se-

nior Software Engineer developing software and manag-

ing projects on simulation and modelling. He has been

researching parallel and distributed systems since 1987.

His research interests include software-deﬁned networks,

wireless sensor networks, mobile networks and business

process modelling, simulation of mobile agent technology

and security.

Chih-Heng Ke is an Associate Professor with the Depart-

ment of Computer Science and Infor- mation Engineer-

ing, National Quemoy University, Kinmen, Taiwan. He re-

ceived the B.S. and Ph.D. degrees in electrical engineering

from National Cheng-Kung University in 1999 and 2007,

respectively. His current research interests include multi-

media communications, wireless networks, and software-

deﬁned networks.

Ensuring Reliable Network Operations and Maintenance: The Role of PMRF for Switch Maintenance and Upgrades in SDN

Article

Full-text available

Oct 2023

Optimizing IIoT Performance: Intelligent Selection of SDN Controllers through AHP Analysis

Article

Full-text available

Jun 2024

Claudio Urrea

This article deals with the use of software-defined networking (SDN) in the Industrial Internet of Things (IIoT). The use of SDN in IIoTcan solve the limitations presented by traditional networks in order to guarantee quality of service (QoS) for new applications. The approach of this work centers on the selection of an SDN controller that satisfies the requirements for the networks of IIoT. Selection is based on the characteristics of the SDN controllers and employs the analytic hierarchy process (AHP). From the review conducted, and as a result of the work, the group of the current best SDN controllers for IIoT is identified, which is a part of the subsequent selection process. Another contribution of this study is that it defines the criteria for comparing these controllers and selecting the most suitable one for this type of application. The established criteria and the employed quantification method via AHP enrich the decision-making process, providing a replicable model for future selections. The objectives and criteria established can be useful for other SDN selection processes to be used in scenarios where delay, jitter, and packet loss are key parameters to consider. This nuanced approach, accommodating both theoretical frameworks and empirical observations, offers an advancement in the strategic deployment of SDN within IIoT environments.

Data plane failure and its recovery techniques in SDN: A systematic literature review

Preprint

Full-text available

Mar 2023

p>Software-defined networking (SDN) plays a crucial role in the enterprise and wide-area networking. The increasing demand for strict service-level agreement applications on the Internet requires networks to be scalable and resilient in the face of link and switch failure. However, there is a lack of systematic reviews on SDN data plane failure recovery techniques. This review article assesses SDN current state-of-the-art link and switches failure recovery solutions. We cover the root causes of failures in the traditional core network and their detection and classify the current failure recovery techniques for SDN into two categories: traditional and artificial intelligence (AI) approaches. AI-based techniques enable efficient failure recovery and enhance the quality of service. We also consider performance measure metrics to evaluate and determine the limitations of existing solutions. This study reviews 188 papers from 2010 to 2021, selecting 70 articles that are highly relevant to our work. All articles are written in English. Our research aims to collect a large amount of evidence that will assist the industry and academic researchers in net- working to address current research gaps in failure recovery solutions for the SDN data plane.</p

Industrial IoT regulated by Software-Defined Networking platform for fast and dynamic fault tolerance application

Article

May 2024
SIMUL MODEL PRACT TH

FFRLI: Fast fault recovery scheme based on link importance for data plane in SDN

Article

Oct 2023
COMPUT NETW

A State-of-the-Art Survey and Taxonomy of Classification, Algorithms, and Techniques for Load Balancing in SDN

Article

Full-text available

Sep 2023

The management of traditional networks has become increasingly complex due to the expansion of the network and the development of new technologies such as cloud computing, the Internet of Things (IoT), and big data. Therefore, it is imperative to transition from operating within conventional networks to utilizing advanced networks capable of effectively managing modern technology. One of the most significant advancements in networking is the implementation of software-defined networks (SDN). SDNs aim to decouple the control plane, which controls network functions, from the data plane, which handles data transmission. This separation enhances the flexibility of network management. The distribution of traffic inside SDN networks plays a crucial role in enhancing network performance and response. Implementing Load Balancing (LB) enhances overall system performance and guarantees the efficient and dependable utilization of network resources. This research aims to comprehensively analyze recent research studies and the taxonomy of LB in SDN, such as classification, algorithms, and techniques. This research provides a comprehensive, state-of-the-art survey of LB in SDN according to LB-Classification, LB algorithms, and LB-Techniques. This research proposed a modern taxonomy for LB-Classification based on two factors: scheduling and models. Also, it proposed a new taxonomy of LB-Algorithms based on three types (static, dynamic, and hybrid) and a taxonomy for a third type (hybrid) consisting of three kinds (hybrid-LB, hybrid dynamic-LB, and hybrid static-LB). Finally, this research proposed a modern classification of LB-Techniques based on six types: (Controller -LB, Server -LB, Path Selection and Re-route - LB, Scheduling Management and Queue -LB, Artificial Intelligence -LB, and Wireless and Wi-Fi-LB).

GENIND: An industrial network topology generator

Article

Sep 2023

Energy-Efficient and Traffic-Aware Function Analysis of Network Service Orchestration

Chapter

Feb 2023

System Service Provider models are experiencing a developmental change driven by client requests, enormous information and mechanical advances, for example, 4G, LTE Advanced, 5G, Software Defined Networks (SDN), and Network Function Virtualization (NFV). Developing situations make use of quick system services that consume system, stockpiling, and processing assets in frameworks. Organizing assets for controlling and making administrations in interconnected spaces and different innovations, in this manner, turns into an incredible test. Innovative work endeavours are dedicated to mechanizing the procedures of arranging, planning, and dealing with the organization and activity of system administrations. In this best-in-class study, we address the subject of Network Service Orchestration (NSO) by looking at the authentic foundation, significant research undertakings, and normalization exercises. We characterize key ideas and propose a scientific classification of NSO approaches and answers to prepare for a typical comprehension of the various endeavours around the acknowledgment of different NSO application situations.

Data Plane Failure and Its Recovery Techniques in SDN: A Systematic Literature Review

Article

Full-text available

Feb 2023

Software-defined networking (SDN) plays a crucial role in the enterprise and wide-area networking. The increasing demand for strict service-level agreement applications on the Internet requires networks to be scalable and resilient in the face of link and switch failure. However, there is a lack of systematic reviews on SDN data plane failure recovery techniques. This review article assesses SDN current state-of-the-art link and switches failure recovery solutions. We cover the root causes of failures in the traditional core network and their detection and classify the current failure recovery techniques for SDN into two categories: traditional and artificial intelligence(AI) approaches. AI-based techniques enable efficient failure recovery and enhance the quality of service. We also consider performance measure metrics to evaluate and determine the limitations of existing solutions. This study reviews 188 papers from 2010 to 2021, selecting 70 articles that are highly relevant to our work. All articles are written in English. Our research aims to collect a large amount of evidence that will assist the industry and academic researchers in networking to address current research gaps in failure recovery solutions for the SDN data plane.

Underwater Pollution Tracking Based on Software-Defined Multi-Tier Edge Computing in 6G-Based Underwater Wireless Networks

Article

Feb 2023

The forthcoming 6G networks are expected to provide a vision of overlapping aerial-ground-underwater wireless networks. Meanwhile, the rapid development of the Internet of Underwater Things (IoUTs) brings forth many categories of Autonomous Underwater Vehicle (AUV)-assisted Underwater Wireless Networks (UWNs). In this paper, we argue that the AUV-assisted UWNs can be intelligently utilized to track underwater pollution. To perform smart underwater pollution tracking, we propose the paradigm of AUV flock-based networking system and Software-Defined Networking (SDN)-enabled AUV flock Networking System (SDN-AUVNS). We introduce the concept of Mobile Edge Computing (MEC) into the control of SDN-AUVNS and propose the upgrade of the control plane of the SDN-AUVNS to with the multi-tier edge computing ability. By the proposed system architecture, we adopt the artificial potential field theory to construct the network controlling model. And we present the underwater tracking model for SDN-AUVNS, especially for the underwater pollution equipotential line of a particular concentration. Furthermore, to provide accurate path planning for the equipotential line tracking, we utilize the linearizability mechanism to optimize and revise the control input for the SDN-AUVNS. Lastly, we give a fast united control algorithm that can intelligently schedule the SDN-AUVNS to track underwater pollution equipotential lines. In particular, we propose a smart approach with the name of ’Inverse Distance Weighting’ to optimize the detection sample of the SDN-AUVNS. Evaluation results indicate that our proposal is able to track/survey the equipotential lines within a satisfactory error.

Optimisation Methods For Fast Restoration of Software-Defined Networks

Article

Full-text available

Aug 2017

The increasing complexity of modern day networked applications and the massive demand on the Internet resources has reignited interest and concern in the underlying networking infrastructures and their ability to cope with such complexity and adapt to the demands of the business applications particularly where such applications require a high degree of robustness and reliability. As a result, software-defined networking has emerged as a promising approach to the definition of network architectures that could carry a high degree of adaptability and robustness reminiscent of the future Internet. Fault tolerance and network updates are considered two of the current research challenges that hamper the growth of software-defined networking in this area. Therefore, this paper represents a step towards tackling these two issues in the context of single link failures. Our main contribution lies in the definition of new algorithms that aim to enhance the problem of finding alternative paths in large scale networks with minimal cost and time-to-update factors. The new solution aims at increasing the efficiency of flow operation reduction during link failures. We evaluate our framework and show how its implementation results in improved efficiency.

A Survey on Fault Management in Software-Defined Networks

Article

Full-text available

Jun 2017

Software-defined networking (SDN) is an emerging paradigm that has become increasingly popular in the recent years. The core idea is to separate the control and data planes, allowing the construction of network applications using high-level abstractions which are translated to network devices through a southbound interface. SDN architecture is composed of three layers: infrastructure layer, responsible exclusively for data forwarding; control layer, which maintains the network view and provides core network abstractions; and application layer, which uses abstractions provided by the control layer to implement network applications. SDN provides features, such as flexibility and programmability, that are key enablers to meet current network requirements (e.g., multi-tenant cloud networks, elastic optical networks). However, along with its benefits, SDN also brings new issues. In this survey we focus on issues related to fault management. Different fault management threat vectors are introduced by each layer, as well as by the interface between layers. Nevertheless, besides addressing fault management issues of its architecture, SDN also must handle the same problems faced by legacy networks. However, programmability and centralized management might be used to provide flexibility to deal with those issues. This paper presents an overview of fault management in SDN. The major contributions of this work are as follows: 1) Identification of the main fault management issues in SDN and classification according to the affected layers; 2) Survey of efforts that address those issues and classification according to the affected planes, issues concerned, general approaches and features; 3) Discussion about trade-offs of different approaches and their suitability for different scenarios.

Virtual topology partitioning towards an efficient failure recovery of software defined networks

Conference Paper

Full-text available

Jul 2017

Software Defined Networking is a new networking paradigm that has emerged recently as a promising solution for tackling the inflexibility of the classical IP networks. The centralized approach of SDN yields a broad area for intelligence to optimise the network at various levels. Fault tolerance is considered one of the most current research challenges that facing the SDN, hence, in this paper we introduce a new method that computes an alternative paths re-actively for centrally controlled networks like SDN. The proposed method aims to reduce the update operation cost that the SDN network controller would spend in order to recover from a single link failure. Through utilising the principle of community detection , we define a new network model for the sake of improving the network's fault tolerance capability. An experimental study is reported showing the performance of the proposed method. Based on the results, some further directions are suggested in the context of machine learning towards achieving further advances in this research area.

Improving the Routing Security in Software-Defined Networks

Article

Feb 2019

Software-Defined Networking provides the opportunities to improve the routing security by dynamically routing flows with diverse routing instances. However, some routing instances could have common vulnerabilities, and the attacker could launch persistent attacks (e.g., zero-day attacks) when the common vulnerabilities are identified. In this paper, we propose a dynamic routing instance switching scheme to mitigate the attacks. We model the dynamic instance switching problem and solve it with our proposed Correlation-aware Dynamic Instance Switching (CDIS) algorithm. The simulation results show that, compared with baseline algorithms, CDIS improves the network compromised ratio at least 10%.

Dynamic slave controller assignment for enhancing control plane robustness in software-defined networks

Article

Feb 2019
FUTURE GENER COMP SY

BWManager: Mitigating Denial of Service Attacks in Software-Defined Networks Through Bandwidth Prediction

Article

Oct 2018

Software-Defined Networking (SDN) has emerged as a new networking paradigm that can provide fine-grained network management service. Since the SDN controller makes control decision for the network, it becomes the main target of Denial of Service (DoS) attacks. In this paper, we propose to mitigate the DoS attacks of the SDN controller with BWManager that mainly consists of four key components: simplified DoS detection module, forecasting engine, priority manager and scheduler. The simplified DoS detection module calculates a comprehensive judgment score for each switch, which indicates the attacking severity of each switch and is used to decide time slice allocation for switches. The forecasting engine is the basis of the controller scheduling method and forecasts the bandwidth consumption of users to determine the users’ trust values. The trust values are used by the priority manager to manage multiple buffer queues with different priorities for the users. The scheduler protects the controller and the normal users under DoS attacks by running a weighted round-robin algorithm to process flow requests in different priority queues. We evaluate the performance and overhead of BWManager in both hardware and software OpenFlow environments. The results demonstrate that BWManager is effective with a limited overhead.

Applicability of risk analysis methods to risk-aware routing in software-defined networks

Conference Paper