ArticlePDF Available

Smart Routing: Towards Proactive Fault Handling of Software-Defined Networks

Authors:

Abstract and Figures

In recent years, the emerging paradigm of software-defined networking has become a hot and thriving topic in both the industrial and academic sectors. Software-defined networking offers numerous benefits against legacy networking systems by simplifying the process of network management through reducing the cost of network configurations. Currently, data plane fault management is limited to two mechanisms: proactive and reactive. These fault management and recovery techniques are activated only after a failure occurrence and hence packet loss is highly likely to occur. This is due to convergence time where new network paths will need to be allocated in order to forward the affected traffic rather than drop it. Such convergence leads to temporary service disruption and unavailability. Practically, not only the speed of recovery mechanisms affects the convergence, but also the delay caused by the process of failure detection. In this paper, we define a new approach for data plane fault management in software- defined networks where the goal is to eliminate the convergence process altogether rather than accelerate the failure detection and recovery. We propose a new framework, called Smart Routing, which allows the network controller to receive forewarning signs on failures and hence avoid risky paths before the failure incidents occur. The proposed approach aims to decrease service disruption, which in turn increases network service availability. We validate our framework through a set of experiments that demonstrate how the underlying model runs and its impact on improving service availability. We take as example of the applicability of the new framework three types of topologies covering real and simulated networks.
Content may be subject to copyright.
Computer Networks 170 (2020) 107104
Contents lists available at ScienceDirect
Computer Networks
journal homepage: www.elsevier.com/locate/comnet
Smart routing: Towards proactive fault handling of software-defined
networks
Ali Malik
a , , Benjamin Aziz
b
, Mo Adda
b
, Chih-Heng Ke
c
a
School of Electrical and Electronic Engineering, Technological University Dublin, Dublin D08 NF82, Ireland
b
School of Computing, Buckingham Building, University of Portsmou th, Portsmouth PO1 3HE, United Kingdom
c
Department of Computer Science and Information Engineering, National Quemoy University, Kinmen 892, Taiwan
a r t i c l e i n f o
Article history:
Received 8 January 2019
Revised 26 December 2019
Accepted 13 January 2020
Available online 16 January 2020
Keywo rds:
OpenFlow
Software-defined networking
Fault management
Risk management
Service availability
a b s t r a c t
In recent years, the emerging paradigm of software-defined networking has become a hot and thriving
topic in both the industrial and academic sectors. Software-defined networking offers numerous benefits
against legacy networking systems by simplifying the process of network management through reducing
the cost of network configurations. Currently, data plane fault management is limited to two mecha-
nisms: proactive and reactive . These fault management and recovery techniques are activated only after a
failure occurrence and hence packet loss is highly likely to occur. This is due to convergence time where
new network paths will need to be allocated in order to forward the affected traffic rather than drop it.
Such convergence leads to temporary service disruption and unavailability. Practically, not only the speed
of recovery mechanisms affects the convergence, but also the delay caused by the process of failure de-
tection. In this paper, we define a new approach for data plane fault management in software-defined
networks where the goal is to eliminate the convergence process altogether rather than accelerate the
failure detection and recovery. We propose a new framework, called Smart Routing , which allows the net-
work controller to receive forewarning signs on failures and hence avoid risky paths before the failure
incidents occur. The proposed approach aims to decrease service disruption, which in turn increases net-
work service availability. We validate our framework through a set of experiments that demonstrate how
the underlying model runs and its impact on improving service availability. We take as example of the
applicability of the new framework three types of topologies covering real and simulated networks.
©2020 Elsevier B.V. All rights reserved.
1. Introduction
The concern about the Internet ossification, which is a conse-
quence of the growing number of variety networks (e.g. IoT, WSN,
Cloud, etc.) that serve up to 9 billion users around the globe, has
led to investigate for replacing the existing rigid network infras-
tructure with programmable one [1] . In this context, Software-
Defined Networking (SDN) has been emerged as a promising solu-
tion for tackling the inflexibility of the legacy networking systems.
In fact, SDN is part of a long history of attempts that aim to lower
the barrier of deploying new innovations and make the network
more programmable. For more on the history of programmable
networks, we refer the interested readers to [2] . Unlike traditional
IP networks, SDN architectures consist of three planes: control, data
Corresponding author.
E-mail addresses: ali.malik@tudublin.ie , up714266@myport.ac.uk (A. Ma-
lik), benjamin.aziz@port.ac.uk (B. Aziz), mo.adda@port.ac.uk (M. Adda),
smallko@gmail.com (C.-H. Ke).
and application . The control plane , or sometimes called the con-
troller , represents the network brain, which provides the essential
functions and exerts a granular control by relying on the global
view over the network topology, which is a crucial feature that
has been missed in the past. The data plane comprises network
forwarding elements that constitute the network topology. These
forwarding elements are dictated by the network controller and
therefore the entire nodes need to disclose their status periodically
to the controller, hence the global view comes. In general, many
studies have classified the SDN into two layers by considering the
application plane as a complementary part to the control layer to
solve various kinds of network issues such as firewall and load bal-
ance. So far, OpenFlow [3] is the most widely used protocol that
enables the controller to govern the SDN data plane through car-
rying the forwarding rules as well as to capacitate the exchanging
of signals between the two planes. Recently, SDN was not solely
confined to the academic field but also gained the attraction of in-
dustry such as Google and Microsoft [4] .
https://doi.org/10.1016/j.comnet.2020.107104
1389-1286/© 2020 Elsevier B.V. All rights reserved.
2 A. Malik, B. Aziz and M. Adda et al. / Computer Networks 170 (2020) 107104
Fig. 1. The compartmentalisation of dependability concept, Laprie [5] .
Nowadays, communication networks play a vital role in human
being’s life activities as it represents the backbone for the modern
technologies. Networking equipment are failure prone and there-
fore, some aspects like availability, reliability and fault manage-
ment are necessary. The term dependability is the umbrella that en-
compasses the aforementioned aspects and Fig. 1 gives an overall
picture on the dependability taxonomy according to [5] . However,
this paper focuses on the availability attribute by means of fault
tolerance and forecasting of SDN link failures.
Although SDNs have brought a bunch of benefits with signifi-
cant network improvements, some new challenges that accompa-
nied with this innovation, such as security [6,7] and recovery from
failure, still need to be addressed thoroughly in order to maximise
the utility of SDNs [8,9] . Data plane link failure is considered to be
a manifold problem. This is because the controller needs first to be
notified about the failure and then to compute alternative routes
in order to update the affected forwarding elements. The issue of
network link failures is not a recent phenomenon, as it takes place
in everyday operation with variations in living-time and causes
[10] . However, the new architecture of OpenFlow requires more
investigation in order to eliminate the challenges that hamper its
growth.
In order to maximise the service availability in SDNs, we define
a new approach that minimises the percentage of service unavail-
ability by using online failure prediction. This allows the network
controller to perform the necessary reconfiguration prior to the oc-
currence of failure incidents. Although a number of works on SDN
fault management have been proposed, none of them has exploited
the global view of SDNs in the context of failure prediction. With
this context in mind, we can summarise the main contributions of
this paper as follows:
A new network model that allows for the forecasting of link
failures by predicting their characteristics in an online fashion.
This model also demonstrates how links failure prediction can
be integrated into the process of proactive restoration with the
aid of risk analysis.
We provide an implementation of the new model in terms of
a couple of fault tolerance algorithms. We use simulation tech-
niques to test the efficiency of these algorithms. Our simulation
results prove that the proposed model and algorithms improve
the service availability of SDNs.
The rest of the paper is organised as follows. Section 2 in-
troduces various SDN fault management techniques from the lit-
erature. The problem statement is highlighted in Section 3 . We
then present our network model and framework in Section 4 .
Sections 7 and 8 present the experimental procedure, observed re-
sult and comparison. Finally, the summary of this paper is pro-
vided in Section 9 with some future directions.
2. Related work
Link failure issue often occurs as a part of everyday operation.
Due its importance and the negative impact it has on the network
Quality of Service (QoS), a considerable amount of research has
been conducted to analyse, characterise, evaluate and recover from
the frequent issue of link failure. While, the physical separation of
the control plane from the data plane results into two indepen-
dent entities. Both entities are susceptible to failure incidents. Ac-
cording to [11] , control plane failures are more severe than other
failures. This is due to the significant role of network controller
in managing the whole network activities. For more details about
control plane failures, we refer the interested readers to [12,13] .
However, in this paper, we are focusing on the data plane link
failures only.
Communication networks are prone to either unintentional fail-
ures, unplanned , due to various causes such as human errors, nat-
ural disasters like earthquakes, overload, software bugs, cable cut
and so on, or to intentional failures, planned , that caused by the
process of maintenance [14,15] . Fail ure recovery scheme is a nec-
essary requirement for networking system to ensure the reliabil-
ity and service availability. Generally, failure recovery mechanisms
of carrier-grade networks are categorized into two types: proactive
and reactive . In case of link failure, resilience mechanism of SDNs
ought to redirect the affected flows in order to avoid the location
of failure and keep the system continue working despite the pres-
ence of the abnormal situation. SDN controller has the ability to
mask the data plane failure either proactively or reactively [16] ,
while each of them has pros and cons. In this section, we discuss
current effort s to t ackle the dat a plane link failures.
2.1. Proactive
In proactive, which is also know as protection , the alternative
paths are preplanned and reserved in advance (before a failure oc-
curs). According to [17] , there are three protection schemes that
can be applied to recovery from network failure:
One to One (1: 1): In which, one protection path is dedicated
to protect exactly one path.
One to Many (1: Y ): In which, one protection path is dedicated
to protect up to Y paths.
Many to Many ( X : Y ): In which, X of specified protection paths
are dedicated to protect up to Y working paths such that X Y .
The authors in [18] have implemented an OpenFlow monitor-
ing function for achieving a fast data plane recovery. In [19] , an-
other protection method has been proposed through using the
OpenFlow-based Segment Protection (OSP) scheme. The main dis-
advantage of this strategy is that it consumes the data plane stor-
ing capability since the more flow entries (i.e rules) to be stored
the more space will be used, however, the current OpenFlow ap-
pliances in the market are able to accommodate up to 80 0 0 flow
entries due to the limitation of the Ter na ry Content-Addressable
Memory (TCAM), thence such a kind of solution is costly [4,16] ,
where this issue was discussed in some studies such as [20,21] . In
addition, the installation of many attributes in the OpenFlow for-
warding elements could lead to deteriorate the process of match-
and-action for the data plane nodes. Moreover, there is no guar-
antee that the preserved backups are failure-free, in other words,
A. Malik, B. Aziz and M. Adda et al. / Computer Networks 170 (2020) 107104 3
the backup path might fail before the primary one and resulting a
waste in space and time.
2.2. Reactive
In reactive, which is also called restoration , the possible alter-
native paths are not preplanned and will be calculated dynam-
ically when failure occurs. The authors in [22,23] presented an
OpenFlow restoration method to recover from single-link failures.
However, their experiments were only conducted on small scale
network topologies not exceeding 14 nodes. In [24] , the authors
have demonstrated through extensive experiments that OpenFlow
restoration is not easily attainable within 50ms, especially for large
scale networks, unless using the protection technique.
In the same context, some works have utilised the concept of
multiple disjoint paths to be employed as a backup. For example,
CORONET [25] is presented as a fault tolerance system for SDNs, in
which multiple link failures can be resolved. The ADaptive Multi-
Path Computation Framework (ADMPCF) [26] for large scale Open-
Flow networks is produced as a traffic engineering tool that capa-
ble to hold two or more disjoint paths to be utilised when some
network events occur (e.g. link failure). HiQoS [27] is introduced as
a traffic engineering tool towards better QoS for SDNs. HiQoS com-
puted multipath (at least two constrained paths) between all pos-
sible pairs in a network, hence a quick recovery from a link failure
is attainable. Most of the existing works did not take into account
the processing time of flow entries (e.g. insert, delete and mod-
ify) that need to be updated. Although the performance of Open-
Flow devices are associated with their manufacturer specification,
in [28] , the authors stated that each single flow entry insertion is
ranging from 0.5ms to 10ms. However, 11m s is the minimum du-
ration that required to modify a single rule since each modification
process includes both deletion (old rule) and insertion (new one)
of rules [29] . There are a number of studies like [30–32] used the
OpenState mechanism to recover from data plane failures without
being dependant on the network controller and hence reducing the
overload on controller and speedup the process of recovery. How-
ever, such approaches still inapplicable as the existing OpenFlow
equipment does not support such customization.
Unlike the existing works, the authors in [33] have dealt with
problem of minimising the time of flow entries that required to
divert from an affected primary path to backup one. Although,
the presented algorithms did not guarantee the shortest path from
end-to-end, but it opens a new direction that worth to be ex-
plored. Within the same context, authors in [34] produced new
algorithms for minimising the required time to update through re-
ducing the solution search space from source to destination in the
affected path. Similarly, in [35] an approach to divide the network
topology into non-overlapping cliques has been produced to tackle
the failure issue in local-based manner rather than global. Both
[34,35] took into account the time required to compute the alter-
native route in order to speed up the operation of update. While,
the main issue with the last three works is that it does not secure
the shortest path from source to destination.
2.3. Summary
In summary, the previous studies produced different methods
to tackle the problem of data plane recovery from link failure inci-
dents and for more details about SDN fault management we refer
the interested readers to the recent survey [11,36] . Eventually, pro-
tection techniques are not ideal due to the TCAM space exhaustion,
whereas the latency issue is the major drawback of the existing
restoration methods. As a result, SDN fault management still needs
more research and investigation.
3. Problem statement
Distinctly, the existing SDN fault management techniques are
getting involved after a failure occurrence. Thus, it cannot prevent
a certain impact on traffic flows such as service unavailability. This
problem occurs due to the delay of the convergence scheme T
C
.
We define T
C
as the required time to amend a path in response to
a failure scenario. Typically, the convergence time in SDN can be
summarised as a combination of three factors:
Failure detection time ( T
D
): This is the time required to detect
a failure incident. Comparing with the conventional network-
ing systems, the centralised management and global view of
SDN ease this task by continuously monitoring the network sta-
tus and get notifications upon failures. However, the speed of
receiving a notification is sometimes associated with the na-
ture of network design and mode of communication (in-band
or out-of-band) [37,38] . According to [39] , the link failure detec-
tion time is ranging from tens to hundreds milliseconds, which
may also rely on the type of commercial OpenFlow switch.
New route computation time ( T
SP
): This is the spent time when
network controller runs a nominated shortest path routing al-
gorithm (e.g. Dijkstra [40] ) to compute the backup path (usually
for the reactive fault tolerance strategies). The T
SP
computation
time could reach 10s of milliseconds [34] according to how big
the network is.
Flow entries update time ( T
Update
): This is the time required to
update relevant switches, i.e., nodes involved in the affected
path. Again, this factor depends on how many forwarding rules
need to be updated after the failure scenario where the amount
of time for every single rule could exceed 10 ms.
Accordingly, the resulting convergence time can be calculated
through the following equation:
T
C
= T
D
+ T
SP
+
dst
src
T
Update
(1)
Currently, the classical SDN fault management methods aim
to tackle the failure after its occurrence, therefore, the recovery
mechanism is activated after the moment of failure and hence
all the previous work proposals embroiled in a certain amount
of delay according to (1) . The only way to completely overcome
the three factors of (1) altogether is by handling the failure be-
fore it occurs. Therefore, failure prediction is required to provide
awareness about the potential future incidents as well as allow-
ing the controller to perform the reconfiguration action in purpose
of overriding failures before causing damage on some paths. Al-
though there are a number of studies have put efforts on the area
of failure prediction, none of the traditional (except the work in
[41] ) and/or the new generation networking systems has exploited
the information that can be gained from any prediction method
to eliminate the network incidents (e.g. link failures). To the best
of our knowledge, Vidalenc et al. [41] is the only realistic study
that discussed the advantages of failure prediction through produc-
ing a risk-aware routing method for the legacy IP networks. While,
our work is differ from them by building a realistic framework of
proactive failure management for SDNs. In this work, we combine
the concept of the online failure prediction with risk analysis to-
wards maximising the network service availability.
4. The proposed model
Anticipating failures before they occur is a promising approach
for further enhancement of an SDN failure management tech-
niques, i.e., the proactive and reactive, in which the controller re-
sponds to failures when they take place. The SDN proposed model
4 A. Malik, B. Aziz and M. Adda et al. / Computer Networks 170 (2020) 107 104
Tabl e 1
List of notations.
Symbol Description
src Source router
dst Destination router
A Service availability
U Service unavailability
e
ij
Link traversing any two arbitrary routers i and j
Q
ptr A pointer that points to first e
ij
in the Queue
F Faile d link set
F
R Faile d/affected route set
PF
L Potential failed link set
PF
R Potential failed route set
M Prediction alarm message
CO Network controller
T
Threshold of failure probability
T
ω Threshold of risk
OF OpenFlow instruction
TP True positive
FN False negative
FP False positive
CC Cable cut per year
SP
x Any shortest path algorithm x in terms of hops
for anticipating link failure events is presented in this section. We
start first by outlining some of the notations we use in the rest of
the paper, as shown in Table 1 . The network topology is modelled
as an undirected graph G = (V, E) ; where V represents the finite
set of vertices (i.e. routers) in G that ranges over by { v
i
, v
j
, . . . , v
z
}
where { i, j, . . . , z} { 1 , . . . , n } for n N , and E represents the fi-
nite set of bidirectional edges (i.e. links) in G that denoted as { e
ij
}
where each e
ij
Eis an edge that enables v
i
and v
j
to connect each
other. Now, we define the following test operational function ( OP )
over a link, which reflects the link state as follows:
OP (e
ij
) =
1 the link is operational
0 otherwise
Therefore F can be defined as follows:
F = { e
ij
| e
ij
E OP (e
ij
) = 0 }
Based on G , we define a path P as a sequence of consecutive ver-
tices representing routers in the network. Each path starts from a
source router, src , and ends with a destination router, dst :
P = (src, ... , dst)
We define the set Flow to represent all demand traffic flows that
need to be serviced. Each flow Flow is an instance of P , which
associates with a particular traffic that are defined by unique src
and dst . We consider flow
set to be the set of all the possible paths
between src and dst that can be derived from G , which is defined
as follows:
flow
set
= { P | ( first (P ) = src) ( last (P ) = dst ) }
and the definition of first and last is given as functions on any gen-
eral sequence (a
1
, . . . , a
n
) :
first ((a
1
, ... , a
n
)) = a
1
last ((a
1
, ... , a
n
)) = a
n
We also consider P
set as a set that contains all the admissible
paths that can be constructed from G , this means P P
set and
therefore, Flow P
set
. When a link failure is reported in G , we iden-
tify the affected routes as follows:
F
R
= { flow | flow F low
v
i
, v
j
. v
i
, v
j
flow OP (v
i
, v
j
) = 0 }
In the same context, but this time we consider the case of when
there is a link failure prediction message m
i
M such that M set
denoted by { m
i
}
n
i =1
where each m
i
M is defined as m
i
= (
¯
e
ij
, t) ,
where t is the time when the system receives m
i
. In this context,
we define the following:
P F
L
= {
¯
e
ij
| ¯
e
ij
E
m
i
.m
i
= (
¯
e
ij
, t) m
i
M}
to characterise the received link, which we use ¯
e
ij
to imply that
e
ij
P F
L is a shorthand, with state of potential to fail and hence it
does not belong to F . Now, we can define the potential to fail route
set as follows:
P F
R
= {
¯
flow
|
¯
flow
F low (
¯
e
ij
.
¯
e
ij
¯
flow
¯
e
ij
P F
L
) }
where
¯
flow
is a flow that has at least one ¯
e
ij
, in other words,
¯
flow
P F
L
= .
4.1. SDN predictive model
All the previous effort s that dealt with data plane failures have
succeeded in mitigating the impact of failures (e.g. reduce the
downtime) rather than attempting to obviate their effect, such as
minimise the service unavailability. Network incidents that cause
routing instability, i.e., flaps, and lead to significant degrading of
network service availability vary [42,43] . However, in our case, we
are only concerned with data link failures. By relying on moni-
toring techniques, some failures can be predicted through failure
tracking, syndrome monitoring, and error reporting [44] . Conse-
quently, a set of conditions can be defined as a base to trigger a
failure warning when at least one of the predefined conditions is
satisfied. The following simple form illustrates rule-based failure
prediction:
if condition
1
then
war ning tr igger
. . .
if condition
n
then
war ning tr igger
Online failure prediction strategies vary, such as machine learn-
ing techniques (e.g. using the κ-nearest neighbor algorithm [45] )
and statistical analysis methods (e.g. time series [44] , Kalman and
Wiener filter [46] ). Such techniques can be used to predict the in-
coming events in short-term based through relying on the past and
current state information of a system. However, in this paper we
do not intend to propose a failure prediction solution as extensive
studies have been conducted in this field with remarkable achieve-
ments. Instead, employing the online failure prediction as a tech-
nique to enrich the current SDN fault management is one of the
main aims of this work. A generic overview of the time relations
of online failure prediction is presented in Fig. 2 , which presumes
the following:
t
d
: represents the historical data upon which the predictor is
forecasting the upcoming failure events.
t
l
: represents the lead time, which is the time in which a fail-
ure alarm is generated. It can also be defined as the minimum
duration between the prediction and failure.
t
w
: is the warning time in which an action may be required
to find a new solution based on the predicted event. Therefore,
t
l
must be greater than t
w
so that the information from pre-
diction will be serviceable. In SDN, the t
w should be at least
adequate to provide the time required to set up the longest
shortest path in the given G .
t
p
: represents the time for which the prediction will be as-
sumed to be a valid case. This should be defined carefully by
the network operator so as to identify the true and false alarms
after a certain time window (i.e. t
p
).
The quality of the failure prediction is usually evaluated by two
parameters: FP and FN ; whereas, Recall and Precision are the two
well-known metrics that are used to measure the overall perfor-
mance.
Recal l =
T P
T P + F N
P recision =
T P
T P + F P
(2)
A. Malik , B. Aziz and M. Adda et al. / Computer Networks 17 0 (2020) 107 104 5
Fig. 2. Online failure prediction and time relatio ns, Salfner et al. [44] .
Tabl e 2
Controller actions based on prediction.
Prediction Action
TP Select an alternative route
FP Unnecessary/needless action
FN Call the standard failure recovery
Fig. 3. Relatio n between prediction and failure sets.
Recall is defined as the ratio of the accurately captured failures
to the total number of the certainly occurred failures, while, Pre-
cision is defined as the ratio of the correctly classified failures to
the total number of the positive predictions. Correspondingly, SDN
controller actions will now associate with the predicted and un-
predicted situations as listed in Table 2 .
On one hand, every false failure alarm will lead to an unneces-
sary reconfiguration for a particular set of routes in Flow and this
will cause unwitting network instability. On the other hand, a con-
troller needs to deal with the undetected failures in a similar way
to the classical methods. Consequently, the more precise behaviour
of prediction, the higher the percentage of network stability and
service availability will be gained. Fig. 3 shows the relevance be-
tween the network model and the predictive model.
4.2. Failure event model
We have implemented an approach of generating failure events
as it is very difficult to find a public network dataset that includes
some useful details like failures, hence, we adopted an alternative
approach by developing our failure model. This research intends
to enhance the SDN fault tolerance and resilience through max-
imising the network service availability. Two basic metrics have
been exploited in this model: mean-time between failure ( MTBF )
and mean-time to recover ( MTTR ); which are essential for calcu-
lating the availability and reliability of each network repairable
component [5,50] . MTBF is defined as the average time in which
a particular component functions before failing, calculated from:
(start
down _ time
start
up _ time
)
number of f ailures
; while, MTTR is the average time required
to repair a failed component. Each component, i.e., link, is char-
acterized by its own values of both MTBF and MTTR , which are
commonly independent from other components in the network. As
a consequence of lacking real data, some metrics (such as cable
length and CC ) can be alternatively used for measuring the two
availability metrics. According to [50] , MTBF can be calculated as
follows:
MT BF (hours ) =
CC ×365 ×24
Cable Length
(3)
For instance, when CC is equal to 100 km, it means that per
100 km there will be on average one cut per year. Besides this,
the MTTR of a link is influenced by its length [51] , which expresses
the fact that the longer link has a higher MTTR value. On this basis,
we have designed the following formula for calculating the MTTR
value for each link in the network.
MT T R (hours ) = γ×CableLength (4)
Where γis defined as a parameter indicating the time required to
fix the cable, which is measured by hour/kilometer format. Due to
the fact that links are physically distributed in different locations
and environments, therefore, γdiffers from one link to another.
In other words, even if some links have the same length, their γ
could be different as it relies on the physical location and the am-
bient conditions. Further discussion is in Section 6 .
5. Risk analysis
According to [52] , risk can be defined as an attempt to answer
the following three questions:
What scenario could occur?
What is the likelihood that scenario would occur?
What is the consequence if the scenario does occur?
We consider these questions towards formulating the risk of
failure in SDNs.
What scenario could occur? The scenario can be defined as any
undesirable event such as failure. According to [53] , there are three
main types of failure scenarios that could affect the SDN network-
ing system, these are: controller failure (including hardware and
software), communication components failure (i.e. node and link)
and application failure (e.g., bugs in application code). However,
this paper considers the scenario of link failure only. Such scenario,
breaks the service down when occurs. Therefore, finding alterna-
tive path is necessary. We define the set of link failure scenarios
as F ranged over by variables f
1
, f
2
, . . . , f
n
F .
What is the likelihood that scenario would occur? The likelihood
that a failure scenario disrupts the network services is conditional
on the occurrence of the scenario. We address this question by the
aid of online failure prediction that is in our case working based on
a scenario’s failure probability, p . Each failure scenario is associated
with a p value that by nature ranges between 0 and 1, this will be
further discussed in Section 6 .
What is the consequence if the scenario does occur? We address
this question by computing the percentage of loss or consequence,
c , that might potentially happen when a failure scenario is pre-
dicted at an early stage. Each failure scenario might lead to some
disconnections and service disruption. Therefore, the severity of
adverse effects of each failure scenario varies. For instance, c
1
that
was caused by f
1
might be different from c
2
that was caused by f
2
,
which would reflect the outage costs that would result from dis-
rupting some of the network connections.
6 A. Malik , B. Aziz and M. Adda et al. / Computer Networks 17 0 (2020) 107 104
Tabl e 3
List of failure scenarios.
Scenario Probability Consequence
f
1 p
1 c
1
f
2 p
2 c
2
. . .
. . .
. . .
f
n p
n c
n
Over a period of time, these questions would make a list of out-
comes as exemplified in Table 3 , where each i th row in the table
can be represented as a triplet, i.e., f
i
, p
i
, c
i
.
Risk can be estimated by using such information as follows:
Risk = { f
i
, p
i
, c
i
} , i = 1 , 2 , . . . , n (5)
Since we are considering the only link failure scenarios, f
(e
ij
)
,
we shall refine the definition of risk in (5) . Accordingly, we rede-
fine the risk to be the chance of damage that is determined by the
combination of the probability of link failure and its consequence.
Risk
f
(e
ij
)
= p
(e
ij
)
×c
(e
ij
) (6)
To deduce the risk value, the two factors of (6) , i.e., p and c ,
can be assessed independently. On one hand, the probability, p ,
depends on the efficacy of the online failure predictor at deter-
mining the likelihood of the incoming failure scenarios, which is,
in this study, defined by a selective failure probability threshold
value, T
. On the other hand, for the consequence, c , it can be
measured based upon the percentage of affected routes that would
result from the anticipated scenario. In our case, we take the def-
inition of such consequence one of the global network topological
characteristics, namely Edge Betweenness Centrality (EBC) [47] . This
is due to the fact that EBC is a direct indicator of the number of
paths that would fail as a consequence of the failure of a particular
link, therefore, providing a natural measure of risk consequence.
The EBC of a link e
ij
is the total number of shortest paths be-
tween pairs of nodes that traverse the edge e
ij
[47] , which can be
formulated as follows:
EBC
e
ij
=
v
i
V
v
j
V
v i, v j
e
ij
v i, v j
(7)
where vi, vj
denotes the number of shortest paths between nodes
vi and vj , while, v i, v j
e
ij
denotes the number of shortest paths be-
tween nodes vi and vj and go through e
ij
E. For instance, Fig. 4
demonstrates an example topology with an EBC value for each link
in the network, which has been calculated based on Ulrik Bran-
des algorithm [48] . For instance, the EBC of e
12
is calculated by the
number of shortest paths containing the edge divided by all possi-
ble paths. Therefore, EBC
e
12
= 0 . 4 . This is because there are 20 pos-
sible paths in the example topology and 8 of them pass through
the edge e
12
. Given that network controller knows the demand
Fig. 4. Top olo gy example with different EBC value s.
traffic matrix between all pairs in the network, i.e., Flow . There-
fore, Eq. (7) in our case is congruent with the following:
EBC
e
ij
M
=
flow
e
ij
flow
(8)
where flow
denotes the total number of paths in Flow set, while,
flow
e
ij
denotes the number of paths in Flow set and pass through
e
ij
M.
With the above context in mind, the higher the EBC value of
e
ij
, which is a normalised value between 0 and 1, the more critical
the link is and therefore, the higher the score indicating the con-
sequences. This is because the outcome of failure for a link with
high EBC will definitely lead to a huge number of path failures and
therefore a higher percentage of negative impacts on the availabil-
ity of network services.
Our goal in this analysis is to gauge the percentage of possi-
ble loss and provide such information to the concerned decision-
making mechanism, i.e., the routing mechanism in our case. For
more details about the existing risk analysis methods that fit SDNs,
we refer the interested readers to [49] .
6. The proposed framework
From a high level point of view, Fig. 5 illustrates the main com-
ponents of our proposed framework where the Smart Routing (SR)
and Prediction Model components are the primary contribution of
our work. We discuss next in more detail the components we used
and developed in this framework.
6.1. SDN controller
The proposed framework supports POX controller [54] , which is
an open source SDN controller written in python and it is more
suitable for fast prototyping than other available controllers such
as [55] . The standard OpenFlow protocol [3] is used for establishing
the communication between the data and control planes, whereas
the set of POX APIs can be used for developing various network
control applications.
Fig. 5. Architecture of the proposed framework.
A. Malik , B. Aziz and M. Adda et al. / Computer Networks 17 0 (2020) 107 104 7
Tabl e 4
Service availability and network flows relatio n.
Event Flow src dst accessibility T
C Serviceability Notes
flow
x Yes Path is working
flow
x
F
R flow
x No T
D
Path is not working
flow
x
F
R flow
x No T
SP
Search for alternatives
flow
x
F
R flow
x No T
Update
Path is restoring
flow
x Yes Path is restored
6.2. Smart routing
Firstly, this module is responsible for maintaining and parsing
the underlying network topology. Top ol og y parameters such as the
number of nodes and links, way of connection and port status
can be detected via the Link Layer Discovery Protocol (LLDP) [56] ,
which is one of the vital features of the current OpenFlow specifi-
cation. The openflow.discovery [57] , which is an already developed
POX component that can be used to send specially crafted LLDP
messages out of OpenFlow nodes so that the topological view over
the data plane layer can be built.
This module will then convert the discovered network topol-
ogy into a graph G representation for efficient management pur-
poses. To do so, we utilised the Networkx tool [58] , which is a pure
python package with a set of powerful functions for manipulating
network graphs. When the network starts working and after shap-
ing the data plane topology, the shortest path for each flow Flow
is configured by the appointed SP
x algorithm, which thereafter is
stored in the Operational Routes table that is specified to contain
all the desired working (healthy) paths.
In order to perceive how the link failure incident could affect
the configured paths from the perspective of service availability
and convergence time, we provide a simple example in Table 4 in
which the service deterioration of the flow
x due to link failure in-
cident is highlighted.
To maintain the Operational Routes table, two algorithms have
been implemented each with its own view in respect to keep the
Flow maintained. Algorithm 1 depicts the baseline shortest path
routing strategy (henceforth called Baseline Routing (BR)), which
is currently performed by the SDN controller. We specify Dijkstra’s
[40] algorithm, with complexity O (| V | + | E| log | V | ) , as the short-
est path finder approach for Algorithm 1 , which we denote by SP
D
instead of SP
x
. So, the SP
D
is a Dijkstra function that can be applied
on any flow
set to return only one unique shortest path.
When the OpenFlow controller reports a link failure event, ev-
ery path suffering from that failure will be detected and then two
operations will be issued by the controller. First, a Remove , denoted
by OF
Remov e
, command is sent to all the routers that belong to each
failed path in Flow as a step to weeding out the incorrectly work-
ing entries, then an alternative route will be computed for every
affected flow . The new flow entries of the alternative path are then
forwarded to the relevant routers of each flow through the Install ,
denoted by OF
Install
, command. Each modified flow , i.e., assigned to
alternative, will be stored in a special set that is called the Labeled
Flow ( LF ), where: LF Flow and with length of n . This is to indicate
that each flow LF is in a sub-optimal state. The recovery from
link failure procedure is demonstrated in line (1–13). However, the
algorithm also includes the reversion process after a failure is re-
formed (line 15–32), which is no less important than the recovery
process [59] , and also to take into account the percentage of rout-
ing flaps that is necessary for later analysis. In fact, we developed
this algorithm for comparison purposes only against Algorithm 2 .
Therefore, it does not reflect a contribution of this paper.
In contrast (and unlike Algorithm 1 ), Algorithm 2 is one of the
main contributions of this work that exploited the prediction in-
formation towards enhancing the service availability and the fault
Algorithm 1: Baseline routing (BR).
On Normal : flow F low : Set Primary Path as flow. flow
SP
D
(flow
set
)
On Failure : Do the following procedure
1 if Link failure reported then
2 foreach e
ij
F do
3 Compute: F
R
4 end
5 do
6 OF
Remov e (flow )
7 flow
set
:= flow
set
{ flow }
8 flow := SP
D
( flow
set
)
9 OF
Install (flow )
10 LF flow
11 F
R
:= F
R
{ flow }
12 while F
R
= ;
13 end
14 c := 0
15 if Link repair reported then
16 do
17 if flow
c is currently optimal then
18 Do nothing
19 c := c + 1
20 end
21 if flow
c is currently sub-optimal then
22 OF
Remov e (flow
c
)
23 f low
c
:= SP
D
( f low
c
set
)
24 OF
Install (flow
c
)
25 LF := LF { flow
c
}
26 c := c + 1
27 end
28 if number of links = E
len
then
29 LF := empty
30 end
31 while c LF
len
;
32 end
tolerance of SDNs. This algorithm depends on Bhandari’s algorithm
for finding K edge-disjoint paths [60] , which has been utilised as
a complementary to build the smart routing strategy. We denoted
Bhandari’s algorithm as SP
B in place of SP
x
.
Thereon, we consider the SP
B as a function that is specified to
compute 2 link-disjoint paths with the least total cost for any given
pair of nodes (i.e. src and dst ) or flow
set
. For the purpose of distin-
guishing between the two returned paths of SP
B
, we denote the
first path as flow
b
1
and the second disjoint one as flow
b
2
. The
time complexity of SP
B
is different from the SP
D
, which is a polyno-
mial that is equivalent to O ((K + 1) . | E| + | V | log | V | ) . The pseudo
code of smart routing is demonstrated in Algorithm 2 , in which the
flow
b
1
is initially selected to represent the primary path for each
flow in the network.
The network controller will then start listening to the predic-
tion module, which will be discussed in the next section, for the
8 A. Malik , B. Aziz and M. Adda et al. / Computer Networks 170 (2020) 107104
Algorithm 2: Smart routing (SR).
Input : Network topology G (V, E) , M
Output : P F
R
≈∅
1 flow F low : Set Primary Path as f low
b
1
. f low
b
1
SP
B
(flow
set
)
2 if M = { m } then
3 P F
L
¯
e
ij
4 end
5 foreach ¯
e
ij
P F
L do
6 Compute: P F
R
7 end
8 EBC
¯
e
ij
=
PF
R
len
F low
len
9 Risk
¯
e
ij
= p(
¯
e
ij
) ×EBC
¯
e
ij
10 if Risk
¯
e
ij
Risk
T
ω
then
11 do
12 OF
Install (f low
b
2
. f low
b
2
SP
B
(f low
set
))
13 OF
Remov e (f low
b
1
. f low
b
1
SP
B
(f low
set
))
14 while P F
R
= ;
15 Wait: t
p
16 if ¯
e
ij
F then
17 Mark as: T P
18 LF P F
R
19 else
20 Mark as: F P
21 do
22 OF
Install (f low
b
1
. f low
b
1
SP
B
(f low
set
))
23 OF
Remov e (f low
b
2
. f low
b
2
SP
B
(f low
set
))
24 while P F
R
= ;
25 end
26 end
27 P F
R
=
28 if [ F = (e
ij
) (e
ij
/ M) ] [ F = ( e
ij
) (e
ij
M ) (Risk
¯
e
ij
<
Risk
T
ω
) ] then
29 Mark as: F N
30 Call Algorithm1
31 end
32 if Link repair reported then
33 Call Algorithm1
34 end
potential of future incidents. When a new message ( m ) is received,
the controller will firstly construct the potential failed list, which
contains the information about link which is expected to fail in
the near future as described in (line 2–4). Secondly, the route (or
routes) which might be affected according to the predicted fail-
ure message will be computed as a preparatory step to replace
them (line 5–7). After identifying the routes that may possibly fail,
the EBC for the predicted link will be calculated as a step towards
measuring the risk (line 8–10). If the risk value is below the risk
threshold, then the prediction information will be ignored and no
action will be taken. Otherwise, the flow entries of the newly com-
puted disjoint path from the second step will be installed through
using the Install command. This is done by adjusting the disjoint
path rules with lower priority than the primary path to avoid the
conflict of matching and action process. Following this step, the
forwarding rules of the risky primary paths will need to be deleted
in order to use TCAM resources efficiently.
This needs to be done in a similar procedure to the installation
but with the Remove command as demonstrated in (line 11–14).
After swapping the primary path due to an expected failure, this
action will be considered as the correct decision for a certain pe-
riod of time (i.e. t
p
) as indicated in line 15. To examine the sub-
stantiality of the changing routes decision, the link that was antic-
ipated to go down within t
l
will be compared against the fail-
ure set F . On one hand, and if the link exists, then the prediction
will be marked as TP , and each flow PF
R
will be labeled as sub-
optimal, in addition, the reconfigured paths will store in LF (line
16–18). On the other hand, the prediction will be considered as
FP and in such a case it is necessary to reset the primary path to
its initial state (i.e. optimal) as deliberated in (line 19-25). In case
when there is a failure that was not captured by the prediction
module, such a case is considered as FN and the failure in such
situations is tackled by calling Algorithm 1 as outlined in (line 28–
30). Finally, Algorithm 1 will also be invoked when a failed link is
repaired (line 32–34).
6.3. Prediction module
In this work, this module is placed on top of the parsed net-
work topology state that gained from the network controller as
a result of lacking historical data. We consider each link as an
independent object of link class. The link class contains a set
of attributes, which currently includes eight attributes as shown
in Fig. 6 . The link attributes are used to control the up and
down events. In the current implementation, we used the prior-
ity queue, Q , as a pool to hold all the non-faulty links. On one
hand, Eqs. (3) and (4) are essential for computing the two static
attributes ( MTBF and MTTR ) of each link. For (3) , we rely on the
topologies information in Section 7.3 and by assuming that CC
equals the minimum cable length in a network. While, for (4) we
used the uniform distribution to generate γfor each link indepen-
dently. On the other hand, the six remaining ones are described as
follows:
ID : a numerical unique value (i.e. 1 , 2 , . . . , n ) assigned to the
link to represent the link identification number.
F_Count : a counter to contain the number of times the link has
failed.
Length : represents the link’s length in km, which is derived
from the topology specification.
Next_F : refers to the next time to failure of link, which con-
trols the process of moving the link into and out-of the Q . In
other words, this attribute determines the link life span in the
Q where the link will be dequeued when its Next_F equals to
zero.
Probability_F : registers the current failure probability, p , of the
link. For instance, the Probability_F of the link ( j ) is arrived
through:
F _ Count(ID
j
)
n
i =1
F _ Count(ID
i
)
×100 , where n is the length of Q .
Status : reflects the current state of the link as either opera-
tional or faulty.
On this basis, we have placed our online predictor scheme, i.e.,
represented by Algorithm 3 , on top of the priority queue in order
to send encapsulated messages about the links which satisfy the
following two conditions (as described in line 2–9):
The probability of failure is greater than or equal to the
threshold T
.
The leading time (i.e. t
l
) is less than or equal to the next
time to failure .
Failure Decision (FD) is a Boolean function that randomly gener-
ates True and False values for each link that satisfies the threshold
condition, i.e., T
. When FD is True, a failure event is generated by
putting the current link, i.e., F
(Q
ptr
)
, down if t
l
is satisfied. Oth-
erwise, when FD is False then no failure event will be generated.
Algorithm 3 is used only for evaluation purposes so that True and
A. Malik , B. Aziz and M. Adda et al. / Computer Networks 170 (2020) 107104 9
Fig. 6. Representation of links in priority queue.
Algorithm 3: Alarm message generator ( M ).
Input : G (V, E)
Output : M
1 while ( Q! = ) do
2 if P robabil it y _ F
(Q
ptr
)
T
then
3 if FD = True then
4 Go To: 8
5 else
6 Go To: 15
7 end
8 Compute: t
l
9 if Next _ F
(Q
ptr
)
t
l
then
10 Wait: Next _ F
(Q
ptr
)
t
l
11 Generate: (m,
¯
e
ij
(Q
ptr
)
)
12 else
13 t
l
is not satisfied
14 end
15 end
16 Wait: Next _ F
(Q
ptr
)
= 0
17 end
False alarms can be made. Hence, the actual link failure prediction
method is outside the scope of this paper.
7. Experimental design and implementation
Since smart routing is aimed to enhance the SDN fault toler-
ance in the context of network service availability, we have imple-
mented some metrics for fair comparison between the traditional
SDN and the proposed system. We also show in this section the
adopted network topologies that have been utilised in our experi-
ments.
7.1. Availability measurements
Considering the convergence time that is required to shift from
a failed or non-operational path to an alternative or backup one,
which conforms with Eq. (1) . This convergence process is defi-
nitely drive to some damages in the network availability and caus-
ing service unavailability. This issue results from the unavailability
of the affected path to the service for a certain amount of time,
as demonstrated in Table 4 . In order to identify the serviceable,
which are denoted by “Yes”, and unserviceable, which are denoted
by “No”, flows with respect to some failure events, we formulated
this problem as follows:
(flow Q) = flow Yes
(flow Q) flow No
where, “Yes” and “No” can be obtained by intersecting each
flow Flow against the Q . The flow is subjected to “Yes” when all
its forming edges reside in the Q , otherwise, the flow will be con-
sidered as unserviceable and subjected to “No”. By knowing the
number of serviceable and unserviceable flows , the service unavail-
ability and thus the service availability can be measured.
The service unavailability of SDN ( U
SDN
) over a given interval
time with a certain number of failure events, which are denoted
by ev , can be arrived through the following:
U
SDN
(F l ow, G ) =
e v
i =1
fl ow F low No
e v ×F low
len
(9)
Whereas, for smart routing it is important to further consider the
impact of Recall values. Hence, the service unavailability of SR ( U
SR
)
can be arrived by the following:
U
SR
(F low, G ) = (1 Recal l ) ×(U
SDN
(F low, G )) (10)
Consequently, the availability A
x
, with x = SDN or SR, can be ar-
rived through the following:
A
x
= 1 U
x (11)
7.2. Routing instability measurements
In traditional networks, routing protocols (e.g. IGP [61] ) perform
two routing changes as a reaction to every single failure, one time
when a failure occurs and another when a failure is repaired. In
fact, both changes are essential for the QoS where the first change
is for the purpose of service availability, while, the goal of the sec-
ond one is to return back from the backup (i.e. sub-optimal) to
the primary (i.e. optimal) path again. In contrast, SDN architecture
brings centralisation and programmability to the scene, therefore,
traditional distributed protocols are independent of the SDN archi-
tecture. Maintaining the optimal path (e.g. minimum hops in our
case) of each flow will require a continuously adaptive strategy that
will be responsible for replacing each sub-optimal flow with the
optimal one after it becomes serviceable. To do so, we assume that
each alternative flow is additionally stored in LF as mentioned in
Section 6 .
For an SDN, the routing flaps (denoted by RF ) can be measured
by means of link up (denoted by u
f
) and down (denoted by d
f
) as
follows:
RF
SDN
=
flow LF
u
f
+
flow F
R
d
f
(12)
On one hand, and according to (12) , after each link down event;
a new route for each flow F
R
is required, which then leads to
a first routing change for each flow . On the other hand, and after
each link up announcement, the controller will need to check the
10 A. Malik, B. Aziz and M. Adda et al. / Computer Networks 170 (2020) 10710 4
Fig. 7. Flow chart of routing flaps.
state of each labeled flow in LF to determine if itâs still the optimal
choice. If so, then no change will be made, otherwise, rerouting is
required and therefore it will result in another routing change.
However, for the smart routing mechanism, it is necessary to
consider the three prediction parameters also (i.e. FN, TP and FP ),
as follows:
RF
SR
=
flow F
R
F N
f
+
flow PF
R
T P
f
+
flow PF
R
F P
f
+
flow LF
u
f
(13)
According to (13) , the FN
f
is equivalent to d
f
in (12) as it re-
flects the actual failure events that have not been captured by the
prediction module, while the remaining are as follows:
Each true prediction will lead to a first reroute flap that gives
the advantage of avoiding an upcoming failure event. While, the
second flap will be similar to the scenario of RF
SDN through in-
serting the flow into the LF and the next flap builds upon the
link restoration u
f
.
Each false prediction leads into two useless flaps, one when the
prediction triggers an alarm, in such a case each potential flow
will be added to the Temporary Labeled Flow set ( TLF ), as a tran-
sient step before it recognises the prediction was false. The sec-
ond flap will perform when t
p expires.
We provide a deep overview of the process of measuring the
number of routing flaps in the flow chart of Fig. 7 , which also
shows how the LF is adjusted in the scenario of the two algo-
rithms, i.e., Algorithms 1 and 2 . Since all actions are associated
with the link state, in this work, we utilise the OpenFlow proto-
col to reflect the data plane links changing state. This is by relying
on the Loss of Signal (LoS) that detects link failures by depending
Fig. 8. Experimental topologies.
Tabl e 5
Topologies’ characteristics.
Topolo gy No des Edges Min
len (e
ij
)
Max
len (e
ij
)
janos-us 26 42 145 km 1127 km
germany50 50 88 36 km 236 km
waxman 70 140 15 km 1099 km
on OpenFlow PORT-STATUS messages. In addition, the proposed
prediction module produces further information about the poten-
tial failures. Both LoS and prediction information will be delivered
to the network controller through the Updater in order to apply the
appropriate action as illustrated in the above flow chart.
7.3. Simulated network topologies
In order to evaluate the proposed method, we have modelled 3
network topologies as depicted in Fig. 8 . Both, (a) janos-us and (b)
germany50 represent real network topology instances that defined
in [62] . However, (c) waxman is a synthetic topology that is cre-
ated by the Internet topology generator Brite [63] through using
the well-known Waxman model [64] . Wa xman’s model is a geo-
graphical approach that connects the distributed routers in a plane
on the basis of the distance among them, which is given by the
following formula:
P ({ v
i
, v
j
} ) = βexp
d(v
i
, v
j
)
Lα(14)
where 0 < αand β1. d represents the distance between v
i
and v
j
, while L represents the maximum distance between any two
given nodes. The number of links among the generated nodes is
associated with the value of αin a directly proportional manner,
while the edge distance increases when the value of βis incre-
mented. We used Brite to generate a large-scale network topology
in comparison to the others (e.g. when the number of edges or
nodes 100). The characteristics of all the modelled topologies
are detailed in Table 5 .
7.4. Implementation
In order to validate our approach, the proposed framework is
built on top of the POX controller. The implementation code of the
current framework is made available on the Github platform [65] .
The proposed framework is evaluated by using the container-based
emulator, Mininet [66] . Mininet is a widely used emulation system,
as evidenced in a recent survey [4] , for evaluating and prototyp-
ing SDN protocols and applications. It can also be used to create
realistic virtual networks, running real kernel, switch and applica-
tion code, on a single machine (VM, cloud or native). Our experi-
ments were designed based on the topologies that we illustrated
in the preceding section. Since one of our experimental topolo-
gies was designed via Brite, we utilised the Fast Network Simu-
lation Setup (FNSS) [67] . FNSS is a python-based toolchain simu-
lator that can be used to facilitate the process of network exper-
iments. It provides a wide range of functions and adapters that
A. Malik , B. Aziz and M. Adda et al. / Computer Networks 170 (2020) 107104 11
Fig. 9. Flow diagram of a link’s life cycle in the Queue.
allow network researchers to parse graphs from different topol-
ogy generators (e.g. Brite) in order to be compatible with and/or
to interface with other simulator/emulator tools, such as Mininet.
Based on the failure event model ( Section 4.2 ), the general reli-
ability theory [68] has been utilised to generate failure events
using the exponential distribution ( mean = MT BF ) for the next
time to failure of each link, and lognormal distribution E ( μ, σ)
with μ= log (MT T R ) ((0 . 5) ×log (1 + ((0 . 6 ×MT T R )
2
/MT T R
2
)))
and σ=
log (1 + ((0 . 6 ×MT T R )
2
/MT T R
2 for time to recover. Re-
garding, failure anticipation, false and true positive have been gen-
erated during the simulated time using the uniform distribution
following the specified threshold value. Fig. 9 summarises the sim-
ulated link queuing system that is correlated to the two metrics
of reliability, i.e., MTBF and MTTR. In order to dispatch the predic-
tion information that is necessarily important to the SR module,
the distributed messages framework (ZeroMQ [69] ) was exploited
to carry the alarm messages, M , from the prediction module to
the network controller interface. In some network flow conditions
it will activate the SR module to begin a possible reconfiguration.
In the emulation environment, we employed two servers; one acts
as the OpenFlow controller and the other to simulate the network
topologies. For each server, we used Ubuntu v.14.04 LTS with Intel
Core-i5 CPU and 8 GB RAM.
8. Comparison and key advantages of SR
In this section, we present comparison and evaluation of the
proposed method versus the default SDN technique (i.e. BR). To
do so, the study has been conducted on the three topologies that
were summarised in Table 5 . To simulate the three topologies, we
ran the emulator for 144 h, i.e., each experimental topology was
simulated in the system for 48 h. Fig. 10 shows the obtained re-
sults from the three topologies based on parameter settings of
T
= 0 . 25 , T
ω
= 0 . 1 , t
l
= 12 0 s and t
p
= 30 s. Keep in mind, and
as discussed earlier, the T
and T
ω values can be selected by the
network operator or by using additional algorithms (i.e. machine
learning) to identify the near optimal values. Since the main goal
of smart routing is to enhance the network service availability, we
plot for each network that which gives the BR and SR mechanisms
for the service availability percentage (Y-axis) and the rate of rout-
ing flaps (X-axis). Furthermore, for SR, the performance of the on-
line failure predictor represented by the values of Recall and Pre-
cision are considered and reported respectively to each topology.
In fact, Recall value has a crucial impact on the service availability
in the SR scheme, however, Precision value has an impact on the
unnecessary routing changes.
On one hand, it can be clearly observed that SR outperformed
the BR in providing network service availability for all test cases.
In spite of the low Recall values (i.e. 0.2-0.3), there is still a gain in
service availability. It can also be observed that janos-us topology
gained the highest improvement percentage in the service avail-
ability and this is because its Recall value is greater than that of
the other topologies.
On the other hand, the rate of the routing flaps generated by SR
is always higher than the BR. This disadvantage comes as a trade-
off for improving the network service availability. Given that the
routing instability by means of unnecessary flaps is correlated with
the value Precision , we have measured the only useless flaps that
were generated during the simulation time and for each topology
as shown in Fig. 11 . Fig. 11 (a) shows the only unnecessary rout-
ing changes that have been reported based on the FP rate of each
topology, where each single FP is associated with two useless flaps,
that is, one for the reconfiguration and the other for the reversion.
However, Fig. 11 (b) shows the percentage of useless routing flaps
for each topology in comparison with the total number of flaps. In
the worst case scenario the routing flaps did not exceed 25%. Al-
though janos-us topology has the highest Precision value, it yielded
a relatively high percentage of useless flaps and this is because the
number of links in the topology is low, hence, it is highly likely
that each single link is associated with a large number of routes
in contrast to the other two topologies. It is also clearly evident
that the online failure prediction plays a significant role in both
service availability (by TP ) and routing flaps (by FP ). Based upon
the experiments and simulations, we have some observations, as
follows:
Some alternative routes are considered as optimal after receiv-
ing an update message, even though the received update is not
involved in its conforming path. The reason is that the current
system defines the optimal path based on hops number. There-
fore, each alternative path that has the same number of hops
as the optimal one will be considered to be an optimal path. It
might not be the case if the adopted routing constraint is not
the number of hops, i.e., using a specified cost function with
different parameters such as bandwidth, congestion, energy, etc.
In some cases it is barely able to find two-disjoint paths and
therefore, sometimes if a path faces two successive failure
alarms on its forming links, then no change will be made.
Hence, we used ( ) instead of ( = ) in the output of
Fig. 10. Routing flaps and service availability.
12 A. Malik, B. Aziz and M. Adda et al. / Computer Networks 170 (2020) 107 104
Fig. 11. Routing instability measurements.
Algorithm 2 , to imply that an entirely empty PF
R
cannot be al-
ways guaranteed.
It is also possible that each flow LF may face one or more
risky links, thus in such a case the entangled flow state will be
the same (i.e. sub-optimal).
In some cases and when the Next _ F < 2 min , the controller will
ignore the prediction if it is generated, as in such cases the
t
l
is not satisfied and therefore the controller will not have
enough time for the reconfiguration.
9. Conclusion and future directions
This paper has demonstrated the promise of using online fail-
ure prediction to enhance the SDN service availability. Since the
network service availability is a well established research area,
its implications for OpenFlow networks are limited. We presented
Smart Routing to tackle the problem of data plane link failures in
SDNs. The proposed approach differs from the existing contribu-
tions by allowing the SDN controller to have a time window in
order to reconfigure the network before the anticipated link failure
takes place. With such approach, the interruption of the network
services caused by link failures can be reduced and therefore it
brings significant benefits to the network service availability. We
demonstrated how the proposed model can be implemented us-
ing a couple of new algorithms that extract the risky links from
the currently-used paths and hence none of these paths will be af-
fected when a risky link fails. The performance of the proposed ap-
proach is tested and evaluated through extensive simulation exper-
iments on various real and synthetic network topologies conducted
with the link failure event model. The experimental findings show
clearly the effectiveness of the proposed method in enhancing the
SDN service availability. Unfortunately, the flaps rate that can be
resulted from the failure prediction may lead to network instabil-
ity, especially when it reaches a high rate. For this purpose, we
measured the percentage of the unnecessary routing flaps and in
the worst case scenario, the rate was 25%, which is nearly reason-
able in practice.
As future work, we will position the study in the setting of ma-
chine learning and signal processing towards achieving that the de-
cision will be made according to the optimal threshold value of the
probability of failure. We are also planning to extend this study
to consider some disaster situations where drastic failure scenar-
ios can lead to multiple link failures with high network availability
degradation and packet loss rates. In such scenarios, one needs to
consider different, possibly less predictable, metrics of failure. Ad-
ditionally, we plan to consider more complex scenarios where we
consider not only link failures but also other forms of failure, e.g.
controller, node and application failures.
Declaration of Competing Interest
We have no conflicts of interest to declare.
CRediT authorship contribution statement
Ali Malik: Conceptualization, Data curation, Investigation,
Methodology, Project administration, Software, Validation, Writ-
ing - original draft. Benjamin Aziz: Conceptualization, Supervision,
Writing - review & editing, Investigation, Validation. Mo Adda: Su-
pervision, Data curation, Writing - review & editing, Investigation,
Validation. Chih-Heng Ke: Data curation, Investigation, Validation,
Software, Writing - review & editing.
Acknowledgements
The authors thank the anonymous reviewers for their valuable
and thoughtful comments.
Supplementary material
Supplementary material associated with this article can be
found, in the online version, at doi: 10.1016/j.comnet.2020.107104 .
References
[1] P. Lin , J. Bi , H. Hu , T. Feng , X. Jiang , A quick survey on selected approaches for
preparing programmable networks, in: Proceedings of the 7th Asian Internet
Engineering Conference, ACM, 2011, pp. 160 –16 3 .
[2] N. Feamster , J. Rexford , E. Zegura , The road to SDN: an intellectual history of
programmable networks, ACM SIGCOMM Comput. Commun. Rev. 44 (2) (2014)
87–98 .
[3] N. McKeown , T. Anderson , H. Balakrishnan , G. Parulkar , L. Peterson , J. Rexford ,
J. Turner , OpenFlow: enabling innovation in campus networks, ACM SIGCOMM
Comput.
Commun. Rev. 38 (2) (2008) 69–74 .
[4] D. Kreutz , F.M. Ramos , P.E . Verissimo , C.E. Rothenberg , S. Azodolmolky , S. Uh-
lig , Software-defined networking: a comprehensive survey, Proc. IEEE 103 (1)
(2015) 14–76 .
[5] J.C. Laprie , Dependability: basic concepts and terminology, in: Dependability:
Basic Concepts and Terminology, Springer, Vienna, 1992, pp. 3–245 .
[6] J. Ai , Z. Guo , H. Chen , G. Cheng , Improving the routing security in software-de-
fined networks, IEEE Commun. Lett. 23 (5) (2019) 838–841 .
[7] T. Wang , Z. Guo , H. Chen , W. Liu
, BWMana ger: mitigating denial of service at-
tacks in software-defined networks through bandwidth prediction, IEEE Trans.
Netw. Serv. Manag. 15 (4) (2018) 1235–1248 .
[8] J.A. Wickboldt , W. P. De Jesus , P. H. Isolani , C.B. Both , J. Rochol , L.Z. Granville ,
Software-defined networking: management requirements and challenges, IEEE
Commun. Mag. 53 (1) (2015) 278–285 .
[9] I.F. Akyildiz , A. Lee , P. Wang , M. Luo , W. Chou , Research challenges for traffic
engineering in software defined networks, IEEE Netw. 30 (3) (2016) 52–58 .
[10] G. Iannaccone, C.N. Chuah, R. Mortier, S. Bhattacharyya, C. Diot,
Analysis of link
failures in an IP backbone, Proceedings of the 2nd ACM SIGCOMM Work shop
on Internet measurment (pp. 237–242). ACM.
[11] F. da Rocha , C. Paulo , E.S. Mota , A survey on fault management in software-de-
fined networks, IEEE Commun. Surv. Tutor. 19 (4) (2017) 2284–2321 .
A. Malik , B. Aziz and M. Adda et al. / Computer Networks 170 (2020) 107104 13
[12] T. Hu , P. Yi , Z. Guo , J. Lan , Y. Hu , Dynamic slave controller assignment for en-
hancing control plane robustness in software-defined networks, Future Gener.
Comput. Syst. 95 (2019) 681–693 .
[13] Z. Guo, W. Feng, S. Liu, W. Jiang, Y. Xu, Z.L. Zhang, Retroflow: maintaining
control resiliency and flow programmability for software-defined WANs, 2019,
arXiv: 1905.03945 .
[14] A. Markopoulou , G. Iannaccone , S. Bhattacharyya , C.N. Chuah , C. Diot , Char-
acterization of failures in an IP backbone, in: INFOCOM 2004. Twenty-Third
AnnualJoint Conference of the IEEE Computer and Communications Societies,
volume 4, IEEE, 2004, March, pp. 2307–2317 .
[15] A. Markopoulou , G. Iannaccone , S. Bhattacharyya , C.N. Chuah , Y. Ganjali ,
C. Diot , Characterization of failures in an operational IP backbone network,
IEEE/ACM Trans. Netw. 16 (4) (2008) 749–762 .
[16] I.F. Akyildiz , A. Lee , P. Wang , M. Luo , W. Chou , A roadmap for traffic engineer-
ing in SDN-OpenFlow networks, Comput. Netw. 71 (2014) 1–30 .
[17] J.P. Vasseur , M. Pickavet , P. Demeester , Network rec overy: Protection and
Restoration of Optical, SONET-SDH, IP, and MPLS, Elsevier, 2004 .
[18] J.
Kempf , E. Bellagamba , A. Kern , D. Jocha , A. Ta kcs , P. Skldstrm , Scalable fault
management for OpenFlow, in: Communications (ICC), 2012 IEEE International
Conference, IEEE, 2012, pp. 6606–6610 .
[19] A . Sgambelluri , A . Giorgetti , F. Cugini , F. Paolucci , P. Castoldi , OpenFlow-based
segment protection in ethernet networks, J. Opt. Commun. Netw. 5 (9) (2013)
1066–1075 .
[20] Z. Guo , R. Liu , Y. Xu , A. Gushchin , A. Wal id , H.J. Chao , STAR: preventing
flow-table overflow in software-defined networks, Comput. Netw. 125 (2017)
15–25 .
[21] Z.
Guo , Y. Xu , R. Liu , A. Gushchin , K.Y. Chen , A. Wal id , H.J. Chao , Balancing
flow table occupancy and link utilization in software-defined networks, Future
Gener. Comput. Syst. 89 (2018) 213–223 .
[22] S. Sharma , D. Staessens , D. Colle , M. Pickavet , P. Demeester , Enabling fast
failure recovery in OpenFlow networks, in: Design of Reliable Communica-
tion Networks (DRCN), 2011 8th International Works hop on the, IEEE, 2011,
pp. 164–171 .
[23] D. Staessens , S. Sharma , D. Colle , M. Pickavet , P. Demeester , Software defined
networking: Meeting carrier
grade requirements, in: Local & Metropolitan Area
Networks (LANMAN), 2011 18th IEEE Wo rks hop on, IEEE, 2011, pp. 1–6 .
[24] S. Sharma , D. Staessens , D. Colle , M. Pickavet , P. Demeester , OpenFlow: meet-
ing carrier-grade recovery requirements, Comput. Commun. 36 (6) (2013)
656–665 .
[25] H. Kim , M. Schlansker , J.R. Santos , J. To ur ril he s , Y. Turner , N. Feamster , Coronet:
fault tolerance for software defined networks, in: Network Protocols (ICNP),
2012 20th IEEE International Conference on, IEEE, 2012, pp. 1–2 .
[26] M. Luo , Y. Zeng , J. Li , W. Chou
, An adaptive multi-path computation framework
for centrally controlled networks, Comput. Netw. 83 (2015) 30–44 .
[27] Y. Jinyao , Z. Hailong , S. Qianjun , L. Bo , G. Xiao , HiQoS: an SDN-based multipath
QoS solution, China Commun. 12 (5) (2015) 123–133 .
[28] C. Rotsos , N. Sarrar , S. Uhlig , R. Sherwood , A.W. Moore , OFLOPS: an open
framework for OpenFlow switch evaluation, in: International Conference on
Passive and Active Network Measurement, Springer, Berlin Heidelberg, 2012,
pp. 85–95 .
[29] X. Jin , H.H. Liu , R. Gandhi , S. Kandula , R. Mahajan ,
M. Zhang , R. Wattenhofer ,
Dynamic scheduling of network updates, in: ACM SIGCOMM Computer Com-
munication Review (Vol. 44, No. 4, ACM, 2014, pp. 539–550 .
[30] G. Bianchi , M. Bonola , A. Capone , C. Cascone , OpenState: programming plat-
form-independent stateful OpenFlow applications inside the switch, ACM SIG-
COMM Comput. Commun. Rev. 44 (2) (2014) 44–51 .
[31] A. Capone , C. Cascone , A.Q. Nguyen , B. Sanso , Detour planning for fast and re-
liable failure rec overy in SDN with OpenState, in: Design of Reliable Commu-
nication Networks (DRCN), 2015 11 th International Conference on the,
IEEE,
2015, pp. 25–32 .
[32] C. Cascone , L. Pollini , D. Sanvito , A. Capone , B. Sanso , SPIDER: fault resilient
SDN pipeline with recovery delay guarantees, in: NetSoft Conference and
Workshops (NetSoft), 2016 IEEE, IEEE, 2016, pp. 296–302 .
[33] S.A. Astaneh , S.S. Heydari , Optimization of SDN flow operations in multi-failure
restoration scenarios, IEEE Trans. Netw. Serv. Manag. 13 (3) (2016) 421–432 .
[34] A. Malik , B. Aziz , M. Adda , C.H. Ke , Optimisation methods for fast restoration
of software-defined networks, IEEE Access 5 (2017) 16111–16123 .
[35] A. Malik, B. Aziz,
C.H. Ke, H. Liu, M. Adda, Virtual topology partitioning to-
wards an efficient failure recovery of software defined networks, in: Machine
Learning and Cybernetics (ICMLC), 2017 International Conference on, IEEE, pp.
646–651.
[36] A. Malik, B. Aziz, A. Al-Haj, M. Adda, Software-defined networks: a walk-
through guide from occurrence to data plane fault tolerance (No. e27624v1).
PeerJ preprints, 2019.
[37] S.S. Lee , K.Y. Li , K.Y. Chan , G.H. Lai , Y.C. Chung , Software-based fast fail-
ure recovery for resilient OpenFlow networks, in: Reliable Networks De-
sign and Modeling (RNDM), 2015 7th International Worksh op on, IEEE, 2015,
pp. 194–200
.
[38] M. Desai , T. Nandagopal , Coping with link failures in centralized control plane
architectures, in: Communication Systems and Networks (COMSNETS), 2010
Second International Conference on, IEEE, 2010, pp. 1–10 .
[39] S.S. Lee , K.Y. Li , K.Y. Chan , G.H. Lai , Y.C. Chung , Path layout planning and soft-
ware based fast failure detection in survivable OpenFlow networks, in: Design
of Reliable Communication Networks (DRCN), 2014 10t h International Confer-
ence on the, IEEE, 2014, pp. 1–8 .
[40] E.W. Dijkstra , W. E. , A note on two problems in connexion with graphs, Numer.
Math. 1 (1) (1959) 269–271 .
[41] B. Vidalenc , L. Ciavaglia , L. Noirie , E. Renault , Dynamic risk-aware routing for
OSPF networks, in: Integrated Network Management (IM 2013), 2013 IFIP/IEEE
International Symposium
on, IEEE, 2013, pp. 226–234 .
[42] A. Medem , R. Te ixe ira , N. Feamster , M. Meulle , Joint analysis of network inci-
dents and intradomain routing changes, in: Network and Service Management
(CNSM), 2010 International Conference on, IEEE, 2010, pp. 198–205 .
[43] C. Labovitz , G.R. Malan , F. Jahanian , Origins of internet routing instability, in:
INFOCOM’99. Eighteenth Annual Joint Conference of the IEEE Computer and
Communications Societies. Proceedings. IEEE, IEEE, 1999, pp. 218–226 .
[44] F. Salfner , M. Lenk , M. Malek , A survey of online failure prediction methods,
ACM Comput. Surv. (CSUR) 42
(3) (2010) 10 .
[45] A. Medem , R. Teixeira , N. Usunier , Predicting critical intradomain routing
events, in: Global Telecommunications Conference (GLOBECOM 2010), 2010
IEEE, IEEE, 2010, pp. 1–5 .
[46] R.S. Mangoubi , Robust Estimation and Failure Detection: A Concise Treatment,
Springer Science & Business Media, 2012 .
[47] L. Lu , M. Zhang , Edge betweenness centrality, in: Encyclopedia of Systems Bi-
ology, Springer, New York, NY., 2013, pp. 647–648 .
[48] U. Brandes , On varia nts of shortest–path betweenness centrality and their
generic computation, Soc. Netw. 30 (2) (2008) 136–145 .
[49] S. Szwaczyk , K. Wrona , M. Amanowicz , Applicability of risk analysis meth-
ods to risk-aware routin g in software-defined networks, in: 2018 International
Conference on Military Communications and Information Systems (ICMCIS),
IEEE, 2018, pp. 1–7 .
[50] S. De Maesschalck , D. Colle , I. Lievens , M. Pickavet , P. Demeester , C. Mauz ,
J. Derkacz , Pan-european optical transport networks: an availability-based
comparison, Photonic Netw. Commun. 5 (3) (2003) 203–225 .
[51] A.J. Gonzalez , B.E. Helvik , Characterisation of router and link failure processes
in UNINETTs IP backbone network, Int. J. Space Based Situated Comput. 7 (1)
(2012) 3–11 . 2
[52] S. Kaplan , B.J. Garrick , On the quantitative definition of risk, Risk Anal. 1 (1)
(1981) 11–27 .
[53] B. Chandrasekaran , T. Benson , Tol era tin g SDN application failures with
legoSDN, in: Proceedings of the 13t h ACM Worksho p on Hot Topi cs in Net-
works, ACM, 2014, p. 22 .
[54] POX Wiki, [Online]. Avail able : https://openflow.stanford.edu/display/ONL/POX+
Wiki .
[55] A. Shalimov , D. Zuikov , D. Zimarina , V. Pashkov , R. Smeliansky , Advanced study
of SDN/OpenFlow controllers, in: Proceedings of the 9th Central & Eastern Eu-
ropean Software Engineering Conference in Russia, ACM, 2013, p. 1 .
[56] W.Y. Huang , J.W. Hu , S.C. Lin , T.L. Liu , P.W. Tsai , C.S. Yan g , J.J. Mambretti ,De-
sign and implementation of an automatic network to pology discovery system
for the future internet across different domains, in: Advanced Information Net-
working and Applications Workshops (WAINA), 2012 26th International Con-
ference on, IEEE, 2012, pp. 903–908 .
[57] Att/pox, Accessed on July. 15, 2019. [Online]. Availab le: https://github.com/att/
pox/blob/master/pox/openflow/discovery.py .
[58] D.A. Schult , P. Swart , Exploring network structure, dynamics, and function us-
ing networkx, in: Proceedings of the 7th Python in Science Conferences (SciPy
2008) (Vol. 2008, 2008, pp.
11–16 .
[59] A. Malik , B. Aziz , M. Adda , Toward s filling the gap of routi ng changes in soft-
ware-de fined networks, in: Proceedings of the Future Technologies Conference,
Springer, Cham., 2018, pp. 682–693 .
[60] R. Bhandari , Survivable Networks: Algorithms for Diverse Routing, Springer
Science & Business Media, 1999 .
[61] S. Poretsky, B. Imhoff, K. Michielsen, Terminology for benchmarking link-state
IGP data-plane route convergence (no. RFC 6412), 2011.
[62] SNDlib library, [Online]. Availa ble: http://sndlib.zib.de .
[63] A . Medina, A . Lakhina, I. Matta, J. Byers, 2001, BRITE: an approach to univer-
sal topology genera tion. In: Modeling,
Analysis and Simulation of Computer
and Telecommunication Systems, 2001. Proceedings. Ninth International Sym-
posium on (pp. 346–353). IEEE.
[64] B.M. Waxman , Routing of multipoint connections, IEEE J. Sel. Areas Commun.
6 (9) (1988) 1617–1622 .
[65] SDN proactive fault handling, Accessed on July. 15, 2019. [Online]. Availab le:
https://github.com/Ali00/SDN- Smart- Rou ting .
[66] B. Lantz , B. Heller , N. McKeown , A network in a laptop: rapid prototyping for
software-defined networks, in: Proceedings of the 9th ACM SIGCOMM Work-
shop on Hot Top ics in Networks, ACM , 2010, p. 19 .
[67] L. Saino , C. Cocora , G. Pavlou ,
A toolchain for simplifying network simulation
setup, in: Proceedings of the 6th International ICST Conference on Simulation
Tools and Techniques, ICST (Institute for Computer Sciences, Social-Informatics
and Telecommunications Engineering, 2013, pp. 82–91 .
[68] M. Ohring , J.R. Lloyd , Reliability and failure of electronic materials and devices,
Academic Press, 2009 .
[69] ZeroMQ, [Online]. Available: http://zeromq.org/ .
14 A. Malik, B. Aziz and M. Adda et al. / Computer Networks 170 (2020) 107 104
Ali Malik is a Postdoctoral Researcher at the School of
Electrical and Electronic Engineering, Technological Uni-
versity Dublin, Ireland. He receive d his B.Sc. degree in
computer science from Al-Qadisiyah University, Iraq, in
2009. He also holds an M.Sc. degree in information tech-
nology from BAMU University, India, in 2012. He obtained
his Ph.D. degree in computer science from the Univers ity
of Portsmouth, Un ited Kingdom, in 2019. His current re-
search interests include software-defined networks, rout-
ing, fault management, risk and cybersecurity.
Benjamin Aziz is a Senior Lecturer in Computer Secu-
rity at the School of Computing, Unive rsity of Portsmouth,
United Kingdom. He holds Ph.D. degree in formal ver-
ification of computer security from Dublin City Univer-
sity (DCU), Ireland, in 2003. He has worked in the past
as a postdoctoral researcher at Imperial College London
and Rutherford Appleton Laboratory, in areas related to
security engineering of large-scale systems, formal de-
sign and analysis, requirements engineering and digital
forensics. He is on board program committees for several
conferences and working groups, such as ERCIM’s FMICS,
STM, Cloud Security Alliance and IFIP WG11.3. His re-
search interests include formal modelling, security, com-
puter forensics, risk management, software engineering, IoT and software-defined
networks.
Mo Adda is a Principal Lecturer in computer networks at
the School of Computing, University of Portsmouth. He re -
ceived the Ph.D. degree in distributed systems and par-
allel processing from the University of Surrey. He was a
Senior Lecturer with the University of Richmond, where
he taught programming, computer architecture, and net-
working for ten years. From 1999 to 2002, he was a Se-
nior Software Engineer developing software and manag-
ing projects on simulation and modelling. He has been
researching parallel and distributed systems since 1987.
His research interests include software-defined networks,
wireless sensor networks, mobile networks and business
process modelling, simulation of mobile agent technology
and security.
Chih-Heng Ke is an Associate Professor with the Depart-
ment of Computer Science and Infor- mation Engineer-
ing, National Quemoy University, Kinmen, Taiwan. He re-
ceived the B.S. and Ph.D. degrees in electrical engineering
from National Cheng-Kung University in 1999 and 2007,
respectively. His current research interests include multi-
media communications, wireless networks, and software-
defined networks.
... 7. This allowed us to evaluate our framework's performance with a reactive [38], predictive [39], and proactive approach [40] based on the maintenance operation. In preventative maintenance, the administrator gives the switch ID and specific time for operation; on the other hand, the controller checks the reliability of the whole switch in the network. ...
... As illustrated in Fig. 13, recovery times with the restoration approach exceed the 50 ms mark when the traffic flow exceeds 180 samples due to the need to recalculate flow paths in albaline topology. Comparing, our methodology outperforms the restoration approach [38] and the proactive strategy [17,39]. It detects errors quickly and efficiently and reroutes traffic. ...
... Terefore, this work deals with the use of SDN in IIoT, as an alternative to traditional networks. Te use of SDN in IIoT can be oriented to diverse network applications, including security [14][15][16][17][18][19][20][21][22], fault resilience [23], or services based on business applications [24,25]. Tis article focuses on the fundamental element of SDN, the network controller, through which the SDN control plane is implemented. ...
Article
Full-text available
This article deals with the use of software-defined networking (SDN) in the Industrial Internet of Things (IIoT). The use of SDN in IIoTcan solve the limitations presented by traditional networks in order to guarantee quality of service (QoS) for new applications. The approach of this work centers on the selection of an SDN controller that satisfies the requirements for the networks of IIoT. Selection is based on the characteristics of the SDN controllers and employs the analytic hierarchy process (AHP). From the review conducted, and as a result of the work, the group of the current best SDN controllers for IIoT is identified, which is a part of the subsequent selection process. Another contribution of this study is that it defines the criteria for comparing these controllers and selecting the most suitable one for this type of application. The established criteria and the employed quantification method via AHP enrich the decision-making process, providing a replicable model for future selections. The objectives and criteria established can be useful for other SDN selection processes to be used in scenarios where delay, jitter, and packet loss are key parameters to consider. This nuanced approach, accommodating both theoretical frameworks and empirical observations, offers an advancement in the strategic deployment of SDN within IIoT environments.
... To fill this gap, considering the advancement of AI in every field, (Natalino et al., 2018) proposed an ML model for physical optical connection recovery for enhanced recovery, availability, and efficient resource utilization. (Malik et al., 2020) proposed a smart routing framework. It can predict a failed link in the network on the basis of the historical data of all links, which makes SDNC move the data traffic from an unreliable path to a reliable path. ...
Preprint
Full-text available
p>Software-defined networking (SDN) plays a crucial role in the enterprise and wide-area networking. The increasing demand for strict service-level agreement applications on the Internet requires networks to be scalable and resilient in the face of link and switch failure. However, there is a lack of systematic reviews on SDN data plane failure recovery techniques. This review article assesses SDN current state-of-the-art link and switches failure recovery solutions. We cover the root causes of failures in the traditional core network and their detection and classify the current failure recovery techniques for SDN into two categories: traditional and artificial intelligence (AI) approaches. AI-based techniques enable efficient failure recovery and enhance the quality of service. We also consider performance measure metrics to evaluate and determine the limitations of existing solutions. This study reviews 188 papers from 2010 to 2021, selecting 70 articles that are highly relevant to our work. All articles are written in English. Our research aims to collect a large amount of evidence that will assist the industry and academic researchers in net- working to address current research gaps in failure recovery solutions for the SDN data plane.</p
Article
Full-text available
The management of traditional networks has become increasingly complex due to the expansion of the network and the development of new technologies such as cloud computing, the Internet of Things (IoT), and big data. Therefore, it is imperative to transition from operating within conventional networks to utilizing advanced networks capable of effectively managing modern technology. One of the most significant advancements in networking is the implementation of software-defined networks (SDN). SDNs aim to decouple the control plane, which controls network functions, from the data plane, which handles data transmission. This separation enhances the flexibility of network management. The distribution of traffic inside SDN networks plays a crucial role in enhancing network performance and response. Implementing Load Balancing (LB) enhances overall system performance and guarantees the efficient and dependable utilization of network resources. This research aims to comprehensively analyze recent research studies and the taxonomy of LB in SDN, such as classification, algorithms, and techniques. This research provides a comprehensive, state-of-the-art survey of LB in SDN according to LB-Classification, LB algorithms, and LB-Techniques. This research proposed a modern taxonomy for LB-Classification based on two factors: scheduling and models. Also, it proposed a new taxonomy of LB-Algorithms based on three types (static, dynamic, and hybrid) and a taxonomy for a third type (hybrid) consisting of three kinds (hybrid-LB, hybrid dynamic-LB, and hybrid static-LB). Finally, this research proposed a modern classification of LB-Techniques based on six types: (Controller -LB, Server -LB, Path Selection and Re-route - LB, Scheduling Management and Queue -LB, Artificial Intelligence -LB, and Wireless and Wi-Fi-LB).
Chapter
System Service Provider models are experiencing a developmental change driven by client requests, enormous information and mechanical advances, for example, 4G, LTE Advanced, 5G, Software Defined Networks (SDN), and Network Function Virtualization (NFV). Developing situations make use of quick system services that consume system, stockpiling, and processing assets in frameworks. Organizing assets for controlling and making administrations in interconnected spaces and different innovations, in this manner, turns into an incredible test. Innovative work endeavours are dedicated to mechanizing the procedures of arranging, planning, and dealing with the organization and activity of system administrations. In this best-in-class study, we address the subject of Network Service Orchestration (NSO) by looking at the authentic foundation, significant research undertakings, and normalization exercises. We characterize key ideas and propose a scientific classification of NSO approaches and answers to prepare for a typical comprehension of the various endeavours around the acknowledgment of different NSO application situations.
Article
Full-text available
Software-defined networking (SDN) plays a crucial role in the enterprise and wide-area networking. The increasing demand for strict service-level agreement applications on the Internet requires networks to be scalable and resilient in the face of link and switch failure. However, there is a lack of systematic reviews on SDN data plane failure recovery techniques. This review article assesses SDN current state-of-the-art link and switches failure recovery solutions. We cover the root causes of failures in the traditional core network and their detection and classify the current failure recovery techniques for SDN into two categories: traditional and artificial intelligence(AI) approaches. AI-based techniques enable efficient failure recovery and enhance the quality of service. We also consider performance measure metrics to evaluate and determine the limitations of existing solutions. This study reviews 188 papers from 2010 to 2021, selecting 70 articles that are highly relevant to our work. All articles are written in English. Our research aims to collect a large amount of evidence that will assist the industry and academic researchers in networking to address current research gaps in failure recovery solutions for the SDN data plane.
Article
The forthcoming 6G networks are expected to provide a vision of overlapping aerial-ground-underwater wireless networks. Meanwhile, the rapid development of the Internet of Underwater Things (IoUTs) brings forth many categories of Autonomous Underwater Vehicle (AUV)-assisted Underwater Wireless Networks (UWNs). In this paper, we argue that the AUV-assisted UWNs can be intelligently utilized to track underwater pollution. To perform smart underwater pollution tracking, we propose the paradigm of AUV flock-based networking system and Software-Defined Networking (SDN)-enabled AUV flock Networking System (SDN-AUVNS). We introduce the concept of Mobile Edge Computing (MEC) into the control of SDN-AUVNS and propose the upgrade of the control plane of the SDN-AUVNS to with the multi-tier edge computing ability. By the proposed system architecture, we adopt the artificial potential field theory to construct the network controlling model. And we present the underwater tracking model for SDN-AUVNS, especially for the underwater pollution equipotential line of a particular concentration. Furthermore, to provide accurate path planning for the equipotential line tracking, we utilize the linearizability mechanism to optimize and revise the control input for the SDN-AUVNS. Lastly, we give a fast united control algorithm that can intelligently schedule the SDN-AUVNS to track underwater pollution equipotential lines. In particular, we propose a smart approach with the name of ’Inverse Distance Weighting’ to optimize the detection sample of the SDN-AUVNS. Evaluation results indicate that our proposal is able to track/survey the equipotential lines within a satisfactory error.
Article
Full-text available
The increasing complexity of modern day networked applications and the massive demand on the Internet resources has reignited interest and concern in the underlying networking infrastructures and their ability to cope with such complexity and adapt to the demands of the business applications particularly where such applications require a high degree of robustness and reliability. As a result, software-defined networking has emerged as a promising approach to the definition of network architectures that could carry a high degree of adaptability and robustness reminiscent of the future Internet. Fault tolerance and network updates are considered two of the current research challenges that hamper the growth of software-defined networking in this area. Therefore, this paper represents a step towards tackling these two issues in the context of single link failures. Our main contribution lies in the definition of new algorithms that aim to enhance the problem of finding alternative paths in large scale networks with minimal cost and time-to-update factors. The new solution aims at increasing the efficiency of flow operation reduction during link failures. We evaluate our framework and show how its implementation results in improved efficiency.
Article
Full-text available
Software-defined networking (SDN) is an emerging paradigm that has become increasingly popular in the recent years. The core idea is to separate the control and data planes, allowing the construction of network applications using high-level abstractions which are translated to network devices through a southbound interface. SDN architecture is composed of three layers: infrastructure layer, responsible exclusively for data forwarding; control layer, which maintains the network view and provides core network abstractions; and application layer, which uses abstractions provided by the control layer to implement network applications. SDN provides features, such as flexibility and programmability, that are key enablers to meet current network requirements (e.g., multi-tenant cloud networks, elastic optical networks). However, along with its benefits, SDN also brings new issues. In this survey we focus on issues related to fault management. Different fault management threat vectors are introduced by each layer, as well as by the interface between layers. Nevertheless, besides addressing fault management issues of its architecture, SDN also must handle the same problems faced by legacy networks. However, programmability and centralized management might be used to provide flexibility to deal with those issues. This paper presents an overview of fault management in SDN. The major contributions of this work are as follows: 1) Identification of the main fault management issues in SDN and classification according to the affected layers; 2) Survey of efforts that address those issues and classification according to the affected planes, issues concerned, general approaches and features; 3) Discussion about trade-offs of different approaches and their suitability for different scenarios.
Conference Paper
Full-text available
Software Defined Networking is a new networking paradigm that has emerged recently as a promising solution for tackling the inflexibility of the classical IP networks. The centralized approach of SDN yields a broad area for intelligence to optimise the network at various levels. Fault tolerance is considered one of the most current research challenges that facing the SDN, hence, in this paper we introduce a new method that computes an alternative paths re-actively for centrally controlled networks like SDN. The proposed method aims to reduce the update operation cost that the SDN network controller would spend in order to recover from a single link failure. Through utilising the principle of community detection , we define a new network model for the sake of improving the network's fault tolerance capability. An experimental study is reported showing the performance of the proposed method. Based on the results, some further directions are suggested in the context of machine learning towards achieving further advances in this research area.
Article
Software-Defined Networking provides the opportunities to improve the routing security by dynamically routing flows with diverse routing instances. However, some routing instances could have common vulnerabilities, and the attacker could launch persistent attacks (e.g., zero-day attacks) when the common vulnerabilities are identified. In this paper, we propose a dynamic routing instance switching scheme to mitigate the attacks. We model the dynamic instance switching problem and solve it with our proposed Correlation-aware Dynamic Instance Switching (CDIS) algorithm. The simulation results show that, compared with baseline algorithms, CDIS improves the network compromised ratio at least 10%.
Article
Software-Defined Networking (SDN) has emerged as a new networking paradigm that can provide fine-grained network management service. Since the SDN controller makes control decision for the network, it becomes the main target of Denial of Service (DoS) attacks. In this paper, we propose to mitigate the DoS attacks of the SDN controller with BWManager that mainly consists of four key components: simplified DoS detection module, forecasting engine, priority manager and scheduler. The simplified DoS detection module calculates a comprehensive judgment score for each switch, which indicates the attacking severity of each switch and is used to decide time slice allocation for switches. The forecasting engine is the basis of the controller scheduling method and forecasts the bandwidth consumption of users to determine the users’ trust values. The trust values are used by the priority manager to manage multiple buffer queues with different priorities for the users. The scheduler protects the controller and the normal users under DoS attacks by running a weighted round-robin algorithm to process flow requests in different priority queues. We evaluate the performance and overhead of BWManager in both hardware and software OpenFlow environments. The results demonstrate that BWManager is effective with a limited overhead.