Content uploaded by Mahyar T. Moghaddam
Author content
All content in this area was uploaded by Mahyar T. Moghaddam on Sep 16, 2019
Content may be subject to copyright.
Fault-Tolerant IoT
A Systematic Mapping Study
Mahyar Tourchi Moghaddam(B
)and Henry Muccini
University of L’Aquila, Via Vetoio 1, L’Aquila, Italy
{mahtou,henry.muccini}@univaq.it
Abstract. A failure may occur at all architectural levels of the Internet
of Things (IoT) applications: sensor and actuator nodes can be missed,
network links can be down, and processing and storage components can
fail to perform properly. That is the reason for which fault-tolerance
(FT) has become a crucial concern for IoT systems.
Our study aims at identifying and classifying the existing FT mecha-
nisms that can tolerate the IoT systems failure. In line with a systematic
mapping study selection procedure, we picked out 60 papers among over
2300 candidate studies. To this end, we applied a rigorous classifica-
tion and extraction framework to select and analyze the most influential
domain-related information. Our analysis revealed the following main
findings: (i) whilst researchers tend to study fault-tolerant IoT (FT-IoT)
in cloud level only, several studies extend the application to fog and
edge computing; (ii) there is a growing scientific interest on using the
microservices architecture to address FT in IoT systems; (iii) the IoT
components distribution, collaboration and intelligent elements location
impact the system resiliency. This study gives a foundation to classify
the existing and future approaches for fault-tolerant IoT, by classifying a
set of methods, techniques and architectures that are potentially capable
to reduce IoT systems failure.
Keywords: Fault-tolerance ·Internet of Things ·
Software architecture ·Systematic mapping study
1 Introduction
IoT is the internal/external communication of intelligent elements via internet to
provide smart services [1]. A dependable IoT system should provide reliable and
fault-free services. A fault is a defect within the hardware or software systems that
impacts the correct functionality. It is particularly difficult to establish a pattern
for FT in IoT, since the IoT devices are heterogeneous, highly distributed, pow-
ered on battery, relied upon wireless communication and affected by scalability.
The distribution of IoT devices cause the system to suffer from, e.g., server crash,
server omission, incorrect response and arbitrary failure. The wireless and bat-
tery dependency makes the IoT devices barely recoverable. Furthermore, being
exposed to new devices and services impacts the system performance.
c
Springer Nature Switzerland AG 2019
R. Calinescu and F. Di Giandomenico (Eds.): SERENE 2019, LNCS 11732, pp. 67–84, 2019.
https://doi.org/10.1007/978-3-030-30856-8_5
68 M. T. Moghaddam and H. Muccini
Although the IoT has been introduced more than one decade ago, the research
and industry communities are still trying to define its different aspects and Qual-
ity of Services (QoS) such as FT. Hence, the goal of this research is to identify
and classify the domain state of the art and to highlight the methods, techniques
and architectures that are potentially suitable to model a FT-IoT. In order to
achieve this goal, a systematic mapping study has been performed. The primary
studies have been chosen based on an accurate inclusion and exclusion criteria
and a deep analysis. The main contributions of this study are: (i) addressing to
an up to date state of the art class for Fault-tolerant IoT modeling, which can
be used as a future research and implementation reference; (ii) investigating on
an IoT reference architecture and assessing the impact of such a software design
on FT; (iii) identifying current characteristics, challenges and publication trends
with respect to FT-IoT approach.
The audience of this study are both research and industry communities inter-
ested to improve their knowledge and select suitable methods to design their IoT
systems.
The paper is organized as follows. Section 2reveals the design of this systematic
study. Section 3presents a reference IoT architecture and analyzes its associated
FT aspects. Sections 4,5,6,7and 8elaborate on the obtained results while Sect. 9
analyses threats to validity. Section 10 closes the paper and discusses future work.
2 Research Method
The goal of this research is formulated based on the Goal-Question-Metric per-
spectives [2,3] as follow:
Purpose: to provide a deep understanding on Fault-tolerant IoT systems
Issue: by identifying, classifying and analyzing different methods, techniques
and architectures
Object: based on existing IoT systems approaches
Viewpoint: from both research and industry viewpoints.
2.1 Search Strategy
To achieve the aforementioned goal, we arranged for a set of questions:
–RQ1: What IoT architectural styles and patterns are able to make the system
prone to fault?
–RQ2: What traditional and novel techniques and methods can protect IoT
systems against failure?
–RQ3: What are the quality attributes associated with Fault-tolerance in IoT
systems?
–RQ4: What are the trends and evolution that can be deduced from the scien-
tific publications on FT-IoT?
Furthermore, a good search strategy should provide effective solutions to the
following questions [4]:
Fault-Tolerant IoT A Systematic Mapping Study 69
Which approaches? The search strategy consists of two phases: (i) an
automatic search on academic database; and (ii) a snowballing. The first step
has been performed using the search string below. A selection criteria has been
subsequently applied on the set of results. Then a snowballing procedure on the
included results of the automatic search has been applied to structure the final
set of primary studies.
(IoT OR “ternet of Things” OR “Internet-of-Things”) AND (“Fault tolerant”
OR “Fault-tolerant” OR “Fault tolerance” OR “Fault-tolerance”)
Where to search? The electronic databases that we used for the automatic
search (ACM, IEEE, Elsevier, Springer, ISI Web of Science, and Wiley Inter Sci-
ence) are known as the main source of literature for potentially relevant studies
on software engineering.
When and what time span to search? We did not consider publication
year as a criterion for the search and selection steps. Thus, all studies com-
ing from the selection steps, until May 2019, were included regardless of their
publication time.
2.2 Selection Strategy
A multi-stage selection process (Fig. 1) has been designed to give a full control
on the number and characteristics of the studies coming from different stages1.
ACM Digital
Library
IEEE Xplore
Springer
ISI Web of
Science
Wiley Inter
Science
Science
Direct
Initial Search Merge & Duplicates
Removal
Selection Criteria
Application Snowballing
236
334
205
228
345
1288
2374 54 Tot a l :
60
Fig. 1. Search and selection process.
1It is worth mentioning that we considered “Software Engineering” as the Search
Topic, since the original search leaded to 193,000 results.
70 M. T. Moghaddam and H. Muccini
Afterwards, we considered all the selected studies, and filtered them according
to a set of well-defined inclusion and exclusion criteria (Table 1). According to the
standards, the definition of inclusion/ exclusion criteria has been guided by two
main drivers: (i) keeping the focus of the selected papers on the scope of the study;
and (ii) avoiding gray or not scientific works. Thus, Inclusion/exclusion criteria
shall be aligned with the research questions. We included studies that satisfied all
inclusion criteria, and discarded studies that met any exclusion criterion.
On the 2,374 potentially relevant papers, we performed a first manual step
applying the selection criteria on title and abstract of the papers. Afterwards,
a second manual step of reading the full text of firstly selected papers has been
performed and followed by snowballing. The reasons for which we obtained only
60 primary studies over 2,374 potentially relevant papers are that: (i) our search
string was quite inclusive (to avoid ignoring any potentially relevant paper); (ii)
however, selection criteria application has been carefully performed in a way to
avoid including the papers that fall out of the scope of the research. In order to
minimize bias, the procedure has been performed by the first researcher and the
results have been double-checked by the other researcher.
Table 1. Inclusion and exclusion criteria.
Inclusion criteria Exclusion criteria
Studies that propose, leverage, or analyze
software and hardware solutions,
methods, techniques and architectures to
design fault-tolerant IoT systems
Studies that, while focusing on IoT, do
not focus on its fault-tolerance aspects
(e.g., studies focusing only on
technological aspects of IoT) or vice
versa
Studies subject to peer review (e.g.,
journal papers, papers published as part
of conference proceedings, workshop
papers, and book chapters)
Secondary or tertiary studies (e.g.,
systematic literature reviews, surveys,
etc.)
Studies written in English language and
available in full-text
Studies in the form of tutorial papers,
editorials, etc. because they do not
provide enough information
After selection of a final set of primary studies, the data has been extracted
to answer the research questions.
Study Replicability. A replication package is provided to tackle the page lim-
its of a workshop paper: https://www.dropbox.com/s/ansb75ncdoqpc9f/DATA-
SERENE-2019.xlsx?dl=0. The package is available as an excel file with differ-
ent sheets that include all necessary information such as search results, primary
studies distribution, data extraction and validity examination.
3 Background on IoT Architectures
In this section, we present a reference software architecture for the internet of
things applications [5–7]. IoT applications typically consist of a set of software
Fault-Tolerant IoT A Systematic Mapping Study 71
CLOUD
MPU
MPU
MPU MPU
MPU
MPU
FOG FOG
Fig. 2. IoT reference architecture (MPU refers to microprocessor unit).
components including perception, data processing and storage (P&S) and actu-
ation, which are distributed across network(s). For the purposes of this paper
that has its focus on fault-tolerant data transmission and analysis, we define our
architecture based on the following P&S modeling characteristics:
–Distribution: this aspect specifies whether data analysis software ought to be
deployed on a single node or on several nodes that are distributed across the
IoT system. In other words, the distribution is referred to the deployment of
the IoT P&S software to hardware. By using a distributed style, the latency
will potentially be reduced due to data traffic and bandwidth consumption
minimization. Such rapid response time facilitates real-time and fault-tolerant
IoT applications. Furthermore, in distributed systems, a faulty P&S will still
hold IoT system available since the faulty component can be replaced by
another one.
–Localization: depending on data size and required analysis complexity, P&S
can be executed locally or remotely. Here is the point in which centralized
cloud and distributed edge and fog concepts become relevant. The advantage
of using a central cloud is that, processing on a cloud component facilitates
long-term data analysis for systems that have no constraints on response time.
For applications with massive P&S requirement, executing the task on the
powerful cloud is the only solution.
Fog nodes are the intermediate P&S, which bring a degree of cloud function-
ality to the network edge. Fog is not limited to perform on a particular device,
so that it can freely be located between device edge and cloud. The analysis
capacity of fog is lower than cloud, but it reduces a significant point of failure
by shifting towards more than one computational component. However, fog
only performs locally so that it does not have a global coverage over a major
IoT system. It is worth mentioning that, some IoT devices are able to per-
form simple P&S by themselves. Performing P&S on IoT device edge, refers
72 M. T. Moghaddam and H. Muccini
to computation capabilities embedded on a smart device to be able to gather
and analyze environmental data.
–Collaboration: the aforementioned computation components may interact to
form and empower IoT services. This collaboration may appear as a level of
information sharing, coordinated analysis and/or planning or synchronized
actuation. Each IoT sensor network may provide data for many collaborative
P&S components, both locally and remotely. Here the advantage is that if
the local P&S node fails, local service is still in access.
Considering above definitions, we further design our reference IoT architec-
ture (Fig. 2). The architecture is composed of a physical layer and several P&S
layers. The physical layer is made up of two sub-layers, namely perception and
application. The perception sub-layer hosts a large number of heterogeneous
sensors and the application sub-layer consists of various types of actuators. The
P&S layers store and analyze data gathered by the perception components to
provide the required IoT service.
Looking through primary studies, each of them address the FT for specific
layer(s) of the IoT architecture. As shown in Fig. 3, whilst the faults usually
occur in sense (26/60) and actuation (12/60) sub-layers, the primary studies
realized the importance of network (38/60) and P&S (33/60) layers for FT-IoT
systems. The reason is that, handling FT is under the responsibility of P&S
nodes and is based on the transmitted data coming from the physical layer. In
Sect. 5, we discuss various FT strategies and techniques for IoT systems.
P1, P2, P3, P4, P5, P6, P9, P11, P12, P14, P15, P17, P20, P24, P26, P27, P28, P29, P30, P31, P33, P34, P40, P41, P42, P43, P44, P45, P46, P47, P48, P51, P54, P56, P57, P58, P59, P60
P1, P2, P3, P6, P8, P10, P11, P13, P16, P19, P21, P22, P23, P31, P33, P35, P36, P37, P38, P39, P40, P41, P44, P45, P48, P49, P50, P51, P53, P55, P56, P58, P59
P1, P4, P5, P6, P7, P9, P10, P12, P13, P14, P15, P16, P17, P19, P20, P28, P32, P35, P36, P42, P45, P50, P52, P55, P58, P59
P1, P6, P7, P13, P19, P21, P25, P32, P35, P55, P57, P58
0 5 10 15 20 25 30 35 40
Network
Processing and Storage
Sense
Actuate
PRIMARY STUDIES #
THE FOCUSED ARCHITECTURAL LAYER
Fig. 3. The primary studies focus on each architectural layer.
Fault-Tolerant IoT A Systematic Mapping Study 73
4 Fault-Tolerant IoT Architectural Patterns
and Styles (RQ1)
This section discusses the specific characteristics of primary studies related to
FT-IoT architectural design. The primary studies used one or more overlaid
style(s) to design their software system. However, among the various IoT archi-
tectural styles, layered architecture (32/60) was the clear winner as reported in
Fig. 4. In the layered view the system is viewed as a complex heterogeneous entity
that can be decomposed into interacting parts. The primary studies designed
their layered architecture in different ways, ranged from 3 (with a central P&S
component only) to 5 (including edge and fog) layers (see Fig.2).
Cloud-based architecture (28/60) won the second position. Fog that is a
significant extension to cloud environment is addressed in 15 studies as well. Few
studies (4/60) used the device edge concept to design their FT-IoT architecture.
Minimizing the impact of a failed component within an integrated fog-cloud
platform needs a common agreement protocol that is able to uniform the system
with the minimum rounds of message exchange.
P2, P3, P6, P8, P13, P14, P16, P17, P22, P25, P28, P31, P33, P35, P36, P37, P39, P40, P42, P43, P44, P47, P48, P49, P50, P51, P53, P54, P55, P57, P58, P60
P2, P3, P6, P7, P8, P12, P15, P16, P19, P21, P22, P23, P24, P27, P31, P33, P35, P39, P41, P44, P45, P48, P50, P51, P52, P53, P56, P57
P6, P8, P10, P11, P13, P23, P27, P32, P45
P7, P21, P40, P56
P2
P2, P3, P4, P5, P6, P8, P9, P11, P15, P16, P21, P22, P23, P24, P28, P29, P30, P31, P33, P34, P36, P40, P41, P42, P44, P45, P46, P47, P48, P50, P51, P52, P53, P58
P7, P13, P14, P19, P25, P26, P27, P36, P37, P39, P55, P56, P57
P10, P17, P18, P20, P24, P32, P35, P38, P43, P49, P59, P60
0 5 10 15 20 25 30 35
Layered
Cloud-based
Service oreiented (SOA)
Microservices
Publish/Subscribe
Hybrid
Centralized
Distributed CollaboraƟve
Architectural Styles Architectural PaƩerns
Fig. 4. FT-IoT architectural styles and patterns.
Service oriented architectures (SOA) (9/60) put the service at the centre
of their IoT application design. In fact, the core application component makes
the service available for other IoT components over a network. Microservices
(4/60) and SOA have the same goal in IoT sytems, that is building one or
multiple applications from a set of different services. A microservice is a small
application with single responsibility, which can be deployed, scaled and tested
independently.
74 M. T. Moghaddam and H. Muccini
P21 proposes a pluggable framework based on a microservices architecture
that implements FT support as two complementary microservices: one that uses
complex event processing for real-time FT detection, and another that uses
online machine learning to detect fault patterns and preemptively mitigate faults
before they are activated. P7 propose a system based on container virtualisation
that allows IoT clouds to carry out fault-tolerance when a microservice running
on an IoT device fails. A reactive microservices architecture and its application
in a fog computing case study to investigate FT challenges at the edge of the
network is presented in P40. P56 present a microservices-based mobile cloud plat-
form by exploiting containerization which replaces heavyweight virtual machines
to guarantee run-time FT.
On the other hand, as explained in Sect. 3, IoT distribution patterns clas-
sify the architectures according to edge intelligence and elements collaboration.
Figure 4shows the distribution patterns that are used by the primary stud-
ies. Most of studies used a Hybrid pattern (34/60) followed by the Centralized
(13/60) and the Distributed Collaborative (12/60) patterns.
In this section we showed that edge/cloud-based distributed architectures
are extensively used by primary studies. The results confirm that: a distributed
architecture provides a rapid response time and high availability, and makes the
system prone to fault.
5 Fault-Tolerance Techniques for Resilient IoT (RQ2)
As shown in Fig. 5, the primary studies adopt various techniques to make their
IoT system fault-tolerant. These techniques are explained below.
5.1 Replication
Replication is the process of sharing the data between redundant IoT HW/SW
components. Replication guarantees the data consistency, so that failure of a
component will not result in system failure. The main replication schemes are
known as active and passive [8].
In active replication scheme (22/60), processes are replicated in multiple pro-
cessors to provide fault-tolerance. In IoT context, active replication continuously
pushes the group of IoT resources (such as fog or cloud) to execute the same
process concurrently. In case of fault, failover can have in very short period to
other active resources [P33]. In this way, an extra processing is occurred and
redundant and duplicated dataset it sent to endpoint. Despite that active repli-
cation takes a lot of processing resources, it is failure transparent and its failure
discovery time is deterministic.
In passive replication (24/60), the primary processor performs and the extra
IoT components remain idle until a failure occurs. The idle components, however,
contact the primary processor in order to be updated and keep consistency. The
passive replication scheme imposes additional cost of resources and suffers from
slow response to failure.
Fault-Tolerant IoT A Systematic Mapping Study 75
5.2 Network Control
In network control scheme (19/60), the IoT network is generally divided into var-
ious clusters. A chosen cluster head (CH) periodically makes roll call requests
to the other nodes and if it does not receive a reply message, the failure will be
confirmed. However, the CH itself makes a single point of failure. Several cluster-
based routing protocols have been proposed by the primary studies. Some pri-
mary studies took advantage of bio-inspired particle multi-swarm optimization
routing algorithm to construct, recover, and select disjoint paths that tolerate
the failure while satisfying the quality of service parameters. Some other studies
used the virtual CH formation and flow graph modeling to efficiently tolerate the
failures of CHs. Multiple traveling salesman is also among the routing algorithms
that are addressed by the primary studies.
5.3 Distributed Recovery Block
In this method (8/60), a single program is concurrently executed on a node
pair, from which one is active and the other is inactive. In no-fault situation,
the main (active) node performs the task and the other node performs the same
task in shadow. Afterwards, both results will be tested and if the test is properly
passed, the results associated with the main node will be delivered as the output.
If the primary node test fails, the shadow node becomes active and produces the
outputs. This method can protect the system only against a single point of
failure.
5.4 Time Redundancy
Time redundancy (1/60) can be performed at both instruction and task levels.
At instruction level, the program is duplicated and subsequently the results are
compared to discover a potential error. In task level, a software is run twice (or
more) to mitigate dynamic faults. Despite that this method does not impose the
cost of additional hardware, it increases the time needed to assure redundancy.
The method reduces the computing performance and consumes more energy as
well.
It is worth mentioning that, the whole IoT system can follow a Reactive or
Proact ive strategy. Reactive FT starts to recover the system after the detection
of an error (using event processing methods). In proactive FT, the recovery
strategy is started even before the detection of an error (using machine learning
methods).
76 M. T. Moghaddam and H. Muccini
P3, P7, P10, P13, P14, P18, P19, P20, P21, P22, P23, P25, P26, P29, P34, P35, P39, P40, P41, P43, P45, P48, P50, P56
P2, P3, P6, P8, P11, P15, P16, P17, P21, P24, P27, P32, P36, P37, P41, P46, P47, P51, P54, P55, P57, P58
P4, P5, P9, P10, P12, P13, P28, P31, P34, P42, P44, P45, P46, P47, P48, P51, P53, P59, P60
P3, P6, P7, P29, P30, P33, P38, P52
P52
0 5 10 15 20 25
Passive
AcƟve
Network Control
Distributed Recovery Block
Time Redundancy
PRIMARY STUDIES #
FAULT-TOLERANCE TECHNIQUES
Fig. 5. Fault-tolerance techniques.
6 Quality of IoT Service Associated with Fault-Tolerance
(RQ3)
The standard used to categorize quality attributes comes from ISO 25010 and
some specific IoT attributes derived from the primary studies keywording.
An IoT system brings many challenges from QoS perspective when takes
into account FT. As shown in Fig. 6, the most recognized quality challenges
P2, P3, P4, P5, P6, P7, P9, P11, P13, P16, P18, P19, P22, P23, P26, P31, P43, P44, P47, P48, P49, P52, P56, P58, P60
P2, P3, P5, P6, P10, P14, P15, P17, P18, P19, P21, P25, P27, P33, P35, P36, P39, P41, P44, P53
P1, P6, P7, P8, P11, P12, P14, P16, P17, P20, P21, P26, P36, P39, P40, P44, P45, P51, P53, P58
P5, P6, P9, P11, P14, P15, P16, P17, P19, P21, P23, P27, P40, P41, P47, P52
P8, P9, P11, P21, P24, P40, P41, P47
P5, P41
0 5 10 15 20 25
Performance
Availability
Security
Scalability
Interoperability
Energy ConsumpƟon
PRIMARY STUDIES #
QUALITY ATTRIBUTES
Fig. 6. QoS associated with FT-IoT.
Fault-Tolerant IoT A Systematic Mapping Study 77
are related to performance (25/60), availability (20/60), security (20/60) and
scalability (16/60), whilst interoperability (8/60) and energy efficiency (2/60)
are positioned in a lower degree of concern.
The level of performance depends on how much the processing and storage
components are pushed to the edge in a decentralized way. Availability is the
ability of a system to be fully or partly operational as and when required. Clearly,
FT and availability are not identical since a fault-tolerant system is supposed
to maintain the system operational without interruption, but a highly available
system may have service interruption. However, A fault-tolerant system should
maintain a high level of system availability and performance as well.
In IoT systems that different components and entities are connected to each
other through a network, security gains a high concern. Scalability is also an
essential attribute as IoT systems should be capable to perform properly con-
sidering a huge number of heterogeneous devices. Commenting on scalability of
IoT as a whole system is difficult, however, it depends on how new resources can
be added on demand. A fault-tolerant system also requires enormous computa-
tional efforts to be run in distributed P&S components. Device heterogeneity
and P&S elements distribution make the system resistive to scalability.
Interoperability helps IoT heterogeneous components to work together effi-
ciently. It actually depends on how much IoT large-scale heterogeneous devices
can communicate directly among each other to gather the required data with-
out having to go through the central/remote components. Since most of IoT
devices are battery powered, energy efficiency that is tied to many other quality
attributes (such as performance) becomes essential. However, wireless and bat-
tery dependency make the IoT devices barely recoverable, flexible to scalability
and performant.
7 Horizontal Analysis
This section reports the results orthogonal to the vertical analysis presented
in the previous sections. For the purpose of this section, we cross-tabulated
and grouped the data, we made comparisons between pairs of concepts of our
classification framework and identified perspectives of interest.
7.1 FT Techniques vs Architectural Patterns
Here the question is, which architectural pattern is more often used for each FT
technique? As shown in Fig. 7, (11/60) studies used hybrid pattern to facilitate
their passive FT techniques, whilst (15/60) used hybrid for active FT. In con-
trary, centralized and collaborative architectural patterns are more suitable to
address passive FT. Obviously, network control FT technique is better to be
addressed by a hybrid architectural pattern. In general, a hybrid architecture
guarantees FT-IoT, since if one fog node fails, the IoT system can shift the
computation to another fog to avoid the single point of failure.
78 M. T. Moghaddam and H. Muccini
Passive Active Network
Control
Distributed
Recovery
Time
Redundancy
Hybrid
Distributed
Collaborative
Centralized
1
2
5
11 15 6
8 1
1
3
5
14
3
Architectural Patterns
Fault-tolerance Techniques
Fig. 7. FT techniques vs patterns.
7.2 FT Techniques vs Quality Attributes
What quality attributes are satisfied when a specific FT technique is adopted?
As shown in Fig. 8, passive technique mostly takes into account performance
and availability, whilst the active technique gives more weight to security and
Passive Active Network
Control
Distributed
Recovery
Time
Redundancy
Performance
Availability
Security
Quality Attributes
Fault-tolerance Techniques
Scalability
Interoperability
Energy
Consumption
4
11 7 9 1
3
4
8
10
2
5
89
2
691
3
2
36
1
1
1
Fig. 8. Techniques vs quality attributes.
Fault-Tolerant IoT A Systematic Mapping Study 79
scalability. Furthermore, network control enhances the performance beside the
fault-tolerance. Regarding the rapid development and extension of devices in the
edge of the network, performance of IoT should be maintained in an appropriate
level. Performance highly depends on the data storage and application logic
distribution among edge and central servers. As mentioned before, fog computing
can pave the way to improve IoT systems performance level.
8 Challenges and Emerging Trends (RQ4)
In this section the emerging trends in resilience for FT-IoT are presented. To
this end, publication year, type and venue are firstly extracted and an overall
discussion is subsequently provided.
8.1 Publication Year
Figure 9shows the distribution of FT-IoT literature. It noticeably indicates that
the number of papers grows by time and there is just one related paper published
before 2014. This result confirms the scientific interest and research necessity on
FT-IoT issues in the last few years.
2012 2013 2014 2015 2016 2017 2018 2019
Journa l
Conference
Worksho p 1 2
3
4
5
12
7
10
1
1 1
10 3
Fig. 9. Primary studies distribution by publication type.
8.2 Publication Type
The most common publication type is conference paper (40/60), followed by
journal (17/60), and workshop paper (3/60). Such a high number of journal and
conference papers may point out that FT-IoT is maturing as a research topic
despite that it is still relatively young.
80 M. T. Moghaddam and H. Muccini
8.3 Publication Venues
From the extracted data we can notice that research on FT-IoT is spread across
many venues mostly in the span of IoT (e.g. WF-IoT), computing (e.g. ICAC)
and networking (e.g. ICOIN) communities. The complete list of venues can be
found in the data extraction file. However, the focus on the aforementioned
aspects can prove the significance of distributed computing and networking for
FT-IoT systems.
8.4 Emerging Trends in Resilience for FT-IoT
Our study reveals that some of the different Ft-IoT techniques are more rarely
covered with respect to others, specifically, distributed recovery block and time
redundancy. We clarify that this result by no means implies that there is lim-
ited literature or support on such FT techniques, but they appear to have a
more limited application on IoT. In architectural level, we observed a significant
move toward adopting hybrid architectures, which make the IoT system prone to
fault. Furthermore, whilst a growth on using service-oriented and microservices
architectures is perceived, their various aspects need to be better investigated
regarding FT. The study showed that for FT-IoT architectural layers, the atten-
tion especially goes to network and processing and storage components.
What our study reveals is also that performance and availability are tied up
with IoT systems fault-tolerance. However, assessing the trade-off between FT
and other IoT quality attributes such as scalability, interoperability and energy
consumption shall be further investigated. Another result to be further evaluated
through a state of the practice analysis, is that only few studies support the
interplay between FT techniques and collaborative architectures. The mentioned
aspects are to be considered by the domain future work.
9 Threats to Validity
According to Peterson et al. [9], the quality rating for this systematic mapping
study assessed and scored as 73%. This value is the ratio of the number of
actions taken in comparison to the total number of actions reported in the quality
checklist. The quality score of our study is far beyond the scores obtained by
existing systematic mapping studies in the literature, which have a distribution
with a median of 33% and 48% as absolute maximum value. However, the threats
to validity are unavoidable. Below we shortly define the main threats to validity
of our study and the way we mitigated them.
External validity: in our study, the most severe threat related to external
validity may consist of having a set of primary studies that is not representative
of the whole research on FT-IoT. We mitigated this potential threat by (i) fol-
lowing a search strategy including both automatic search and backward-forward
snowballing of selected studies; and (ii) defining a set of inclusion and exclusion
criteria. Along the same lines, gray and non-English literature are not included
Fault-Tolerant IoT A Systematic Mapping Study 81
in our research as we want to focus exclusively on the state of the art presented
in high-quality scientific studies in English.
Internal validity: it refers to the level of influence that extraneous variables
may have on the design of the study. We mitigated this potential threat to
validity by (i) rigorously defining and validating the structure of our study, (ii)
defining our classification framework by carefully following the keywording pro-
cess, and (iii) conducting a well-structured vertical analysis. Construct validity:
It concerns the validity of extracted data with respect to the research questions.
We mitigated this potential source of threats in different ways. (i) performing
automatic search on a couple of databases to avoid potential biases; (ii) having a
strong and tested search string; (iii) complementing the automatic by the snow-
balling activity; and (iv) rigorously screen the studies according to inclusion and
exclusion criteria.
Conclusion validity: it concerns the relationship between the extracted data
and the obtained results. We mitigated potential threats to conclusion validity
by applying well accepted systematic methods and processes throughout our
study and documenting all of them in the excel package.
10 Conclusion
In this paper we present a systematic mapping study with the goal of classifying
and identifying the domain state-of-the-art and extract a set of FT-IoT methods
and techniques. Starting from over 2300 potentially relevant studies, we applied
a rigorous selection procedure resulting in 60 primary studies. The results of
this study are both research and industry oriented and are intended to make a
framework for future research in FT-IoT related fields. As a future work, we will
assess the potential integration of existing research to an industrial level of IoT.
Primary Studies
•P1: Toward a New Approach to IoT Fault Tolerance, https://doi.org/10.1109/MC.
2016.238
•P2: CEFIoT: A fault-tolerant IoT architecture for edge and cloud, https://doi.org/
10.1109/WF-IoT.2018.8355149
•P3: Reliable and Fault-Tolerant IoT-Edge Architecture, https://doi.org/10.1109/
ICSENS.2018.8589624
•P4: Efficient Fault-Tolerant Routing in IoT Wireless Sensor Networks Based on
Bipartite-Flow Graph Modeling, https://doi.org/10.1109/ACCESS.2019.2894002
•P5: Optimizing Multipath Routing With Guaranteed Fault Tolerance in Internet
of Things, https://doi.org/10.1109/JSEN.2017.2739188
•P6: Brume - A Horizontally Scalable and Fault Tolerant Building Operating Sys-
tem, https://doi.org/10.1109/IoTDI.2018.00018
•P7: A Watchdog Service Making Container-Based Micro-services Reliable in IoT
Clouds, https://doi.org/10.1109/FiCloud.2017.57
82 M. T. Moghaddam and H. Muccini
•P8: Towards Fault Tolerant Fog Computing for IoT-Based Smart City Applications,
https://doi.org/10.1109/CCWC.2019.8666447
•P9: Device clustering for fault monitoring in Internet of Things systems, https://
doi.org/10.1109/WF-IoT.2015.7389057
•P10: Decentralized fault tolerance mechanism for intelligent IoT/M2M middleware,
https://doi.org/10.1109/WF-IoT.2014.6803115
•P11: Application of Blockchain in Collaborative Internet-of-Things Services,
https://doi.org/10.1109/TCSS.2019.2913165
•P12: A Review of Aggregation Algorithms for the Internet of Things, https://doi.
org/10.1109/ICSEng.2017.43
•P13: Supporting Service Adaptation in Fault Tolerant Internet of Things, https://
doi.org/10.1109/SOCA.2015.38
•P14: Fault tolerant and scalable IoT-based architecture for health monitoring,
https://doi.org/10.1109/SAS.2015.7133626
•P15: Fault tolerance capability of cloud data center, https://doi.org/10.1109/ICCP.
2017.8117053
•P16: Reaching Agreement in an Integrated Fog Cloud IoT, https://doi.org/10.
1109/ACCESS.2018.2877609
•P17: Byzantine Resilient Protocol for the IoT, https://doi.org/10.1109/JIOT.2018.
2871157
•P18: DRAW: Data Replication for Enhanced Data Availability in IoT-based Sensor
Systems, https://doi.org/10.1109/DASC/PiCom/DataCom
•P19: Power efficient, bandwidth optimized and fault tolerant sensor management
for IOT in Smart Home, https://doi.org/10.1109/IADCC.2015.7154732
•P20: Energy efficiency and robustness for IoT: Building a smart home security
system, https://doi.org/10.1109/ICCP.2016.7737120
•P21: A Microservices Architecture for Reactive and Proactive Fault Tolerance in
IoT Systems, https://doi.org/10.1109/WoWMoM.2018.8449789
•P22: Management of solar energy in microgrids using IoT-based dependable control,
https://doi.org/10.1109/ICEMS.2017.8056441
•P23: A hierarchical cloud architecture for integrated mobility, service, and trust
management of service-oriented IoT systems, https://doi.org/10.1109/INTECH.
2016.7845021
•P24: Fault-Tolerant Real-Time Collaborative Network Edge Analytics for Indus-
trial IoT and Cyber Physical Systems with Communication Network Diversity,
https://doi.org/10.1109/CIC.2018.00052
•P25: Fault-Tolerant mHealth Framework in the Context of IoT-Based Real-Time
Wearable Health Data Sensors, https://doi.org/10.1109/ACCESS.2019.2910411
•P26: SCONN: Design and Implement Dual-Band Wireless Networking Assisted
Fault Tolerant Data Transmission in Intelligent Buildings, https://doi.org/10.1109/
VTCFall.2018.8690787
•P27: Fault-tolerant application placement in heterogeneous cloud environments,
https://doi.org/10.1109/CNSM.2015.7367359
•P28: A reliable and energy efficient IoT data transmission scheme for smart cities
based on redundant residue based error correction coding, https://doi.org/10.1109/
SECONW.2015.7328141
•P29: Distributed Continuous-Time Fault Estimation Control for Multiple Devices
in IoT Networks, https://doi.org/10.1109/ACCESS.2019.2892905
•P30: Trend-adaptive multi-scale PCA for data fault detection in IoT networks,
https://doi.org/10.1109/ICOIN.2018.8343217
Fault-Tolerant IoT A Systematic Mapping Study 83
•P31: Adaptive and Fault-tolerant Data Processing in Healthcare IoT Based on Fog
Computing, https://doi.org/10.1109/TNSE.2018.2859307
•P32: Fault-Recovery and Coherence in Internet of Things Choreographies, https://
doi.org/10.1109/WF-IoT.2014.6803224
•P33: A Novel Data Reduction Technique with Fault-tolerance for Internet-of-things,
https://doi.org/10.1145/3018896.3018971
•P34: Performance Comparisons of Fault-Tolerant Rouging Approaches for IoT
Wireless Sensor Networks, https://doi.org/10.1145/3195106.3195168
•P35: Rivulet: A Fault-tolerant Platform for Smart-home Applications, https://doi.
org/10.1145/3135974.3135988
•P36: Censorship Resistant Decentralized IoT Management Systems, https://doi.
org/10.1145/3286978.3286979
•P37: Towards a Foundation for a Collaborative Replicable Smart Cities IoT Archi-
tecture, https://doi.org/10.1145/3063386.3063763
•P38: Responsible Objects: Towards Self-Healing Internet of Things Applications,
https://doi.org/10.1109/ICAC.2015.60
•P39: A Multi-agent System Architecture for Self-Healing Cloud Infrastructure,
https://doi.org/10.1145/2896387.2896392
•P40: Reactive Microservices for the Internet of Things: A Case Study in Fog Com-
puting, https://doi.org/10.1145/3297280.3297402
•P41: Fault Tolerance Techniques and Architectures in Cloud Computing - a Com-
parative Analysis, https://doi.org/10.1109/ICGCIoT.2015.7380625
•P42: Energy Efficient Fault-tolerant Clustering Algorithm for Wireless Sensor Net-
works, https://doi.org/10.1109/ICGCIoT.2015.7380464
•P43: Layered Fault Management Scheme for End-to-end Transmission in Internet
of Things, https://doi.org/10.1007/s11036-012-0355-5
•P44: An Architectural Mechanism for Resilient IoT Services, https://doi.org/10.
1145/3137003.3137010
•P45: Resilience of Stateful IoT Applications in a Dynamic Fog Environment,
https://doi.org/10.1145/3286978.3287007
•P46: The Optimal Generalized Byzantine Agreement in Cluster-based Wireless
Sensor Networks, https://doi.org/10.1016/j.csi.2014.01.005
•P47: A Reliable IoT System for Personal Healthcare Devices, https://doi.org/10.
1016/j.future.2017.04.004
•P48: Reliable Industrial IoT-based Distributed Automation, https://doi.org/10.
1145/3302505.3310072
•P49: Low-Cost Memory Fault Tolerance for IoT Devices, https://doi.org/10.1145/
3126534
•P50: Idea: A System for Efficient Failure Management in Smart IoT Environments,
https://doi.org/10.1145/2906388.2906406
•P51: Patterns for Things That Fail, https://www.hillside.net/plop/2017/papers/
proceedings/papers/07-ramadas.pdf
•P52: Fall-curve: A Novel Primitive for IoT Fault Detection and Isolation, https://
doi.org/10.1145/3274783.3274853
•P53: Multilevel IoT Model for Smart Cities Resilience, https://doi.org/10.1145/
3095786.3095793
•P54: Energy Efficient Device Discovery for Reliable Communication in 5G-based
IoT and BSNs Using Unmanned Aerial Vehicles, https://doi.org/10.1016/j.jnca.
2017.08.013
•P55: A Programming Framework for Implementing Fault-Tolerant Mechanism in
IoT Applications, https://doi.org/10.1007/978-3-319-27137-8 56
84 M. T. Moghaddam and H. Muccini
•P56: Transient fault aware application partitioning computational offloading algo-
rithm in microservices based mobile cloudlet networks, https://doi.org/10.1007/
s00607-019-00733-4
•P57: Channel Dependability of the ATM Communication Network Based on
the Multilevel Distributed Cloud Technology, https://doi.org/10.1007/978-3-319-
67642-5 49
•P58: Design of compressed sensing fault-tolerant encryption scheme for key sharing
in IoT Multi-cloudy environment(s), https://doi.org/10.1016/j.jisa.2019.04.004
•P59: Fault-Tolerant Temperature Control Algorithm for IoT Networks in Smart
Buildings, https://doi.org/10.3390/en11123430
•P60: Virtualization in Wireless Sensor Networks: Fault Tolerant Embedding for
Internet of Things, https://doi.org/10.1109/JIOT.2017.2717704
References
1. Muccini, H., Moghaddam, M.T.: IoT architectural styles. In: Cuesta, C.E., Garlan,
D., P´erez, J. (eds.) ECSA 2018. LNCS, vol. 11048, pp. 68–85. Springer, Cham (2018).
https://doi.org/10.1007/978-3-030- 00761-4 5
2. Kitchenham, B., Brereton, P.: A systematic review of systematic review process
research in software engineering. Inf. Softw. Technol. 55(12), 2049–2075 (2013)
3. Kitchenham, B.A., Charters, S.: Guidelines for performing systematic literature
reviews in software engineering. Technical report, EBSE-2007-01 (2007)
4. Zhang, H., Babar, M.A., Tell, P.: Identifying relevant studies in software engineering.
Inf. Softw. Technol. 53(6), 625–637 (2011). https://doi.org/10.1016/j.infsof.2010.12.
010
5. Muccini, H., Spalazzese, R., Moghaddam, M.T., Sharaf, M.: Self-adaptive IoT archi-
tectures: an emergency handling case study. In: Proceedings of the 12th European
Conference on Software Architecture: Companion Proceedings, p. 19. ACM (2018)
6. Muccini, H., Arbib, C., Davidsson, P., Tourchi Moghaddam, M.: An IoT software
architecture for an evacuable building architecture. In: Proceedings of the 52nd
Hawaii International Conference on System Sciences (2019)
7. Arbib, C., Arcelli, D., Dugdale, J., Moghaddam, M., Muccini, H.: Real-time emer-
gency response through performant IoT architectures. In: International Conference
on Information Systems for Crisis Response and Management (ISCRAM) (2019)
8. Fayyaz, M., Vladimirova, T.: Survey and future directions of fault-tolerant dis-
tributed computing on board spacecraft. Adv. Space Res. 58(11), 2352–2375 (2016)
9. Petersen, K., Vakkalanka, S., Kuzniarz, L.: Guidelines for conducting systematic
mapping studies in software engineering: an update. Inf. Softw. Technol. 64, 1–18
(2015)