ChapterPDF Available

Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing: Analysing Characteristics

Authors:

Abstract and Figures

In this chapter we are focusing on reliability, fault tolerance and quality of service in cloud computing. The flexible and scalable property of dynamically fetching and relinquishing computing resources in a cost-effective and device-independent manner with minimal management effort or service provider interaction the demand for Cloud computing paradigm has increased dramatically in last few years. Though lots of enhancement took place, cloud computing paradigm is still subject to a large number of system failures. As a result, there is an increasing concern among community regarding the reliability and availability of Cloud computing services. Dynamically provisioning of resources allows cloud computing environment to meet casually varying resource and service requirements of cloud customer applications. Quality of Service (QoS) plays an important role in the affective allocation of resources and has been widely investigated in the Cloud Computing paradigm.
Content may be subject to copyright.
Handbook of Research on
Security Considerations in
Cloud Computing
Kashif Munir
King Fahd University of Petroleum & Minerals, Saudi Arabia
Mubarak S. Al-Mutairi
King Fahd University of Petroleum & Minerals, Saudi Arabia
Lawan A. Mohammed
King Fahd University of Petroleum & Minerals, Saudi Arabia
A volume in the Advances in Information Security,
Privacy, and Ethics (AISPE) Book Series
Published in the United States of America by
Information Science Reference (an imprint of IGI Global)
701 E. Chocolate Avenue
Hershey PA, USA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail: cust@igi-global.com
Web site: http://www.igi-global.com
Copyright © 2015 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in
any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or
companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data
British Cataloguing in Publication Data
A Cataloguing in Publication record for this book is available from the British Library.
All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the
authors, but not necessarily of the publisher.
For electronic access to this publication, please contact: eresources@igi-global.com.
Handbook of research on security considerations in cloud computing / Kashif Munir, Mubarak S. Al-Mutairi, and Lawan A.
Mohammed, editors.
pages cm
Includes bibliographical references and index.
ISBN 978-1-4666-8387-7 (hardcover) -- ISBN 978-1-4666-8388-4 (ebook) 1. Cloud computing--Security measures--
Handbooks, manuals, etc. I. Munir, Kashif, 1976- editor.
QA76.585.H3646 2015
004.67’82--dc23
2015008172
This book is published in the IGI Global book series Advances in Information Security, Privacy, and Ethics (AISPE) (ISSN:
1948-9730; eISSN: 1948-9749)
Managing Director:
Managing Editor:
Director of Intellectual Property & Contracts:
Acquisitions Editor:
Production Editor:
Development Editor:
Cover Design:
Lindsay Johnston
Austin DeMarco
Jan Travers
Kayla Wolfe
Christina Henning
Caitlyn Martin
Jason Mull
358
Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Chapter 18
DOI: 10.4018/978-1-4666-8387-7.ch018
Reliability, Fault Tolerance,
and Quality-of-Service
in Cloud Computing:
Analysing Characteristics
ABSTRACT
In this chapter we are focusing on reliability, fault tolerance and quality of service in cloud computing.
The flexible and scalable property of dynamically fetching and relinquishing computing resources in
a cost-effective and device-independent manner with minimal management effort or service provider
interaction the demand for Cloud computing paradigm has increased dramatically in last few years.
Though lots of enhancement took place, cloud computing paradigm is still subject to a large number of
system failures. As a result, there is an increasing concern among community regarding the reliability
and availability of Cloud computing services. Dynamically provisioning of resources allows cloud
computing environment to meet casually varying resource and service requirements of cloud customer
applications. Quality of Service (QoS) plays an important role in the affective allocation of resources
and has been widely investigated in the Cloud Computing paradigm.
1. INTRODUCTION
Cloud computing can be viewed as a model of
equipping computing resources such as hardware,
system software and applications as a reliable
service over internet in a convenient, flexible and
scalable manner. Often, this computer resources
that is hardware, system software and applications
are referred as Infrastructure as a Service (IaaS),
Platform as a Service (PaaS) and Software as a
service (IaaS), respectively. It (Buyya, Yeo, Venu-
gopal, Broberg, & Brandic, 2009; Expósito et al.,
2013) offers cost effective and effortless outsourc-
ing of resources in dynamic service environments
Piyush Kumar Shukla
University Institute of Technology RGPV, India
Gaurav Singh
Motilal Nehru National Institute of Technology, India
359
Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing
to consumers and also facilitates the construction
of service based applications equipped with the
latest advancement of diverse research areas, such
as Grid Computing, Service-oriented computing,
business processes and virtualization.
Cloud computing providers often employs two
different models to offer these services, utility
computing model and Pay per Use model. Util-
ity computing model is similar to how traditional
utility services (such as water, electricity) are
consumed. Whereas in Pay per Use model users
are allowed to pay on the basis of number of type
of service (characterized on basis of parameters
like CPU cores, memory, and disk capacity)they
use (A Vouk, 2008; Randles, Lamb, & Taleb-
Bendiab, 2010). Payper Use model is useful in
cloud resource provisioning to satisfy the SaaS
user’s needs with reducing cost and maximizing
the profit of the SaaS provider. Another major
concern for cloud resource providers is how to
reduce energy consumption and thereby decreas-
ing operating costs and maximizing the revenue
of cloud provider (Berl et al., 2010; Kim, Belo-
glazov, & Buyya, 2009; Srikantaiah, Kansal, &
Zhao, 2008).Therefore, how to serve request of
the cloud services user to meet Quality of Service
(QoS) needs, fault resistant reliable services and
maximize the profit of the SaaS provider and
cloud resource provider becomes a concern to
be addressed in cloud computing environment
urgently(Li, 2012).
In order to achieve its goal, Cloud, require a
novel infrastructure that incorporates a high-level
monitoring approach to support autonomous, on
demand deployment and decommission of service
instances. For this, Clouds rely greatly, on virtual-
ization of resources to provide management com-
bined with separation of users. Virtual appliances
are employed to encapsulate a complete software
system (e.g. operating system, software libraries
and the deployable services themselves) prepared
for execution in virtual machines (VM)(Kertész et
al., 2013). Cloud management is responsible for
all resources used by all the applications deployed
in the cloud.
Cloud computing and networking can be
viewed as the two different important keys in future
Internet (FI) vision, where Internet connection of
objects and federation of infrastructures become
of high importance (Papagianni et al., 2013). For
many cloud computing applications, network
performance is a key factor in cloud computing
performance to meet QoS delivery, which is
directly linked to the network performance and
provisioning model adopted for computational re-
sources. Thus, in cloud paradigm the convergence
between cloud computing and networking is more
a requirement than a desire in order to facilitate
the efficient realization of cloud computing para-
digm. Providers need to consider the dynamic
provisioning, configuration, reconfiguration, and
optimization of both computing resources (e.g.,
servers) as well as networking resources to meet
their objectives.
2. CLOUD COMPUTING
ARCHITECTURE
Cloud computing environment supposed to fur-
nish its huge pool of computing resources that
encompasses processing power, memory, and
development platform and platform to its users.
This demand of sharing drives architecture of
cloud computing to support convenient, efficient
and flexible on demand services to users.
Architecture of cloud system comprised of dif-
ferent components connected in a loosely manner.
These components can be broadly categorized into
two parts as a front end and back end. Generally,
users input and output device that includes PC,
smart phone, tablet, etc. are referred as front end.
360
Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing
Also applications and interfaces e.g. web browser
that are required to access cloud services are com-
ponents of front end. Traditional cloud computing
architecture is depicted in Figure 1 below.
Whereas, cloud and all the resources required
to provide cloud computing services are referred
to as back end. Cloud back end comprises four
distinct layers as illustrated in Figure 1 (Fox et
al., 2009). Physical resources such as servers,
storage and network switches comprise the lowest-
layer in the stack. On top of physical layer is the
Infrastructure-as-a-Service (IaaS) layer where
virtualization and system management tools are
embedded. Front and back end are linked through
a network may be via Internet Intranet or Inter-
cloud. Cloud Computing Architecture has been
illustrated in Figure 1.
Typically, in cloud deployment, the data cen-
ters and virtualization technology are employed
to utilize the maximum physical resources. The
layer above the IaaS is the Platform-as-a-Service
(PaaS) which contains all user-level middleware
tools that provide an environment to simplify
application development and deployment (e.g.,
web 2.0 interfaces, libraries and programming
languages).Layer on top of the PaaS layer where
user-level applications (e.g., social networks and
scientific models) are built and hosted referred
as the Software-as-a-Service (SaaS). Security,
protocols and control mechanisms are also imple-
mented at Back end.
The fundamental model of cloud computing
architecture is to separate powerful computation
from user devices so that users can enjoy almost
all services with simple, light-weighted devices
with input/output and communication capaci-
ties enough to access to cloud systems for their
demanded services as shown in Figure 2. Users
Figure 1. Cloud Computing Architecture
361
Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing
access the cloud computing environment using
output device has been illustrated in Figure 1.
User demand a service using his input device,
this demand then will be transmitted to a cloud
system over network. After receiving demand
cloud system processes the demand using pow-
erful resources and returns a result efficiently
to the user output device. Similarly, users are
allowed to store their data in the cloud and make
use of applications embedded in cloud system
to process their data and retrieve them if needed
regardless of their locations. Cloud systems let it
be possible that users can enjoy complex, various
and novel services without being limited by their
own equipment.
Challenges
In a general, cloud service architecture, on demand
of service, cloud assigns resources to service the
user by taking into account the server loading, ser-
vice type, location of users, and so on. In memory
oriented service the current usage depends on
the previous one, hence, it is essential to load
the service status, stored in the previously hosted
server, of previous usage before proceeding. To
encounter this server would store the image file
of the virtual machine in its storage to ensure that
the user resume it services with the same settings.
In case if the user wants to access virtual machine
from location far away from the server then the
connection would be forced to point to the server
which stores the image file. Thus, it necessitates
the cloud system to assign the previous server,
where service was processed, to host him. Con-
sequently, result deduced from user operations
will be transmitted along a very long path to the
output device of the user. As a result QoS may be
affected as more and more streams are transmit-
ted through the backbone, the bandwidth of the
backbone would be almost exhausted.
To address this, researchers had proposed
multiple clouds environment, interoperability
of clouds, resource allocation scheme, service
migration and many more. Multiple clouds envi-
ronment had been proposed in (Houidi, Mechtri,
Louati, & Zeghlache, 2011) which focus on the
cloud service provisioning from multiple cloud
providers. A five-level model to assess the maturity
of cloud to cloud interoperability was presented
Figure 2. Users access the cloud computing environment using output device
362
Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing
by (Dowell, Barreto, Michael, & Shing, 2011).
A resource allocation scheme for efficient on
demand resource allocation had been proposed
in (Marshall, Keahey, & Freeman, 2011). There
had been some researches about service migra-
tion (Oikonomou & Stavrakakis, 2010) which
lack the considerations of the characteristics of
cloud computing. A scheme called dual migra-
tion to monitor users location as well as to move
the contents onto the server closest to the user is
proposed by Researchers (Lai, Yang, Lin, & Shieh,
2012). Therefore, a user can enjoy services by
means of the great capacity of the closest server.
3. TYPES OF CLOUDS
Cloud environments (Sotomayor, Montero,
Llorente, & Foster, 2009) can be categorized
into private clouds, public clouds, hybrid clouds
and community clouds on the basis of the way
in which the clouds can be deployed. Different
types of cloud are explained in brief in following
subsections.
Private Cloud
As name implies, the infrastructure of a private
cloud is an internal data centre of an organization
which is privately owned not available to the pub-
lic. Computing resources are pooled and managed
internally which leads to greater efficiencies and
can be applied dynamically according to demand.
It facilitates internal users the fundamental com-
puting resources as well as the high-level security
and control mechanisms. Being privately owned
private cloud allows the enterprise to continue
to follow own workflow and security procedures
which ensures that the correct level of “code” is
executing. Also private clouds are not burdened
by network bandwidth and availability issues or
potential security exposures that may be associ-
ated with public clouds. Overall private clouds
can offer the provider and user greater control,
security, and resilience.
Public Cloud
A public cloud is one in which a third-party
provider makes resources, such as applications
and other computing resources available to the
general public via the Internet. The cloud service
provider is responsible for setting up the hardware,
software, applications, and networking resources.
A public cloud service has advantages associated
with such as flexibility, extensibility, pay-per-
use and inexpensive to use, but it is often more
expensive than a private data center if resources
are used for several years. Also public clouds do
not imply that the user’s data is public. In many
cases, access control mechanisms are required
before the user can make use of cloud resources.
It is made available to the general public or a wide
industry group.
Hybrid Cloud
The private cloud platform owned by each
enterprise integrates various resources such as
computing and storage in a server, which can be
re-configured when and as required. This flex-
ibility, which the private cloud provides shows
how powerful and valuable it is when deployed in
combination with public cloud. In hybrid cloud,
one can use its private in addition with public
cloud resources to make capital out of investments
by catering for specific application requirements
in terms of data confidentiality, security, perfor-
mance and latency.
In private cloud environment, it is the respon-
sibility of an organization who purchased it, is to
maintain and manage all resources. According to
research in (Kang et al., 2008) the peak load of
a private cloud is much larger than average, but
transient. The big spikes are not predictable. If a
private cloud attempts to satisfy all the workload
constraints, the transit peak load would force the
363
Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing
owner to invest more hardware resources in the
private cloud. This leads to over provisioning
and waste of hardware resources in most of the
time. The pay-as-you-go public cloud model
can be utilized in such scenario without making
any redundant resources in the private cloud. To
deal with the spike workload problem in a cost
consideration aspect, the public cloud resources
are dynamically added into a private cloud, and
become a hybrid cloud environment. As the public
cloud resources can be dynamically moved in and
out according to different requirements. Only the
period during which the public cloud handles the
overloading tasks can cost extra money which is far
less than investing more in purchasing resources.
Therefore, the hybrid cloud model helps reduce
hardware cost and the operation cost if a private
cloud already exists. To achieve this, the workload
has to be split and distributed to the private cloud
and the public cloud, or simply the hybrid cloud
(Bittencourt & Madeira, 2011). A hybrid cloud is
a combination of public and private clouds bound
together by either standardized or proprietary
technology that enables data and application por-
tability. With hybrid cloud deployment model, the
users benefited by lower over provisioning factor,
more efficient provisioning, better performance
and less hardware cost (Subramanian, 2011).
One of the result of evolution of hybrid cloud is
a cloud federation (Casola, Rak, & Villano, 2010)
which aims to cost-effective assets and resources
optimization among heterogeneous environments
where clouds can cooperate together with the goal
of obtaining unbounded computation resources,
hence new business opportunities. Federation
brings together different cloud flavors, external
and internal resources. Thus, any organization
can select a public computing environment on
demand when its private cloud reaches a particular
workload threshold.
Community Cloud
A community cloud can be a private cloud pur-
chased by a single user to support a community
of users, or a hybrid cloud with the costs spread
over a few users of the cloud. A community cloud
is often set up as a sandbox environment where
community users can test their applications, or
access cloud resources. Community cloud are
used and controlled by a group of organizations
with a shared interest. In other words, it is a pri-
vate cloud purchased by a single user to support
a community of users where fees may be charged
to subsidiaries.
4. RELIABILITY AND
FAULT-TOLERANCE
With the flexibility and scalability characteristic
in dynamically obtaining and releasing computing
resources in a cost-effective and device-indepen-
dent manner with minimal management effort or
service provider interaction the demand for Cloud
computing paradigm has increased dramatically in
last few years. While lots of improvement taken
place, cloud computing paradigm is still subject
to a large number of system failures. As a result,
there is an increasing concern among community
regarding the reliability and availability of Cloud
computing services. Moreover, the highly complex
nature of the underlying resources makes it more
vulnerable to a large number of failures even
for carefully engineered data centres (Barroso,
Clidaras, & Hölzle, 2013).These failures had an
impact on the overall reliability and availability
of the Cloud computing service. As a result, an
effective means to encounter failures even that
are unknown and unpredictable in numbers has
becomes of urgent need to both the users as well
as the service providers to ensure correct and
continuous system operation. Fault tolerance
serves as a technique to assure user’s reliability
and availability.
364
Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing
In general, a failure refers to error or condi-
tion in which the system to achieve its intended
functionality or the expected behavior. There may
be many reason behind a Failure may happen by
various reasons that is due to reaching an invalid
system state, network failure, etc,. The belief cause
for an error is a fault that represents a fundamental
impairment in the system. Thus, Fault tolerance
is the ability of the system to perform its function
even in the presence of failures. It serves as one of
the means to improve the overall system’s depend-
ability. In particular, it contributes significantly
in increasing system’s reliability and availability.
Cloud computing environment faults that appear
as failures to the end users can be categorized into
two types similarly to other distributed systems
(Piuri, 2013):
Crash faults that cause the system compo-
nents to completely stop functioning or re-
main inactive during failures (e.g., power
outage, hard disk crash)
Byzantine faults that leads the system
components to behave arbitrarily or mali-
ciously during failure, causing the system
to behave unpredictably incorrect.
To implement fault tolerant system the most
important is to clearly understand and determine
what constitutes the correct system behavior so
that specifications on its failure characteristics
can be provided. Failure in any layer in the cloud
architecture at particular instance has an impact
on the services offered by the layers above it. That
is if failures occur in the IaaS layer or the physi-
cal hardware then its impact is significantly high;
hence, it is more important to characterize typical
hardware faults and develop corresponding fault
tolerance techniques.
The key observations derived from one of
the study of failure behaviour of various server
components and hardware repair behaviour based
on the statistical information (Gill, Jain, & Na-
gappan, 2011; Vishwanath & Nagappan, 2010)
is as follows.
8% of the machines that are subject to re-
pair events has the average number of re-
pair is2 i.e., 2 per machine. The annual
failure rate (AFR) is therefore around 8%.
An amount spent on repair cost for
an 8% AFRwere 2.5 million dollars
approximately.
About 78% of total faults/replacements
were detected on hard disks, 5% on RAID
controllers and 3% due to memory failures.
13% of replacements were due to a collec-
tion of components. This implies that Hard
disks are clearly the most failure-prone
hardware components and the most signifi-
cant reason behind server failures.
About 5% of servers experience a disk fail-
ure in less than 1 year from the date when
it is purchased, 12% when the machines are
1 year old, and 25% of the servers see hard
disk failures when it is 2 years old.
Interestingly, factors such as age of the
server, its configuration, location within
the rack and workload run on the machine
were not found to be a significant indicator
for failures.
It can be inferred from these statistics that ro-
bust fault tolerance mechanisms must be employed
to improve the reliability of hard disks (assuming
independent component failures) in order to re-
duce the number of failures. Furthermore, use of
hard disks that have already experienced a failure
should be reduced to meet the high availability
and reliability requirements.
In order to model failure behavior of cloud
computing it is also important to consider failure
behavior of the network. And to characterize the
network failure behavior it is important to under-
stand the overall network topology and various
network components involved in constructing
a data center. Similarly to the study on failure
behavior of servers, a large scale study on the
network failures in data centers is performed in
(Gill et al., 2011). A link failure happens when
the connection between two devices on a specific
365
Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing
interface is down and a device failure happens
when the device is not routing/forwarding packets
correctly (e.g., due to power outage or hardware
crash). Key observations derived from this study
are as follows:
With failure probability of 1 in 5 load bal-
ancers are least reliable among all the net-
work devices, and ToRs are most reliable
with a failure rate of less than 5%. The root
causes for failures in LBs are mainly the
software bugs and configuration errors as
opposed to the hardware errors for other
devices.
The links forwarding traffic from LBs have
highest failure rates whereas links higher
in the topology and links connecting re-
dundant devices have second highest fail-
ure rates.
The estimated median number of packets
lost during a failure is 59K and median
number of bytes is 25MB.
Network redundancy reduces the median
impact of failures (in terms of number of
lost bytes) by only 40%. This observation
is against the common belief that network
redundancy completely masks failures
from applications.
Therefore, the overall data center network
reliability is about 99.99% for 80% of the
links and 60% of the devices.
The most widely adopted methods to achieve
fault tolerance against crash faults and byzantine
faults are as follows:
Checking and monitoring: In this tech-
nique the system is constantly monitored
at runtime to validate, verify and ensure
that correct system specifications are being
met. This technique plays an important role
in failure detection and subsequent recon-
figuration and easy to implement.
Checkpoint and restart: When the system
undergoes a failure, it is restored to the
previously known correct state captured
and saved based on pre-defined parameters
using the latest checkpoint information in-
stead of restarting the system from start.
Replication: Critical system components
are mirrored using additional hardware,
software and network resources in such a
manner that a copy of the critical compo-
nents is available even after a failure hap-
pens. Replication mechanisms are mainly
applied in two formats: active and passive.
In active replication, all the replicas are si-
multaneously active and each replica pro-
cesses the same request at the same time.
This ensures that all the replicas have the
same system state at any given point of
time and it can continue to deliver its ser-
vice even in case of a single replica fail-
ure. Whereas, in passive replication, only
one primary replica processes the requests
while the backup replicas only save the
system state during normal execution pe-
riods. Backup replicas are invoked only
when the primary replica fails.
Fault tolerance mechanisms are varyingly
successful in tolerating faults according to study
(Ayari, Barbaron, Lefevre & Primet, 2008). For
example, a passively replicated system can handle
only crash faults whereas actively replicated sys-
tem using 3+1 replicas are capable of overcoming
byzantine faults. In general, mechanisms that
handles failures at a finer granularity, offering
higher performance guarantees but at cost of
higher amount of resources (Jhawar, Piuri, &
Santambrogio, 2012). Therefore, in the design
of fault tolerance mechanisms one must take into
account a number of factors such as implementa-
tion complexity, resource costs, resilience, and
performance metrics, and achieve a fine balance
of the following parameters:
366
Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing
Fault tolerance model: This factor mea-
sures the resilience level of the fault tol-
erance technique that means at what par it
can tolerate failures or error in the system.
It can also be understood as a robustness
of failure detection protocols, strength of
the failover level and state synchronization
method.
Resource consumption: It takes into ac-
count the amount and cost of resources
incorporates to realize a fault tolerance
mechanism. This factor is normally subject
to the depth of the failure detection and re-
covery mechanisms involved in terms of
CPU, memory, bandwidth, I/O, and so on.
Performance: The impact of the fault toler-
ance procedure on the end-to-end quality
of service (QoS) both during failure and
failure-free periods is measured by this
factor. This impact is often characterized
using, replica launch latency fault detec-
tion latency and failure recovery latency,
and other application-dependent metrics
such as bandwidth, latency, and loss rate.
5. QUALITY OF SERVICE
Dynamically provisioning of resources allows
cloud computing environment to meet casually
varying resource and service requirements of cloud
customer applications. Quality of Service (QoS)
plays an important role in the affective allocation
of resources and has been widely investigated in
the Cloud Computing paradigm. Not only in cloud,
QoS has been an issue in many of the Distributed
Computing paradigms, such as Grid computing
and High Performance Computing. Quality of
Service (QoS)provides a level of assurance against
the application requirements that ensure a certain
level of reliability, availability and performance
of a service and can also cover other aspects of
service quality such as security and dependability.
QoS is primarily concerned with the management
and performance of resources such as processors
memory, storage and networks in Cloud Com-
puting. The QoS is sometimes used as a quality
measure, with many different definitions instead
of referring to the ability to reserve resources.
QoS models are associated with End-Users and
Providers (and often Brokers) and include resource
capacity planning via the use of schedulers and load
balancers and utilize Service Level Agreements
(SLA)(Armstrong & Djemame, 2009). SLAs is
a legal binding contract upon QoS between an
End-User and Provider and define End-User re-
source requirements and execution environment
guarantees to provide End-User what that they are
receiving the exact services they have payed for.
Multiple providers of cloud offers different ser-
vices on their own terms employing own security
levels, system platforms and management systems.
Users often face difficulty while finding best ser-
vices to meet their objectives. Cloud service broker
are specialized expert to provide intermediary role
between providers and consumers of cloud and
assist purchaser of cloud services to find the ap-
propriate cloud offering as well as in deployment
and management of applications on cloud. Cloud
brokers helps in negotiating the best deals and
relationships between cloud consumers and cloud
providers. Specialized tools are used by brokers to
identify the most appropriate cloud resource and
map the requirements of application to it. They
can also be dynamic by automatically routing
data, applications and infrastructure needs based
on some QoS criteria like availability, reliability,
latency, price etc. In an attempt to provide broker
solutions Researchers have (Salehi & Buyya, 2010)
proposed a user level broker using two market
oriented scheduling policies. The proposed broker
increases the computational capacity of the local
resources by employing resources from an IaaS
provider. Researchers(Yang, Zhou, Liang, He, &
Sun, 2010) qcr4w introduced a service oriented
broker that claims guarantee data transmission
and uniform mechanism for arranging resources
via broker to maintain certain level of services
367
Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing
to users. Cloud Quality of Service management
Strategy(C-QoSMS) framework has been pro-
posed by Researchers (Ganghishetti & Wankar,
2011) which needs to be included in the cloud
service broker. By adding C-QoSMS component
in the cloud broker provides the capacity to the
customer to select a cloud provider based on the
QoS criteria specified in the SLA in minimum
searching time.
It is also important to know that there are dif-
ferent types of cloud users with different types
of applications with different set of personalized
preferences or QoCS requirements. Some applica-
tions require considerable computing and storage
power while others have strong need for execution
time. Reliability, Availability, Execution Time,
Reputation, and Tariff are the commonly used
QoS criteria for services selection (Lin, Sheu,
Chang, & Yuan, 2011). The goal of the cloud
users’ is to process their services successfully
and meet their performance, security, deadline
and cost target. This implies that the success of
the underlying business motto of the cloud users
rely on determining the best fitted cloud service
for a personalized application.
Cloud service typically comes with various
levels of services and performance characteristics
and to fulfill its promises to provide high quality,
on-demand services with service-oriented archi-
tecture often makes Quality of Cloud Service
(QoCS) high variance. This let the difficulty for
the users to compare these cloud services and select
them to meet their QoCS requirements. To address
this Researcher (Wang, Liu, Sun, Zou& Yang,
2014) propose an accurate evaluation approach
of QoCS in service-oriented cloud computing
by employing fuzzy synthetic decision to asses
cloud service providers according to cloud users’
preferences. Also, cloud service with consistently
good QoCS performance is usually more recom-
mendable than a QoCS performance having large
variance. So, due to any unpredictable behavior
of network such as bandwidth, time and many
other factors may impact the quality of these
cloud services. Hence, in the evaluation of cloud
service performance consistency should be taken
into account as an important criterion.
The deadline constraint problem is another
important QoS criterion that can be viewed as a
resource selection problem to fit the user’s demand
in terms of execution time. Such a problem is
similar to the service discovery problem in the
domain of web service composition. Researchers
introduced the linear integer programming (LIP)
model in order to address the service selection
problem by maximizing the utility value which
is a weighted sum of user-defined QoS attributes
(Ardagna & Pernici, 2007). They applied LIP-
based approaches for service matching, ranking
and selection. LIP-based approaches are prone to
high computational complexity associated with the
growth of the size of web services. This kind of
scheduling problem with QoS constraints are mod-
eled as a variation of multi-dimension multi choice
knapsack problem (MMKP) (Parra-Hernandez &
Dimopoulos, 2005), which has been proven to be
NP-complete (Martello & Toth, 1990).To encoun-
ter this Researcher( Wang, Chang, Lo& Lee, 2013)
propose an adaptive scheduling algorithm called
Adaptive Scheduling with QoS Satisfaction (AsQ)
for hybrid cloud environments. The AsQaims to
fit the deadline constraints of the submitted jobs
and to reduce the cost for renting public cloud
resources if using a public cloud is necessary. The
AsQ attempts to maximize the utilization rate of
the private cloud and to minimize the renting cost
of the public cloud.
6. CONCLUSION
In this Chapter, the current consensus of what
Cloud Computing is, the confusion surrounding
the different cloud computing deployment model
viz., Public, Private, Hybrid and Community
cloud, traditional cloud computing architecture
and the relevance of reliability, fault tolerance
and QoS in Clouds have been discussed. Fault
368
Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing
tolerance is a critical means of assuring reliability
QoS criteria in cloud computing. In other words,
Fault tolerance is concerned with all the methods
required to enable a system to tolerate software
faults at the runtime. It ensures the correct func-
tioning and continuous operation of cloud system.
We discussed the different failure attributes of
typical Cloud-based services caused largely due
to crash faults and byzantine faults and fault tol-
erance mechanisms to encounter these failures.
We also described various study conducted that
addresses user’s reliability and availability con-
cerns. Quality of services (QoS) is the ability to
render different priority to different applications,
users, or data flows, or to guarantee a certain level
performance data flow. To meet the requirements
of both cloud users and service providers, role
of resource broker is discussed. In this Chapter,
we also discussed the QoS measures covering its
research challenges, tools used for implementing
QoS in cloud computing. The motivation behind,
concepts, technology, researches and the state of
QoS in Cloud Computing have been reviewed.
REFERENCES
Ardagna, D., & Pernici, B. (2007). Adaptive
service composition in flexible processes. Soft-
ware Engineering, IEEE Transactions on, 33(6),
369-384.
Armstrong, D., & Djemame, K. (2009). Towards
quality of service in the cloud.
Ayari, N., Barbaron, D., Lefevre, L., & Primet,
P. (2008). Fault tolerance for highly available
internet services: concepts, approaches, and is-
sues. Communications Surveys & Tutorials, IEEE,
10(2), 34-46.
Barroso, L. A., Clidaras, J., & Hölzle, U. (2013).
The datacenter as a computer: An introduction to
the design of warehouse-scale machines. Synthesis
lectures on computer architecture, 8(3), 1-15.
Berl, A., Gelenbe, E., Di Girolamo, M., Giuliani,
G., De Meer, H., Dang, M. Q., & Pentikousis, K.
(2010). Energy-efficient cloud computing. The
computer journal, 53(7), 1045-1051.
Bittencourt, L. F., & Madeira, E. R. M. (2011).
HCOC: a cost optimization algorithm for workflow
scheduling in hybrid clouds. Journal of Internet
Services and Applications, 2(3), 207-227.
Buyya, R., Yeo, C. S., Venugopal, S., Broberg,
J., & Brandic, I. (2009). Cloud computing and
emerging IT platforms: Vision, hype, and reality
for delivering computing as the 5th utility. Future
Generation computer systems, 25(6), 599-616.
Casola, V., Rak, M., & Villano, U. (2010). Identity
federation in cloud computing.
Dowell, S., Barreto, A., Michael, J. B., & Shing,
M.-T. (2011). Cloud to cloud interoperability.
Expósito, R. R., Taboada, G. L., Ramos, S.,
González-Domínguez, J., Touriño, J., & Doallo,
R. (2013). Analysis of I/O performance on an
amazon EC2 cluster compute and high I/O plat-
form. Journal of grid computing, 11(4), 613-631.
Fox, A., Griffith, R., Joseph, A., Katz, R., Konwin-
ski, A., Lee, G., et al. (2009). Above the clouds:
A Berkeley view of cloud computing. University
of California, Berkeley, Rep. UCB/EECS, 28, 13.
Ganghishetti, P., & Wankar, R. (2011). Quality of
Service Design in Clouds. CSI Communications,
35(2), 12–15.
Gill, P., Jain, N., & Nagappan, N. (2011). Un-
derstanding network failures in data centers:
measurement, analysis, and implications.
Houidi, I., Mechtri, M., Louati, W., & Zeghlache,
D. (2011). Cloud service delivery across multiple
cloud platforms.
Jhawar, R., Piuri, V., & Santambrogio, M. (2012).
A comprehensive conceptual system-level ap-
proach to fault tolerance in cloud computing.
369
Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing
Kang, X., Zhang, H., Jiang, G., Chen, H., Meng,
X., & Yoshihira, K. (2008). Measurement, mod-
eling, and analysis of internet video sharing site
workload: A case study.
Kertész, A., Kecskemeti, G., Oriol, M., Kotcauer,
P., Acs, S., Rodríguez, M., . . . Franch, X. (2013).
Enhancing federated cloud management with an
integrated service monitoring approach. Journal
of grid computing, 11(4), 699-720.
Kim, K. H., Beloglazov, A., & Buyya, R. (2009).
Power-aware provisioning of cloud resources for
real-time services.
Lai, W. K., Yang, K.-T., Lin, Y.-C., & Shieh, C.-S.
(2012). Dual migration for improved efficiency in
cloud service Intelligent Information and Database
Systems (pp. 216-225). Springer.
Li, C. (2012). Optimal resource provisioning for
cloud computing environment. The Journal of
Supercomputing, 62(2), 989-1022.
Lin, C.-F., Sheu, R.-K., Chang, Y.-S., & Yuan,
S.-M. (2011). A relaxable service selection al-
gorithm for QoS-based web service composition.
Information and Software Technology, 53(12),
1370-1381.
Marshall, P., Keahey, K., & Freeman, T. (2011).
Improving utilization of infrastructure clouds.
Martello, S., & Toth, P. (1990). Knapsack prob-
lems: Algorithms and computer interpretations.
Hoboken, NJ: Wiley-Interscience.
Oikonomou, K., & Stavrakakis, I. (2010). Scalable
service migration in autonomic network environ-
ments. Selected Areas in Communications, IEEE
Journal on, 28(1), 84-94.
Papagianni, C., Leivadeas, A., Papavassiliou, S.,
Maglaris, V., Cervello-Pastor, C., & Monje, A.
(2013). On the optimal allocation of virtual re-
sources in cloud computing networks. Computers,
IEEE Transactions on, 62(6), 1060-1071.
Parra-Hernandez, R., & Dimopoulos, N. J. (2005).
A new heuristic for solving the multichoice mul-
tidimensional knapsack problem. Systems, Man
and Cybernetics, Part A: Systems and Humans,
IEEE Transactions on, 35(5), 708-717.
Piuri, R. J. V. (2013). Fault Tolerance and Resil-
ience in Cloud Computing Environments. In J.
Vacca (Ed.), Computer and information Security
Handbook (2nd Ed.). Morgan Kaufmann.
Randles, M., Lamb, D., & Taleb-Bendiab, A.
(2010). A comparative study into distributed load
balancing algorithms for cloud computing.
Salehi, M. A., & Buyya, R. (2010). Adapting
market-oriented scheduling policies for cloud
computing Algorithms and Architectures for
Parallel Processing (pp. 351-362). Springer.
Sotomayor, B., Montero, R. S., Llorente, I. M.,
& Foster, I. (2009). Virtual infrastructure man-
agement in private and hybrid clouds. Internet
computing, IEEE, 13(5), 14-22.
Srikantaiah, S., Kansal, A., & Zhao, F. (2008).
Energy aware consolidation for cloud computing.
Subramanian, K. (2011). Hybrid clouds. Retrieved
from http://emea. trendmicro. com/imperia/md/
content/uk/cloud-security/wp01_hybridcloud-
krish_110624us. pdf
Vishwanath, K. V., & Nagappan, N. (2010). Char-
acterizing cloud computing hardware reliability.
A Vouk, M. (2008). Cloud computing–issues,
research and implementations. CIT. Journal of
Computing and Information Technology, 16(4),
235-246.
Wang, S., Liu, Z., Sun, Q., Zou, H., & Yang, F.
(2014). Towards an accurate evaluation of qual-
ity of cloud service in service-oriented cloud
computing. Journal of Intelligent Manufacturing,
25(2), 283-291.
370
Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing
Wang, W.-J., Chang, Y.-S., Lo, W.-T., & Lee, Y.-K.
(2013). Adaptive scheduling for parallel tasks with
QoS satisfaction for hybrid cloud environments.
The Journal of Supercomputing, 66(2), 783-811.
Yang, Y., Zhou, Y., Liang, L., He, D., & Sun, Z.
(2010). A service-oriented broker for bulk data
transfer in cloud computing.
KEY TERMS AND DEFINITIONS
Cloud Computing: A model for delivering IT
services in which resources are retrieved from the
internet through web-based tools and applications
rather than a direct connection to a server.
Fault Tolerance: It is the property that enables
a system to continue operating properly in the event
of failure of some of its components.
Quality of Service (QoS): Quality of service
(QoS) generally refers to a network’s capability to
achieve maximum bandwidth and deal with other
network performance elements like latency, error
rate and uptime. Quality of service also involves
controlling and managing network resources by
setting priorities for explicit types of data (files,
audio and video) on the network or cloud.
Virtualization: Virtualization, in computing,
is the creation of a virtual (rather than actual) ver-
sion of something, such as a hardware platform,
operating system, a storage device or network
resources.
Article
Providing dynamic resources is based on the virtualization features of the cloud environment. Cloud computing as an emerging technology uses a high availability of services at any time, in any place and independent of the hardware. However, fault tolerance is one of the main problems and challenges in cloud computing. This subject has an important effect on cloud computing, but, as far as we know, there is not a comprehensive and systematic study in this field. Accordingly, in this paper, the existing methods and mechanisms are discussed in different groups, such as proactive and reactive, types of fault detection, etc. Various fault tolerance techniques are provided and discussed. The advantages and disadvantages of these techniques are shown on the basis of the technology that they have used. Generally, the contributions of this research provide a summary of the available challenges associated with fault tolerance, a description of several important fault tolerance methods in the cloud computing and the key regions for the betterment of fault tolerance techniques in the future works. The advantages and disadvantages of the selected articles in each category are also highlighted and their significant challenges are discussed to provide the research lines for further studies.
Article
Full-text available
"Cloud" computing - a relatively recent term, builds on decades of research in virtualization, distributed computing, utility computing, and more recently networking, web and software services. It implies a service oriented architecture, reduced information technology overhead for the end-user, great flexibility, reduced total cost of ownership, on-demand services and many other things. This paper discusses the concept of "cloud" computing, some of the issues it tries to address, related research topics, and a "cloud" implementation available today.
Article
Full-text available
Quality of Service (QoS) plays a critical role in the aective reserva-tion of resources within service oriented distributed systems and has been widely investigated in the now well established paradigm of Grid Com-puting. The emergence of a new paradigm, Cloud Computing, continues the natural evolution of Distributed Systems to cater for changes in ap-plication domains and system requirements. Virtualisation of resources, a key technology underlying Cloud Computing, sets forth new challenges to be investigated within QoS and presents opportunities to apply the knowledge and lessons learnt from Grid Computing. QoS has been an issue in many of the Distributed Computing para-digms, such as Grid Computing and High Performance Computing. The aim of this paper is to address QoS specically in the context of the nascent paradigm Cloud Computing and propose relevant research questions. The objectives of this paper are to discuss the confusion surrounding the term Cloud, the current consensus of what Cloud Computing is and the le-gacy bequest by Grid Computing to this emergent paradigm. Emphasis is placed on the state of QoS provisioning in Grids and the technology to en-able it in Cloud Computing. Finally open research questions within QoS relevant to Cloud Computing are proposed and the direction of various future research is envisioned.
Article
Full-text available
Cloud Computing enables the construction and the provisioning of virtualized service-based applications in a simple and cost effective outsourcing to dynamic service environments. Cloud Federations envisage a distributed, heterogeneous environment consisting of various cloud infrastructures by aggregating different IaaS provider capabilities coming from both the commercial and the academic area. In this paper, we introduce a federated cloud management solution that operates the federation through utilizing cloud-brokers for various IaaS providers. In order to enable an enhanced provider selection and inter-cloud service executions, an integrated monitoring approach is proposed which is capable of measuring the availability and reliability of the provisioned services in different providers. To this end, a minimal metric monitoring service has been designed and used together with a service monitoring solution to measure cloud performance. The transparent and cost effective operation on commercial clouds and the capability to simultaneously monitor both private and public clouds were the major design goals of this integrated cloud monitoring approach. Finally, the evaluation of our proposed solution is presented on different private IaaS systems participating in federations.
Article
The increasing demand for flexibility and scalability in dynamically obtaining and releasing computing resources in a cost-effective and device-independent manner, and easiness in hosting applications without the burden of installation and maintenance, has resulted in a wide adoption of the cloud computing paradigm. While the benefits are immense, this computing paradigm is still vulnerable to a large number of system failures; as a consequence, users have become increasingly concerned about the reliability and availability of cloud computing services. Fault tolerance and resilience serve as an effective means to address users' reliability and availability concerns. In this chapter, we focus on characterizing the recurrent failures in a typical cloud computing environment, analyzing the effects of failures on users' applications and surveying fault tolerance solutions corresponding to each class of failures. We also discuss the perspective of offering fault tolerance as a service to users' applications as one of the effective means of addressing users' reliability and availability concerns.
Article
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Acknowledgement The RAD Lab's existence is due to the generous support of the founding members Google, Microsoft, and Sun Microsystems and of the affiliate members Amazon Web Services, Cisco Systems, Facebook, Hewlett-
Conference Paper
Wireless technologies enable users to retain their Internet access at anywhere and at anytime, without the tangling of wired cables. Users might want to keep enjoying their favor services when they are moving. However, the user mobility would causes longer and longer path to the serving server so that the QoS cannot be guaranteed. In order to maintain better QoS as a user moves, we proposed a novel and efficient cloud service architecture, named dual migration. The dual migration architecture keeps monitor the location of a user and migrates the contents what the user might need onto the closest server for the current location of the user. Therefore, the hop count of the path between a user and the corresponding server is short.
Article
The paper presents an efficient cloud resource provisioning approach. The Software as a Service (SaaS) provider leases resources from cloud providers and also leases software as services to SaaS users. The SaaS providers aim at minimizing the payment of using VMs from cloud providers, and want to maximize the profit earned through serving the SaaS users’ requests. The SaaS providers also guarantee meeting quality of service (QoS) requirements of the SaaS users. The cloud provider is to maximize the profit without exceeding the upper bound of energy consumption of cloud provider for provisioning virtual machines (VMs) to the SaaS provider. The SaaS users purpose to obtain the optimized QoS to accomplish their jobs with a limited budget and deadline. The proposed optimal cloud resource provisioning algorithm includes two sub-algorithms at different levels: interaction between the SaaS user and SaaS provider at the application layer and interaction between the SaaS provider and cloud resource provider at the resource layer. Simulations are conducted to compare the performance of proposed cloud resource provisioning algorithm with related work.
Article
A hybrid cloud integrates private clouds and public clouds into one unified environment. For the economy and the efficiency reasons, the hybrid cloud environment should be able to automatically maximize the utilization rate of the private cloud and minimize the cost of the public cloud when users submit their computing jobs to the environment. In this paper, we propose the Adaptive-Scheduling-with-QoS-Satisfaction algorithm, namely AsQ, for the hybrid cloud environment to raise the resource utilization rate of the private cloud and to diminish task response time as much as possible. We exploit runtime estimation and several fast scheduling strategies for near-optimal resource allocation, which results in high resource utilization rate and low execution time in the private cloud. Moreover, the near-optimal allocation in the private cloud can reduce the amount of tasks that need to be executed on the public cloud to satisfy their deadline. For the tasks that have to be dispatched to the public cloud, we choose the minimal cost strategy to reduce the cost of using public clouds based on the characteristics of tasks such as workload size and data size. Therefore, the AsQ can achieve a total optimization regarding cost and deadline constraints. Many experiments have been conducted to evaluate the performance of the proposed AsQ. The results show that the performance of the proposed AsQ is superior to recent similar algorithms in terms of task waiting time, task execution time and task finish time. The results also show that the proposed algorithm achieves a better QoS satisfaction rate than other similar studies.
Article
Cloud computing is currently being explored by the scientific community to assess its suitability for High Performance Computing (HPC) environments. In this novel paradigm, compute and storage resources, as well as applications, can be dynamically provisioned on a pay-per-use basis. This paper presents a thorough evaluation of the I/O storage subsystem using the Amazon EC2 Cluster Compute platform and the recent High I/O instance type, to determine its suitability for I/O-intensive applications. The evaluation has been carried out at different layers using representative benchmarks in order to evaluate the low-level cloud storage devices available in Amazon EC2, ephemeral disks and Elastic Block Store (EBS) volumes, both on local and distributed file systems. In addition, several I/O interfaces (POSIX, MPI-IO and HDF5) commonly used by scientific workloads have also been assessed. Furthermore, the scalability of a representative parallel I/O code has also been analyzed at the application level, taking into account both performance and cost metrics. The analysis of the experimental results has shown that available cloud storage devices can have different performance characteristics and usage constraints. Our comprehensive evaluation can help scientists to increase significantly (up to several times) the performance of I/O-intensive applications in Amazon EC2 cloud. An example of optimal configuration that can maximize I/O performance in this cloud is the use of a RAID 0 of 2 ephemeral disks, TCP with 9,000 bytes MTU, NFS async and MPI-IO on the High I/O instance type, which provides ephemeral disks backed by Solid State Drive (SSD) technology.