ChapterPDF Available

Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing: Analysing Characteristics

July 2015

July 2015

DOI:10.4018/978-1-4666-8387-7.ch018

In book: Handbook of Research on Security Considerations in Cloud Computing
Publisher: Published in the United States of America by Information Science Reference (an imprint of IGI Global)
Editors: Kashif Munir King Fahd University of Petroleum & Minerals, Saudi Arabia Mubarak S. Al-Mutairi King Fahd University of Petroleum & Minerals, Saudi Arabia Lawan A. Mohammed King Fahd University of Petroleum & Minerals, Saudi Arabia A volume

Authors:

Piyush Kumar Shukla

Rajiv Gandhi Proudyogiki Vishwavidyalaya

Gaurav Singh

In this chapter we are focusing on reliability, fault tolerance and quality of service in cloud computing. The flexible and scalable property of dynamically fetching and relinquishing computing resources in a cost-effective and device-independent manner with minimal management effort or service provider interaction the demand for Cloud computing paradigm has increased dramatically in last few years. Though lots of enhancement took place, cloud computing paradigm is still subject to a large number of system failures. As a result, there is an increasing concern among community regarding the reliability and availability of Cloud computing services. Dynamically provisioning of resources allows cloud computing environment to meet casually varying resource and service requirements of cloud customer applications. Quality of Service (QoS) plays an important role in the affective allocation of resources and has been widely investigated in the Cloud Computing paradigm.

Cloud Computing Architecture

…

Users access the cloud computing environment using output device

…

Figures - uploaded by Piyush Kumar Shukla

Content may be subject to copyright.

Content uploaded by Piyush Kumar Shukla

Content may be subject to copyright.

Handbook of Research on

Security Considerations in

Cloud Computing

Kashif Munir

King Fahd University of Petroleum & Minerals, Saudi Arabia

Mubarak S. Al-Mutairi

King Fahd University of Petroleum & Minerals, Saudi Arabia

Lawan A. Mohammed

King Fahd University of Petroleum & Minerals, Saudi Arabia

A volume in the Advances in Information Security,

Privacy, and Ethics (AISPE) Book Series

Published in the United States of America by

Information Science Reference (an imprint of IGI Global)

701 E. Chocolate Avenue

Hershey PA, USA 17033

Tel: 717-533-8845

Fax: 717-533-8661

E-mail: cust@igi-global.com

Web site: http://www.igi-global.com

any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher.

Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or

companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark.

Library of Congress Cataloging-in-Publication Data

British Cataloguing in Publication Data

A Cataloguing in Publication record for this book is available from the British Library.

All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the

authors, but not necessarily of the publisher.

For electronic access to this publication, please contact: eresources@igi-global.com.

Handbook of research on security considerations in cloud computing / Kashif Munir, Mubarak S. Al-Mutairi, and Lawan A.

Mohammed, editors.

pages cm

Includes bibliographical references and index.

ISBN 978-1-4666-8387-7 (hardcover) -- ISBN 978-1-4666-8388-4 (ebook) 1. Cloud computing--Security measures--

Handbooks, manuals, etc. I. Munir, Kashif, 1976- editor.

QA76.585.H3646 2015

004.67’82--dc23

2015008172

This book is published in the IGI Global book series Advances in Information Security, Privacy, and Ethics (AISPE) (ISSN:

1948-9730; eISSN: 1948-9749)

Managing Director:

Managing Editor:

Director of Intellectual Property & Contracts:

Acquisitions Editor:

Production Editor:

Development Editor:

Cover Design:

Lindsay Johnston

Austin DeMarco

Jan Travers

Kayla Wolfe

Christina Henning

Caitlyn Martin

Jason Mull

358

Chapter 18

DOI: 10.4018/978-1-4666-8387-7.ch018

Reliability, Fault Tolerance,

and Quality-of-Service

in Cloud Computing:

Analysing Characteristics

ABSTRACT

In this chapter we are focusing on reliability, fault tolerance and quality of service in cloud computing.

The ﬂexible and scalable property of dynamically fetching and relinquishing computing resources in

a cost-eﬀective and device-independent manner with minimal management eﬀort or service provider

interaction the demand for Cloud computing paradigm has increased dramatically in last few years.

Though lots of enhancement took place, cloud computing paradigm is still subject to a large number of

system failures. As a result, there is an increasing concern among community regarding the reliability

and availability of Cloud computing services. Dynamically provisioning of resources allows cloud

computing environment to meet casually varying resource and service requirements of cloud customer

applications. Quality of Service (QoS) plays an important role in the aﬀective allocation of resources

and has been widely investigated in the Cloud Computing paradigm.

1. INTRODUCTION

Cloud computing can be viewed as a model of

equipping computing resources such as hardware,

system software and applications as a reliable

service over internet in a convenient, flexible and

scalable manner. Often, this computer resources

that is hardware, system software and applications

are referred as Infrastructure as a Service (IaaS),

Platform as a Service (PaaS) and Software as a

service (IaaS), respectively. It (Buyya, Yeo, Venu-

gopal, Broberg, & Brandic, 2009; Expósito et al.,

2013) offers cost effective and effortless outsourc-

ing of resources in dynamic service environments

Piyush Kumar Shukla

University Institute of Technology RGPV, India

Gaurav Singh

Motilal Nehru National Institute of Technology, India

359

Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing

to consumers and also facilitates the construction

of service based applications equipped with the

latest advancement of diverse research areas, such

as Grid Computing, Service-oriented computing,

business processes and virtualization.

Cloud computing providers often employs two

different models to offer these services, utility

computing model and Pay per Use model. Util-

ity computing model is similar to how traditional

utility services (such as water, electricity) are

consumed. Whereas in Pay per Use model users

are allowed to pay on the basis of number of type

of service (characterized on basis of parameters

like CPU cores, memory, and disk capacity)they

use (A Vouk, 2008; Randles, Lamb, & Taleb-

Bendiab, 2010). Payper Use model is useful in

cloud resource provisioning to satisfy the SaaS

user’s needs with reducing cost and maximizing

the profit of the SaaS provider. Another major

concern for cloud resource providers is how to

reduce energy consumption and thereby decreas-

ing operating costs and maximizing the revenue

of cloud provider (Berl et al., 2010; Kim, Belo-

glazov, & Buyya, 2009; Srikantaiah, Kansal, &

Zhao, 2008).Therefore, how to serve request of

the cloud services user to meet Quality of Service

(QoS) needs, fault resistant reliable services and

maximize the profit of the SaaS provider and

cloud resource provider becomes a concern to

be addressed in cloud computing environment

urgently(Li, 2012).

In order to achieve its goal, Cloud, require a

novel infrastructure that incorporates a high-level

monitoring approach to support autonomous, on

demand deployment and decommission of service

instances. For this, Clouds rely greatly, on virtual-

ization of resources to provide management com-

bined with separation of users. Virtual appliances

are employed to encapsulate a complete software

system (e.g. operating system, software libraries

and the deployable services themselves) prepared

for execution in virtual machines (VM)(Kertész et

al., 2013). Cloud management is responsible for

all resources used by all the applications deployed

in the cloud.

Cloud computing and networking can be

viewed as the two different important keys in future

Internet (FI) vision, where Internet connection of

objects and federation of infrastructures become

of high importance (Papagianni et al., 2013). For

many cloud computing applications, network

performance is a key factor in cloud computing

performance to meet QoS delivery, which is

directly linked to the network performance and

provisioning model adopted for computational re-

sources. Thus, in cloud paradigm the convergence

between cloud computing and networking is more

a requirement than a desire in order to facilitate

the efficient realization of cloud computing para-

digm. Providers need to consider the dynamic

provisioning, configuration, reconfiguration, and

optimization of both computing resources (e.g.,

servers) as well as networking resources to meet

their objectives.

2. CLOUD COMPUTING

ARCHITECTURE

Cloud computing environment supposed to fur-

nish its huge pool of computing resources that

encompasses processing power, memory, and

development platform and platform to its users.

This demand of sharing drives architecture of

cloud computing to support convenient, efficient

and flexible on demand services to users.

Architecture of cloud system comprised of dif-

ferent components connected in a loosely manner.

These components can be broadly categorized into

two parts as a front end and back end. Generally,

users input and output device that includes PC,

smart phone, tablet, etc. are referred as front end.

360

Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing

Also applications and interfaces e.g. web browser

that are required to access cloud services are com-

ponents of front end. Traditional cloud computing

architecture is depicted in Figure 1 below.

Whereas, cloud and all the resources required

to provide cloud computing services are referred

to as back end. Cloud back end comprises four

distinct layers as illustrated in Figure 1 (Fox et

al., 2009). Physical resources such as servers,

storage and network switches comprise the lowest-

layer in the stack. On top of physical layer is the

Infrastructure-as-a-Service (IaaS) layer where

virtualization and system management tools are

embedded. Front and back end are linked through

a network may be via Internet Intranet or Inter-

cloud. Cloud Computing Architecture has been

illustrated in Figure 1.

Typically, in cloud deployment, the data cen-

ters and virtualization technology are employed

to utilize the maximum physical resources. The

layer above the IaaS is the Platform-as-a-Service

(PaaS) which contains all user-level middleware

tools that provide an environment to simplify

application development and deployment (e.g.,

web 2.0 interfaces, libraries and programming

languages).Layer on top of the PaaS layer where

user-level applications (e.g., social networks and

scientific models) are built and hosted referred

as the Software-as-a-Service (SaaS). Security,

protocols and control mechanisms are also imple-

mented at Back end.

The fundamental model of cloud computing

architecture is to separate powerful computation

from user devices so that users can enjoy almost

all services with simple, light-weighted devices

with input/output and communication capaci-

ties enough to access to cloud systems for their

demanded services as shown in Figure 2. Users

Figure 1. Cloud Computing Architecture

361

Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing

access the cloud computing environment using

output device has been illustrated in Figure 1.

User demand a service using his input device,

this demand then will be transmitted to a cloud

system over network. After receiving demand

cloud system processes the demand using pow-

erful resources and returns a result efficiently

to the user output device. Similarly, users are

allowed to store their data in the cloud and make

use of applications embedded in cloud system

to process their data and retrieve them if needed

regardless of their locations. Cloud systems let it

be possible that users can enjoy complex, various

and novel services without being limited by their

own equipment.

Challenges

In a general, cloud service architecture, on demand

of service, cloud assigns resources to service the

user by taking into account the server loading, ser-

vice type, location of users, and so on. In memory

oriented service the current usage depends on

the previous one, hence, it is essential to load

the service status, stored in the previously hosted

server, of previous usage before proceeding. To

encounter this server would store the image file

of the virtual machine in its storage to ensure that

the user resume it services with the same settings.

In case if the user wants to access virtual machine

from location far away from the server then the

connection would be forced to point to the server

which stores the image file. Thus, it necessitates

the cloud system to assign the previous server,

where service was processed, to host him. Con-

sequently, result deduced from user operations

will be transmitted along a very long path to the

output device of the user. As a result QoS may be

affected as more and more streams are transmit-

ted through the backbone, the bandwidth of the

backbone would be almost exhausted.

To address this, researchers had proposed

multiple clouds environment, interoperability

of clouds, resource allocation scheme, service

migration and many more. Multiple clouds envi-

ronment had been proposed in (Houidi, Mechtri,

Louati, & Zeghlache, 2011) which focus on the

cloud service provisioning from multiple cloud

providers. A five-level model to assess the maturity

of cloud to cloud interoperability was presented

Figure 2. Users access the cloud computing environment using output device

362

Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing

by (Dowell, Barreto, Michael, & Shing, 2011).

A resource allocation scheme for efficient on

demand resource allocation had been proposed

in (Marshall, Keahey, & Freeman, 2011). There

had been some researches about service migra-

tion (Oikonomou & Stavrakakis, 2010) which

lack the considerations of the characteristics of

cloud computing. A scheme called dual migra-

tion to monitor users location as well as to move

the contents onto the server closest to the user is

proposed by Researchers (Lai, Yang, Lin, & Shieh,

2012). Therefore, a user can enjoy services by

means of the great capacity of the closest server.

3. TYPES OF CLOUDS

Cloud environments (Sotomayor, Montero,

Llorente, & Foster, 2009) can be categorized

into private clouds, public clouds, hybrid clouds

and community clouds on the basis of the way

in which the clouds can be deployed. Different

types of cloud are explained in brief in following

subsections.

Private Cloud

As name implies, the infrastructure of a private

cloud is an internal data centre of an organization

which is privately owned not available to the pub-

lic. Computing resources are pooled and managed

internally which leads to greater efficiencies and

can be applied dynamically according to demand.

It facilitates internal users the fundamental com-

puting resources as well as the high-level security

and control mechanisms. Being privately owned

private cloud allows the enterprise to continue

to follow own workflow and security procedures

which ensures that the correct level of “code” is

executing. Also private clouds are not burdened

by network bandwidth and availability issues or

potential security exposures that may be associ-

ated with public clouds. Overall private clouds

can offer the provider and user greater control,

security, and resilience.

Public Cloud

A public cloud is one in which a third-party

provider makes resources, such as applications

and other computing resources available to the

general public via the Internet. The cloud service

provider is responsible for setting up the hardware,

software, applications, and networking resources.

A public cloud service has advantages associated

with such as flexibility, extensibility, pay-per-

use and inexpensive to use, but it is often more

expensive than a private data center if resources

are used for several years. Also public clouds do

not imply that the user’s data is public. In many

cases, access control mechanisms are required

before the user can make use of cloud resources.

It is made available to the general public or a wide

industry group.

Hybrid Cloud

The private cloud platform owned by each

enterprise integrates various resources such as

computing and storage in a server, which can be

re-configured when and as required. This flex-

ibility, which the private cloud provides shows

how powerful and valuable it is when deployed in

combination with public cloud. In hybrid cloud,

one can use its private in addition with public

cloud resources to make capital out of investments

by catering for specific application requirements

in terms of data confidentiality, security, perfor-

mance and latency.

In private cloud environment, it is the respon-

sibility of an organization who purchased it, is to

maintain and manage all resources. According to

research in (Kang et al., 2008) the peak load of

a private cloud is much larger than average, but

transient. The big spikes are not predictable. If a

private cloud attempts to satisfy all the workload

constraints, the transit peak load would force the

363

Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing

owner to invest more hardware resources in the

private cloud. This leads to over provisioning

and waste of hardware resources in most of the

time. The pay-as-you-go public cloud model

can be utilized in such scenario without making

any redundant resources in the private cloud. To

deal with the spike workload problem in a cost

consideration aspect, the public cloud resources

are dynamically added into a private cloud, and

become a hybrid cloud environment. As the public

cloud resources can be dynamically moved in and

out according to different requirements. Only the

period during which the public cloud handles the

overloading tasks can cost extra money which is far

less than investing more in purchasing resources.

Therefore, the hybrid cloud model helps reduce

hardware cost and the operation cost if a private

cloud already exists. To achieve this, the workload

has to be split and distributed to the private cloud

and the public cloud, or simply the hybrid cloud

(Bittencourt & Madeira, 2011). A hybrid cloud is

a combination of public and private clouds bound

together by either standardized or proprietary

technology that enables data and application por-

tability. With hybrid cloud deployment model, the

users benefited by lower over provisioning factor,

more efficient provisioning, better performance

and less hardware cost (Subramanian, 2011).

One of the result of evolution of hybrid cloud is

a cloud federation (Casola, Rak, & Villano, 2010)

which aims to cost-effective assets and resources

optimization among heterogeneous environments

where clouds can cooperate together with the goal

of obtaining unbounded computation resources,

hence new business opportunities. Federation

brings together different cloud flavors, external

and internal resources. Thus, any organization

can select a public computing environment on

demand when its private cloud reaches a particular

workload threshold.

Community Cloud

A community cloud can be a private cloud pur-

chased by a single user to support a community

of users, or a hybrid cloud with the costs spread

over a few users of the cloud. A community cloud

is often set up as a sandbox environment where

community users can test their applications, or

access cloud resources. Community cloud are

used and controlled by a group of organizations

with a shared interest. In other words, it is a pri-

vate cloud purchased by a single user to support

a community of users where fees may be charged

to subsidiaries.

4. RELIABILITY AND

FAULT-TOLERANCE

With the flexibility and scalability characteristic

in dynamically obtaining and releasing computing

resources in a cost-effective and device-indepen-

dent manner with minimal management effort or

service provider interaction the demand for Cloud

computing paradigm has increased dramatically in

last few years. While lots of improvement taken

place, cloud computing paradigm is still subject

to a large number of system failures. As a result,

there is an increasing concern among community

regarding the reliability and availability of Cloud

computing services. Moreover, the highly complex

nature of the underlying resources makes it more

vulnerable to a large number of failures even

for carefully engineered data centres (Barroso,

Clidaras, & Hölzle, 2013).These failures had an

impact on the overall reliability and availability

of the Cloud computing service. As a result, an

effective means to encounter failures even that

are unknown and unpredictable in numbers has

becomes of urgent need to both the users as well

as the service providers to ensure correct and

continuous system operation. Fault tolerance

serves as a technique to assure user’s reliability

and availability.

364

Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing

In general, a failure refers to error or condi-

tion in which the system to achieve its intended

functionality or the expected behavior. There may

be many reason behind a Failure may happen by

various reasons that is due to reaching an invalid

system state, network failure, etc,. The belief cause

for an error is a fault that represents a fundamental

impairment in the system. Thus, Fault tolerance

is the ability of the system to perform its function

even in the presence of failures. It serves as one of

the means to improve the overall system’s depend-

ability. In particular, it contributes significantly

in increasing system’s reliability and availability.

Cloud computing environment faults that appear

as failures to the end users can be categorized into

two types similarly to other distributed systems

(Piuri, 2013):

• Crash faults that cause the system compo-

nents to completely stop functioning or re-

main inactive during failures (e.g., power

outage, hard disk crash)

• Byzantine faults that leads the system

components to behave arbitrarily or mali-

ciously during failure, causing the system

to behave unpredictably incorrect.

To implement fault tolerant system the most

important is to clearly understand and determine

what constitutes the correct system behavior so

that specifications on its failure characteristics

can be provided. Failure in any layer in the cloud

architecture at particular instance has an impact

on the services offered by the layers above it. That

is if failures occur in the IaaS layer or the physi-

cal hardware then its impact is significantly high;

hence, it is more important to characterize typical

hardware faults and develop corresponding fault

tolerance techniques.

The key observations derived from one of

the study of failure behaviour of various server

components and hardware repair behaviour based

on the statistical information (Gill, Jain, & Na-

gappan, 2011; Vishwanath & Nagappan, 2010)

is as follows.

• 8% of the machines that are subject to re-

pair events has the average number of re-

pair is2 i.e., 2 per machine. The annual

failure rate (AFR) is therefore around 8%.

• An amount spent on repair cost for

an 8% AFRwere 2.5 million dollars

approximately.

• About 78% of total faults/replacements

were detected on hard disks, 5% on RAID

controllers and 3% due to memory failures.

13% of replacements were due to a collec-

tion of components. This implies that Hard

disks are clearly the most failure-prone

hardware components and the most signiﬁ-

cant reason behind server failures.

• About 5% of servers experience a disk fail-

ure in less than 1 year from the date when

it is purchased, 12% when the machines are

1 year old, and 25% of the servers see hard

disk failures when it is 2 years old.

• Interestingly, factors such as age of the

server, its conﬁguration, location within

the rack and workload run on the machine

were not found to be a signiﬁcant indicator

for failures.

It can be inferred from these statistics that ro-

bust fault tolerance mechanisms must be employed

to improve the reliability of hard disks (assuming

independent component failures) in order to re-

duce the number of failures. Furthermore, use of

hard disks that have already experienced a failure

should be reduced to meet the high availability

and reliability requirements.

In order to model failure behavior of cloud

computing it is also important to consider failure

behavior of the network. And to characterize the

network failure behavior it is important to under-

stand the overall network topology and various

network components involved in constructing

a data center. Similarly to the study on failure

behavior of servers, a large scale study on the

network failures in data centers is performed in

(Gill et al., 2011). A link failure happens when

the connection between two devices on a specific

365

Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing

interface is down and a device failure happens

when the device is not routing/forwarding packets

correctly (e.g., due to power outage or hardware

crash). Key observations derived from this study

are as follows:

• With failure probability of 1 in 5 load bal-

ancers are least reliable among all the net-

work devices, and ToRs are most reliable

with a failure rate of less than 5%. The root

causes for failures in LBs are mainly the

software bugs and conﬁguration errors as

opposed to the hardware errors for other

devices.

• The links forwarding traﬃc from LBs have

highest failure rates whereas links higher

in the topology and links connecting re-

dundant devices have second highest fail-

ure rates.

• The estimated median number of packets

lost during a failure is 59K and median

number of bytes is 25MB.

• Network redundancy reduces the median

impact of failures (in terms of number of

lost bytes) by only 40%. This observation

is against the common belief that network

redundancy completely masks failures

from applications.

• Therefore, the overall data center network

reliability is about 99.99% for 80% of the

links and 60% of the devices.

The most widely adopted methods to achieve

fault tolerance against crash faults and byzantine

faults are as follows:

• Checking and monitoring: In this tech-

nique the system is constantly monitored

at runtime to validate, verify and ensure

that correct system speciﬁcations are being

met. This technique plays an important role

in failure detection and subsequent recon-

ﬁguration and easy to implement.

• Checkpoint and restart: When the system

undergoes a failure, it is restored to the

previously known correct state captured

and saved based on pre-deﬁned parameters

using the latest checkpoint information in-

stead of restarting the system from start.

• Replication: Critical system components

are mirrored using additional hardware,

software and network resources in such a

manner that a copy of the critical compo-

nents is available even after a failure hap-

pens. Replication mechanisms are mainly

applied in two formats: active and passive.

In active replication, all the replicas are si-

multaneously active and each replica pro-

cesses the same request at the same time.

This ensures that all the replicas have the

same system state at any given point of

time and it can continue to deliver its ser-

vice even in case of a single replica fail-

ure. Whereas, in passive replication, only

one primary replica processes the requests

while the backup replicas only save the

system state during normal execution pe-

riods. Backup replicas are invoked only

when the primary replica fails.

Fault tolerance mechanisms are varyingly

successful in tolerating faults according to study

(Ayari, Barbaron, Lefevre & Primet, 2008). For

example, a passively replicated system can handle

only crash faults whereas actively replicated sys-

tem using 3+1 replicas are capable of overcoming

byzantine faults. In general, mechanisms that

handles failures at a finer granularity, offering

higher performance guarantees but at cost of

higher amount of resources (Jhawar, Piuri, &

Santambrogio, 2012). Therefore, in the design

of fault tolerance mechanisms one must take into

account a number of factors such as implementa-

tion complexity, resource costs, resilience, and

performance metrics, and achieve a fine balance

of the following parameters:

366

Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing

• Fault tolerance model: This factor mea-

sures the resilience level of the fault tol-

erance technique that means at what par it

can tolerate failures or error in the system.

It can also be understood as a robustness

of failure detection protocols, strength of

the failover level and state synchronization

method.

• Resource consumption: It takes into ac-

count the amount and cost of resources

incorporates to realize a fault tolerance

mechanism. This factor is normally subject

to the depth of the failure detection and re-

covery mechanisms involved in terms of

CPU, memory, bandwidth, I/O, and so on.

• Performance: The impact of the fault toler-

ance procedure on the end-to-end quality

of service (QoS) both during failure and

failure-free periods is measured by this

factor. This impact is often characterized

using, replica launch latency fault detec-

tion latency and failure recovery latency,

and other application-dependent metrics

such as bandwidth, latency, and loss rate.

5. QUALITY OF SERVICE

Dynamically provisioning of resources allows

cloud computing environment to meet casually

varying resource and service requirements of cloud

customer applications. Quality of Service (QoS)

plays an important role in the affective allocation

of resources and has been widely investigated in

the Cloud Computing paradigm. Not only in cloud,

QoS has been an issue in many of the Distributed

Computing paradigms, such as Grid computing

and High Performance Computing. Quality of

Service (QoS)provides a level of assurance against

the application requirements that ensure a certain

level of reliability, availability and performance

of a service and can also cover other aspects of

service quality such as security and dependability.

QoS is primarily concerned with the management

and performance of resources such as processors

memory, storage and networks in Cloud Com-

puting. The QoS is sometimes used as a quality

measure, with many different definitions instead

of referring to the ability to reserve resources.

QoS models are associated with End-Users and

Providers (and often Brokers) and include resource

capacity planning via the use of schedulers and load

balancers and utilize Service Level Agreements

(SLA)(Armstrong & Djemame, 2009). SLAs is

a legal binding contract upon QoS between an

End-User and Provider and define End-User re-

source requirements and execution environment

guarantees to provide End-User what that they are

receiving the exact services they have payed for.

Multiple providers of cloud offers different ser-

vices on their own terms employing own security

levels, system platforms and management systems.

Users often face difficulty while finding best ser-

vices to meet their objectives. Cloud service broker

are specialized expert to provide intermediary role

between providers and consumers of cloud and

assist purchaser of cloud services to find the ap-

propriate cloud offering as well as in deployment

and management of applications on cloud. Cloud

brokers helps in negotiating the best deals and

relationships between cloud consumers and cloud

providers. Specialized tools are used by brokers to

identify the most appropriate cloud resource and

map the requirements of application to it. They

can also be dynamic by automatically routing

data, applications and infrastructure needs based

on some QoS criteria like availability, reliability,

latency, price etc. In an attempt to provide broker

solutions Researchers have (Salehi & Buyya, 2010)

proposed a user level broker using two market

oriented scheduling policies. The proposed broker

increases the computational capacity of the local

resources by employing resources from an IaaS

provider. Researchers(Yang, Zhou, Liang, He, &

Sun, 2010) qcr4w introduced a service oriented

broker that claims guarantee data transmission

and uniform mechanism for arranging resources

via broker to maintain certain level of services

367

Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing

to users. Cloud Quality of Service management

Strategy(C-QoSMS) framework has been pro-

posed by Researchers (Ganghishetti & Wankar,

2011) which needs to be included in the cloud

service broker. By adding C-QoSMS component

in the cloud broker provides the capacity to the

customer to select a cloud provider based on the

QoS criteria specified in the SLA in minimum

searching time.

It is also important to know that there are dif-

ferent types of cloud users with different types

of applications with different set of personalized

preferences or QoCS requirements. Some applica-

tions require considerable computing and storage

power while others have strong need for execution

time. Reliability, Availability, Execution Time,

Reputation, and Tariff are the commonly used

QoS criteria for services selection (Lin, Sheu,

Chang, & Yuan, 2011). The goal of the cloud

users’ is to process their services successfully

and meet their performance, security, deadline

and cost target. This implies that the success of

the underlying business motto of the cloud users

rely on determining the best fitted cloud service

for a personalized application.

Cloud service typically comes with various

levels of services and performance characteristics

and to fulfill its promises to provide high quality,

on-demand services with service-oriented archi-

tecture often makes Quality of Cloud Service

(QoCS) high variance. This let the difficulty for

the users to compare these cloud services and select

them to meet their QoCS requirements. To address

this Researcher (Wang, Liu, Sun, Zou& Yang,

2014) propose an accurate evaluation approach

of QoCS in service-oriented cloud computing

by employing fuzzy synthetic decision to asses

cloud service providers according to cloud users’

preferences. Also, cloud service with consistently

good QoCS performance is usually more recom-

mendable than a QoCS performance having large

variance. So, due to any unpredictable behavior

of network such as bandwidth, time and many

other factors may impact the quality of these

cloud services. Hence, in the evaluation of cloud

service performance consistency should be taken

into account as an important criterion.

The deadline constraint problem is another

important QoS criterion that can be viewed as a

resource selection problem to fit the user’s demand

in terms of execution time. Such a problem is

similar to the service discovery problem in the

domain of web service composition. Researchers

introduced the linear integer programming (LIP)

model in order to address the service selection

problem by maximizing the utility value which

is a weighted sum of user-defined QoS attributes

(Ardagna & Pernici, 2007). They applied LIP-

based approaches for service matching, ranking

and selection. LIP-based approaches are prone to

high computational complexity associated with the

growth of the size of web services. This kind of

scheduling problem with QoS constraints are mod-

eled as a variation of multi-dimension multi choice

knapsack problem (MMKP) (Parra-Hernandez &

Dimopoulos, 2005), which has been proven to be

NP-complete (Martello & Toth, 1990).To encoun-

ter this Researcher( Wang, Chang, Lo& Lee, 2013)

propose an adaptive scheduling algorithm called

Adaptive Scheduling with QoS Satisfaction (AsQ)

for hybrid cloud environments. The AsQaims to

fit the deadline constraints of the submitted jobs

and to reduce the cost for renting public cloud

resources if using a public cloud is necessary. The

AsQ attempts to maximize the utilization rate of

the private cloud and to minimize the renting cost

of the public cloud.

6. CONCLUSION

In this Chapter, the current consensus of what

Cloud Computing is, the confusion surrounding

the different cloud computing deployment model

viz., Public, Private, Hybrid and Community

cloud, traditional cloud computing architecture

and the relevance of reliability, fault tolerance

and QoS in Clouds have been discussed. Fault

368

Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing

tolerance is a critical means of assuring reliability

QoS criteria in cloud computing. In other words,

Fault tolerance is concerned with all the methods

required to enable a system to tolerate software

faults at the runtime. It ensures the correct func-

tioning and continuous operation of cloud system.

We discussed the different failure attributes of

typical Cloud-based services caused largely due

to crash faults and byzantine faults and fault tol-

erance mechanisms to encounter these failures.

We also described various study conducted that

addresses user’s reliability and availability con-

cerns. Quality of services (QoS) is the ability to

render different priority to different applications,

users, or data flows, or to guarantee a certain level

performance data flow. To meet the requirements

of both cloud users and service providers, role

of resource broker is discussed. In this Chapter,

we also discussed the QoS measures covering its

research challenges, tools used for implementing

QoS in cloud computing. The motivation behind,

concepts, technology, researches and the state of

QoS in Cloud Computing have been reviewed.

REFERENCES

Ardagna, D., & Pernici, B. (2007). Adaptive

service composition in flexible processes. Soft-

ware Engineering, IEEE Transactions on, 33(6),

369-384.

Armstrong, D., & Djemame, K. (2009). Towards

quality of service in the cloud.

Ayari, N., Barbaron, D., Lefevre, L., & Primet,

P. (2008). Fault tolerance for highly available

internet services: concepts, approaches, and is-

sues. Communications Surveys & Tutorials, IEEE,

10(2), 34-46.

Barroso, L. A., Clidaras, J., & Hölzle, U. (2013).

The datacenter as a computer: An introduction to

the design of warehouse-scale machines. Synthesis

lectures on computer architecture, 8(3), 1-15.

Berl, A., Gelenbe, E., Di Girolamo, M., Giuliani,

G., De Meer, H., Dang, M. Q., & Pentikousis, K.

(2010). Energy-efficient cloud computing. The

computer journal, 53(7), 1045-1051.

Bittencourt, L. F., & Madeira, E. R. M. (2011).

HCOC: a cost optimization algorithm for workflow

scheduling in hybrid clouds. Journal of Internet

Services and Applications, 2(3), 207-227.

Buyya, R., Yeo, C. S., Venugopal, S., Broberg,

J., & Brandic, I. (2009). Cloud computing and

emerging IT platforms: Vision, hype, and reality

for delivering computing as the 5th utility. Future

Generation computer systems, 25(6), 599-616.

Casola, V., Rak, M., & Villano, U. (2010). Identity

federation in cloud computing.

Dowell, S., Barreto, A., Michael, J. B., & Shing,

M.-T. (2011). Cloud to cloud interoperability.

Expósito, R. R., Taboada, G. L., Ramos, S.,

González-Domínguez, J., Touriño, J., & Doallo,

R. (2013). Analysis of I/O performance on an

amazon EC2 cluster compute and high I/O plat-

form. Journal of grid computing, 11(4), 613-631.

Fox, A., Griffith, R., Joseph, A., Katz, R., Konwin-

ski, A., Lee, G., et al. (2009). Above the clouds:

A Berkeley view of cloud computing. University

of California, Berkeley, Rep. UCB/EECS, 28, 13.

Ganghishetti, P., & Wankar, R. (2011). Quality of

Service Design in Clouds. CSI Communications,

35(2), 12–15.

Gill, P., Jain, N., & Nagappan, N. (2011). Un-

derstanding network failures in data centers:

measurement, analysis, and implications.

Houidi, I., Mechtri, M., Louati, W., & Zeghlache,

D. (2011). Cloud service delivery across multiple

cloud platforms.

Jhawar, R., Piuri, V., & Santambrogio, M. (2012).

A comprehensive conceptual system-level ap-

proach to fault tolerance in cloud computing.

369

Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing

Kang, X., Zhang, H., Jiang, G., Chen, H., Meng,

X., & Yoshihira, K. (2008). Measurement, mod-

eling, and analysis of internet video sharing site

workload: A case study.

Kertész, A., Kecskemeti, G., Oriol, M., Kotcauer,

P., Acs, S., Rodríguez, M., . . . Franch, X. (2013).

Enhancing federated cloud management with an

integrated service monitoring approach. Journal

of grid computing, 11(4), 699-720.

Kim, K. H., Beloglazov, A., & Buyya, R. (2009).

Power-aware provisioning of cloud resources for

real-time services.

Lai, W. K., Yang, K.-T., Lin, Y.-C., & Shieh, C.-S.

(2012). Dual migration for improved efficiency in

cloud service Intelligent Information and Database

Systems (pp. 216-225). Springer.

Li, C. (2012). Optimal resource provisioning for

cloud computing environment. The Journal of

Supercomputing, 62(2), 989-1022.

Lin, C.-F., Sheu, R.-K., Chang, Y.-S., & Yuan,

S.-M. (2011). A relaxable service selection al-

gorithm for QoS-based web service composition.

Information and Software Technology, 53(12),

1370-1381.

Marshall, P., Keahey, K., & Freeman, T. (2011).

Improving utilization of infrastructure clouds.

Martello, S., & Toth, P. (1990). Knapsack prob-

lems: Algorithms and computer interpretations.

Hoboken, NJ: Wiley-Interscience.

Oikonomou, K., & Stavrakakis, I. (2010). Scalable

service migration in autonomic network environ-

ments. Selected Areas in Communications, IEEE

Journal on, 28(1), 84-94.

Papagianni, C., Leivadeas, A., Papavassiliou, S.,

Maglaris, V., Cervello-Pastor, C., & Monje, A.

(2013). On the optimal allocation of virtual re-

sources in cloud computing networks. Computers,

IEEE Transactions on, 62(6), 1060-1071.

Parra-Hernandez, R., & Dimopoulos, N. J. (2005).

A new heuristic for solving the multichoice mul-

tidimensional knapsack problem. Systems, Man

and Cybernetics, Part A: Systems and Humans,

IEEE Transactions on, 35(5), 708-717.

Piuri, R. J. V. (2013). Fault Tolerance and Resil-

ience in Cloud Computing Environments. In J.

Vacca (Ed.), Computer and information Security

Handbook (2nd Ed.). Morgan Kaufmann.

Randles, M., Lamb, D., & Taleb-Bendiab, A.

(2010). A comparative study into distributed load

balancing algorithms for cloud computing.

Salehi, M. A., & Buyya, R. (2010). Adapting

market-oriented scheduling policies for cloud

computing Algorithms and Architectures for

Parallel Processing (pp. 351-362). Springer.

Sotomayor, B., Montero, R. S., Llorente, I. M.,

& Foster, I. (2009). Virtual infrastructure man-

agement in private and hybrid clouds. Internet

computing, IEEE, 13(5), 14-22.

Srikantaiah, S., Kansal, A., & Zhao, F. (2008).

Energy aware consolidation for cloud computing.

Subramanian, K. (2011). Hybrid clouds. Retrieved

from http://emea. trendmicro. com/imperia/md/

content/uk/cloud-security/wp01_hybridcloud-

krish_110624us. pdf

Vishwanath, K. V., & Nagappan, N. (2010). Char-

acterizing cloud computing hardware reliability.

A Vouk, M. (2008). Cloud computing–issues,

research and implementations. CIT. Journal of

Computing and Information Technology, 16(4),

235-246.

Wang, S., Liu, Z., Sun, Q., Zou, H., & Yang, F.

(2014). Towards an accurate evaluation of qual-

ity of cloud service in service-oriented cloud

computing. Journal of Intelligent Manufacturing,

25(2), 283-291.

370

Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing

Wang, W.-J., Chang, Y.-S., Lo, W.-T., & Lee, Y.-K.

(2013). Adaptive scheduling for parallel tasks with

QoS satisfaction for hybrid cloud environments.

The Journal of Supercomputing, 66(2), 783-811.

Yang, Y., Zhou, Y., Liang, L., He, D., & Sun, Z.

(2010). A service-oriented broker for bulk data

transfer in cloud computing.

KEY TERMS AND DEFINITIONS

Cloud Computing: A model for delivering IT

services in which resources are retrieved from the

internet through web-based tools and applications

rather than a direct connection to a server.

Fault Tolerance: It is the property that enables

a system to continue operating properly in the event

of failure of some of its components.

Quality of Service (QoS): Quality of service

(QoS) generally refers to a network’s capability to

achieve maximum bandwidth and deal with other

network performance elements like latency, error

rate and uptime. Quality of service also involves

controlling and managing network resources by

setting priorities for explicit types of data (files,

audio and video) on the network or cloud.

Virtualization: Virtualization, in computing,

is the creation of a virtual (rather than actual) ver-

sion of something, such as a hardware platform,

operating system, a storage device or network

resources.

Comprehensive and systematic study on the fault tolerance architectures in the cloud computing

Article

Feb 2020

Providing dynamic resources is based on the virtualization features of the cloud environment. Cloud computing as an emerging technology uses a high availability of services at any time, in any place and independent of the hardware. However, fault tolerance is one of the main problems and challenges in cloud computing. This subject has an important effect on cloud computing, but, as far as we know, there is not a comprehensive and systematic study in this field. Accordingly, in this paper, the existing methods and mechanisms are discussed in different groups, such as proactive and reactive, types of fault detection, etc. Various fault tolerance techniques are provided and discussed. The advantages and disadvantages of these techniques are shown on the basis of the technology that they have used. Generally, the contributions of this research provide a summary of the available challenges associated with fault tolerance, a description of several important fault tolerance methods in the cloud computing and the key regions for the betterment of fault tolerance techniques in the future works. The advantages and disadvantages of the selected articles in each category are also highlighted and their significant challenges are discussed to provide the research lines for further studies.

Cloud Computing � Issues, Research and Implementations

Article

Full-text available

Jan 2008

Mladen A. Vouk

"Cloud" computing - a relatively recent term, builds on decades of research in virtualization, distributed computing, utility computing, and more recently networking, web and software services. It implies a service oriented architecture, reduced information technology overhead for the end-user, great flexibility, reduced total cost of ownership, on-demand services and many other things. This paper discusses the concept of "cloud" computing, some of the issues it tries to address, related research topics, and a "cloud" implementation available today.

Towards Quality of Service in the Cloud

Article

Full-text available

Quality of Service (QoS) plays a critical role in the aective reserva-tion of resources within service oriented distributed systems and has been widely investigated in the now well established paradigm of Grid Com-puting. The emergence of a new paradigm, Cloud Computing, continues the natural evolution of Distributed Systems to cater for changes in ap-plication domains and system requirements. Virtualisation of resources, a key technology underlying Cloud Computing, sets forth new challenges to be investigated within QoS and presents opportunities to apply the knowledge and lessons learnt from Grid Computing. QoS has been an issue in many of the Distributed Computing para-digms, such as Grid Computing and High Performance Computing. The aim of this paper is to address QoS specically in the context of the nascent paradigm Cloud Computing and propose relevant research questions. The objectives of this paper are to discuss the confusion surrounding the term Cloud, the current consensus of what Cloud Computing is and the le-gacy bequest by Grid Computing to this emergent paradigm. Emphasis is placed on the state of QoS provisioning in Grids and the technology to en-able it in Cloud Computing. Finally open research questions within QoS relevant to Cloud Computing are proposed and the direction of various future research is envisioned.

Enhancing Federated Cloud Management with an Integrated Service Monitoring Approach

Article

Full-text available

Dec 2013

Cloud Computing enables the construction and the provisioning of virtualized service-based applications in a simple and cost effective outsourcing to dynamic service environments. Cloud Federations envisage a distributed, heterogeneous environment consisting of various cloud infrastructures by aggregating different IaaS provider capabilities coming from both the commercial and the academic area. In this paper, we introduce a federated cloud management solution that operates the federation through utilizing cloud-brokers for various IaaS providers. In order to enable an enhanced provider selection and inter-cloud service executions, an integrated monitoring approach is proposed which is capable of measuring the availability and reliability of the provisioned services in different providers. To this end, a minimal metric monitoring service has been designed and used together with a service monitoring solution to measure cloud performance. The transparent and cost effective operation on commercial clouds and the capability to simultaneously monitor both private and public clouds were the major design goals of this integrated cloud monitoring approach. Finally, the evaluation of our proposed solution is presented on different private IaaS systems participating in federations.

Fault Tolerance and Resilience in Cloud Computing Environments

Article

Dec 2013

The increasing demand for flexibility and scalability in dynamically obtaining and releasing computing resources in a cost-effective and device-independent manner, and easiness in hosting applications without the burden of installation and maintenance, has resulted in a wide adoption of the cloud computing paradigm. While the benefits are immense, this computing paradigm is still vulnerable to a large number of system failures; as a consequence, users have become increasingly concerned about the reliability and availability of cloud computing services. Fault tolerance and resilience serve as an effective means to address users' reliability and availability concerns. In this chapter, we focus on characterizing the recurrent failures in a typical cloud computing environment, analyzing the effects of failures on users' applications and surveying fault tolerance solutions corresponding to each class of failures. We also discuss the perspective of offering fault tolerance as a service to users' applications as one of the effective means of addressing users' reliability and availability concerns.

Above the Clouds: A Berkeley View of Cloud Computing

Article

Jan 2009

personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Acknowledgement The RAD Lab's existence is due to the generous support of the founding members Google, Microsoft, and Sun Microsystems and of the affiliate members Amazon Web Services, Cisco Systems, Facebook, Hewlett-

Dual Migration for Improved Efficiency in Cloud Service

Conference Paper

Mar 2012

Wireless technologies enable users to retain their Internet access at anywhere and at anytime, without the tangling of wired cables. Users might want to keep enjoying their favor services when they are moving. However, the user mobility would causes longer and longer path to the serving server so that the QoS cannot be guaranteed. In order to maintain better QoS as a user moves, we proposed a novel and efficient cloud service architecture, named dual migration. The dual migration architecture keeps monitor the location of a user and migrates the contents what the user might need onto the closest server for the current location of the user. Therefore, the hop count of the path between a user and the corresponding server is short.

Optimal resource provisioning for cloud computing environment

Article

Nov 2012

The paper presents an efficient cloud resource provisioning approach. The Software as a Service (SaaS) provider leases resources from cloud providers and also leases software as services to SaaS users. The SaaS providers aim at minimizing the payment of using VMs from cloud providers, and want to maximize the profit earned through serving the SaaS users’ requests. The SaaS providers also guarantee meeting quality of service (QoS) requirements of the SaaS users. The cloud provider is to maximize the profit without exceeding the upper bound of energy consumption of cloud provider for provisioning virtual machines (VMs) to the SaaS provider. The SaaS users purpose to obtain the optimized QoS to accomplish their jobs with a limited budget and deadline. The proposed optimal cloud resource provisioning algorithm includes two sub-algorithms at different levels: interaction between the SaaS user and SaaS provider at the application layer and interaction between the SaaS provider and cloud resource provider at the resource layer. Simulations are conducted to compare the performance of proposed cloud resource provisioning algorithm with related work.

Adaptive scheduling for parallel tasks with QoS satisfaction for hybrid cloud environments

Article

Nov 2013

A hybrid cloud integrates private clouds and public clouds into one unified environment. For the economy and the efficiency reasons, the hybrid cloud environment should be able to automatically maximize the utilization rate of the private cloud and minimize the cost of the public cloud when users submit their computing jobs to the environment. In this paper, we propose the Adaptive-Scheduling-with-QoS-Satisfaction algorithm, namely AsQ, for the hybrid cloud environment to raise the resource utilization rate of the private cloud and to diminish task response time as much as possible. We exploit runtime estimation and several fast scheduling strategies for near-optimal resource allocation, which results in high resource utilization rate and low execution time in the private cloud. Moreover, the near-optimal allocation in the private cloud can reduce the amount of tasks that need to be executed on the public cloud to satisfy their deadline. For the tasks that have to be dispatched to the public cloud, we choose the minimal cost strategy to reduce the cost of using public clouds based on the characteristics of tasks such as workload size and data size. Therefore, the AsQ can achieve a total optimization regarding cost and deadline constraints. Many experiments have been conducted to evaluate the performance of the proposed AsQ. The results show that the performance of the proposed AsQ is superior to recent similar algorithms in terms of task waiting time, task execution time and task finish time. The results also show that the proposed algorithm achieves a better QoS satisfaction rate than other similar studies.

Analysis of I/O Performance on an Amazon EC2 Cluster Compute and High I/O Platform

Article

Dec 2013

Cloud computing is currently being explored by the scientific community to assess its suitability for High Performance Computing (HPC) environments. In this novel paradigm, compute and storage resources, as well as applications, can be dynamically provisioned on a pay-per-use basis. This paper presents a thorough evaluation of the I/O storage subsystem using the Amazon EC2 Cluster Compute platform and the recent High I/O instance type, to determine its suitability for I/O-intensive applications. The evaluation has been carried out at different layers using representative benchmarks in order to evaluate the low-level cloud storage devices available in Amazon EC2, ephemeral disks and Elastic Block Store (EBS) volumes, both on local and distributed file systems. In addition, several I/O interfaces (POSIX, MPI-IO and HDF5) commonly used by scientific workloads have also been assessed. Furthermore, the scalability of a representative parallel I/O code has also been analyzed at the application level, taking into account both performance and cost metrics. The analysis of the experimental results has shown that available cloud storage devices can have different performance characteristics and usage constraints. Our comprehensive evaluation can help scientists to increase significantly (up to several times) the performance of I/O-intensive applications in Amazon EC2 cloud. An example of optimal configuration that can maximize I/O performance in this cloud is the use of a RAID 0 of 2 ephemeral disks, TCP with 9,000 bytes MTU, NFS async and MPI-IO on the High I/O instance type, which provides ephemeral disks backed by Solid State Drive (SSD) technology.

Knapsack Problems: algorithms and computer implementation: John Wiley & Sons Ltd

Article

Jan 1990

Reliability, Fault Tolerance, and Quality-of-Service in Cloud Computing: Analysing Characteristics

Abstract and Figures

Recommended publications

Fault Tolerance Techniques and Comparative Implementation in Cloud Computing

Adaptive scheduling for parallel tasks with QoS satisfaction for hybrid cloud environments

Genetic Algorithm Based QoS-Aware Service Compositions in Cloud Computing

A survey on opensource cloud computing solutions

Dual Migration for Improved Efficiency in Cloud Service