ArticlePDF Available

Adaptive resource provisioning for read intensive multi-tier applications in the cloud

Authors:

Abstract and Figures

A Service-Level Agreement (SLA) provides surety for specific quality attributes to the consumers of services. However, current SLAs offered by cloud infrastructure providers do not address response time, which, from the user’s point of view, is the most important quality attribute for Web applications. Satisfying a maximum average response time guarantee for Web applications is difficult for two main reasons: first, traffic patterns are highly dynamic and difficult to predict accurately; second, the complex nature of multi-tier Web applications increases the difficulty of identifying bottlenecks and resolving them automatically. This paper proposes a methodology and presents a working prototype system for automatic detection and resolution of bottlenecks in a multi-tier Web application hosted on a cloud in order to satisfy specific maximum response time requirements. It also proposes a method for identifying and retracting over-provisioned resources in multi-tier cloud-hosted Web applications. We demonstrate the feasibility of the approach in an experimental evaluation with a testbed EUCALYPTUS-based cloud and a synthetic workload. Automatic bottleneck detection and resolution under dynamic resource management has the potential to enable cloud infrastructure providers to provide SLAs for Web applications that guarantee specific response time requirements while minimizing resource utilization.
Content may be subject to copyright.
Adaptive Resource Provisioning for Read Intensive
Multi-tier Applications in the Cloud
Waheed Iqbala, Matthew N. Daileya, David Carrerab, Paul Janeceka
aComputer Science and Information Management, Asian Institute of Technology,
Thailand
bTechnical University of Catalonia (UPC) Barcelona Supercomputing Center (BSC),
Spain
Abstract
A Service-Level Agreement (SLA) provides surety for specific quality at-
tributes to the consumers of services. However, current SLAs offered by cloud
infrastructure providers do not address response time, which, from the user’s
point of view, is the most important quality attribute for Web applications.
Satisfying a maximum average response time guarantee for Web applications
is difficult for two main reasons: first, traffic patterns are highly dynamic
and difficult to predict accurately; second, the complex nature of multi-tier
Web applications increases the difficulty of identifying bottlenecks and resolv-
ing them automatically. This paper proposes a methodology and presents a
working prototype system for automatic detection and resolution of bottle-
necks in a multi-tier Web application hosted on a cloud in order to satisfy
specific maximum response time requirements. It also proposes a method
for identifying and retracting over-provisioned resources in multi-tier cloud-
hosted Web applications. We demonstrate the feasibility of the approach in
an experimental evaluation with a testbed EUCALYPTUS-based cloud and
a synthetic workload. Automatic bottleneck detection and resolution under
dynamic resource management has the potential to enable cloud infrastruc-
ture providers to provide SLAs for Web applications that guarantee specific
response time requirements while minimizing resource utilization.
Keywords: Cloud computing, Adaptive resource management, Quality of
Service, Multi-tier applications, Service-level agreement, Scalability
Preprint submitted to Future Generation Computer Systems October 21, 2010
1. Introduction
Cloud providers [1] use the Infrastructure as a Service model to allow
consumers to rent computational and storage resources on demand and ac-
cording to their usage. Cloud infrastructure providers maximize their profits
by fulfilling their obligations to consumers with minimal infrastructure and
maximal resource utilization.
Although most cloud infrastructure providers provide service-level agree-
ments (SLAs) for availability or other quality attributes, the most important
quality attribute for Web applications from the user’s point of view, response
time, is not addressed by current SLAs. Guaranteeing response time is a dif-
ficult problem for two main reasons. First, Web application traffic is highly
dynamic and difficult to predict accurately. Second, the complex nature
of multi-tier Web applications, in which bottlenecks can occur at multiple
points, means response time violations may not be easy to diagnose or rem-
edy. It is also difficult to determine optimal static resource allocation for
multi-tier Web applications manually for certain workloads due to the dy-
namic nature of incoming requests and exponential number of possible allo-
cation strategies. Therefore, if a cloud infrastructure provider is to guarantee
a particular maximum response time for any traffic level, it must automati-
cally detect bottleneck tiers and allocate additional resources to those tiers
as traffic grows.
In this paper, we take steps toward eliminating this limitation of cur-
rent cloud-based Web application hosting SLAs. We propose a methodology
and present a working prototype system running on a EUCALYPTUS-based
[2] cloud that actively monitors the response times for requests to a multi-
tier Web application, gathers CPU usage statistics, and uses heuristics to
identify the bottlenecks. When bottlenecks are identified, the system dy-
namically allocates the resources required by the application to resolve the
identified bottlenecks and maintain response time requirements. The system
furthermore predicts the optimal configuration for the dynamically varying
workload and scales down the configuration whenever possible to minimize
resource utilization.
The bottleneck resolution method is purely reactive. Reactive bottleneck
resolution has the benefit of avoiding inaccurate a priori performance models
and pre-deployment profiling. In contrast, the scale down method is neces-
sarily predictive, since we must avoid premature release of busy resources.
However, the predictive model is built using application performance statis-
2
tics acquired while the application is running under real-world traffic loads,
so it neither suffers from the inaccuracy of a priori models nor requires pre-
deployment profiling.
In this paper, we describe our prototype, the heuristics we have developed
for reactive scale-up of multi-tier Web applications, the predictive models
we have developed for scale-down, and an evaluation of the prototype on
a testbed cloud. The evaluation uses a specific two-tier Web application
consisting of a Web server tier and a database tier. In this context, the
resources to be minimized are the number of Web servers in the Web server
tier and the number of replicas in the database tier. We find that the system
is able to detect bottlenecks, resolve them using adaptive resource allocation,
satisfy the SLA, and free up over-provisioned resources as soon as they are
not required.
There are a few limitations to this preliminary work. We only address
scaling of the Web server tier and a read-only database tier. Our system
only perform hardware and virtual resource management for applications.
In particular, we do not address software configuration management; for
example, we assume that the number of connections from each server in
the Web server tier to the database tier is sufficient for the given workload.
Additionally, real-world cloud infrastructure providers using our approach to
response time-driven SLAs would need to protect themselves with detailed
contracts (imagine for example the rogue application owner who purposefully
inserts delays in order to force SLA violations). We plan to address some of
these limitations in future work.
In the rest of this paper, we provide related work, then we describe our
approach, the prototype implementation, and an experimental evaluation of
the prototype.
2. Related Work
There has been a great deal of research on dynamic resource allocation
for physical and virtual machines and clusters of virtual machines [3]. In [4]
and [5], a two-level control loop is proposed to make resource allocation deci-
sions within a single physical machine. This work does not address integrated
management of a collection of physical machines. The authors of [6] study
the overhead of a dynamic allocation scheme that relies on virtualization as
opposed to static resource allocation. None of these techniques provide a
3
technology to dynamically adjust allocation based on SLA objectives in the
presence of resource contention.
VMware DRS [7] provides technology to automatically adjust the amount
of physical resources available to VMs based on defined policies. This is
achieved using the live-migration automation mechanism provided by VMo-
tion. VMware DRS adopts a VM-centric view of the system: policies and
priorities are configured on a VM-level.
A approach similar to VMware DRS is proposed in [8], which proposes a
dynamic adaptation technique based on rearranging VMs so as to minimize
the number of physical machines used. The application awareness is limited
to configuring physical machine utilization thresholds based on off-line anal-
ysis of application performance as a function of machine utilization. In all of
this work, runtime requirements of VMs are taken as a given and there is no
explicit mechanism to tune resource consumption by any given VM.
Foster et al. [9] address the problem of deploying a cluster of virtual
machines with given resource configurations across a set of physical machines.
Czajkowski et al. [10] define a Java API permitting developers to monitor
and manage a cluster of Java VMs and to define resource allocation policies
for such clusters.
Unlike [7] and [8], our system takes an application-centric approach; the
virtual machine is considered only as a container in which an application is
deployed. Using knowledge of application workload and performance goals,
we can utilize a more versatile set of automation mechanisms than [7], [8],
[9], or [10].
Network bandwidth allocation issues in the deployment of clusters of
virtual machines has also been studied in [11]. The problem there is to place
virtual machines interconnected using virtual networks on physical servers
interconnected using a wide area network. VMs may be migrated, but the
emphasis is rather than resource scaling, to allocate network bandwidth for
the virtual networks. In contrast, our focus is on data center environments,
in which network bandwidth is of lesser concern.
There have been several efforts to perform dynamic scaling of Web ap-
plications based on workload monitoring. Amazon Auto Scaling [12] allows
consumers to scale up or down according to criteria such as average CPU
utilization across a group of compute instances. [13] presents the design of
an auto-scaling solution based on incoming traffic analysis for Axis2 Web
services running on Amazon EC2. [14] presents a statistical machine learn-
ing approach to predict system performance and minimize the number of
4
resources required to maintain the performance of an application hosted on
a cloud. [15] monitors the CPU and bandwidth usage of virtual machines
hosted on an Amazon EC2 cloud, identifies the resource requirements of ap-
plications, and dynamically switches between different virtual machine con-
figurations to satisfy the changing workloads. However, none of these solu-
tions address the issues of multi-tier Web applications or database scalability,
a crucial step to dynamically manage multi-tier workloads.
Thus far, only a few researchers have addressed the problem of resource
provisioning for multi-tier applications. [16] presents an analytical model
using queuing networks to capture the behavior of each tier. The model is
able to predict the mean response time for a specific workload given several
parameters such as the visit ratio,service time, and think time. However, the
authors do not apply their approach toward dynamic resource management
on clouds. [17] presents a predictive and reactive approach using queuing
theory to address dynamic provisioning for multi-tier applications. The pre-
dictive approach is to allocate resources to applications on large time scales
such as days and hours, while the reactive approach is used for short time
scales such as seconds and minutes. This allows the system to overcome the
“flash crowd” phenomenon and correct prediction mistakes made by the pre-
dictive model. The technique assumes knowledge of the resource demands of
each tier. In addition to the queuing model, the authors also provide a sim-
ple black-box approach for dynamic provisioning that scales up all replicable
tiers when bottlenecks are detected. However, this work does not address
database scalability or releasing of application resources when they are not
required. In contrast, our system classifies requests as either dynamic or
static and uses a black box heuristic technique to scale up and scale down
only one tier at a time. Our scale-up system is reactive in resolving bottle-
necks and our scale-down system is predictive in releasing resources.
The most recent work in this area [18] presents a technique to model
dynamic workloads for multi-tier Web applications using k-means cluster-
ing. The method uses queuing theory to model the system’s reaction to the
workload and to identify the number of instances required for an Amazon
EC2 cloud to perform well under a given workload. Although this work does
model system behavior on a per-tier basis, it does not perform multi-tier
dynamic resource provisioning. In particular, database tier scaling is not
considered.
In our own recent work [19], we consider single-tier Web applications, use
log-based monitoring to identify SLA violations, and use dynamic resource
5
allocation to satisfy SLAs. In [20], we consider multi-tier Web applications
and propose an algorithm based on heuristics to identify the bottlenecks.
This work uses a simple reactive technique to scale up multi-tier Web ap-
plications to satisfy SLAs. The work described in the current paper is an
extension of this work. We aim to solve the problem of dynamic resource
provisioning for multi-tier Web applications to satisfy a response time SLA
with minimal resource utilization. Our method is reactive for scale-up de-
cisions and predictive for scale-down decisions. Our method uses heuristics
and predictive models to scale each tier of a given application, with the goal
of requiring minimal knowledge of and minimal modification of the existing
application. To the best of our knowledge, our system is the first SLA-
driven resource manager for clouds based on open source technology. Our
working prototype, built on top of a EUCALYPTUS-based compute cloud,
provides dynamic resource allocation and load balancing for multi-tier Web
applications in order to satisfy a SLA that enforces specific response time
requirements.
3. System Design and Implementation Details
3.1. Dynamic provisioning for multi-tier Web applications
Here we describe our methodology for dynamic provisioning of resources
for multi-tier Web applications, including the algorithms, system design, and
implementation. A high-level flow diagram for bottleneck detection, scale-up
decision making, and scale-down decision making in our prototype system is
shown in Figure 1.
3.1.1. Reactive model for scale-up
We use heuristics and active profiling of the CPUs of virtual machine-
hosted application tiers for identification of bottlenecks. Our system reads
the Web server proxy logs for tseconds and clusters the log entries into dy-
namic content requests and static content requests. Requests to resources
(Web pages) containing server-side scripts (PHP, JSP, ASP, etc.) are consid-
ered as dynamic content requests. Requests to the static resources (HTML,
JPG, PNG, TXT, etc.) are considered as static content requests. Dynamic
resources are generated through utilization of the CPU and may depend on
other tiers, while static resources are pre-generated flat files available in the
Web server tier. Each type of request has different characteristics and is
6
Yes
Read proxy logs
for t seconds
Cluster requests as
dynamic or static
Calculate RT_s and RT_d (95th percentile of
the average response time of static and
dynamic content requests)
Is RT_s or RT_d
above threshold?
Scale-up Web
server tier
Get CPU utilization of
every instance in Web
server tier
Has last scale
operation been
realized?
No
Yes
Is CPU in any
instance in Web
tier saturating?
Start
Scale-up
Database tier
RT_s above threshold
RT_d above
threshold No
Get N_web and N_db (number of
Web server and database server
instances) from prediction model
Is N_web
< current number
of Web instances
Scale-down Web
server tier Are RT_s and
RT_d under
threshold for last k
intervals
No
Yes
Is N_db
< current number of
DB instances
Scale-down
database tier
Yes
No
Yes
No No
Figure 1: Flow diagram for prototype system that detects the bottleneck tier in a two-tier
Web application hosted on a heterogeneous cloud and dynamically scales the tier to satisfy
a SLA that defines response time requirements and ensures to release over-provisioned
resources.
monitored separately for purposes of bottleneck detection. The system cal-
culates the 95th percentile of the average response time. When static content
response time indicates saturation, the system scales the Web server tier.
When the system determines that dynamic content response time indicates
saturation, it obtains the CPU utilization across the Web server tier. If the
CPU utilization of any instance in the Web server tier has reached a satura-
tion threshold, the system scales up the Web server tier; otherwise, it scales
up the database tier. Each scale up operation adds exactly one server to a
specific tier. Our focus is on read-intensive applications, and we assume that
a mechanism such as [21] exists to ensure consistent reads after updates to a
master database. Before initiating a scale operation, the system ensures that
the effect of the last scale operation has been realized. If the system satisfies
the response time requirements for kconsecutive intervals, it uses the pre-
dictive model to identify any over-provisioned resources and if appropriate,
scales down the over-provisioned tier(s). The predictive model is explained
next.
7
3.1.2. Predictive model for scale down
To determine when to initiate scale down operations, we use a regression
model that predicts, for each time interval t, the number of Web server
instances nweb
tand number of database server instances ndb
trequired for the
current observed workload. We use polynomial regression with polynomial
degree two. Our reactive scale-up algorithm feeds training observations to the
model as appropriate. We retain training observations for every interval of
time that satisfies the response time requirements. Each observation contains
the observed workload for each type of request and the existing configuration
of the tiers for the last 60-second interval. We can express the model as
follows:
nweb
t=a0+a1(hs
t+hd
t) + a2(hs
t+hd
t)2+web
t(1)
ndb
t=b0+b1hd
t+b2(hd
t)2+db
t,(2)
where hs
tand hd
tare the number of static and dynamic requests received dur-
ing interval t. We assume noise web
tN(0,(σweb)2) and db
tN(0,(σdb)2).
Since both static and dynamic resource requests hit the Web server tier,
we assume that nweb
t(the number of Web server instances required, Equa-
tion 1) depends on both hs
tand hd
t. To keep the number of model parameters
to be estimated small, we use a single parameter for the sum of the two load
levels. Since the database server only handles database queries, which are
normally only invoked by dynamic pages, we assume that ndb
t(the number
of database server instances required, Equation 2) depends only on hd
t.
The regression coefficients a0,a1,a2,b0,b1, and b2are recalculated after
updating the sufficient statistics for all of the historical data every time a
new observation is received. (The sufficient statistics are the sums and sums
of squares for variables nweb
t,ndb
t,hs
t, and hd
tover the training set up to the
current point in time.) The most recent predictive model is used as shown
in the flow diagram of Figure 1 to identify over-provisioned resources for the
current workload and retract them from the current configuration.
3.2. System components and implementation
To manage cloud resources dynamically based on response time require-
ments, we developed three components: VLBCoordinator,VLBManager, and
VMProfiler. We use Nginx [22] as a load balancer because it offers detailed
logging and allows reloading of its configuration file without termination of
8
existing client sessions. VLBCoordinator and VLBManager are our service
management [23] components.
VLBCoordinator interacts with a EUCALYPTUS cloud using Typica
[24]. Typica is a simple API written in Java to access a variety of Ama-
zon Web services such as EC2, SimpleDB, and DevPay. The core functions
of VLBCoordinator are instantiateVirtualMachine and getVMIP, which
are accessible through XML-RPC. VLBManager monitors the traces of the
load balancer and detects violations of response time requirements. It clus-
ters the requests into static and dynamic resource requests and calculates the
average response time for each type of request. VMProfiler is used to log the
CPU utilization of each virtual machine. It exposes XML-RPC functions to
obtain the CPU utilization of specific virtual machine for the last nminutes.
Every Web application has an application-specific interface between the
Web server tier and the database tier. We assume that database writes are
handled by a single master MySQL instance and that database reads can
be handled by a cluster of MySQL slaves. Under this assumption, we have
developed a component for load balancing and scaling the database tier that
requires minimal modification of the application.
Our prototype is based on the RUBiS [25] open-source benchmark Web
application for auctions. It provides core functionality of an auction site
such as browsing, selling, and bidding for items, and provides three user
roles: visitor, buyer, and seller. Visitors are not required to register and are
allowed to browse items that are available for auction. We used the PHP
implementation of RUBiS as a sample Web application for our experimental
evaluation.
To enable RUBiS to support load balancing over the database tier, we
modified it to use round-robin balancing over a set of database servers listed
in a database connection settings file, and we developed a server-side compo-
nent, DbConfigAgent, to update the database connection settings file after
a scaling operation has modified the configuration of the database tier. The
entire benchmark system consists of the physical machines supporting the
EUCALYPTUS cloud, a virtual Web server acting as a proxying load bal-
ancer for the entire Web application, a tier of virtual Web servers running the
RUBiS application software, and a tier of virtual database servers. Figure 2
shows the deployment of our components along with the main interactions.
9
getCPUusage (vmid, duration)
Virtual Machine
(Web App Proxy)
VLBManager
Physical Machine (Cloud
Frontend)
VLBCoordinator
Physical Machine
VMProfiler
Virtual Machine
(Webserver)
DbConfigAgent
updateDBList(dblist)
scaleUp(vmImgid)
scaleDown(vmImgid)
Figure 2: Component deployment diagram for system components including main inter-
actions.
4. Experimental Setup
In this section we describe the setup for an experimental evaluation of
our prototype based on a testbed cloud using the RUBiS Web application
and a synthetic workload generator.
4.1. Testbed cloud
We built a small private heterogeneous compute cloud on seven physical
machines (Front-end, Node1, Node2, Node3, Node4, Node5, and Node6)
using EUCALYPTUS. Figure 3 shows the design of our testbed cloud. Front-
end and Node1 are Intel Pentium 4 machines with 2.84 GHz and 2.66 GHz
CPUs, respectively. Node2 is an Intel Celeron machine with a 2.4 GHz CPU.
Node3 is an Intel Core 2 Duo machine with a 2.6 GHz CPU. Node4, Node5,
and Node6 are Intel Pentium Dual Core machines with 2.8 GHz CPU. Front-
end, Node2, Node3, Node4, Node5, and Node6 have 2 GB RAM while Node1
and Node4 have 1.5 GB RAM.
We used EUCALYPTUS to establish a cloud architecture comprised of
one Cloud Controller (CLC), one Cluster Controller (CC), and six Node Con-
trollers (NCs). We installed the CLC and CC on a front-end node attached
to both our main LAN and the cloud’s private network. We installed the
NCs on six separate machines (Node1, Node2, Node3, Node4, Node5, and
Node6) connected to the private network.
10
VM
Node1
NC
Internet
LAN
Front-end
CLC
CC
VMs
Node3
NC
VMs
Node4
NC
VM
Node2
NC
VMs
Node5
NC
VMs
Node6
NC
Figure 3: EUCALYPTUS-based testbed cloud using seven physical machines. We installed
the CLC and CC on a front-end node attached to both our main LAN and the cloud’s
private network. We installed the NCs on six separate machines (Node1, Node2, Node3,
Node4, Node5, and Node6) connected to the private network. Each physical machine has
the capacity to spawn a maximum number of virtual machines as shown (highlighted in
red) in the figure, based on the number of cores.
4.2. Workload generation
We use httperf [26] to generate synthetic workloads for RUBiS. We gen-
erate workloads for specific durations with a required number of user sessions
per second. A user session emulates a visitor that browses items up for auc-
tion in specific categories and geographical regions and also bids on items
up for auction. In a first cycle, every five minutes, we increment the load
level by 6, from load level 6 up to load level 108, and then we decrement
the load level by 6 from load level 108 down to load level 6. In a second
cycle, we increment the load level by 6, from load level 6 up to load level 60,
and then we decrement the load level by 6 from load level 60 down to load
11
0
12
24
36
48
60
72
84
96
108
120
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280
Load level (new sessions/s)
Time (minutes)
Figure 4: Workload generation profile for all experiments.
level 6. Each load level represents the number of user sessions per second;
each user session makes six requests to static resources and five requests to
dynamic resources including five pauses to simulate user think time. The dy-
namic resources consist of PHP pages that make read-only database queries.
Note that while each session is closed loop (the workload generator waits
for a response before submitting the next request), session creation is open
loop: new sessions are created independently of the system’s ability to handle
them. This means that many requests may queue up, leading to exponen-
tial increases in response times. Figure 4 shows the workload levels we use
for our experiments over time. We use three workload generators distributed
over three separate machines during the experiments to ensure that workload
generation machines never reach saturation.
We performed all of our experiments based on this workload generator
and RUBiS benchmark Web application.
5. Experimental Design
To evaluate our proposed system, we performed three experiments. Ex-
periments 1 and 2 profile the system’s behavior using specific static alloca-
tions. Experiment 3 profiles the system’s behavior under adaptive resource
allocation using the proposed algorithm for bottleneck detection and resolu-
tion. Experiments 1 and 2 demonstrate system behavior using current in-
dustry practices, whereas Experiment 3 shows the strength of the proposed
alternative methodology. Table 1 summarizes the experiments, and details
follow.
12
5.1. Experiment 1: Simple static allocation
In Experiment 1, we statically allocate one virtual machine to the Web
server tier and one virtual machine to the database tier, and then we profile
system behavior over the synthetic workload described previously. The single
Web server/single database server configuration is the most common initial
allocation strategy used by most application deployment engineers.
5.2. Experiment 2: Static over-allocation
In Experiment 2, we over-allocate resources, using a maximal static con-
figuration sufficient to process the workload. We statically allocate a cluster
of four Web server instances and four database server instances, and then
we then profile the system behavior over the synthetic workload described
previously. Since it is quite difficult to determine an optimal allocation for a
multi-tier application manually, we actually derived this configuration from
the the behavior of the adaptive system profiled in Experiment 3.
5.3. Experiment 3: Adaptive allocation under proposed system
In Experiment 3, we use our proposed system to adapt to changing work-
loads. Initially, we started two virtual machines on our testbed cloud. The
Nginx-based Web server farm was initialized with one virtual machine host-
ing the Web server tier, and another single virtual machine was used to host
the database tier. As discussed earlier, we modified RUBiS to perform load
balancing across the instances in the database server cluster. The system’s
goal was to satisfy a SLA that enforces a one-second maximum average re-
sponse time requirement for the RUBiS application regardless of load level
using our proposed algorithm for bottleneck detection and resolution. The
threshold for CPU saturation (refer to the flow diagram in Figure 1) was
set to 85% utilization. This gives the system a chance to handle unexpected
spikes in CPU activities, and it is a reasonable threshold for efficient use of
the server [27].
To determine good values for the important parameters t(the time to read
proxy traces) and k(the number of consecutive intervals required to satisfy
response time constraints before a scale-down operation is attempted), we
performed a grid search over a set of reasonable values for tand k.
13
Table 1: Summary of experiments.
Exp. Description
1 Static allocation using one VM for Web server tier and one
VM for database tier
2 Static over-allocation using a cluster of four VMs for the Web
server tier and four VMs for database tier
3 Adaptive allocation using proposed methodology
6. Experimental Results
6.1. Experiment 1: Simple static allocation
This section describes the results we obtained in Experiment 1. Figure 5
shows the throughput of the system during the experiment. After load level
30, we do not observe any growth in the system’s throughput because one
or both of the tiers have reached their saturation points. Although the load
level increases with time, the system is unable to serve all requests, and it
either rejects or queues the remaining requests.
Figure 6 shows the 95th percentile of average response time during Ex-
periment 1. From load level 6 to load level 24, we observe a nearly constant
response time, but after load level 24, the arrival rate exceeds the limits of
the system’s processing capacity. One of the virtual machines hosting the
application tiers becomes a bottleneck, then requests begin to spend more
time in the queue and request processing time increases. From that point we
observe rapid growth in the response time. After load level 30, however, the
queue also becomes saturated, and the system rejects most requests. There-
fore, we do not observe further growth in the average response time. Clearly,
the system only works efficiently from load level 6 to load level 24.
Figure 7 shows the CPU utilization of the two virtual machines hosting
the application tiers during Experiment 1. The downward spikes at the be-
ginning of each load level occur because all user sessions are cleared between
load level increments, and it takes some time for the system to return to
a steady state. We do not observe any tier saturating its CPU during this
experiment; after load level 30, the CPU utilization remains nearly constant,
indicating that the CPU was not a bottleneck for this application with the
given workload.
14
0
100
200
300
400
500
600
700
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280
0
12
24
36
48
60
72
84
96
108
120
Throughput (number of served requests/s)
Load level (new sessions/s)
Time (minutes)
Dynamic contents
Static contents
Workload
Figure 5: Throughput of the system during Experiment 1.
0
1
2
3
4
5
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280
# machines
Time (minutes)
DB server
Web server
0
5000
10000
15000
20000
25000
30000
0
12
24
36
48
60
72
84
96
108
120
95th percentile of mean response time (ms)
Load level (new sessions/s)
Dynamic contents
Static contents
Max SLA mean response time
Workload
Figure 6: 95th percentile of mean response time during Experiment 1.
6.2. Experiment 2: Static over-allocation
In Experiment 2, to observe the system’s behavior under a static al-
location policy using the maximal configuration observed during adaptive
experiments, we allocated four virtual machines to the Web server tier and
four virtual machines to the database tier, and generated the same work-
load described in Section 4.2. Figure 8 shows the throughput of the system
during Experiment 2. We observe the expected linear relationship between
load level and throughput; as load level increases, the system throughput
increases, and as load level decreases, the system throughput decreases.
Figure 9 shows the 95th percentile of average response times during Ex-
periment 2. We do not observe any response time violations during the
15
0
25
50
75
100
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280
CPU utilization (%)
Time (minutes)
Database server VM
0
25
50
75
100
CPU utilization (%)
Web server VM
Figure 7: CPU utilization of virtual machines used during Experiment 1.
0
100
200
300
400
500
600
700
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280
Throughput (number of served requests/s)
Time (minutes)
Dynamic contents
Static contents
Figure 8: Throughput of the system during Experiment 2.
experiment. We observe a slight increase in response time during load lev-
els 80 to 100 because, during this interval, the system is serving the peak
workload and utilizing all of the allocated resources to satisfy the workload
requirements. This experiment shows that the maximal configuration identi-
fied by our adaptive resource allocation system would never lead to violations
of the response time requirements under the same load.
6.3. Experiment 3: Bottleneck detection and resolution under adaptive allo-
cation
This section describes the results of Experiment 3 using our proposed
algorithm for bottleneck detection and resolution. We first identified appro-
priate values and impact for important parameters (tand k) in our proposed
16
0
1
2
3
4
5
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280
# machines
Time (minutes)
DB server
Web server
0
500
1000
1500
2000
0
12
24
36
48
60
72
84
96
108
120
95th percentile of mean response time (ms)
Load level (new sessions/s)
Dynamic contents
Static contents
Max SLA mean response time
Workload
Figure 9: 95th percentile of mean response time during Experiment 2.
Table 2: Summary of grid search to find good values for the important parameters of the
proposed system.
t k % requests
missing SLA
Scale-down
mistakes Total operations
30 4 3.228 12 38
30 8 2.002 3 22
60 4 2.413 4 22
60 8 2.034 2 20
120 4 3.227 2 18
120 8 3.312 0 15
algorithm using a grid search. We then examined the results from the best
configuration in more detail.
6.3.1. Parameter value identification
We used t= 30,60,and 120 and k= 4 and 6 for the grid search. For
each value of tand k, the percentage of requests missing SLA requirements,
scale-down decision mistakes, and total number of scale operations (scale-up
and scale-down) are shown in Table 2.
Figure 10 compares the percentage of requests missing SLA requirements,
scale-down decision mistakes, and total number of scale (scale-up and scale-
down) operations over different values of tand k.
We observe that a large number of requests exceed the required response
time when we use small values (t= 30, k= 4) for both parameters or a
large value (t= 120) for t. The parameter kis the number of consecutive
17
0
1
2
3
4
0 30 60 90 120 150
% Requests missing SLA
t (seconds)
k = 4
k = 8
(a) Percentage of requests
missing SLA.
0
2
4
6
8
10
12
14
0 30 60 90 120 150
Scale-down decision mistakes
t (seconds)
k = 4
k = 8
(b) Scale-down mistakes.
0
4
8
12
16
20
24
28
32
36
40
0 30 60 90 120 150
Number of scale operation
t (seconds)
k = 4
k = 8
(c) Total number of scale op-
erations.
Figure 10: Grid search comparison for determining appropriate values of tand kfor the
system.
intervals of length trequired to satisfy response time constraints before a
scale-down operation is attempted. As kdepends on t, using small values
for tand kenables system to react quickly and make scale-down decisions
that increase the number of scale-down mistakes. The system requires some
time to recover from such mistakes, so we observe additional response time
violations during the recovery. The large value of tincreases the system’s
reaction time; this is why we observe a large number of requests exceeding the
required response time with t= 120. We can also observe that as tincreases,
the number of scale down mistakes decreases, since scale down decisions are
made less frequently. However, the slower response with high values of talso
means that the system takes more time to respond to long traffic spikes and
to release over-provisioned resources. Smaller values of twith larger values of
kreduce the occurrence of scale down mistakes without negatively affecting
the system’s responsiveness to traffic spikes.
We selected the values t= 60 and k= 8 for further examination, as these
values provide a good trade off between the percentage of requests missing
the SLA, the number of scale-down decision mistakes, and the total number
of operations. Figure 11 shows the 95th percentile of the average response
time during Experiment 3 using automatic bottleneck detection and adaptive
resource allocation under this parameter regime. The bottom graph shows
the adaptive addition and retraction of instances in each tier after a bot-
tleneck or over-provisioning is detected during the experiment. Whenever
the system detects a violation of the response time requirements, it uses the
proposed reactive algorithm to identify the bottleneck tier then dynamically
adds another virtual machine to the server farm for that bottleneck tier.
We observe temporary violations of the required response time for short pe-
18
0
1
2
3
4
5
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280
# machines
Time (minutes)
DB server
Web server
0
5000
10000
15000
20000
25000
0
12
24
36
48
60
72
84
96
108
120
95th percentile of mean response time (ms)
Load level (new sessions/s)
Dynamic contents
Static contents
Max SLA mean response time
Workload
Figure 11: 95th percentile of mean response time during Experiment 3 using t= 1 and
k= 8 under proposed system.
riods of time due to the latency of virtual machine boot-up and the time
required to observe the effects of previous scale operations. Whenever the
system identifies over-provisioning of virtual machines for specific tiers us-
ing the predictive model, it scales down the specific tiers adaptively. In the
beginning, the prediction model makes some mistakes; we can observe two in-
correctly predicted scale-down decisions at load level 146 and load level 252.
However, the reactive scale-up algorithm quickly brings the system back to a
configuration that satisfies the response time requirements. Occasional mis-
takes such as these are expected due to noise, since the predictive approach
is statistical. We could in principle reduce the occurrence of these mistakes
by incorporating traffic pattern prediction as part of the decision model.
Figure 12 shows the system throughput during the experiment. We ob-
serve linear growth in the system throughput through the full range of load
levels. The throughput increases and decreases as required with the load
level.
Figure 13 shows the CPU utilization of all virtual machines during the
experiment. Initially, the system is configured with one VM in each tier. The
system adaptively adds and removes virtual machines to each tier over time.
The differing steady-state levels of CPU utilization for the different VMs
reflects the use of round-robin balancing across differing processor speeds for
the physical nodes. We observe the same downward spike at the beginning
of each load level as in the earlier experiments due to the time for the system
to return to steady state after all user sessions are cleared.
19
0
100
200
300
400
500
600
700
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280
Throughput (number of served requests/s)
Time (minutes)
Dynamic contents
Static contents
Figure 12: Throughput of the system during Experiment 3 using t= 1 and k= 8 under
proposed system.
0
25
50
75
100
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280
CPU utilization (%)
Time (minutes)
Database server VMs
CPU saturation threashold
0
25
50
75
100
CPU utilization (%)
Web server VMs
CPU saturation threashold
Figure 13: CPU utilization of all VMs during Experiment 3 using t= 1 and k= 8 under
proposed system.
The experiments demonstrate first that insufficient static resource alloca-
tion policies lead to system failure, that maximal static resource allocation
policies lead to overprovisioning of resources, and that our proposed adap-
tive resource allocation method is able to maintain a maximum response time
SLA while utilizing minimal resources.
7. Conclusion
In this paper, we have proposed a methodology and described a proto-
type system for automatic identification and resolution of bottlenecks and
automatic identification and resolution of overprovisioning in multi-tier ap-
20
plications hosted on a cloud. Our experimental results show that while we
clearly cannot provide a SLA guaranteeing a specific response time with an
undefined load level for a multi-tier Web application using static resource al-
location, our adaptive resource provisioning method could enable us to offer
such SLAs.
It is very difficult to identify a minimally resource intensive configuration
of a multi-tier Web application that satisfies given response time require-
ments for a given workload, even using pre-deployment training and testing.
However, our system is capable of identifying the minimum resources re-
quired using heuristics, a predictive model, and automatic adaptive resource
provisioning. Cloud infrastructure providers can adopt our approach not
only to offer their customers SLAs with response time guarantees but also
to minimize the resources allocated to the customers’ applications, reducing
their costs.
We are currently extending our system to support n-tier clustered ap-
plications hosted on a cloud, and we are planning to extend our prediction
model, which is currently only used to retract over-provisioned resources,
to also perform bottleneck prediction in advance, in order to overcome the
virtual machine boot-up latency problem. We are developing more sophis-
ticated methods to classify URLs into static and dynamic content requests,
rather than relying on filename extensions. Finally, we intend to incorpo-
rate the effects of heterogeneous physical machines on the prediction model
and also address issues related to best utilization of physical machines for
particular tiers.
Acknowledgments
This work was supported by graduate fellowships from the Higher Edu-
cation Commission (HEC) of Pakistan and the Asian Institute of Technology
to WI and by the Ministry of Science and Technology of Spain under contract
TIN2007-60625. We thank Faisal Bukhari, Irshad Ali, and Kifayat Ullah for
valuable discussions related to this work.
References
[1] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, I. Brandic, Cloud
computing and emerging IT platforms: Vision, hype, and reality for
delivering computing as the 5th utility, Future Generation Computer
Systems 25 (2009) 599 – 616.
21
[2] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Yous-
eff, D. Zagorodnov, The EUCALYPTUS Open-source Cloud-computing
System, in: CCA’08: Proceedings of the Cloud Computing and Its
Applications Workshop, Chicago, IL, USA.
[3] P. Anedda, S. Leo, S. Manca, M. Gaggero, G. Zanetti, Suspending,
migrating and resuming HPC virtual clusters, Future Generation Com-
puter Systems 26 (2010) 1063 –72.
[4] X. Zhu, Z. Wang, S. Singhal, Utility-driven workload management using
nested control design, in: ACC ’06: American Control Conference,
Minneapolis, Minnesota USA.
[5] P. Padala, K. G. Shin, X. Zhu, M. Uysal, Z. Wang, S. Singhal, A. Mer-
chant, K. Salem, Adaptive control of virtualized resources in utility
computing environments, in: EuroSys ’07: Proceedings of the ACM
SIGOPS/EuroSys European Conference on Computer Systems 2007,
ACM, New York, NY, USA, 2007, pp. 289–302.
[6] Z. Wang, X. Zhu, P. Padala, S. Singhal, Capacity and performance
overhead in dynamic resource allocation to virtual containers, in: In-
tegrated Network Management, 2007. IM ’07. 10th IEEE International
Symposium on Integrated Management, Dublin, Ireland, pp. 149–58.
[7] WMware, VMware Distributed Resource Scheduler (DRS), 2010. Avail-
able at http://www.vmware.com/products/drs/.
[8] G. Khanna, K. Beaty, G. Kar, A. Kochut, Application performance man-
agement in virtualized server environments, in: NOMS ’06: Network
Operations and Management Symposium, Vancouver, BC, pp. 373–81.
[9] I. Foster, T. Freeman, K. Keahy, D. Scheftner, B. Sotomayer, X. Zhang,
Virtual clusters for grid communities, in: CCGRID ’06: Proceedings of
the Sixth IEEE International Symposium on Cluster Computing and the
Grid (CCGRID’06), IEEE Computer Society, Washington, DC, USA,
2006, pp. 513–20.
[10] G. Czajkowski, M. Wegiel, L. Daynes, K. Palacz, M. Jordan, G. Skinner,
C. Bryce, Resource management for clusters of virtual machines, in:
CCGRID ’05: Proceedings of the Fifth IEEE International Symposium
22
on Cluster Computing and the Grid (CCGrid’05) - Volume 1, IEEE
Computer Society, Washington, DC, USA, 2005, pp. 382–9.
[11] A. Sundararaj, M. Sanghi, J. Lange, P. Dinda, Hardness of approx-
imation and greedy algorithms for the adaptation problem in virtual
environments, in: ICAC ’06: 7th IEEE International Conference on
Autonomic Computing and Communicatioins, Washington, DC, USA,
pp. 291–2.
[12] Amazon Inc, Amazon Web Services auto scaling, 2009. Available at
http://aws.amazon.com/autoscaling/.
[13] A. Azeez, Auto-scaling Web services on Amazon EC2,
2008. Available at http://people.apache.org/~azeez/
autoscaling-web-services-azeez.pdf.
[14] P. Bodik, R. Griffith, C. Sutton, A. Fox, M. Jordan, D. Patterson, Sta-
tistical machine learning makes automatic control practical for internet
datacenters, in: HotCloud’09: Proceedings of the Workshop on Hot
Topics in Cloud Computing.
[15] H. Liu, S. Wee, Web server farm in the cloud: Performance evaluation
and dynamic architecture, in: CloudCom ’09: Proceedings of the 1st
International Conference on Cloud Computing, Springer-Verlag, Berlin,
Heidelberg, 2009, pp. 369–80.
[16] B. Urgaonkar, G. Pacifici, P. Shenoy, M. Spreitzer, A. Tantawi, An
analytical model for multi-tier internet services and its applications,
in: SIGMETRICS ’05: Proceedings of the 2005 ACM SIGMETRICS
International Conference on Measurement and Modeling of Computer
Systems, volume 33, ACM, 2005, pp. 291–302.
[17] B. Urgaonkar, P. Shenoy, A. Chandra, P. Goyal, T. Wood, Agile dy-
namic provisioning of multi-tier internet applications, ACM Transac-
tions on Autonomous and Adaptive Systems 3 (2008) 1–39.
[18] R. Singh, U. Sharma, E. Cecchet, P. Shenoy, Autonomic mix-aware
provisioning for non-stationary data center workloads, in: ICAC ’10:
Proceedings of the 7th IEEE International Conference on Autonomic
Computing and Communication, IEEE Computer Society, Washington,
DC, USA, 2010.
23
[19] W. Iqbal, M. Dailey, D. Carrera, SLA-driven adaptive resource manage-
ment for web applications on a heterogeneous compute cloud, in: Cloud-
Com ’09: Proceedings of the 1st International Conference on Cloud
Computing, Springer-Verlag, Berlin, Heidelberg, 2009, pp. 243–53.
[20] W. Iqbal, M. N. Dailey, D. Carrera, P. Janecek, SLA-driven automatic
bottleneck detection and resolution for read intensive multi-tier appli-
cations hosted on a cloud, in: GPC ’10: Proceedings of the 5th Interna-
tional Conference on Advances in Grid and Pervasive Computing, pp.
37–46.
[21] xkoto, Gridscale, 2009. http://www.xkoto.com/products/.
[22] I. Sysoev, Nginx, 2002. Available at http://nginx.net/.
[23] L. Rodero-Merino, L. M. Vaquero, V. Gil, F. Galn, J. Fontn, R. S. Mon-
tero, I. M. Llorente, From infrastructure delivery to service management
in clouds, Future Generation Computer Systems 26 (2010) 1226 –40.
[24] Google Code, Typica: A Java client library for a variety of Amazon Web
Services, 2008. Available at http://code.google.com/p/typica/.
[25] OW2 Consortium, RUBiS: An auction site prototype, 1999. http://
rubis.ow2.org/.
[26] D. Mosberger, T. Jin, httperf - a tool for measuring web server perfor-
mance, in: In First Workshop on Internet Server Performance, ACM,
1998, pp. 59–67.
[27] J. Allspaw, The Art of Capacity Planning, O’Reilly Media, Inc., Se-
bastopol CA, USA, 2008.
24
... In fact an autoscaler should consider the architecture of the application to select the adequate microservices for scaling. Actually, classic autoscaling solutions focus on monolithic applications 15,16,17,18,19,20,21,22,23 or in the best case three tier applications 24,25 . In such context, selecting the component which should be scaled is not complicated. ...
... In literature many solutions propose autoscalers treating the elasticity on VM level 15,16,17,18,19,20,21,22,23,24,25 . Whereas, few solutions propose autoscalers focusing on container level 4,5,6,7,8,9,10,11,12,13 , and these studies are dedicated for microservicesbased applications. ...
... The response time of the workload is the sum of the response time of each tier. Iqbal et al. 20 present an elastic approach dedicated for intensive multi-tier applications. This study uses reactive and predictive strategies. ...
Preprint
Full-text available
Microservices is an architectural style of development consisting of a collection of small independent and loosely coupled components. Microservices-based applications are deployed in form of many containers deployed in a pool of virtual machines (VMs). When there is a rise on the workload, microservice resources are overloaded and then it should have additional resources. One of the most important issues in cloud environment is to minimize computing resources so as to reduce the deployment cost of the application. We can resolve this issue and optimize computing resources using autoscaling techniques. An autoscaler automatically provisions essential resources at real time. Existing autoscalers have many issues which reduce their efficiency. The first issue is that existing autoscalers suppose that threshold exceeding is always caused by the rise of the workload. However, exceeding thresholds may not be caused by the increase in the workload, but may be caused by other problems such as specific requests, VM or container issues. The second issue is that in resource provisioning, many autoscalers do not select the appropriate microservices for scaling resources. The third issue is that existing autoscalers do not calculate needed resources to be allocated for each microservice
... -Modeling algorithms whose main objective is to model elastic rules, resource usage, etc. In literature, we found sybl rules employed in [34] and resource profiling used in [62] and [19]. -Graph models such as Directed Acyclic Graph (DAG) [88] used in [24] [26], Completed Partially Directed Acyclic Graph (CPDAG) [89] employed in [24] [26], Causal Bayesian Networks (CBN) [90] illustrated in [23], Graphical Variational Auto-Encoder (GVAE) [91] deployed in [23], and PC-algorithm [92] used in [24]. ...
... Iqbal et al. [62] foregrounded an elastic approach dedicated for intensive multi-tier applications. This study uses reactive and predictive strategies. ...
Preprint
Full-text available
Elasticity is an essential treatment in Cloudenvironment employed in academic and industrial contexts. The main purpose of elasticity is to reduce thedeployment cost while optimizing computing resources.Multiple studies were conducted to tackle classic applications using monolithic architecture deployed withvirtual machines (VMs). However, with the spread ofmicroservice pattern, recent studies have been investigating this new trend using containers. This paperclassifies and discusses existing approaches dealing withcloud elasticity. It provides a novel taxonomy for elasticapproaches while focusing on microservices-based solutions. We additionally specify the strength and theshortcomings of each class of works. As a conclusion,we report the challenges for microservices-based applications elasticity and provide requirements for futureinvestigations.
... In another paper [5], the authors propose another option for integrating the reactive and proactive approach, namely, scaling up is reactive, and scaling down is proactive. The reactive component constantly analyzes the current metrics of the speed of response to requests. ...
... Under dynamic workloads, resource provisioning allows a system to scale out and in resources [42]. Improved scalability is the result of efficient resource provisioning. ...
Article
Full-text available
Microservices are being used by businesses to split monolithic software into a set of small services whose instances run independently in containers. Load balancing and auto-scaling are important cloud features for cloud-based container microservices because they control the number of resources available. The current issues concerning load balancing and auto-scaling techniques in Cloud-based container microservices were investigated in this paper. Server overloaded, service failure and traffic spikes were the key challenges faced during the microservices communication phase, making it difficult to provide better Quality of Service (QoS) to users. The aim is to critically investigate the addressed issues related to Load balancing and Auto-scaling in Cloud-based Container Microservices (CBCM) in order to enhance performance for better QoS to the users.
Article
Workload characterization and subsequent prediction are significant steps in maintaining the elasticity and scalability of resources in Cloud Data Centers. Due to the high variance in cloud workloads, designing a prediction algorithm that models the variations in the workload is a non-trivial task. If the workload predictor is unable to handle the dynamism in the workloads, then the result of the predictor may lead to over-provisioning or under-provisioning of cloud resources. To address this problem, we have created a Super Markov Prediction Model (SMPM) whose behaviour changes as per the change in the workload patterns. As the time progresses, based on the workload pattern SMPM uses different sequence models to predict the future workload. To evaluate the proposed model, we have experimented with Alibaba trace 2018, Google Cluster Trace (GCT), Alibaba trace 2020 and TPC-W workload trace. We have compared SMPM's prediction results with existing state-of-the-art prediction models and empirically verified that the proposed prediction model achieves a better accuracy as quantified using Root Mean Square Error (RMSE) and Mean Absolute Error (MAE).
Article
A cluster of transcoding servers is essential for transcoding many on-demand videos. Cloud computing presents a scalable framework for online video transcoding, and the infrastructure as a service (IaaS) cloud provides heterogeneous virtual machines (VMs) for creating a dynamically scalable cluster of servers. Heterogeneous VMs consist of small or big cores, which are assigned dynamically to allocate varying sizes of videos to the appropriate VMs for transcoding. Earlier research has proposed cloud-based heterogeneous scheduling for allocating different types of videos to different types of VMs so that the quality of service is maintained by reducing video rejection. In this paper, we propose a heterogeneous multi-core video scheduling model that additionally estimates the number of VMs and cores per VM with the variation of the number of videos to optimize the resources and cost of a cloud-based transcoding system. We further estimate the model's overhead concerning the variation in the number of videos. We conducted experiments on random videos, and experimental results reveal that the proposed model provides an excellent estimation of the number of VMs and cores. The proposed model reduces the average cost by 5% and requires almost 10% fewer cores for processing video tasks than the existing work in average cases.
Article
Full-text available
Cloud computing emphasis on using and underlying infrastructure in a much efficient way. That’s why it is gaining immense importance in today’s industry. Like every other field, cloud computing also has some key feature for estimating the standard of working of every cloud provider. Elasticity is one of these key features. The term elasticity in cloud computing is directly related to response time (a server takes towards user request during resource providing and de-providing. With increase in demand and a huge shift of industry towards cloud, the problem of handling user requests also arisen. For a long time, the concept of virtualization held industry with all its merits and demerits to handle multiple requests over cloud. Biggest disadvantage of virtualization shown heavy load on underlying kernel or server but from past some decades an alternative technology emerges and get popular in a short time due to great efficiency known as containerization. In this paper we will discuss about elasticity in cloud, working of containers to see how it can help to improve elasticity in cloud for this will using some tools for analyzing two technologies i.e. virtualization and containerization. We will observe whether containers show less response time than virtual machine. If yes that’s mean elasticity can be improved in cloud on larger scale which may improve cloud efficiency to a large extent and will make cloud more eye catching.
Article
Full-text available
Microservices are containerized, loosely coupled, interactive smaller units of the application that can be deployed, reused, and maintained independently. In a microservices-based application, allocating the right computing resources for each containerized microservice is important to meet the specific performance requirements while minimizing the infrastructure cost. Microservices-based applications are easy to scale automatically based on incoming workload and resource demand automatically. However, it is challenging to identify the right amount of resources for containers hosting microservices and then allocate them dynamically during the auto-scaling. Existing auto-scaling solutions for microservices focus on identifying the appropriate time and number of containers to be added/removed dynamically for an application. However, they do not address the issue of selecting the right amount of resources, such as CPU cores, for individual containers during each scaling event. This paper presents a novel approach to dynamically allocate the CPU resources to the containerized microservice during the autoscaling events. Our proposed approach is based on the machine learning method, which can identify the right amount of CPU resources for each container, dynamically spawning for the microservices over time to satisfy the application’s response time requirements. The proposed solution is evaluated using a benchmark microservices-based application based on real-world workloads on the Kubernetes cluster. The experimental results show that the proposed solution outperforms by yielding a 40% to 60% reduction in violating the response time requirements with 0.5\(\times\) to 1.5\(\times\) less cost compared to the state-of-art baseline methods.
Chapter
Resource scaling is widely employed in cloud computing to adapt system operation to internal (i.e., application) and external (i.e., environment) changes. We present a quantitative approach for coordinated vertical scaling of resources in cloud computing workflows, aimed at satisfying an agreed Service Level Objective (SLO) by improving the workflow end-to-end (e2e) response time distribution. Workflows consist of IaaS services running on dedicated clusters, statically reserved before execution. Services are composed through sequence, choice/merge, and balanced split/join blocks, and have generally distributed (i.e., non-Markovian) durations possibly over bounded supports, facilitating fitting of analytical distributions from observed data. Resource allocation is performed through an efficient heuristics guided by the mean makespans of sub-workflows. The heuristics performs a top-down visit of the hierarchy of services, and it exploits an efficient compositional method to derive the response time distribution and the mean makespan of each sub-workflow. Experimental results on a workflow with high concurrency degree appear promising for feasibility and effectiveness of the approach.
Chapter
Energy reduction has become a necessity for modern datacentres, with CPU being a key contributor to the energy consumption of nodes. Increasing the utilization of CPU resources on active nodes is a key step towards energy efficiency. However, this is a challenging undertaking, as the workload can vary significantly among the nodes and over time, exposing operators to the risk of overcommitting the CPU. In this paper, we explore the trade-off between energy efficiency and node overloads, to drive virtual machine (VM) consolidation in a cost-aware manner. We introduce a model that uses runtime information to estimate the target utilization of the nodes to control their load, identifying and considering correlated behavior among collocated workloads. Moreover, we introduce a VM allocation and node management policy that exploits the model to increase the profit of datacentre operators considering the trade-off between energy reduction and potential SLA violation costs. We evaluate our work through simulations using node profiles derived from real machines and workloads from real datacentre traces. The results show that our policy adapts the nodes’ target utilization in a highly effective way, converging to a target utilization that is statically optimal for the workload at hand. Moreover, we show that our policy closely matches, or even outperforms two state-of-the-art policies that combine VM consolidation with VFS – the second one, also operating the CPU at reduced voltage margins – even when these are configured to use a static, workload- and architecture-specific target utilization derived through offline characterization of the workload.
Article
Full-text available
Horizontally-scalable Internet services on clusters of commodity computers appear to be a great fit for automatic control: there is a target output (service-level agreement), observed output (actual latency), and gain controller (adjusting the number of servers). Yet few datacenters are automated this way in practice, due in part to well-founded skepticism about whether the simple models often used in the research literature can capture complex real-life workload/performance relationships and keep up with changing conditions that might invalidate the models. We argue that these shortcomings can be fixed by importing modeling, control, and analysis techniques from statistics and machine learning. In particular, we apply rich statistical models of the application's performance, simulation-based methods for finding an optimal control policy, and change-point methods to find abrupt changes in performance. Preliminary results running aWeb 2.0 benchmark application driven by real workload traces on Amazon's EC2 cloud show that our method can effectively control the number of servers, even in the face of performance anomalies.
Conference Paper
Full-text available
As businesses have grown, so has the need to deploy I/T applications rapidly to support the expanding business processes. Often, this growth was achieved in an unplanned way: each time a new application was needed a new server along with the application software was deployed and new storage elements were purchased. In many cases this has led to what is often referred to as "server sprawl", resulting in low server utilization and high system management costs. An architectural approach that is becoming increasingly popular to address this problem is known as server virtualization. In this paper we introduce the concept of server consolidation using virtualization and point out associated issues that arise in the area of application performance. We show how some of these problems can be solved by monitoring key performance metrics and using the data to trigger migration of virtual machines within physical servers. The algorithms we present attempt to minimize the cost of migration and maintain acceptable application performance levels
Conference Paper
Full-text available
Virtualization and consolidation of IT resources have created a need for more effective workload management tools, one that dynamically controls resource allocation to a hosted application to achieve quality of service (QoS) goals. These goals can in turn be driven by the utility of the service, typically based on the application's service level agreement (SLA) as well as the cost of resources allocated. In this paper, we build on our earlier work on dynamic CPU allocation to applications on shared servers, and present a feedback control system consisting of two nested integral control loops for managing the QoS metric of the application along with the utilization of the allocated CPU resource. The control system was implemented on a lab testbed running an Apache Web server and using the 90th percentile of the response times as the QoS metric. Experiments using a synthetic workload based on an industry benchmark validated two important features of the nested control design. First, compared to a single loop for controlling response time only, the nested design is less sensitive to the bimodal behavior of the system resulting in more robust performance. Second, compared to a single loop for controlling CPU utilization only, the new design provides a framework for dealing with the tradeoff between better QoS and lower cost of resources, therefore resulting in better overall utility of the service
Conference Paper
Full-text available
Since many Internet applications employ a multi-tier architecture, in this paper, we focus on the problem of analytically modeling the behavior of such applications. We present a model based on a network of queues, where the queues represent different tiers of the application. Our model is sufficiently general to capture (i) the behavior of tiers with significantly different performance characteristics and (ii) application idiosyncrasies such as session-based workloads, concurrency limits, and caching at intermediate tiers. We validate our model using real multi-tier applications running on a Linux server cluster. Our experiments indicate that our model faithfully captures the performance of these applications for a number of workloads and configurations. For a variety of scenarios, including those with caching at one of the application tiers, the average response times predicted by our model were within the 95% confidence intervals of the observed average response times. Our experiments also demonstrate the utility of the model for dynamic capacity provisioning, performance prediction, bottleneck identification, and session policing. In one scenario, where the request arrival rate increased from less than 1500 to nearly 4200 requests/min, a dynamic provisioning technique employing our model was able to maintain response time targets by increasing the capacity of two of the application tiers by factors of 2 and 3.5, respectively.
Conference Paper
Full-text available
A Service-Level Agreement (SLA) provides surety for specific quality attributes to the consumers of services. However, the current SLAs offered by cloud providers do not address response time, which, from the user’s point of view, is the most important quality attribute for Web applications. Satisfying a maximum average response time guarantee for Web applications is difficult for two main reasons: first, traffic patterns are unpredictable; second, the complex nature of multi-tier Web applications increases the difficulty of identifying bottlenecks and resolving them automatically. This paper presents a working prototype system that automatically detects and resolves bottlenecks in a multi-tier Web application hosted on a EUCALYPTUS-based cloud in order to satisfy specific maximum response time requirements. We demonstrate the feasibility of the approach in an experimental evaluation with a testbed cloud and a synthetic workload. Automatic bottleneck detection and resolution under dynamic resource management has the potential to enable cloud providers to provide SLAs for Web applications that guarantee specific response time requirements.
Conference Paper
Full-text available
Data centers are often under-utilized due to over-provisioning as well as time-varying resource demands of typical enterprise applications. One approach to increase resource utilization is to consolidate applications in a shared infrastructure using virtualization. Meeting application-level quality of service (QoS) goals becomes a challenge in a consolidated environment as application resource needs differ. Furthermore, for multi-tier applications, the amount of resources needed to achieve their QoS goals might be different at each tier and may also depend on availability of resources in other tiers. In this paper, we develop an adaptive resource control system that dynamically adjusts the resource shares to individual tiers in order to meet application-level QoS goals while achieving high resource utilization in the data center. Our control system is developed using classical control theory, and we used a black-box system modeling approach to overcome the absence of first principle models for complex enterprise applications and systems. To evaluate our controllers, we built a testbed simulating a virtual data center using Xen virtual machines. We experimented with two multi-tier applications in this virtual data center: a two-tier implementation of RUBiS, an online auction site, and a two-tier Java implementation of TPC-W. Our results indicate that the proposed control system is able to maintain high resource utilization and meets QoS goals in spite of varying resource demands from the applications.
Conference Paper
Full-text available
Current service-level agreements (SLAs) offered by cloud providers make guarantees about quality attributes such as availability. However, although one of the most important quality attributes from the perspective of the users of a cloud-based Web application is its response time, current SLAs do not guarantee response time. Satisfying a maximum average response time guarantee for Web applications is difficult due to unpredictable traffic patterns, but in this paper we show how it can be accomplished through dynamic resource allocation in a virtual Web farm. We present the design and implementation of a working prototype built on a EUCALYPTUS-based heterogeneous compute cloud that actively monitors the response time of each virtual machine assigned to the farm and adaptively scales up the application to satisfy a SLA promising a specific average response time. We demonstrate the feasibility of the approach in an experimental evaluation with a testbed cloud and a synthetic workload. Adaptive resource management has the potential to increase the usability of Web applications while maximizing resource utilization.
Conference Paper
Web applications’ traffic demand fluctuates widely and unpredictably. The common practice of provisioning a fixed capacity would either result in unsatisfied customers (underprovision) or waste valuable capital investment (overprovision). By leveraging an infrastructure cloud’s on-demand, pay-per-use capabilities, we finally can match the capacity with the demand in real time. This paper investigates how we can build a web server farm in the cloud. We first present a benchmark performance study on various cloud components, which not only shows their performance results, but also reveals their limitations. Because of the limitations, no single configuration of cloud components can excel in all traffic scenarios. We then propose a dynamic switching architecture which dynamically switches among several configurations depending on the workload and traffic pattern.
Article
With the significant advances in Information and Communications Technology (ICT) over the last half century, there is an increasingly perceived vision that computing will one day be the 5th utility (after water, electricity, gas, and telephony). This computing utility, like all other four existing utilities, will provide the basic level of computing service that is considered essential to meet the everyday needs of the general community. To deliver this vision, a number of computing paradigms have been proposed, of which the latest one is known as Cloud computing. Hence, in this paper, we define Cloud computing and provide the architecture for creating Clouds with market-oriented resource allocation by leveraging technologies such as Virtual Machines (VMs). We also provide insights on market-based resource management strategies that encompass both customer-driven service management and computational risk management to sustain Service Level Agreement (SLA)-oriented resource allocation. In addition, we reveal our early thoughts on interconnecting Clouds for dynamically creating global Cloud exchanges and markets. Then, we present some representative Cloud platforms, especially those developed in industries, along with our current work towards realizing market-oriented resource allocation of Clouds as realized in Aneka enterprise Cloud technology. Furthermore, we highlight the difference between High Performance Computing (HPC) workload and Internet-based services workload. We also describe a meta-negotiation infrastructure to establish global Cloud exchanges and markets, and illustrate a case study of harnessing ‘Storage Clouds’ for high performance content delivery. Finally, we conclude with the need for convergence of competing IT paradigms to deliver our 21st century vision.
Article
A systematic study of issues related to suspending, migrating and resuming virtual clusters for data-driven HPC applications is presented. The interest is focused on nontrivial virtual clusters, that is where the running computation is expected to be coordinated and strongly coupled. It is shown that this requires that all cluster level operations, such as start and save, should be performed as synchronously as possible on all nodes, introducing the need of barriers at the virtual cluster computing meta-level. Once a synchronization mechanism is provided, and appropriate transport strategies have been setup, it is possible to suspend, migrate and resume whole virtual clusters composed of “heavy” (4 GB RAM, 6 GB disk images) virtual machines in times of the order of few minutes without disrupting parallel computation–albeit of the MapReduce type–running inside them. The approach is intrinsically parallel, and should scale without problems to larger size virtual clusters.